1. Introduction

1.1 The policy debate over short-term rentals

As of March 2024, there are over 8,900 Airbnb listings in the city of Amsterdam, the Netherlands. However, the ongoing presence of these listings is not without controversy. For instance, Geerte Udo — director of Amsterdam’s tourism promotion agency — argues that short-term rental properties exacerbate the acute housing shortage that Amsterdam has been experiencing, while harming the character of local communities. With housing affordability and supply being a perennial policy concern for the electorate, both local and national politicians have called for greater regulatory scrutiny over the operations of platforms such as Airbnb. Indeed, in line with expectations, the conversion of residential properties to short-term rental properties has generated upward pressure on surrounding property values, which can have an adverse impact on local housing affordability. Indeed, a study on Airbnb activity in the Netherlands by Morpey (2022) found that a 1% increase in neighborhood-level Airbnb activity leads to a 0.038% increase in residential property values. You can view the presentation introducing our application here:
Fairbnb APP Introduction

“Private holiday rental was harmful in cities like Amsterdam… it took away a lot of houses from the market. We as Amsterdam people have to fight hard to find a house anyway. So that’s what we think: from a city perspective, it takes out social cohesion and it doesn’t add any houses.”

Geerte Udo
amsterdam&partners foundation
2022 Bloomberg CityLab global cities summit

Cities around the world are at a crossroads for short-term rental regulation, with policies falling on a spectrum from outright bans to laissez-faire attitudes towards such platforms. More drastic approaches include that of New York, which effectively bans Airbnbs through Local Law 18 (Short-Term Rental Registration Law) with exceptionally stringent compliance rules. Likewise, Singapore bans all short-term rentals with durations below three months, while Vienna places extremely tight restrictions on which neighborhoods are short-term rentals permitted, with the 2023 update to the Building Code expanding the scope of such restrictions. In comparison, cities such as London have less stringent regulations over short-term rentals, often with a cap on the maximum number of nights permitted in a year.

We believe that the level of stringency should be determined through the democratic choice of local communities and the wider populace. But good democracy requires well-informed citizens. For the vast majority of cities in the absence of a total ban on Airbnb, policymakers and citizens need a versatile decision-making tool to evaluate the impact of Airbnb activity on both the city level and the neighborhood level. This forms the basis for Fairbnb — our proposed short-term rental market intelligence tool for policymakers (along with community advocates and interested members of the public) to quickly diagnose short-term rental supply-demand mismatches at the hyper-local level, opening possibilities for policies to dynamically respond to the market, as introduced in the next section.

“Houses aren’t used for what they’re meant to be used: for living, and some neighborhoods are completely taken over by tourists. That’s a severe threat to the livability of our inhabitants.”

Laurens Ivens
Amsterdam’s Deputy Mayor for Housing
2020, Interview for the Politico

1.2 Fairbnb: A short-term rental market intelligence tool for policymakers

The Fairbnb web application primarily serves policymakers in the housing and planning fields. At its core is the reporting of a Fairness Index — a local, below-neighborhood level measure of how much Airbnb prices deviate from baseline price levels, as determined by property features and locational characteristics. A value of 0 for the Fairness Index indicates that Airbnb prices at a particular area are in line with market expectations, while a negative value shows that Airbnb prices are lower than what is expected of that area. The Fairness Index would be derived from a hedonic pricing model for Airbnb prices using a robust set of property-level predictors and controlling for time and neighborhood fixed effects. A fishnet of cells overlaid across Amsterdam then aggregates errors from model predictions to produce the Fairness Index metric.

A persistently large, negative Fairness Index may be of particular concern to policymakers, given that its implications are two-fold: it signals that for a given area (below the neighborhood level) Airbnb prices are below the market equilibrium, indicating that there is excess supply of listings. This in turn reflects the reality that residential housing is excessively converted for short-term rentals, which could be more efficiently used for long-term residential tenancy given social justice considerations. The Index thus serves as a starting point for policymakers to further explore the underlying drivers for persistent deviations in an area from the market equilibrium, which will differ from place to place. Indeed, there are myriad possibilities for policy interventions on areas consistently with a large, negative Fairness Index, and here it is apt to introduce the regulatory environment that property owners work within.

1.3 Regulatory environment for short-term rentals

Changing landscape of Airbnb regulations in Amsterdam and the Netherlands

As of Dec 2024, the City of Amsterdam mandates homeowners to register for a permit for renting out their properties to tourists in Amsterdam, with each permit priced at a flat rate of €71.00. In addition, the number of nights allowed for short-term rentals are capped at 30 per calendar year, and the nightly guest limit is 4. Homeowners must also give notice to the municipality (the City of Amsterdam) regarding their short-term rentals, prior to the arrival of each guest.

Such regulations have evolved significantly over time. For instance, in Jan 2018, the city council imposed a stricter limit as to the maximum permitted number of days each property can be let out in a year, from 60 to 30, which came into force on 1 Jan 2019. The datasets provided contain information scraped from the platform prior to the implementation of this policy update.

In addition, the City of Amsterdam attempted to ban Airbnb listings altogether in the three neighborhoods in the red light district and parts of the canal ring starting from 2020, on the grounds of nuisance. However, this decision has been overturned by the country’s highest administrative court (Council of State) in 2023. The judgment noted that the City should have first explored a reduced quota system or considered withdrawing permits from owners whose properties have received nuisance complaints from neighbors.

Meanwhile, property market analysts have observed that the recently tightened limits on rentals have cut Airbnb supply in Amsterdam by about half, although rising tourist inflows have placed upward pressure on the listing of residential properties for short-term rental.

The European Union regulatory environment

In the European Union (EU), draft legislation is in place to reconcile the demand for short-term rental platforms such as Airbnb and the adverse impacts they can bring to local communities. This is in response to pressure from 13 EU cities including Amsterdam and Barcelona to address the conversion of residential properties intended for long-term leases to short-term rentals.

Under the draft Directive (set to come into force in 2026), companies such as Airbnb are required to share data with local government agencies on a monthly basis regarding 1) number of nights the unit was rented out, 2) number of guests, 3) specific address, and 4) URL of the listing. This policy is intended for local governments to actively monitor market dynamics in the short-term accommodation rental sector, and indeed is the key proposition for an intelligence tool such as Fairbnb — exploiting granular short-term rental data for more precise policymaking.

1.4 Turning insights into actions

With this regulatory context, there are various actions policymakers can take based on the Fairbnb Index. For instance, if a certain part of a neighborhood with a consistently large, negative Fairness Index (indicating an over-supply of Airbnb properties), policymakers could adopt a combination of the following local policies, working along with residents to:

Set a limit on the total number of permits for that particular area while introducing a competitive bidding process for permits, with a pre-determined reserve price. This is in contrast to the flat €71 permit price under existing regulations.
Introduce a minimum quantum of affordable / long-term leasehold units for that specific area, to preserve its character and community.
Levy taxes on vacancy in between short-term rental periods, to disincentivize speculation by Airbnb property owners.
Shorten the maximum allowed number of days for a property to be rented out on a short-term basis.
Implement an additional community tax on short-term rental hosts
Reinvest all short-term rental related revenue into local community infrastructure and housing programs

The strength of hyper-local market intelligence, in contrast to city-wide blanket policies, lies in the ability for policy differentiation across and within neighborhoods. The small scale of intervention, in combination with market-based mechanisms, also mitigates the unintended effect of affecting supply-demand dynamics in adjacent areas and neighborhoods.

Against this backdrop, the next section introduces the 2018 – 19 Airbnb data available for the present analysis, along with supplementary datasets that are used for the calculation of the Fairbnb Index.

2. Exploratory Analysis

This section introduces the original datasets provided relating to the scraped Airbnb listings for Amsterdam, and some external datasets we have obtained that may improve the predictive power of models of Airbnb rental prices.

2.1 Data wrangling

For reproducibility, the data wrangling workflow is presented below for both the original datasets and the external datasets obtained.

Original datasets

The datasets provided are scraped by Erik Bruin on 6 Dec 2018, covering daily price data for 20,030 unique Airbnb listings in Amsterdam from 6 Dec 2018 to 6 Dec 2019 (365 days). In total, there are six datasets available for analysis, with their descriptions as follows:

calendar.csv — records the availability of the 20,030 unique Airbnb listings throughout the study period (each listing having 365 records), along with price information for each available day — this forms the primary source of data for this analysis
listings_details.csv — contains information on each listing, such as host name, neighborhood, listing type (private room or entire place) and reviews available, and a description
reviews_details.csv — provides the text content for each review under each listing, and the corresponding date

Calendar data

For calendar.csv, the cleaning process involves 1) removing listings where there is no price information available for any day in the 365-day study period and 2) converting the $price variable to numeric.

calendar <- read.csv("./data/calendar.csv")

calendar$listing_id <- as.factor(calendar$listing_id)
calendar$available <- as.factor(calendar$available)

calendar$date <- as.POSIXct(calendar$date, format = "%Y-%m-%d", tz = "UTC")
calendar$price[calendar$price == ""] <- NA

calendar$price <- gsub("\\$", "", calendar$price) 
calendar$price <- gsub(",", "", calendar$price)  
calendar$price <- as.numeric(calendar$price)  

calendar <- calendar %>%
  group_by(listing_id) %>%
  filter(!all(is.na(price))) %>%
  ungroup() 

calendar <- calendar[!is.na(calendar$price), ]

calendar <- calendar %>% 
  select(-available)

Listings details data

For listings_details.csv, variables for geometry, superhost status, property and room type, bedding, occupancy, number of bathrooms and bedrooms and number of beds are extracted.

listings_details <- read.csv("./data/listings_details.csv")

listings_details <- listings_details %>%
  select(
    id,
    latitude,
    longitude,
    # categorical variables
    host_is_superhost,
    property_type,
    room_type,
    bed_type,
    # numeric variables
    accommodates,
    bathrooms,
    bedrooms,
    beds) %>%
  mutate(host_is_superhost = ifelse(is.na(host_is_superhost), FALSE, host_is_superhost))

Reviews data

There are a few limitations of using reviews data as an input for subsequent models, namely that 1) 12.0% of listings in the original dataset are new listings with 0 reviews, 2) 31.0% of listings with reviews available have 5 or fewer reviews. It is therefore difficult, and indeed problematic, to assume that listings without reviews share similar price trends, all else being equal. As such, for the present analysis, the reviews dataset will not be used as a model input.

reviews <- read.csv("./data/reviews.csv")

reviews_summary <- reviews %>%
  group_by(listing_id) %>%
  summarise(count_reviews = n())

# Proportion without reviews (full listings dataset)
1 - nrow(reviews_summary) / nrow(listings_details)

# Proportion with 5 or fewer reviews
reviews_summary %>%
  summarise(prop_leq_5 = mean(count_reviews <= 5)) %>%
  pull(prop_leq_5)

ggplot(reviews_summary, aes(x = count_reviews)) +
  geom_histogram(binwidth = 10, fill = "#006e70", color = "white") +
  labs(
    title = "Histogram of number of reviews per listing",
    subtitle = "Data: 17,624 listings where reviews are available",
    x = "Number of reviews",
    y = "Count"
  ) +
  scale_y_continuous(labels = comma) +
  theme_minimal(base_family = "ubuntu")

External datasets

In addition to the original datasets provided, additional features are incorporated into our analysis, as described below:

attractions.geojson — using Tripadvisor, the 10 most popular places of interest for tourists are obtained, including the Rijksmuseum and the Red Light District
trammetro-2018.geojson — locations of tram and metro stops in 2018 are available directly on Amsterdam’s open data portal
restaurants-osm.geojson — locations of 1,801 restaurants located in Amsterdam, obtained using the QuickOSM plugin in QGIS
hotels-osm.geojson — locations of 382 hotels located in Amsterdam, obtained using the QuickOSM plugin in QGIS
holidays-ref.csv — dates of 2018 and 2019 public holidays in the Netherlands, obtained from timeanddate.com
sf.neighborhd25 and sf.neighborhd110 — neighborhood polygon layers downloaded from Amsterdam’s open data portal

Attractions data

The locations of the top 10 most popular places of interest for tourists on Tripadvisor are plotted manually on QGIS and loaded as attractions.geojson.

attraction <- st_read('./data/attractions.geojson')
attraction <- attraction %>%
  st_transform('EPSG:28992')

Neighborhood data

The neighborhood layers data are directly queried from Amsterdam’s open data portal in the code chunk below, and neighborhood information is joined to each listing.

neighborhd25.url <- "https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=INDELING_GEBIED_EXWATER&THEMA=gebiedsindeling"

neighborhd110.url <- "https://maps.amsterdam.nl/open_geodata/geojson_lnglat.php?KAARTLAAG=INDELING_WIJK_EXWATER&THEMA=gebiedsindeling"

sf.neighborhd25 <- st_read(neighborhd25.url) %>%
  st_transform('EPSG:28992')
sf.neighborhd110 <- st_read(neighborhd110.url) %>%
  st_transform('EPSG:28992')

listings_details <- listings_details %>%
  filter(id %in% unique(calendar$listing_id))

listings_details_sf <- st_as_sf(
  listings_details,
  coords = c("longitude", "latitude"), # Specify the coordinate columns
  crs = 4326 # Use the WGS84 CRS (standard for lat/long)
) %>%
  st_transform('EPSG:28992')

listings_details_sf <- listings_details_sf %>%
  filter(apply(st_intersects(., sf.neighborhd25, sparse = FALSE), 1, any)) %>%
  st_join(., sf.neighborhd25[, c("Gebied")],  
  left = TRUE               
  ) %>%
  rename(neighborhood25 = Gebied) %>%
  st_join(., sf.neighborhd110[, c("Wijk")],  
  left = TRUE                     
  ) %>%
  rename(neighborhood110 = Wijk)

listings_details <- listings_details %>%
  filter(id %in% listings_details_sf$id)

Transit data

trammetro-2018.geojson from Amsterdam’s open data portal is directly available and is visualized below:

transit <- st_read("./data/trammetro-2018.php.geojson")

transit <- transit %>%
  distinct(geometry, .keep_all = TRUE)%>%
  st_transform('EPSG:28992')

ggplot() +
  geom_sf(data = sf.neighborhd25, fill = NA, color = "gray80") +
  geom_point(data = subset(transit, Modaliteit == "Tram"),
             aes(x = st_coordinates(geometry)[,1], 
                 y = st_coordinates(geometry)[,2]),
             color = "#00b2b3", alpha = 0.5, size = 1) +  
  geom_point(data = subset(transit, Modaliteit == "Metro"),
             aes(x = st_coordinates(geometry)[,1], 
                 y = st_coordinates(geometry)[,2]),
             color = "#e74c3c", alpha = 0.5, size = 1) +   
  labs(title = "Transit stations in Amsterdam",
       subtitle = "The GVB network: Tram (blue) and Metro (red) stops",
       x = "Longitude", y = "Latitude") +
  theme_minimal(base_family = "ubuntu") +
  theme(
    axis.line = element_blank(),
    axis.text = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank(),
    axis.title = element_blank()
  )

Hotels and restaurants data

Like above, hotels and restaurants data is scraped in QGIS using QuickOSM, with the outputs loaded below.

hotels <- st_read("./data/hotels-osm.geojson")
hotels <- hotels %>%
  select(full_id, name, geometry) %>%
  st_transform('EPSG:28992')

restaurant <- st_read("./data/restaurants-osm.geojson")
restaurant <- restaurant%>%
  select(full_id, name, geometry)%>%
  st_transform('EPSG:28992')

Holidays data

A $pub_hol dummy variable is created for the calendar dataset to indicate if a given day falls on a public holiday in the Netherlands. A similar variable $pub_hol_week is created to indicate if a given day falls within ±3 days from a public holiday.

holidays <- read.csv("./data/holidays-ref.csv")

calendar$date <- as.POSIXct(calendar$date, format = "%Y-%m-%d", tz = "UTC")
holidays$Date <- as.POSIXct(holidays$Date, format = "%d/%m/%Y", tz = "UTC")

calendar <- calendar %>%
  mutate(pub_hol = ifelse(date %in% unique(holidays$Date), 1, 0))

holiday_ranges <- holidays %>%
  mutate(Date = as.Date(Date)) %>%  # Ensure Date is in Date format
  rowwise() %>%
  summarise(
    Date = list(Date + c(-3, -2, -1, 0, 1, 2, 3)), # Generate date ranges
    .groups = "drop"
  ) %>%
  unnest(cols = c(Date)) %>%  # Unnest the list of dates
  distinct() %>%  # Remove duplicates
  arrange(Date)   # Sort the dates

holiday_ranges <- holiday_ranges %>%
  mutate(Date = as.POSIXct(Date, tz = "UTC"))

calendar <- calendar %>%
  mutate(pub_hol_week = ifelse(date %in% unique(holiday_ranges$Date), 1, 0))

Other data wrangling steps

Time fixed effects

To account for time fixed effects (seasonality in Airbnb pricing), the variables $month and $day (day of the week) are created.

calendar <- calendar %>%
  mutate(
    day = format(calendar$date, "%a"),  # Day of the week (Mon, Tue, ...)
    month = format(calendar$date, "%b")  # Month abbreviation (Jan, Feb, ...)
  )

## plot mean price by month
# calendar_bymonth <- calendar_cleaned %>%
#   group_by(month) %>%
#   summarise(mean_price = mean(price, na.rm = T))
# 
# calendar_bymonth$month <- as.numeric(as.character(calendar_bymonth$month))
# plot(x = calendar_bymonth$month, y = calendar_bymonth$mean_price, type = "l")

Removing price outliers

For the 10,510 listings, the daily rate ranges from €21.43 to €8,500, with a mean of €179.90 and a median of €148.75. From experience, it is evident that ultra-luxurious units (with particularly high daily rates, given their bespoke features and services) are contained within the dataset. These outliers are likely to distort the conclusions that will be drawn from subsequent regression analysis. Thus, for the present analysis, we apply the commonly used rule which states that a data point is an outlier if it is more than 1.5 x IQR above the third quartile or 1.5 x IQR below the first quartile. In other words, low outliers are below Q1 - 1.5 x IQR and high outliers are greater than Q3 + 1.5 x IQR.

Calculating the mean daily rate for each listing, this formula yields a price threshold of €354.62 per night. This will remove 621 (5.9%) out of 10,510 listings. Following which, price records are filtered to those below twice this threshold (€709.23), as some properties have anomalous prices (such as listing #22807937 which charges €8,070 on a few days and €80 on all other days), likely a minor error introduced during the web scraping process.

listings_details$id <- as.factor(listings_details$id)
listings_details_sf$id <- as.factor(listings_details_sf$id)

calendar <- calendar %>%
  filter(listing_id %in% unique(listings_details$id))

calendar_price_summary <- calendar %>%
  group_by(listing_id) %>%
  summarise(mean_price = mean(price, na.rm = T))

summary(calendar_price_summary$mean_price)

q1 <- quantile(calendar_price_summary$mean_price, 0.25)
q3 <- quantile(calendar_price_summary$mean_price, 0.75)
iqr <- q3 - q1

# Define outlier thresholds
lower_bound <- q1 - 1.5 * iqr
upper_bound <- q3 + 1.5 * iqr

lower_bound
upper_bound
upper_bound * 2

calendar_price_summary %>%
  filter(mean_price < lower_bound | mean_price > upper_bound) %>% 
  nrow()

# [1] 621

calendar_price_summary <- calendar_price_summary %>%
  filter(mean_price < 354.62)

calendar <- calendar %>%
  filter(listing_id %in% unique(calendar_price_summary$listing_id))

listings_details <- listings_details %>%
  filter(id %in% unique(calendar_price_summary$listing_id))

listings_details_sf <- listings_details_sf %>%
  filter(id %in% unique(calendar_price_summary$listing_id))

# Remove price records outliers for calendar
calendar <- calendar %>%
  filter(price < 709.23)

listings_details <- listings_details %>%
  filter(id %in% unique(calendar$listing_id))

listings_details_sf <- listings_details_sf %>%
  filter(id %in% unique(calendar$listing_id))

Removing NA values

Within the listings dataset, there are 14 listings lacking in information on property attributes (such as the number of beds). These are removed from the dataset, leaving us with 9,875 listings.

listings_details %>%
  summarise(na_rows = sum(rowSums(is.na(.)) > 0))

listings_details <- na.omit(listings_details)
listings_details_sf <- na.omit(listings_details_sf)

calendar <- calendar %>%
  filter(listing_id %in% unique(listings_details$id))

Generating nearest neighbor features

For each listing, the average distance to the three nearest transit stops, three nearest restaurants and three nearest popular tourist attractions are calculated.

predictors.unique <- listings_details_sf %>%
  mutate(tran_nn3 = nn_function(st_coordinates(listings_details_sf), st_coordinates(transit), 3),
         rest_nn15 = nn_function(st_coordinates(listings_details_sf), st_coordinates(restaurant), 15),
         attr_nn3 = nn_function(st_coordinates(listings_details_sf), st_coordinates(attraction), 3)) %>%
  rename(listing_id = id)

listings_details <- st_drop_geometry(predictors.unique)
listings_details_sf <- predictors.unique

calendar <- calendar %>%
  left_join(predictors.unique, by = "listing_id") %>%
  na.omit()

nrow(calendar)
# [1] 1059688

Removing hotels and non-conventional property types from dataset

Here, hotels, hostels and non-conventional property types such as boats and barns are removed from the dataset, as the focus of the present analysis is on mainstream, residential properties.

calendar <- calendar %>%
  mutate(across(where(is.character), as.factor))

# summary(calendar$property_type)

exclude_property_types <- c("Houseboat", "Boutique hotel", "Boat", "Other", 
                            "Hotel", "Cabin", "Cottage", "Hostel", "Camper/RV", 
                            "Barn", "Casa particular (Cuba)", "Aparthotel", "Nature lodge", "Tiny house", "Chalet")

calendar <- calendar %>%
  filter(!property_type %in% exclude_property_types)

calendar <- calendar %>%
  filter(room_type != "Shared room")

calendar <- calendar %>%
  mutate(entire_place = ifelse(room_type == "Entire home/apt", 1, 0))

calendar$room_type = NULL

summary(calendar$property_type)

listings_details <- listings_details %>%
  filter(listing_id %in% unique(calendar$listing_id))

listings_details_sf <- listings_details_sf %>%
  filter(listing_id %in% unique(calendar$listing_id))

unique(calendar$host_is_superhost)

calendar <- calendar %>%
  mutate(superhost = if_else(host_is_superhost %in% c("", "f"), 0, 1))

listings_details <- listings_details %>%
  mutate(superhost = if_else(host_is_superhost %in% c("", "f"), 0, 1))

listings_details_sf <- listings_details_sf %>%
  mutate(superhost = if_else(host_is_superhost %in% c("", "f"), 0, 1))

calendar <- calendar %>%
  mutate(superhost = factor(superhost))

listings_details <- listings_details %>%
  mutate(superhost = factor(superhost))

listings_details_sf <- listings_details_sf %>%
  mutate(superhost = factor(superhost))

# Convert binary variables to factors
calendar <- calendar %>%
  mutate(
    entire_place = factor(entire_place),
    pub_hol = factor(pub_hol),
    pub_hol_week = factor(pub_hol_week)
  )

Outcome of data wrangling process

From above, the data wrangling process results in 9,378 Airbnb listings in Amsterdam, associated with 976,204 price records available for regression analysis. Each price record is associated with the following variables:

Dependent variable

price — daily rate of Airbnb listing

Continuous predictor variables

accommodates — number of guests permitted
bathrooms — number of bathrooms in unit
bedrooms — number of bedrooms in unit
beds — number of beds in unit
tran_nn3 — distance to three nearest transit stops (metro / tram)
rest_nn15 — distance to three nearest restaurants
attr_nn3 — distance to three nearest tourist attractions

Categorical predictor variables

property_type — type of property, namely: Apartment, Townhouse, Bed and breakfast, House, Guesthouse, Condominium, Guest suite, Loft, Serviced apartment, Villa, and Bungalow
entire_place — whether guests have the entire place to themselves (1) or just a private room (0)
pub_hol — whether the date falls on a public holiday in the Netherlands
pub_hol_week — whether the date falls within ±3 days from a public holiday in the Netherlands
day — day of the week
month — month of the year
superhost — whether the host is a superhost (1) or not (0)
bed_type — type of beds available, namely: Real Bed, Pull-out Sofa, Futon, or Airbed
neighborhood110 — neighborhood name

2.2 Summary statistics and initial observations

Listings availability

For the 9,378 listings, out of the 365-day study period, listings are marked as available for an average of 104 days, which is much higher than the pre-2019 yearly limit set by the City of Amsterdam of 60 days. This illuminates one interesting behavioral aspect of property owners — the tendency to speculatively list out their properties for a wider date range than what is legally permitted. It is likely that such properties sit vacant during the short periods between bookings — a rather inefficient use of space that forms one key motivation for this analysis.

calendar_count <- calendar %>%
  group_by(listing_id) %>%
  summarise(record_count = n(), .groups = "drop")

mean(calendar_count$record_count)
# [1] 104.0951

Price histogram

The histogram below shows the distribution of daily rates for the filtered Airbnb listings. As the figure shows a skew to the right, it is useful to perform a log-transformation of Airbnb prices for subsequent analyses.

calendar_price_summary <- calendar %>%
  group_by(listing_id) %>%
  summarise(mean_price = mean(price, na.rm = T))

ggplot(calendar_price_summary, aes(x = mean_price)) +
  geom_histogram(binwidth = 20, fill = "#006e70", color = "white") +
  labs(
    title = "Histogram of mean daily rates for Airbnb listings in Amsterdam",
    subtitle = "Data: 9,378 listings scraped on 6 Dec 2018 (after cleaning)",
    x = "Mean daily rate (€)",
    y = "Count"
  ) +
  scale_y_continuous(labels = comma) +
  theme_minimal(base_family = "ubuntu")

The result of the log-transformation (creating a new variable log_price) is indicated by the histogram below. While there is now a slight left skew, the degree of skew seems to be reduced, with the shape following more closely that of a normal distribution.

calendar <- calendar %>%
  mutate(log_price = log(price))

calendar_price_summary <- calendar %>%
  group_by(listing_id) %>%
  summarise(mean_logged_price = mean(log_price, na.rm = T))

ggplot(calendar_price_summary, aes(x = mean_logged_price)) +
  geom_histogram(binwidth = 0.15, fill = "#00999e", color = "white") +
  labs(
    title = "Log-transformed mean daily rates in Amsterdam",
    subtitle = "Data: 9,378 listings scraped on 6 Dec 2018 (after cleaning)",
    x = "Mean daily rate (€)",
    y = "Count"
  ) +
  scale_y_continuous(labels = comma) +
  theme_minimal(base_family = "ubuntu")

Numeric predictor variables

Summary statistics for numeric variables

The table below shows the summary statistics for the numeric predictor variables in the dataset.

str(calendar)
summ_stats <- calendar %>% st_drop_geometry()

summ_stats <- summ_stats %>%
  select(accommodates, bathrooms, bedrooms, beds, tran_nn3, rest_nn15, attr_nn3)

summ_stats <- summ_stats %>%
  psych::describe() %>% 
  dplyr::select(mean:median,min,max) %>%
  mutate(Mean = as.character(round(mean)),
         SD = as.character(round(sd)),
         Median = as.character(round(median)),
         Min = as.character(round((min))),
         Max = as.character(round(max))) %>%
  dplyr::select(-(mean:max)) %>%
  # filter(Max > 1) %>%
  as.data.frame()

summ_stats$Description <- 
  c("Occupancy", 
    "Number of bathrooms", 
    "Number of bedrooms", 
    "Number of beds", 
    "Average distance to 3 nearest transit stations", 
    "Average distance to 3 nearest restaurants", 
    "Average distance to 3 nearest tourist attractions")

summ_stats %>%
  dplyr::select(Description, Mean, SD, Median, Min, Max) %>%
  kable(format = "html", caption = "Summary statistics for numeric predictor variables") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "condensed"),
                fixed_thead = TRUE)

Summary statistics for numeric predictor variables
	Description	Mean	SD	Median	Min	Max
accommodates	Occupancy	3	1	2	1	12
bathrooms	Number of bathrooms	1	0	1	0	10
bedrooms	Number of bedrooms	1	1	1	0	6
beds	Number of beds	2	1	1	0	12
tran_nn3	Average distance to 3 nearest transit stations	480	501	339	132	4448
rest_nn15	Average distance to 3 nearest restaurants	379	413	245	17	3762
attr_nn3	Average distance to 3 nearest tourist attractions	2046	1729	1533	189	10365

Histogram plots of numeric predictors

Here, histogram plots of the seven numeric predictors are produced to examine any potential skew, which can be mitigated with log-transformation.

numeric_vars <- calendar %>% 
  st_drop_geometry() %>%
  select(accommodates, bathrooms, bedrooms, beds, tran_nn3, rest_nn15, attr_nn3)

ggplot(numeric_vars %>%
  gather(key = "variable", value = "value"), aes(x = value)) +
  geom_histogram(bins = 15, fill = "#006e70", color = "white", alpha = 1) +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  theme_minimal(base_family = "ubuntu") + 
  labs(
    title = "Histograms for numeric predictor variables",
    subtitle = "Includes property attributes and locational characteristics",
    x = "Value",
    y = "Count"
  ) +
  theme(
    text = element_text(family = "Ubuntu"),
    plot.title = element_text(family = "ubuntu", size = 16),
    axis.title.x = element_text(family = "ubuntu"),
    axis.title.y = element_text(family = "ubuntu")
  ) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma)

Indeed, there is an obvious skew for these five predictors attr_nn3, bathrooms, beds, rest_nn15 and tran_nn3, and log transformation is thus applied for them, with the resulting histograms as follows:

calendar <- calendar %>%
  mutate(
    log_attr_nn3 = log(attr_nn3),
    log_bathrooms = log(bathrooms + 1),
    log_beds = log(beds + 1),
    log_rest_nn15 = log(rest_nn15),
    log_tran_nn3 = log(tran_nn3)
  )

numeric_vars <- calendar %>% 
  st_drop_geometry() %>%
  select(log_bathrooms, log_beds, log_tran_nn3, log_rest_nn15, log_attr_nn3)

ggplot(numeric_vars %>%
  gather(key = "variable", value = "value"), aes(x = value)) +
  geom_histogram(bins = 15, fill = "#00999e", color = "white", alpha = 1) +
  facet_wrap(~ variable, scales = "free", ncol = 2) +
  theme_minimal(base_family = "ubuntu") + 
  labs(
    title = "Histograms for log-transformed numeric predictor variables",
    subtitle = "Log-transformation involving adding 1 to the number of beds and bathrooms",
    x = "Value",
    y = "Count"
  ) +
  theme(
    text = element_text(family = "Ubuntu"),
    plot.title = element_text(family = "ubuntu", size = 16),
    axis.title.x = element_text(family = "ubuntu"),
    axis.title.y = element_text(family = "ubuntu")
  ) +
  scale_x_continuous(labels = comma) +
  scale_y_continuous(labels = comma)

Correlation matrix

The correlation matrix below shows the pairwise Pearson correlation coefficients between numeric variables (with log-transformation, if applicable), including the dependent variable (log-transformed Airbnb prices).

numeric_vars <- calendar %>% 
  st_drop_geometry() %>%
  select(log_price, accommodates, log_bathrooms, bedrooms, log_beds, log_tran_nn3, log_rest_nn15, log_attr_nn3)

corr <- round(cor(numeric_vars), 2)

p <- ggcorrplot(corr,
                colors = rev(corr_mat_pal),
                lab = FALSE,
                type = "lower",
                insig = "blank",
                tl.cex = 6) +
  labs(title = "Correlation matrix for continuous predictor variables") +
  theme_minimal(base_family = "Ubuntu") +
  theme(
    text = element_text(family = "Ubuntu"),
    axis.text.x = element_text(size = 5, family = "Ubuntu"),
    axis.text.y = element_text(size = 8, family = "Ubuntu"),
    plot.title = element_text(size = 14, family = "Ubuntu"),
    plot.subtitle = element_text(size = 10, family = "Ubuntu"),
    axis.title = element_blank(),
    axis.ticks = element_blank(),
    panel.grid = element_blank()
  )

p + 
  geom_text(aes(label = value),
            family = "Ubuntu",
            size = 4,
            fontface = "plain")

There appears to be a strong collinearity between log_rest_nn15 and log_attr_nn3. This may be due to the fact that tourist hotspots in Amsterdam are likely to have greater restaurant offerings, given the higher footfall. Likewise, the strong collinearity between log_rest_nn15 and log_tran_nn3 suggests that restaurants are also likely to cluster around transit areas, which intuitively have greater visitor traffic. Unsurprisingly, the correlation between number of beds and number of bedrooms is high. These findings will be taken into account in the subsequent modelling section when making comparisons between different model specifications.

Meanwhile, the figure below presents the correlation of different numeric predictor variables with the dependent variable, in descending order. All numeric predictors have a moderate-to-weak correlation with log-transformed daily rates, suggesting that interaction terms may be helpful in subsequent regression analysis.

numeric_vars %>%
  correlate() %>%
  select(term, log_price) %>%
  mutate(Correlation = if_else(log_price > 0, "Positive", "Negative")) %>%
  drop_na() %>%
  mutate(term = fct_reorder(term, desc(abs(log_price)))) %>%
  ggplot(aes(x = term, y = log_price, fill = Correlation)) +
  geom_col(alpha = 0.8, width = 0.5) +
  scale_fill_manual(values = c("#e74c3c", "#006e70")) +
  theme_minimal(base_family = "ubuntu") +
  labs(
    title = "Correlation coefficients of numeric variables",
    subtitle = "With log-transformed Airbnb daily rates",
    x = "Numeric variables",
    y = "Correlation with logged daily rates"
  ) +
  theme(
    text = element_text(family = "ubuntu"),
    plot.title = element_text(family = "ubuntu", size = 16),
    axis.text.x = element_text(angle = 45, hjust = 1, family = "ubuntu"),
    axis.text.y = element_text(family = "ubuntu"),
    axis.title.x = element_text(family = "ubuntu"),
    axis.title.y = element_text(family = "ubuntu")
  )

Scatterplots between numeric predictors and Airbnb prices

As a complement to the correlation values calculated, the figures below present selected scatterplots of numeric predictor variables with log-transformed Airbnb prices.

Distance to transit

ggplot(calendar, aes(x = log_tran_nn3, y = log_price)) +
  geom_point(color = "black", alpha = 0.008) + 
  geom_smooth(method = "lm", color = "#e74c3c", se = FALSE) +  # Best fit line
  labs(
    title = "Scatterplot of logged daily rates vs logged distance to transit",
    subtitle = "Best-fit line indicated in red",
    x = "Log-transformed distance to transit",
    y = "Log-transformed Airbnb daily rates"
  ) +
  theme_minimal(base_family = "ubuntu") +
  theme(
    text = element_text(family = "ubuntu"),
    plot.title = element_text(family = "ubuntu"),
    plot.subtitle = element_text(family = "ubuntu"),
    axis.title.x = element_text(family = "ubuntu"),
    axis.title.y = element_text(family = "ubuntu")
  )

Number of beds

ggplot(calendar, aes(x = log_beds, y = log_price)) +
  geom_point(color = "black", alpha = 0.008) + 
  geom_smooth(method = "lm", color = "#e74c3c", se = FALSE) +  # Best fit line
  labs(
    title = "Scatterplot of logged daily rates vs logged number of beds",
    subtitle = "Best-fit line indicated in red",
    x = "Log-transformed number of beds",
    y = "Log-transformed Airbnb daily rates"
  ) +
  theme_minimal(base_family = "ubuntu") +
  theme(
    text = element_text(family = "ubuntu"),
    plot.title = element_text(family = "ubuntu"),
    plot.subtitle = element_text(family = "ubuntu"),
    axis.title.x = element_text(family = "ubuntu"),
    axis.title.y = element_text(family = "ubuntu")
  )

Categorical predictor variables

Summary statistics for categorical variables

The box plots below summarizes the (log-transformed) Airbnb daily rates for each factor of the categorical predictor variables (with the exception of neighborhood110 which has 110 factors) in the dataset.

categorical_vars <- calendar %>% 
  st_drop_geometry() %>%
  select(log_price, property_type, entire_place, pub_hol, pub_hol_week, day, month, 
  superhost, bed_type)

cat_vars <- c("property_type", "entire_place", "pub_hol", "pub_hol_week", "day", "month", 
  "superhost", "bed_type")

plot_list <- list()

for (var in cat_vars) {
  plot <- ggplot(categorical_vars, aes_string(x = var, y = "log_price")) +
    geom_boxplot(fill = '#006e70') +
    scale_y_continuous(labels = comma) +
    labs(title = paste("Logged daily rates by", var),
         x = var,
         y = "Logged daily rates") +
    theme_minimal() +
    theme(
      axis.text.x = element_text(angle = 45, hjust = 1, family = "ubuntu"),
      axis.text.y = element_text(family = "ubuntu"),
      axis.title.x = element_text(family = "ubuntu"),
      axis.title.y = element_text(size = 8, family = "ubuntu"),
      plot.title = element_text(size = 10, family = "ubuntu")
    )
  
  plot_list[[var]] <- plot
}

combined_plot <- wrap_plots(plot_list, ncol = 3)
combined_plot

In line with expectations, there is a price premium for public holidays and weeks with public holidays. Having access to the entire place also fetches a large premium

Meanwhile, on the Airbnb platform, selected property owners are granted the “Superhost” badge signifying exceptional hospitality standards. In the cleaned dataset, 2,034 (21.7%) property owners are classified as Superhosts. Interestingly, the box plots above do not show a price premium for Superhost properties.

# Create a data frame for plotting
host_status <- as.data.frame(table(listings_details$host_is_superhost))
colnames(host_status) <- c("Superhost", "Count")
host_status <- host_status %>%
  mutate(Proportion = round(Count / sum(Count), 4))

# Plot
ggplot(host_status, aes(x = Count, y = Superhost, fill = "#006e70")) +
  geom_bar(stat = "identity", color = "#006e70", fill = "#006e70", width = 0.5) +
  labs(
    title = "Superhost status of Airbnb listings",
    subtitle = "Data: 9,378 listings scraped on 6 Dec 2018 (after removing outliers)",
    x = "Number of listings",
    y = "Superhost status"
  ) +
  scale_x_continuous(labels = comma) +
  theme_minimal(base_family = "ubuntu")

Spatial distribution of Airbnb listings

Distribution of average daily rates for listings

The map below shows the distribution of average daily rates for Airbnb listings, aggregated at the neighborhood level. With the exception of the business district of Amstel III/Bullewijk in the southeast, average daily rates are the highest in the traditional city center, and decreases with distance from the city center, in agreement with the classic bid-rent theory in real estate. The seeming anomalous observations for Amstel III/Bullewijk could be due to the client base consisting more of business than leisure travelers, given the corporate setting of the overall neighborhood. Another explanation could be the lack of alternative accommodation providers in the vicinity and the lack of competition within the platform (with the number of Airbnb listings being the 3rd lowest among all 110 neighborhoods).

price_summary <- calendar %>%
  group_by(neighborhood110) %>%
  summarise(
    mean_price = mean(price, na.rm = TRUE),
    count = n()  # Add count of records in each group
  )

price_summary_sf <- price_summary %>%
  left_join(sf.neighborhd110, by = c("neighborhood110" = "Wijk")) %>%
  st_as_sf()

ggplot(price_summary_sf) +
  geom_sf(aes(fill = mean_price)) +
  scale_fill_viridis(option = "magma", name = "Mean daily rate", direction = -1) +
  theme_minimal() +
  labs(
    title = "Mean daily rate by neighborhood",
    subtitle = "Base layer: Amsterdam Open Data Portal"
  ) +
  theme(
    text = element_text(family = "ubuntu"),
    plot.title = element_text(family = "ubuntu"),
    plot.subtitle = element_text(family = "ubuntu"),
    axis.title = element_blank(),
    axis.text = element_blank(),
    panel.grid = element_blank()
  )

3. Modelling approach

3.1 Train-test split

Prior to variable selection, the cleaned data is divided into a 70% train set and 30% test set. The createDataPartition function is modified to ensure that each one of the 110 neighborhoods is covered in the train and test sets. For reproducibility, a seed of 5080 is set.

set.seed(5080)
train_indices <- createDataPartition(calendar$neighborhood110, p = 0.7, list = FALSE)

train_data <- calendar[train_indices, ]
test_data <- calendar[-train_indices, ]

3.2 Variable selection

Using the variables described above (including log-transformations when necessary), a base linear model (Model #1) is created containing all predictors. Based on this base specification, Model #2 modifies the neighborhood predictor, using a coarser spatial unit of analysis (25 districts rather than 110 neighborhoods). Following the observations of collinearity among some locational characteristics predictors, Model #3 modifies the base model by dropping the distance to the nearest restaurants (log_rest_nn15). Lastly, Models #4 and #5 experiment with interaction terms, with Model #4 testing whether the interaction between occupancy and property type improves the model, and Model #5 investigating the interaction between occupancy and whether guests have access to the entire place.

These models are fitted on the test dataset, with the mean absolute errors (MAE) and mean absolute percentage errors (MAPE) calculated for each regression specification. These two metrics are then compared across the five models, with lower values indicating a better model fit, as presented in the summary table below.

any(is.na(calendar))

# reg1
reg1_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3

reg1 <- lm(reg1_formula, data = train_data)

# reg2
reg2_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood25 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3

reg2 <- lm(reg2_formula, data = train_data)

# reg3
reg3_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_tran_nn3

reg3 <- lm(reg3_formula, data = train_data)

# reg4 with interaction: accommodates * property_type
reg4_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 + 
  accommodates:property_type

reg4 <- lm(reg4_formula, data = train_data)

# reg5 with interaction: accommodates * entire_place
reg5_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 + 
  accommodates:entire_place

reg5 <- lm(reg5_formula, data = train_data)

summary(reg1)
summary(reg2)
summary(reg3)
summary(reg4)
summary(reg5)

reg_models <- list(reg1 = reg1, reg2 = reg2, reg3 = reg3, reg4 = reg4, reg5 = reg5)

calc_metrics <- function(model, test_data) {
  test_data <- test_data %>%
    mutate(
      log_pricePredict = predict(model, test_data),
      log_price.Error = log_pricePredict - log_price,
      log_price.AbsError = abs(log_price.Error),
      log_price.APE = log_price.AbsError / log_price
    )
  
  test_data %>%
    summarise(
      Mean_Error = mean(log_price.Error, na.rm = TRUE),
      Mean_Absolute_Error = mean(log_price.AbsError, na.rm = TRUE),
      Mean_APE = mean(log_price.APE, na.rm = TRUE)
    )
}

metrics_list <- map(reg_models, calc_metrics, test_data = test_data)

metrics_df <- bind_rows(metrics_list, .id = "Model") %>%
  mutate(across(where(is.numeric), ~ signif(.x, 4)))

save(metrics_df, file = "comparisons_models.RData")

load("comparisons_models.RData")

metrics_df %>%
  kable(format = "html", caption = "Error metrics for different model specifications") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "condensed"),
                fixed_thead = TRUE)

Error metrics for different model specifications
Model	Mean_Error	Mean_Absolute_Error	Mean_APE
reg1	-0.001238	0.2444	0.04979
reg2	-0.001193	0.2489	0.05073
reg3	-0.001253	0.2445	0.04981
reg4	-0.001259	0.2436	0.04963
reg5	-0.001208	0.2427	0.04942

From the outputs above, one can observe that:

Controlling for more granular neighborhood fixed effects results in lower MAE and MAPE (comparing Models #1 and #2)
Removing distance to the nearest restaurants from the predictors does not improve MAE and MAPE (comparing Models #1 and #3)

Controlling for the interaction between occupancy and entire_place and between occupancy and property_type improves both the MAE and MAPE.

3.3 Final model

Following this, the final model incorporates:

Neighborhood fixed effects (110 neighborhoods of Amsterdam)
Average distance to the nearest restaurants
Interaction effect between occupancy and entire_place
Interaction effect between occupancy and property_type

reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 + 
  accommodates:entire_place +
  accommodates:property_type

# reg6 <- lm(reg6_formula, data = train_data)

4. Model Outputs and Validation

In this section, the goodness of fit of the final model is evaluated with reference to its generalizability (whether it can predict new data well and under different spatial and temporal contexts) and accuracy (whether there is a low degree of error between actual and predicted values).

4.1 Generalizability — k-fold cross validation

Here, the final model is cross-validated with k = 100 folds, in particular specifying that each fold should contain all the neighborhoods of Amsterdam present in the original dataset. For each fold, the mean absolute error (MAE) between actual and predicted Airbnb daily rates is calculated, and the results are displayed in the histogram below.

calendar$neighborhood110 <- droplevels(calendar$neighborhood110)
train_data$neighborhood110 <- droplevels(train_data$neighborhood110)
test_data$neighborhood110 <- droplevels(test_data$neighborhood110)

# Set formula and train data
reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 + 
  accommodates:entire_place +
  accommodates:property_type

# Create k stratified folds by neighborhood110
set.seed(5080) # For reproducibility
folds <- caret::createFolds(train_data$neighborhood110, k = 100, list = TRUE)

# Perform cross-validation
cv_results <- map_dfr(folds, function(test_indices) {
  # Split the data into training and testing
  test_data <- train_data[test_indices, ]
  train_data_fold <- train_data[-test_indices, ]
  
  # Train the model
  reg6 <- lm(reg6_formula, data = train_data_fold)
  
  # Predict on the test set
  test_data <- test_data %>%
    mutate(
      log_pricePredict = predict(reg6, test_data),
      log_price.Error = log_pricePredict - log_price,
      log_price.AbsError = abs(log_pricePredict - log_price),
      log_price.APE = abs(log_pricePredict - log_price) / log_price
    )
  
  # Calculate metrics for this fold
  metrics <- test_data %>%
    summarise(
      MSE = mean(log_price.Error^2),
      MAE = mean(log_price.AbsError),
      MAPE = mean(log_price.APE)
    )
  
  return(metrics)
})


# cv_results_summary <- cv_results %>%
#   summarise(
#     Avg_MSE = mean(MSE),
#     Avg_MAE = mean(MAE),
#     Avg_MAPE = mean(MAPE)
#   )

save(cv_results, file = "cv_results.RData")

load("cv_results.RData")

ggplot(cv_results, aes(x = MAE)) +
  geom_histogram(binwidth = 0.001, fill = "#006e70", color = "white") +
  theme_minimal(base_family = "Ubuntu") +
  labs(
    title = "Distribution of Mean Absolute Error of 100 folds",
    subtitle = "Final model",
    x = "Mean Absolute Error (MAE)",
    y = "Frequency"
  ) +
  theme(
    text = element_text(family = "Ubuntu"),
    plot.title = element_text(size = 16, family = "Ubuntu"),
    plot.subtitle = element_text(size = 12, family = "Ubuntu"),
    axis.title = element_text(size = 12, family = "Ubuntu"),
    axis.text = element_text(size = 10, family = "Ubuntu")
  )

The histogram above shows that the final model is able to generalize well for new Airbnb listing data, as the mean absolute errors of the folds cluster tightly together.

4.2 Generalizability across space — Map of MAPE at neighborhood level

To determine if the final model is generalizable across space, a map aggregating the mean absolute percentage errors (MAPE) of the test set predictions at the neighborhood level is produced, as presented in the figure below. If there is no clustering of high or low MAPE at certain regions, then the model is able to generalize well across space.

reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 + 
  accommodates:entire_place +
  accommodates:property_type

reg6 <- lm(reg6_formula, data = train_data)

test_data <- test_data %>%
  mutate(
    log_pricePredict = predict(reg6, test_data),
    log_price.Error = log_pricePredict - log_price,
    log_price.AbsError = abs(log_pricePredict - log_price),
    log_price.APE = abs(log_pricePredict - log_price) / log_price
  )

neighborhood_mean_APE <- test_data %>%
  group_by(neighborhood110) %>%
  summarise(mean_APE = mean(log_price.APE, na.rm = TRUE))

sf_with_mean_APE <- sf.neighborhd110 %>%
  left_join(neighborhood_mean_APE, by = c("Wijk" = "neighborhood110"))

sf_with_mean_APE <- sf_with_mean_APE %>%
  filter(!is.na(mean_APE))

ggplot(sf_with_mean_APE) +
  geom_sf(aes(fill = mean_APE)) +
  scale_fill_viridis(option = "mako", name = "MAPE", direction = -1) +
  theme_minimal() +
  labs(
    title = "Mean absolute percentage error (MAPE) by neighborhood",
    subtitle = "Calculations based on the final model"
  ) +
  theme(
    text = element_text(family = "ubuntu"),
    plot.title = element_text(family = "ubuntu"),
    plot.subtitle = element_text(family = "ubuntu"),
    axis.title = element_blank(),
    axis.text = element_blank(),
    panel.grid = element_blank()
  )

The map above shows a rather decent generalizability across space. The central neighborhoods have a decent spread of MAPE values from 3% to 7%, and even in the city outskirts (areas such as Amsterdam-Zuidoost to the southeast and Amsterdam Nieuw-West to the northwest) there is a wide range of values from 2% to 9%. Thus, on the whole, one can conclude that the model is able to generalize well across space.

4.3 Accuracy of model — MAE and MAPE for test dataset

The accuracy of the model can be determined with reference to the MAE and MAPE values calculated for the test dataset. These are reported in the table below:

mean_abs_error <- mean(test_data$log_price.AbsError, na.rm = TRUE)
mean_ape <- mean(test_data$log_price.APE, na.rm = TRUE)

error_metrics <- data.frame(
  Metric = c("Mean Absolute Error (MAE)", "Mean Absolute Percentage Error (MAPE)"),
  Value = c(mean_abs_error, mean_ape)
)

error_metrics %>%
  kable(
    col.names = c("Error Metric", "Value"),
    caption = "Measuring model accuracy: MAE and MAPE",
    digits = 3
  ) %>%
  kable_styling(
    bootstrap_options = c("striped", "hover", "condensed", "responsive"),
    full_width = FALSE,
    font_size = 12
  )

Measuring model accuracy: MAE and MAPE
Error Metric	Value
Mean Absolute Error (MAE)	0.242
Mean Absolute Percentage Error (MAPE)	0.049

As the dependent variable (Airbnb) is log-transformed, the more appropriate measure for determining accuracy is the MAPE, since the range of Airbnb daily rates can vary quite widely. Here, the MAPE of 4.9% is rather decent, showing that the model is generally accurate.

4.4 Final model output

Leaving out the (110) rows for neighborhood fixed effects, the output of the final model is as follows:

reg6_formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 + 
  accommodates:entire_place +
  accommodates:property_type

# Note that the data here is calendar
reg6 <- lm(reg6_formula, data = calendar)

reg6_summary <- summary(reg6)
reg6_summary$coefficients <- reg6_summary$coefficients[!grepl("neighborhood110", rownames(reg6_summary$coefficients)), ]

coefficients_df <- as.data.frame(reg6_summary$coefficients)

coefficients_df <- coefficients_df[!grepl("neighborhood110", rownames(coefficients_df)), ]

colnames(coefficients_df)[4] <- "p-value"

coefficients_df[3] <- NULL

kable(coefficients_df, caption = "Final model output") %>%
  kable_styling(bootstrap_options = c("hover", "striped", "condensed"))

The regression output shows that all predictors are statistically significant at 5%, with the exception of the property_type “Villa” and the interaction between property_type “Loft” and occupancy.

4.5 Fishnet creation & calculation of the Fairbnb Index

In this final section for modelling, a fishnet grid is produced based on the Airbnb listings, with each cell measuring 500 meters by 500 meters, offering a good granularity for bespoke short-term rental policy evaluation and interventions. For each cell, the mean error of its listings based on the model is aggregated, producing the Index value.

The curve below shows the cumulative density function for the aggregated errors of each cell, and reflects a decent distribution of values to be represented in the Fairbnb Index.

fishnet <- 
  st_make_grid(sf.neighborhd110,
               cellsize = 500, # 1640.42 feet / 500 meters
               square = TRUE) %>%
  .[sf.neighborhd110] %>%
  st_sf() %>%
  mutate(uniqueID = 1:n())

fairbnb_net <- 
  dplyr::select(listings_details_sf) %>% 
  mutate(countBnb = 1) %>% 
  aggregate(., fishnet, sum) %>%
  mutate(countBnb = replace_na(countBnb, 0))

fairbnb_net <- fairbnb_net %>%
  mutate(sufficient_data = if_else(countBnb >= 10, TRUE, FALSE))

fairbnb_net <- fairbnb_net %>%
  mutate(Cell_ID = row_number())

calendar <- calendar %>%
  mutate(
    log_pricePredict = predict(reg6, calendar),
    log_price.Error = log_price - log_pricePredict,         
    log_price.AbsError = abs(log_price - log_pricePredict),   
    log_price.PE = (log_price - log_pricePredict) / log_price
  )

calendar_sf <- calendar %>%
  left_join(
    select(listings_details_sf, listing_id, geometry), 
    by = "listing_id" 
  )

calendar_sf <- st_as_sf(calendar_sf, sf_column_name = "geometry.y", crs = st_crs(listings_details_sf))

aggregated_data <- calendar_sf %>%
  st_join(fairbnb_net, join = st_within) %>%
  group_by(Cell_ID) %>%  
  summarise(
    mean_log_price_error = mean(log_price.Error, na.rm = TRUE),
    mean_log_price_PE = mean(log_price.PE, na.rm = TRUE),
    mean_price = mean(price, na.rm = TRUE)
  )

fairbnb_net_with_errors <- fairbnb_net %>%
  left_join(aggregated_data %>% st_drop_geometry(), by = "Cell_ID")

fairbnb_net_with_errors <- fairbnb_net_with_errors %>%
  filter(sufficient_data == TRUE)

range(fairbnb_net_with_errors$mean_log_price_error, na.rm = TRUE)
range(fairbnb_net_with_errors$mean_log_price_PE, na.rm = TRUE)

# ggplot() +
#   geom_sf(data = fairbnb_net, fill = "#ded9ca", color = "white", size = 5) +
#   theme_minimal(base_family = "Ubuntu") +
#   labs(
#     title = 'Base fishnet',
#     subtitle = "500m x 500m grids"
#   ) +
#   theme(
#     plot.title = element_text(size = 14, family = "Ubuntu"),
#     axis.title = element_blank(),
#     axis.text = element_blank(),
#     panel.grid = element_blank()    
#   )

ggplot(fairbnb_net_with_errors, aes(x = mean_log_price_PE)) +
  stat_ecdf(geom = "step", color = "#006e70") +
  labs(
    title = "Cumulative density function of model percentage errors",
    subtitle = "Final model outputs aggregated into fishnet cells",
    x = "Mean Log Price Error",
    y = "Cumulative Density"
  ) +
  theme_minimal() +
  theme(
    text = element_text(family = "ubuntu"),
    plot.title = element_text(family = "ubuntu"),
    plot.subtitle = element_text(family = "ubuntu"),
    axis.title = element_text(family = "ubuntu"),
    axis.text = element_text(family = "ubuntu")
  )

Next, the Fairbnb Index is calculated by aggregating the percentage errors in each fishnet cell, and is visualized in the map below.

fairbnb_net_with_errors <- fairbnb_net_with_errors %>%
  mutate(Fairbnb_Index = round(100 * mean_log_price_PE, 5))

fairbnb_net_with_errors <- fairbnb_net_with_errors %>%
  mutate(Fairbnb_Index_Category = case_when(
    Fairbnb_Index < -4 ~ "#c7522a",
    Fairbnb_Index >= -4 & Fairbnb_Index < -2 ~ "#e5c185",
    Fairbnb_Index >= -2 & Fairbnb_Index < 2 ~ "#c8c7d188",
    Fairbnb_Index >= 2 & Fairbnb_Index <= 4 ~ "#74a892",
    Fairbnb_Index > 4 ~ "#008585",
    TRUE ~ "#FFFFFF"  # Default color if not in any range
  ))

ggplot(fairbnb_net_with_errors) +
  geom_sf(data = sf.neighborhd110, fill = NA, color = "#c8c7d188", size = 5) +
  geom_sf(aes(fill = Fairbnb_Index_Category)) +
  scale_fill_identity() +  # Use the colors directly without modification
  theme_minimal() +
  labs(
    title = "Fairbnb Index across Amsterdam",
    subtitle = "<span style='color:black; font-weight:bold;'>Color guide:<br>
    <span style='color:#c7522a;'>Dark red:</span> Below -4<br>
                <span style='color:#e5c185;'>Orange:</span> -4 to -2<br>
                <span style='color:black;'>Gray:</span> -2 to 2<br>
                <span style='color:#74a892;'>Light green:</span> 2 to 4<br>
                <span style='color:#008585;'>Dark green:</span> Above 4"
  ) +
  theme(
    text = element_text(family = "ubuntu"),
    plot.title = element_text(size = 16, family = "ubuntu"),
    plot.subtitle = element_markdown(size = 8),
    axis.title = element_blank(),
    axis.text = element_blank(),
    panel.grid = element_blank()
  )

An interactive version of the map (embedded Leaflet map) can be previewed below.

fairbnb_net_with_errors_4326 <- st_transform(fairbnb_net_with_errors, crs = 4326)

leaflet(fairbnb_net_with_errors_4326) %>%
  addProviderTiles(providers$CartoDB.Positron) %>%  # Base map
  addPolygons(
    fillColor = ~Fairbnb_Index_Category,  # Color based on the Fairbnb Index
    fillOpacity = 0.7,
    color = "#ffffff",  # Border color for polygons
    weight = 1,  # Border weight
    popup = ~paste("Fairbnb Index value: ", Fairbnb_Index),  # Popup with Fairbnb Index value
    label = ~paste("Fairbnb Index value: ", Fairbnb_Index),  # Tooltip with Fairbnb Index value
    labelOptions = labelOptions(
      direction = "auto",  # Tooltip direction
      textsize = "15px"  # Tooltip font size
    )
  ) %>%
  addLegend(
    position = "topright", 
    colors = c("#c7522a", "#e5c185", "#c8c7d188", "#74a892", "#008585"),
    labels = c("Below -4", "-4 to -2", "-2 to 2", "2 to 4", "Above 4"),
    title = "Fairbnb Index Categories"
  )

fairbnb_net_with_errors_centroids <- st_centroid(fairbnb_net_with_errors)

fairbnb_net_with_errors_joined <- st_join(fairbnb_net_with_errors_centroids, sf.neighborhd110[, c("Wijk")])

fairbnb_detailed <- fairbnb_net_with_errors %>%
  left_join(fairbnb_net_with_errors_joined %>% st_drop_geometry() %>% select(Cell_ID, Wijk), by = c("Cell_ID" = "Cell_ID")) %>%
  st_as_sf()

fairbnb_detailed_4326 <- st_transform(fairbnb_detailed, crs = 4326)

fairbnb_detailed_4326 <- fairbnb_detailed_4326 %>%
  mutate(Fairbnb_Index_Category = ifelse(Fairbnb_Index_Category == "#c8c7d188", "#b6c1d5", Fairbnb_Index_Category))

# Replace NA values in the Wijk column
fairbnb_detailed_4326$Wijk[is.na(fairbnb_detailed_4326$Wijk)] <- "Partially outside City administrative boundary"

st_write(fairbnb_detailed_4326, "fairbnb_detailed.geojson")

Web application prototype

Meanwhile, a web application prototype is developed using JavaScript, and can be viewed at the following link:

https://tinyurl.com/musa-5080-final

5. Conclusion

In summary, the Fairbnb modelling exercise and web application serve as a useful starting point for policymakers in Amsterdam to examine the nuanced effects of short-term rental properties on neighborhoods, in particular uncovering groups of blocks where there appears to be chronic over-supply of short-term rental properties — dwelling spaces which could be better utilized when returned for residential use. With reference to the Fairbnb Index (and how it changes over time as new data arrives), policymakers can pinpoint specific areas below the neighborhood level for market interventions, such as setting price floors for permits and restricting the supply of permits at the local level. Such steps are necessary in order to rebalance supply and demand for short-term rental properties — an important step in stemming the hemorrhaging of residential properties to the short-term rental industry for such areas, whose market players tend to behave in a speculative manner, as the exploratory analysis has demonstrated.

A key advantage of the modelling process is its replicability, which guarantees the long-term sustainability of the Fairbnb platform, particularly with the upcoming EU Directive legislating companies like Airbnb to share more data with local and national governments. While the current analysis looks at a time frame of one year, the price model can be easily applied to more recent listing data using the same code base contained in this R markdown file.

There are a few areas where the present analysis can be augmented. For instance, text-mining algorithms (including those provided by large language model services such as the OpenAI API) could be exploited in the feature engineering stage to obtain more property-level information not included in the current analysis, such as views from the apartment or special interior features which may fetch a price premium over similar properties. As for the modelling phase, the current linear regression approach could be extended through more advanced supervised machine learning algorithms such as decision trees, which might capture the relationships between the predictor variables in a more robust manner. Indeed, the code block below provides one such example — the use of the open-source machine learning library XGBoost — which consists of a set of classification and regression trees in predicting Airbnb daily rates. Initial results already show a lower mean absolute error and mean absolute percentage error compared to our best linear model, demonstrating one strength of more advanced statistical analyses.

formula <- log_price ~ pub_hol + pub_hol_week + day + month + property_type + 
  bed_type + accommodates + bathrooms + bedrooms + beds + 
  neighborhood110 + entire_place + superhost + 
  log_attr_nn3 + log_bathrooms + log_beds + log_rest_nn15 + log_tran_nn3 + 
  accommodates:entire_place + accommodates:property_type

# Create model matrix for features
model_matrix <- model.matrix(formula, data = train_data)

# Extract target variable
target <- train_data$log_price

# Convert to DMatrix (efficient storage for xgboost)
dtrain <- xgb.DMatrix(data = model_matrix, label = target)

# Parameters for xgboost
params <- list(
  objective = "reg:squarederror",  # Regression task
  eval_metric = "rmse",           # Root Mean Squared Error
  max_depth = 6,                  # Depth of the tree
  eta = 0.3,                      # Learning rate
  subsample = 0.8,                # Fraction of data used per iteration
  colsample_bytree = 0.8          # Fraction of features used per tree
)

# Train the model
set.seed(123)  # For reproducibility
model <- xgb.train(
  params = params,
  data = dtrain,
  nrounds = 100,  # Number of boosting rounds
  watchlist = list(train = dtrain),
  verbose = 1
)

test_matrix <- model.matrix(formula, data = test_data)

# Convert test_matrix to DMatrix
dtest <- xgb.DMatrix(data = test_matrix)

# Predict using the trained model
test_predictions <- predict(model, dtest)

# Output predictions
head(test_predictions)

if ("log_price" %in% colnames(test_data)) {
  actual <- test_data$log_price
  
  # Root Mean Squared Error (RMSE)
  rmse_test <- sqrt(mean((actual - test_predictions)^2))
  
  # Mean Absolute Error (MAE)
  mae_test <- mean(abs(actual - test_predictions))
  
  # Mean Absolute Percentage Error (MAPE)
  mape_test <- mean(abs((actual - test_predictions) / actual)) * 100
  
  print(paste("Test RMSE:", round(rmse_test, 4)))
  print(paste("Test MAE:", round(mae_test, 4)))
  print(paste("Test MAPE (%):", round(mape_test, 2)))
}

Irregardless of the final modelling approach chosen, it should be up to Amsterdam’s citizens — the people who are priced out by investment properties in their own neighborhoods, the residents who cross paths with short-term visitors to their apartment blocks, the activists concerned about the erosion of the character of their districts, or the property owners who take profit from speculative letting of their property portfolios — to negotiate the final set of regulations governing short-term rental properties in their ara. Indeed, Fairbnb is never intended to be wielded as a blunt technocratic tool for intervening in housing and short-term rental markets, but rather as an open platform that empowers policymakers to make informed decisions and for citizens to scrutinize and influence this decision-making process. After all, it is homes, livelihoods and communities that are at stake.

6. References

Ioannides, D., Röslmaier, M., & Van Der Zee, E. (2019). Airbnb as an instigator of ‘tourism bubble’ expansion in Utrecht’s Lombok neighbourhood. Tourism Geographies, 21(5), 822-840.

Koolhoven, R. (2018). Regulating Airbnb in the Netherlands. J. Eur. Consumer & Mkt. L., 7, 253.

Morpey, U. (2022). The effect of Airbnb activity on residential real estate values and livability: The case for the Netherlands (Doctoral dissertation).

Image credits:

Cover image — Adrien Olichon via Unsplash

https://unsplash.com/photos/view-of-apartment-complex-buildings-082xic2aif8

Fairbnb logo — Own work

Analysis of Amsterdam's Short-term Rental Policy

Jingmiao Fei & Byron Huang

2024-12-13

1. Introduction

1.1 The policy debate over short-term rentals

1.2 Fairbnb: A short-term rental market intelligence tool for policymakers

1.3 Regulatory environment for short-term rentals

Changing landscape of Airbnb regulations in Amsterdam and the Netherlands

The European Union regulatory environment

1.4 Turning insights into actions

2. Exploratory Analysis

2.1 Data wrangling

Original datasets

External datasets

Other data wrangling steps

Time fixed effects

Removing price outliers

Removing NA values

Generating nearest neighbor features

Removing hotels and non-conventional property types from dataset

Outcome of data wrangling process

2.2 Summary statistics and initial observations

Listings availability

Price histogram

Numeric predictor variables

Summary statistics for numeric variables

Histogram plots of numeric predictors

Correlation matrix

Scatterplots between numeric predictors and Airbnb prices

Categorical predictor variables

Summary statistics for categorical variables

Spatial distribution of Airbnb listings

Distribution of average daily rates for listings

3. Modelling approach

3.1 Train-test split

3.2 Variable selection

3.3 Final model

4. Model Outputs and Validation

4.1 Generalizability — k-fold cross validation

4.2 Generalizability across space — Map of MAPE at neighborhood level

4.3 Accuracy of model — MAE and MAPE for test dataset

4.4 Final model output

4.5 Fishnet creation & calculation of the Fairbnb Index

Web application prototype

5. Conclusion

6. References