• 1. Introduction
  • 2. Data Wrangling
  • 3. Exploratory Analysis
  • 4. Modeling
  • 5. Accuracy and Generalizability
    • 5.1. Examine Error Metrics for Accuracy
    • 5.2 K-fold Cross Validation
    • 5.3. Space-Time Error Evaluation
  • 6. Discussion

1. Introduction

Bike share programs have become an integral part of urban transportation systems, offering a convenient, sustainable, and cost-effective alternative to private vehicles and public transit. Philadelphia’s Indego bike share network, for instance, serves as a critical mobility option for residents and visitors alike, enabling them to navigate the city while reducing congestion and emissions. However, a common operational challenge for bike share programs is the need to maintain an optimal balance of bikes across stations to meet variable demand patterns. Due to daily commuting habits and various socio-economic factors, some parts of a city may experience an abundance of bikes, while others struggle with shortages. For example, residential neighborhoods may see a surge in bike numbers in the morning as commuters head to work, only to be left with empty stations by evening. On the other hand, business districts may have too many bikes during the day, requiring redistribution for the evening commute. Without efficient re-balancing, the bike share system’s effectiveness is compromised, leading to reduced customer satisfaction and a drop in usage. Without effective re-balancing strategies, stations may experience shortages or surpluses, leading to user frustration and decreased system efficiency.

To address these challenges, we propose a proactive re-balancing plan for Indego, the bike share system in Philadelphia, that combines predictive analytics with operational logistics. Launched in 2015, Indego has 1,400 bikes distributed across 130 stations, mostly concentrated in Center City, West Philadelphia, and South Philadelphia1. Our approach involves managing a fleet of re-balancing trucks to redistribute bikes across the network based on demand forecasts. Additionally, we plan to implement user-focused incentives, such as offering ride discounts for returning bikes to underutilized stations or peak-demand locations. This dual approach enhances operational efficiency while fostering rider participation in maintaining system equilibrium.

Our model take one month (May) as the study period, and can provide estimated bike share traffic for each station, every hour of a day throughout the month, to enable timely and actionable interventions. While it remains to be seen how the model performs in other months of a year. The choice of time lag features such as the previous hour’s trip counts, weather conditions, amenity features such as the distance to the nearest subway station, and historical demand patterns at each station, aligns with this operational horizon. By leveraging these insights, our re-balancing plan ensures a dynamic, data-driven response to changing demand, maximizing both user satisfaction and system performance.

2. Data Wrangling

For our analysis, we used four datasets to build the variables: 1)Indego station data in May, 2022; 2)weather data collected at Philadelphia International Airport in May, 2022, including percipitation, temperature and wind speed; 3)5-year ACS Census Data from 2018 to 2022; 4)Septa Data, the subway system in Philadelphia

2.1. Setup

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gganimate)
library(caret)
library(RColorBrewer)
library(gifski)

Sys.setlocale("LC_TIME", "English")

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette7 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c","darkblue","#123F5A")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")

2.2. Import Census Data

Import basic census data in Philadelphia in 2022, including total population, median household income, median age, racial demographics, travel time, and transportation data related to commuting and public transportation usage. Additionally, new calculated columns are created, such as the percentage of the population that is White, the mean commute time, and the percentage of commuters who use public transportation.

PHLCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2022, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
PHLTracts <- 
  PHLCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

2.3. Import Indego Data

Import Philadelphia’s bike-sharing data (Indego), which provides information about bike trips made by users throughout the city. The dataset includes trip start and end times, start and end station locations, user types, and trip durations.

dat<-read.csv("indego-trips-2022-q2.csv")

dat$start_time <- mdy_hm(dat$start_time)
dat$end_time <- mdy_hm(dat$end_time)
# head(dat$start_time)

# only keep records of May
dat_may <- dat %>%
  filter(month(start_time) == 5)
dat2 <- dat_may %>%
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
dat_census <- st_join(
  dat2 %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lon) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
  PHLTracts %>%
          st_transform(crs=4326),
          join = st_intersects, left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., PHLTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  na.omit()
dat_sf <- st_as_sf(dat%>%na.omit(), coords = c("start_lon", "start_lat"), crs = 4326)
ggplot() +
  geom_sf(data = PHLTracts) +
  geom_sf(data = dat_sf %>%
            group_by(start_station) %>%
            summarize(count = n()),
          color = "#08519c",
          alpha = 0.4) +
  labs(title = "Distribution of Indego Stations in Philadelphia",
       subtitle = "May 2022")+
  mapTheme

2.4. Import Weather Data

Import weather data in Philadelphia, including perception, temperature and wind speed.

weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2022-05-01", date_end = "2022-05-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - PHL PHL - May, 2022")

2.5 Import Septa Data

Import subway station points in Philadelphia.

el <- st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson")
Broad_St <- st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson")
septaStops <- 
  rbind(
     el %>% 
      mutate(Line = "El") %>%
      dplyr::select(Station, Line),
     Broad_St %>%
      mutate(Line ="Broad_St") %>%
      dplyr::select(Station, Line)) %>%
  st_transform(crs=4326)  

ggplot()+
  geom_sf(data = PHLTracts)+
  geom_sf(data = septaStops,fill = "#08519c", color = "#08519c", alpha = 0.7)+
  labs(title = "Subway Stations in Philadelphia")+
  theme_minimal()

3. Exploratory Analysis

3.1. Bike share trips per hour

Exploring bike share trips per hour, we observed a cyclical pattern in trip numbers throughout the day, with certain hours showing clear peak demand. Additionally, trips on weekends were significantly lower compared to weekdays. However, during the weekend around May 28th (the Memorial Day), there was no noticeable dip in trips, reflecting the impact of holidays on people’s travel behavior.

ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. PHL, May, 2022",
       x="Date", 
       y="Number of trips")+
  plotTheme

Breaking these trends down, we categorized the 24 hours into four time periods: overnight, AM Rush, Mid-Day, and PM Rush, and explored the distribution of the number of trips per hour within these periods. We found that trip counts in all four time slots were mostly concentrated around 1 and 2, with Mid-Day and PM Rush periods tending to have more counts of 2.

dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. PHL, May, 2022",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

ggplot(dat_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 1)+
  labs(title="Bike share trips per hr by station. PHL, May, 2022",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme

Exploring bike share trips by day, we can see that the trip counts in Friday and weekends are clearly lower.

ggplot(dat_census %>% mutate(hour = hour(start_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in PHL, by day of the week, May, 2022",
       x="Hour", 
       y="Trip Counts")+
  scale_color_manual(values = brewer.pal(n = 7, name = "Set2") ) +
     plotTheme

ggplot(dat_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in PHL - weekend vs weekday, May, 2022",
       x="Hour", 
       y="Trip Counts")+
     plotTheme

These patterns are also evident spatially, with the highest ridership per station concentrated in the downtown and University City areas during the afternoon rush hour.

ggplot()+
  geom_sf(data = PHLTracts %>%
          st_transform(crs=4326))+
  geom_point(data = dat_census %>%
               na.omit()%>%
            mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.5, size = 1)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. PHL, May, 2022")+
  mapTheme

3.2 Create Space-Time Panel

Here we create a study panel that has each unique time and start station observations.

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat )%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))
ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

# turn start station id frome numeric to categorical variable
ride.panel$start_station <- factor(ride.panel$start_station)
ride.panel <- 
  left_join(ride.panel, PHLCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

3.3. Create time lags

We created time lags to capture the temporal dependency and better fit the historical impact. Here we created 1, 2, 3, 4 ,12 hours lag, and 1, 2, 3-day holiday lag and lead.

ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))

3.4 Create Amenity Features

We add an amenity feature, distance to the nearest subway station as one of the variables. Because a large portion of people using shared bikes are heading to the subways.

# Convert ride.panel to an sf object using start_lon and start_lat
ride.panel_sf <- st_as_sf(ride.panel, coords = c("start_lon", "start_lat"), crs = 4326)

# Reproject both dataframes to EPSG:2272
ride.panel_proj <- st_transform(ride.panel_sf, crs = 2272)
septaStops_proj <- st_transform(septaStops, crs = 2272)

# Calculate the nearest distance from each ride.panel point to the closest septaStop
nearest_distances <- st_nearest_feature(ride.panel_proj, septaStops_proj)  # Index of nearest points
nearest_points <- septaStops_proj[nearest_distances, ]  # Get the nearest points
distances <- st_distance(ride.panel_proj, nearest_points, by_element = TRUE)  # Calculate distances

# Add the distances back to ride.panel
ride.panel$nearest_septa_dist <- as.numeric(distances)  # Convert to numeric

# Optional: Add nearest septaStop ID or other attributes
ride.panel$nearest_septaStop_id <- nearest_points$Station  
ride.panel <- na.omit(ride.panel)

Finally, we created an animated plot and note clear spatial and temporal autocorrelation in our data: the animated map indicates that data cluster at certain times of the day in central Philadelphia.

data_5_18 <- ride.panel %>%
  filter(format(interval60, "%Y-%m-%d") == "2022-05-18")  
data_5_18$interval60 <- data_5_18$interval60 + hours(4)


p <- ggplot() +
  geom_sf(data = PHLTracts)+
  geom_point(data = data_5_18, aes(x = start_lon, y = start_lat, color = Trip_Count), alpha = 0.8, size = 2) + 
  scale_color_gradient(low = "lightblue", high = "darkblue") + 
  labs(title = "Bike Share Trips on 2022-05-18", 
       subtitle = "Hour: {format(frame_time, '%Y-%m-%d %H:00:00')}", 
       x = "Longitude", 
       y = "Latitude") +
  theme_minimal() +
  transition_time(interval60) + 
  ease_aes('linear') 

# Export the animation as a GIF
# animated_gif <- animate(p, width = 800, height = 600, fps = 10, renderer = gifski_renderer())
# anim_save("bike_share_trips.gif", animated_gif)
knitr::include_graphics("bike_share_trips.gif")

4. Modeling

4.1. Run Models

We use Week 20, 21, and 22 as train dataset, for it includes the holiday lag in the last 3 weeks. And take the 18th and 19th week as test set.

ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)

For our analysis, we used OLS regressions. Although OLS may not be the best method due to the spatial autocorrelation discussed earlier, alternative approaches like spatial lag or geographically weighted regression could enhance performance and generalizability. Nonetheless, for the objectives of this task, OLS proves adequate.

In our analysis, seven regression models are built:

reg1: only time fixed effect

reg2: only space fixed effect

reg3: time - space fixed effect

reg4: reg3 with hour lag

reg5: reg4 with holiday lag

reg6: reg5 with amenity feature

reg7: reg6 with some social indicators (population,median household income and percent of people taking public transprtation)

reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

reg6 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + nearest_septa_dist, 
     data=ride.Train)

reg7 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + nearest_septa_dist + Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
     data=ride.Train)

4.2. Predict for test data

ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 
model_pred <- function(dat, fit){pred <- predict(fit, newdata = dat)}
week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
           FTime_Space_FE_timeLags_holidayLags_amenity = map(.x = data, fit = reg6, .f = model_pred),
           GTime_Space_FE_timeLags_holidayLags_amenity_society = map(.x = data, fit = reg7, .f = model_pred)
           ) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions

5. Accuracy and Generalizability

5.1. Examine Error Metrics for Accuracy

Among the seven models we ran, we found that those incorporating both temporal and spatial features, along with various time-lagged variables, performed the best. Interestingly, spatial features appeared to be more influential than temporal ones; including spatial lags in the model led to significant improvements. The addition of holiday lags, amenity features, and social economic variables brought almost no improvement. These features might already be implicitly captured by the pure spatial and temporal characteristics in the dataset. For example, the variable representing income, with its strong spatial autocorrelation, is likely already embedded in the location data; and the distance to the nearest subway station seems to be included in the spatial fixed effect.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette7) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "PHL; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme

5.2 K-fold Cross Validation

Moving forward, we stick with reg6, which seems to have the best goodness of fit generally. We used k-fold cross validation to test our model’s accuracy and generalizability. The RMSE is 1.04 and MAE is 0.67, which indicate a moderate level of performance.

folds <- 5

control <- trainControl(method="cv", number=folds)


set.seed(123)

model_cv <- train(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + nearest_septa_dist, 
                  data=ride.panel,
                  method="lm",
                  trControl=control)

cv_df <- data.frame(
  model = "FTime_Space_FE_timeLags_holidayLags_amenity",
  folds = folds,
  rmse = model_cv$results$RMSE,
  mae = model_cv$results$MAE
)
cv_df %>%
    kbl(caption = "Cross-Validation Results") %>%
    kable_paper(bootstrap_options = "striped", full_width = F)
Cross-Validation Results
model folds rmse mae
FTime_Space_FE_timeLags_holidayLags_amenity 5 1.043422 0.6721477

Exploring the mean absolute errors (MAE) by station, we find that there seems to be a spatial pattern to our model.The high MAE values are concemtrated in the downtown area.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = PHLCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model 6")+
  mapTheme

5.3. Space-Time Error Evaluation

Except for spatial error distribution, we should also consider temporal erros. We plot observed and predicted for different times of day during the week and weekend, and some patterns begin to emerge.

  1. Underprediction: In most of the plots, the predicted values are consistently lower than the observed ones, particularly as the number of trips increases. This is shown by data points lying above the 45-degree line (representing perfect prediction). This suggests that the model has more difficulty with higher-demand stations.

  2. Time Effects: The underprediction seems more pronounced during peak times like the AM and PM rush hours compared to midday or overnight. This indicates the model may not fully capture the patterns of travel during these peak periods.

  3. Outliers: Some data points, especially in the higher trip ranges, are far from the main cluster. These outliers highlight cases where the model significantly underperforms in predicting the correct number of trips.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

We found that the MAE is highest during the PM rush, followed by the AM rush and midday, with the smallest MAE occurring during the overnight period. This reflects the model’s challenges in accurately predicting demand during peak hours. These periods, along with the areas in downtown where the model’s errors are most pronounced, correspond to more developed regions with higher demand. This suggests that the model’s performance, particularly in high-demand, spatially concentrated areas, still requires significant improvement.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = PHLCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

Lastly, we find that our model does not generalize perfectly. Our model performs less accurately in wealthier, whiter neighborhoods, and more accurately in areas with higher public transportation reliance.

week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = PHLCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

6. Discussion

In summary, we can see that while we are able to track the time components of demand, we miss the peaks and underpredict during periods of high demand. Our model can effectively predict the demand for bike redistribution on the Indego bikeshare system to some extent, with an MAE of approximately 0.6, indicating that the model’s predicted bike demand at any given station is fairly close to the actual demand (within about 1 bike). However, since the base bike demand per station per hour is relatively small (typically under 3 bikes), the accuracy of predictions still needs improvement. Moreover, Our model demonstrates limited generalizability, as the MAE varies across different socio-economic regions. And it remains to be seen how the model performs in other months of a year.

The main issue we face is the underestimation of demand at high-demand stations, and these are precisely the areas that need attention the most. From an operations perspective, underestimating high demand may lead to inadequate resource allocation, especially during peak periods (such as morning and evening rush hours), directly affecting user experience. For example, during these peak periods, stations may run out of bikes, preventing users from renting or returning bikes, which reduces the satisfaction of meeting the demand. As demand increases, a lack of sufficient bikes can lead to longer wait times for users, or even user loss, lowering the overall usage of the bikeshare system. Furthermore, the operator may fail to effectively redistribute bikes to high-demand areas, thus reducing the efficiency and service quality of the entire system.

To reduce errors, we could incorporate more spatial-temporal features. For example, we can use binary variables to account for peak periods and add social variables such as nearby amenities and residential density. Additionally, using more detailed historical data for training could allow for more complex models. For example, the model could differentiate between holidays that generate varying levels of demand (e.g., Memorial Day versus Christmas) or learn from a broader range of historical data. We can also analyze more specific error plots to identify areas or times where the predictions deviate significantly, helping us optimize resource distribution for specific locations or times. In terms of spatial analysis, methods such as geographically weighted regression (GWR) can be applied to explore the demand patterns in different areas, identifying regions where the demand characteristics differ significantly from others, allowing for more precise predictions.

References:

---
title: "Bikeshare Rebalancing Prediction in Philadelphia"
author: "Jingmiao Fei"
date: "2024-11-22"
output:
  html_document:
    toc: true
    toc_float: true
    code_folding: hide
    code_download: true
---

```{r setup, include=FALSE}
knitr::opts_chunk$set(echo = TRUE)
```

# 1. Introduction

Bike share programs have become an integral part of urban transportation systems, offering a convenient, sustainable, and cost-effective alternative to private vehicles and public transit. Philadelphia’s Indego bike share network, for instance, serves as a critical mobility option for residents and visitors alike, enabling them to navigate the city while reducing congestion and emissions. However, a common operational challenge for bike share programs is the need to maintain an optimal balance of bikes across stations to meet variable demand patterns. Due to daily commuting habits and various socio-economic factors, some parts of a city may experience an abundance of bikes, while others struggle with shortages. For example, residential neighborhoods may see a surge in bike numbers in the morning as commuters head to work, only to be left with empty stations by evening. On the other hand, business districts may have too many bikes during the day, requiring redistribution for the evening commute. Without efficient re-balancing, the bike share system's effectiveness is compromised, leading to reduced customer satisfaction and a drop in usage. Without effective re-balancing strategies, stations may experience shortages or surpluses, leading to user frustration and decreased system efficiency. 

To address these challenges, we propose a proactive re-balancing plan for Indego, the bike share system in Philadelphia, that combines predictive analytics with operational logistics. Launched in 2015, Indego has 1,400 bikes distributed across 130 stations, mostly concentrated in Center City, West Philadelphia, and South Philadelphia[^1]. Our approach involves managing a fleet of re-balancing trucks to redistribute bikes across the network based on demand forecasts. Additionally, we plan to implement user-focused incentives, such as offering ride discounts for returning bikes to underutilized stations or peak-demand locations. This dual approach enhances operational efficiency while fostering rider participation in maintaining system equilibrium.

Our model take one month (May) as the study period, and can provide estimated bike share traffic for each station, every hour of a day throughout the month, to enable timely and actionable interventions. While it remains to be seen how the model performs in other months of a year. The choice of time lag features such as the previous hour’s trip counts, weather conditions, amenity features such as the distance to the nearest subway station, and historical demand patterns at each station, aligns with this operational horizon. By leveraging these insights, our re-balancing plan ensures a dynamic, data-driven response to changing demand, maximizing both user satisfaction and system performance.

# 2. Data Wrangling

For our analysis, we used four datasets to build the variables: 1)Indego station data in May, 2022; 2)weather data collected at Philadelphia International Airport in May, 2022, including percipitation, temperature and wind speed; 3)5-year ACS Census Data from 2018 to 2022; 4)Septa Data, the subway system in Philadelphia

## 2.1. Setup

```{r setup_13,  message=FALSE, warning=FALSE, results='hide'}
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(gganimate)
library(caret)
library(RColorBrewer)
library(gifski)

Sys.setlocale("LC_TIME", "English")

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette7 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c","darkblue","#123F5A")
palette4 <- c("#D2FBD4","#92BCAB","#527D82","#123F5A")
palette2 <- c("#6baed6","#08519c")
```

```{r install_census_API_key, warning = FALSE, include=FALSE, eval = TRUE}
# Install Census API Key
tidycensus::census_api_key("e6ed37e145b901c8c418a193d5bc174948045488", overwrite = TRUE)
```


## 2.2. Import Census Data

Import basic census data in Philadelphia in 2022, including total population, median household income, median age, racial demographics, travel time, and transportation data related to commuting and public transportation usage. Additionally, new calculated columns are created, such as the percentage of the population that is White, the mean commute time, and the percentage of commuters who use public transportation.

```{r get_census, message=FALSE, warning=FALSE, cache=TRUE, results = 'hide'}
PHLCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2022, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
```

```{r extract_geometries }
PHLTracts <- 
  PHLCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

```


## 2.3. Import Indego Data

Import Philadelphia's bike-sharing data (Indego), which provides information about bike trips made by users throughout the city. The dataset includes trip start and end times, start and end station locations, user types, and trip durations. 

```{r read_dat, message=FALSE, warning=FALSE, results='hide' }
dat<-read.csv("indego-trips-2022-q2.csv")

dat$start_time <- mdy_hm(dat$start_time)
dat$end_time <- mdy_hm(dat$end_time)
# head(dat$start_time)

# only keep records of May
dat_may <- dat %>%
  filter(month(start_time) == 5)

```
```{r time_bins, message=FALSE, warning=FALSE, results='hide' }
dat2 <- dat_may %>%
  mutate(interval60 = floor_date(ymd_hms(start_time), unit = "hour"),
         interval15 = floor_date(ymd_hms(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))
```


```{r add_census_tracts , message = FALSE, warning = FALSE}
dat_census <- st_join(
  dat2 %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lat) == FALSE &
                   is.na(end_lon) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
  PHLTracts %>%
          st_transform(crs=4326),
          join = st_intersects, left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., PHLTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  na.omit()
```


```{r}
dat_sf <- st_as_sf(dat%>%na.omit(), coords = c("start_lon", "start_lat"), crs = 4326)
ggplot() +
  geom_sf(data = PHLTracts) +
  geom_sf(data = dat_sf %>%
            group_by(start_station) %>%
            summarize(count = n()),
          color = "#08519c",
          alpha = 0.4) +
  labs(title = "Distribution of Indego Stations in Philadelphia",
       subtitle = "May 2022")+
  mapTheme
```

## 2.4. Import Weather Data

Import weather data in Philadelphia, including perception, temperature and wind speed.

```{r import_weather,message=FALSE, warning=FALSE, results='hide'}
weather.Panel <- 
  riem_measures(station = "PHL", date_start = "2022-05-01", date_end = "2022-05-31") %>%
  dplyr::select(valid, tmpf, p01i, sknt)%>%
  replace(is.na(.), 0) %>%
    mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
    mutate(week = week(interval60),
           dotw = wday(interval60, label=TRUE)) %>%
    group_by(interval60) %>%
    summarize(Temperature = max(tmpf),
              Precipitation = sum(p01i),
              Wind_Speed = max(sknt)) %>%
    mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
```

```{r plot_weather, catche = TRUE, message=FALSE, warning=FALSE, results='hide'}
grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
  labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - PHL PHL - May, 2022")
```

## 2.5 Import Septa Data

Import subway station points in Philadelphia.

```{r septadata, message=FALSE, warning=FALSE, results='hide'}
el <- st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson")
Broad_St <- st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson")
septaStops <- 
  rbind(
     el %>% 
      mutate(Line = "El") %>%
      dplyr::select(Station, Line),
     Broad_St %>%
      mutate(Line ="Broad_St") %>%
      dplyr::select(Station, Line)) %>%
  st_transform(crs=4326)  

ggplot()+
  geom_sf(data = PHLTracts)+
  geom_sf(data = septaStops,fill = "#08519c", color = "#08519c", alpha = 0.7)+
  labs(title = "Subway Stations in Philadelphia")+
  theme_minimal()
```


# 3. Exploratory Analysis

## 3.1. Bike share trips per hour
Exploring bike share trips per hour, we observed a cyclical pattern in trip numbers throughout the day, with certain hours showing clear peak demand. Additionally, trips on weekends were significantly lower compared to weekdays. However, during the weekend around May 28th (the Memorial Day), there was no noticeable dip in trips, reflecting the impact of holidays on people's travel behavior.


```{r trip_timeseries }
ggplot(dat_census %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n))+
  labs(title="Bike share trips per hr. PHL, May, 2022",
       x="Date", 
       y="Number of trips")+
  plotTheme
```

Breaking these trends down, we categorized the 24 hours into four time periods: overnight, AM Rush, Mid-Day, and PM Rush, and explored the distribution of the number of trips per hour within these periods. We found that trip counts in all four time slots were mostly concentrated around 1 and 2, with Mid-Day and PM Rush periods tending to have more counts of 2.

```{r mean_trips_hist, warning = FALSE, message = FALSE }
dat_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1)+
  labs(title="Mean Number of Hourly Trips Per Station. PHL, May, 2022",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme
```

```{r trips_station_dotw }
ggplot(dat_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 1)+
  labs(title="Bike share trips per hr by station. PHL, May, 2022",
       x="Trip Counts", 
       y="Number of Stations")+
  plotTheme
```

Exploring bike share trips by day, we can see that the trip counts in Friday and weekends are clearly lower.
```{r trips_hour_dotw }
ggplot(dat_census %>% mutate(hour = hour(start_time)))+
     geom_freqpoly(aes(hour, color = dotw), binwidth = 1)+
  labs(title="Bike share trips in PHL, by day of the week, May, 2022",
       x="Hour", 
       y="Trip Counts")+
  scale_color_manual(values = brewer.pal(n = 7, name = "Set2") ) +
     plotTheme


ggplot(dat_census %>% 
         mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1)+
  labs(title="Bike share trips in PHL - weekend vs weekday, May, 2022",
       x="Hour", 
       y="Trip Counts")+
     plotTheme
```

These patterns are also evident spatially, with the highest ridership per station concentrated in the downtown and University City areas during the afternoon rush hour.

```{r origin_map }
ggplot()+
  geom_sf(data = PHLTracts %>%
          st_transform(crs=4326))+
  geom_point(data = dat_census %>%
               na.omit()%>%
            mutate(hour = hour(start_time),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
                time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
              group_by(start_station, start_lat, start_lon, weekend, time_of_day) %>%
              tally(),
            aes(x=start_lon, y = start_lat, color = n), 
            fill = "transparent", alpha = 0.5, size = 1)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. PHL, May, 2022")+
  mapTheme
```


## 3.2 Create Space-Time Panel

Here we create a study panel that has each unique time and start station observations.

```{r panel_length_check , message=FALSE, warning=FALSE, results='hide'}

study.panel <- 
  expand.grid(interval60=unique(dat_census$interval60), 
              start_station = unique(dat_census$start_station)) %>%
  left_join(., dat_census %>%
              select(start_station, Origin.Tract, start_lon, start_lat )%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

```

```{r create_panel , message=FALSE, warning=FALSE, results='hide'}
ride.panel <- 
  dat_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel) %>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

# turn start station id frome numeric to categorical variable
ride.panel$start_station <- factor(ride.panel$start_station)
```

```{r census_and_panel , message=FALSE, warning=FALSE, results='hide'}
ride.panel <- 
  left_join(ride.panel, PHLCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))
```

## 3.3. Create time lags

We created time lags to capture the temporal dependency and better fit the historical impact. Here we created 1, 2, 3, 4 ,12 hours lag, and 1, 2, 3-day holiday lag and lead.
```{r time_lags ,message=FALSE, warning=FALSE, results='hide'}
ride.panel <- 
  ride.panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 148,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = ifelse(is.na(holidayLag) == TRUE, 0, holidayLag))

```

```{r evaluate_lags ,message=FALSE, warning=FALSE, results='hide'}
as.data.frame(ride.panel) %>%
    group_by(interval60) %>% 
    summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
    gather(Variable, Value, -interval60, -Trip_Count) %>%
    mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                                "lag12Hours","lag1day")))%>%
    group_by(Variable) %>%  
    summarize(correlation = round(cor(Value, Trip_Count),2))
```

## 3.4 Create Amenity Features

We add an amenity feature, distance to the nearest subway station as one of the variables. Because a large portion of people using shared bikes are heading to the subways.

```{r}
# Convert ride.panel to an sf object using start_lon and start_lat
ride.panel_sf <- st_as_sf(ride.panel, coords = c("start_lon", "start_lat"), crs = 4326)

# Reproject both dataframes to EPSG:2272
ride.panel_proj <- st_transform(ride.panel_sf, crs = 2272)
septaStops_proj <- st_transform(septaStops, crs = 2272)

# Calculate the nearest distance from each ride.panel point to the closest septaStop
nearest_distances <- st_nearest_feature(ride.panel_proj, septaStops_proj)  # Index of nearest points
nearest_points <- septaStops_proj[nearest_distances, ]  # Get the nearest points
distances <- st_distance(ride.panel_proj, nearest_points, by_element = TRUE)  # Calculate distances

# Add the distances back to ride.panel
ride.panel$nearest_septa_dist <- as.numeric(distances)  # Convert to numeric

# Optional: Add nearest septaStop ID or other attributes
ride.panel$nearest_septaStop_id <- nearest_points$Station  
```

```{r}
ride.panel <- na.omit(ride.panel)
```

Finally, we created an animated plot and note clear spatial and temporal autocorrelation in our data: the animated map indicates that data cluster at certain times of the day in central Philadelphia.

```{r animate_plot, warning=FALSE, message=FALSE}

data_5_18 <- ride.panel %>%
  filter(format(interval60, "%Y-%m-%d") == "2022-05-18")  
data_5_18$interval60 <- data_5_18$interval60 + hours(4)


p <- ggplot() +
  geom_sf(data = PHLTracts)+
  geom_point(data = data_5_18, aes(x = start_lon, y = start_lat, color = Trip_Count), alpha = 0.8, size = 2) + 
  scale_color_gradient(low = "lightblue", high = "darkblue") + 
  labs(title = "Bike Share Trips on 2022-05-18", 
       subtitle = "Hour: {format(frame_time, '%Y-%m-%d %H:00:00')}", 
       x = "Longitude", 
       y = "Latitude") +
  theme_minimal() +
  transition_time(interval60) + 
  ease_aes('linear') 

# Export the animation as a GIF
# animated_gif <- animate(p, width = 800, height = 600, fps = 10, renderer = gifski_renderer())
# anim_save("bike_share_trips.gif", animated_gif)

```

```{r Display the GIF}
knitr::include_graphics("bike_share_trips.gif")
```

# 4. Modeling

## 4.1. Run Models

We use Week 20, 21, and 22 as train dataset, for it includes the holiday lag in the last 3 weeks. And take the 18th and 19th week as test set.
```{r train_test }
ride.Train <- filter(ride.panel, week >= 20)
ride.Test <- filter(ride.panel, week < 20)
```

For our analysis, we used OLS regressions. Although OLS may not be the best method due to the spatial autocorrelation discussed earlier, alternative approaches like spatial lag or geographically weighted regression could enhance performance and generalizability. Nonetheless, for the objectives of this task, OLS proves adequate.

In our analysis, seven regression models are built: 

reg1: only time fixed effect

reg2: only space fixed effect

reg3: time - space fixed effect

reg4: reg3 with hour lag

reg5: reg4 with holiday lag

reg6: reg5 with amenity feature

reg7: reg6 with some social indicators (population,median household income and percent of people taking public transprtation)


```{r seven_models }
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)

reg2 <- 
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride.Train)

reg3 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)

reg4 <- 
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day, 
     data=ride.Train)

reg5 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

reg6 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + nearest_septa_dist, 
     data=ride.Train)

reg7 <- 
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + nearest_septa_dist + Total_Pop + Med_Inc + Percent_Taking_Public_Trans, 
     data=ride.Train)

```

## 4.2. Predict for test data

```{r nest_data ,message=FALSE, warning=FALSE, results='hide'}
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 
```


```{r predict_function,message=FALSE, warning=FALSE, results='hide' }
model_pred <- function(dat, fit){pred <- predict(fit, newdata = dat)}
```


```{r do_predicitons,message=FALSE, warning=FALSE, results='hide'  }
week_predictions <- 
  ride.Test.weekNest %>% 
    mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
           BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
           CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
           DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
           ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred),
           FTime_Space_FE_timeLags_holidayLags_amenity = map(.x = data, fit = reg6, .f = model_pred),
           GTime_Space_FE_timeLags_holidayLags_amenity_society = map(.x = data, fit = reg7, .f = model_pred)
           ) %>% 
    gather(Regression, Prediction, -data, -week) %>%
    mutate(Observed = map(data, pull, Trip_Count),
           Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
           MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
           sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

week_predictions
```

# 5. Accuracy and Generalizability

## 5.1. Examine Error Metrics for Accuracy

Among the seven models we ran, we found that those incorporating both temporal and spatial features, along with various time-lagged variables, performed the best. Interestingly, spatial features appeared to be more influential than temporal ones; including spatial lags in the model led to significant improvements. The addition of holiday lags, amenity features, and social economic variables brought almost no improvement. These features might already be implicitly captured by the pure spatial and temporal characteristics in the dataset. For example, the variable representing income, with its strong spatial autocorrelation, is likely already embedded in the location data; and the distance to the nearest subway station seems to be included in the spatial fixed effect.

```{r plot_errors_by_model }
week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette7) +
    labs(title = "Mean Absolute Errors by model specification and week") +
  plotTheme
```

```{r error_vs_actual_timeseries , warning = FALSE, message = FALSE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station)) %>%
    dplyr::select(interval60, start_station, Observed, Prediction, Regression) %>%
    unnest() %>%
    gather(Variable, Value, -Regression, -interval60, -start_station) %>%
    group_by(Regression, Variable, interval60) %>%
    summarize(Value = sum(Value)) %>%
    ggplot(aes(interval60, Value, colour=Variable)) + 
      geom_line(size = 1.1) + 
      facet_wrap(~Regression, ncol=1) +
      labs(title = "Predicted/Observed bike share time series", subtitle = "PHL; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
      plotTheme
```

## 5.2 K-fold Cross Validation

Moving forward, we stick with `reg6`, which seems to have the best goodness of fit generally. We used k-fold cross validation to test our model's accuracy and generalizability. The RMSE is 1.04 and MAE is 0.67, which indicate a moderate level of performance.

```{r}
folds <- 5

control <- trainControl(method="cv", number=folds)


set.seed(123)

model_cv <- train(Trip_Count ~ start_station + hour(interval60) + dotw + Temperature + Precipitation +
                   lagHour + lag2Hours +lag3Hours +lag12Hours + lag1day + holidayLag + holiday + nearest_septa_dist, 
                  data=ride.panel,
                  method="lm",
                  trControl=control)

cv_df <- data.frame(
  model = "FTime_Space_FE_timeLags_holidayLags_amenity",
  folds = folds,
  rmse = model_cv$results$RMSE,
  mae = model_cv$results$MAE
)
```

```{r}
cv_df %>%
    kbl(caption = "Cross-Validation Results") %>%
    kable_paper(bootstrap_options = "striped", full_width = F)
```



Exploring the mean absolute errors (MAE) by station, we find that there seems to be a spatial pattern to our model.The high MAE values are concemtrated in the downtown area.

```{r errors_by_station, warning = FALSE, message = FALSE }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon)) %>%
    select(interval60, start_station, start_lon, start_lat, Observed, Prediction, Regression) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity") %>%
  group_by(start_station, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
  geom_sf(data = PHLCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  labs(title="Mean Abs Error, Test Set, Model 6")+
  mapTheme
```


## 5.3. Space-Time Error Evaluation

Except for spatial error distribution, we should also consider temporal erros. We plot observed and predicted for different times of day during the week and weekend, and some patterns begin to emerge.

1) Underprediction: In most of the plots, the predicted values are consistently lower than the observed ones, particularly as the number of trips increases. This is shown by data points lying above the 45-degree line (representing perfect prediction). This suggests that the model has more difficulty with higher-demand stations.

2) Time Effects: The underprediction seems more pronounced during peak times like the AM and PM rush hours compared to midday or overnight. This indicates the model may not fully capture the patterns of travel during these peak periods.

3) Outliers: Some data points, especially in the higher trip ranges, are far from the main cluster. These outliers highlight cases where the model significantly underperforms in predicting the correct number of trips.

```{r obs_pred_all, warning=FALSE, message = FALSE, cache=TRUE}
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
    geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "red")+
    geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme
```

We found that the MAE is highest during the PM rush, followed by the AM rush and midday, with the smallest MAE occurring during the overnight period. This reflects the model's challenges in accurately predicting demand during peak hours. These periods, along with the areas in downtown where the model's errors are most pronounced, correspond to more developed regions with higher demand. This suggests that the model's performance, particularly in high-demand, spatially concentrated areas, still requires significant improvement.

```{r station_summary, warning=FALSE, message = FALSE }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw) ) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lon, start_lat) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = PHLCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 0.5)+
  scale_colour_viridis(direction = -1,
  discrete = FALSE, option = "D")+
  ylim(min(dat_census$start_lat), max(dat_census$start_lat))+
  xlim(min(dat_census$start_lon), max(dat_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme
  
```

Lastly, we find that our model does not generalize perfectly. Our model performs less accurately in wealthier, whiter neighborhoods, and more accurately in areas with higher public transportation reliance. 

```{r station_summary2, warning=FALSE, message = FALSE }
week_predictions %>% 
    mutate(interval60 = map(data, pull, interval60),
           start_station = map(data, pull, start_station), 
           start_lat = map(data, pull, start_lat), 
           start_lon = map(data, pull, start_lon),
           dotw = map(data, pull, dotw),
           Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
           Med_Inc = map(data, pull, Med_Inc),
           Percent_White = map(data, pull, Percent_White)) %>%
    select(interval60, start_station, start_lon, 
           start_lat, Observed, Prediction, Regression,
           dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
    unnest() %>%
  filter(Regression == "FTime_Space_FE_timeLags_holidayLags_amenity")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  #geom_sf(data = PHLCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme
  
```

# 6. Discussion

In summary, we can see that while we are able to track the time components of demand, we miss the peaks and underpredict during periods of high demand. Our model can effectively predict the demand for bike redistribution on the Indego bikeshare system to some extent, with an MAE of approximately 0.6, indicating that the model's predicted bike demand at any given station is fairly close to the actual demand (within about 1 bike). However, since the base bike demand per station per hour is relatively small (typically under 3 bikes), the accuracy of predictions still needs improvement. Moreover, Our model demonstrates limited generalizability, as the MAE varies across different socio-economic regions. And it remains to be seen how the model performs in other months of a year.

The main issue we face is the underestimation of demand at high-demand stations, and these are precisely the areas that need attention the most. From an operations perspective, underestimating high demand may lead to inadequate resource allocation, especially during peak periods (such as morning and evening rush hours), directly affecting user experience. For example, during these peak periods, stations may run out of bikes, preventing users from renting or returning bikes, which reduces the satisfaction of meeting the demand. As demand increases, a lack of sufficient bikes can lead to longer wait times for users, or even user loss, lowering the overall usage of the bikeshare system. Furthermore, the operator may fail to effectively redistribute bikes to high-demand areas, thus reducing the efficiency and service quality of the entire system.

To reduce errors, we could incorporate more spatial-temporal features. For example, we can use binary variables to account for peak periods and add social variables such as nearby amenities and residential density. Additionally, using more detailed historical data for training could allow for more complex models. For example, the model could differentiate between holidays that generate varying levels of demand (e.g., Memorial Day versus Christmas) or learn from a broader range of historical data. We can also analyze more specific error plots to identify areas or times where the predictions deviate significantly, helping us optimize resource distribution for specific locations or times. In terms of spatial analysis, methods such as geographically weighted regression (GWR) can be applied to explore the demand patterns in different areas, identifying regions where the demand characteristics differ significantly from others, allowing for more precise predictions.


References:

[^1]: https://en.wikipedia.org/wiki/Indego
