Introduction

Indego is Philadelphia’s official bike share system. Since it began operations in 2015, the system has logged at least five million rides. Following a record ridership in 2021, Indego announced its plans to add over 30 new stations next year, expanding further west, north, and south of Center City and University City where most of its stations are concentrated. Like many other bike share systems, one of Indego’s operational challenge is rebalancing or physically redistributing its bikes to ensure availability. In its 2018 Business Plan, rebalancing was identified as the system’s single largest operating expense. To help balance supply and demand, Indego’s team of dispatchers monitors bike availability and directs drivers to move bikes to and from stations that need rebalancing. Indego also introduced IndeHero, a program that provides riders with credits if bikes are returned to empty stations. As part of its five-year vision, Indego is looking to increase the capacities of more popular stations and expand its service area to help reduce the need for rebalancing.

For this report, I used Indego’s trip data to develop a spatio-temporal model that could help predict ridership demand. By anticipating supply and demand, Indego will be able to balance its system more efficiently and optimize availability. In addition to historical data, I also accounted for weather, holidays, and neighborhood effects. This model will be trained on data from the first three weeks of July 2021 and tested on data from the last week of July and the first week of August 2021.

Data Preparation

I imported Indego’s third quarter of 2021 trip data, its latest available. For this analysis, I am looking at a 5 week duration, from July 1, 2021 to August 5, 2021.

indego <-
  read.csv("~/GitHub/MUSA-508/indego-trips-2021-q3.csv")

indego2 <- indego %>%
  mutate(interval60 = floor_date(mdy_hm(start_time), unit = "hour"),
         interval15 = floor_date(mdy_hm(start_time), unit = "15 mins"),
         week = week(interval60),
         dotw = as.factor(wday(interval60)))%>%
  filter(week>=26 & week < 31)

indego2 <- indego2 %>% mutate(dotw=recode(dotw, '1' = "Mon", '2' = "Tue", '3' = "Wed", '4' = "Thurs", '5' = "Fri", '6' = "Sat", '7' = "Sun"))

Amenities

Indego’s existing stations are mostly located near parks, tourist attractions, universities and rail stations. The geographic locations of these amenities were downloaded from Philadelphia’s Open Data Portal. The distance from a bike share station to its nearest amenity were calculated using the nearest neighbor distance.

universities<-st_read("https://opendata.arcgis.com/api/v3/datasets/8ad76bc179cf44bd9b1c23d6f66f57d1_0/downloads/data?format=geojson&spatialRefId=4326") %>% distinct(NAME, .keep_all = TRUE)%>%
  dplyr::select(geometry) %>%
  na.omit()%>%
  st_transform('EPSG:2263')

universities.sf <- st_centroid(universities)

parks<-st_read("https://opendata.arcgis.com/datasets/d52445160ab14380a673e5849203eb64_0.geojson")%>% filter(PPR_USE== "MISC_PARK" | PPR_USE== "SQUARE_PLAZA"|PPR_USE== "RECREATION_SITE")%>%
 dplyr::select(geometry) %>%
  na.omit()%>%
  st_transform('EPSG:2263')

parks.sf <- st_centroid(parks)

landmarks<-st_read("http://data-phl.opendata.arcgis.com/datasets/5146960d4d014f2396cb82f31cd82dfe_0.geojson") %>% filter(FEAT_TYPE== "Historic Landmark" | FEAT_TYPE== "Monument" | FEAT_TYPE == "Museum") %>% 
 dplyr::select(geometry) %>%
  na.omit()%>%
  st_transform('EPSG:2263')

landmarks.sf <- st_centroid(landmarks)

septaStops <- 
  rbind(
    st_read("https://opendata.arcgis.com/datasets/8c6e2575c8ad46eb887e6bb35825e1a6_0.geojson") %>% 
    mutate(Line = "El") %>%
   dplyr:: select(Station, Line),
      st_read("https://opendata.arcgis.com/datasets/2e9037fd5bef406488ffe5bb67d21312_0.geojson") %>%
      mutate(Line ="Broad_St") %>%
      dplyr:: select(Station, Line)) %>%
dplyr::select(geometry) %>%
   st_transform('EPSG:2263')  

#converting indego df to a gdf
indego_nn <- indego2 %>%
  na.omit() %>%
  st_as_sf(coords = c("start_lon", "start_lat"), crs = 4326, agr = "constant")%>%
  st_transform('EPSG:2263')

st_c <- st_coordinates

indego_nn <-
  indego_nn %>%
  mutate(
    universities.nn =
      nn_function(st_c(indego_nn), st_c(universities.sf),1),
    parks.nn =
      nn_function(st_c(indego_nn), st_c(parks.sf),1),
    landmarks.nn =
      nn_function(st_c(indego_nn), st_c(landmarks.sf),1),
    septaStops.nn =
      nn_function(st_c(indego_nn), st_c(septaStops),1))

indego_nn <-
  indego_nn%>%
  dplyr::select(start_station, end_station, universities.nn, parks.nn, landmarks.nn, septaStops.nn,interval60, interval15, week, dotw)

Census Variables

To capture neighborhood effects, I added demographic variables including the share of white population, share of residents taking public transit, and their mean commute times.

phillyCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001"), 
          year = 2019, 
          state = "PA", 
          geometry = TRUE, 
          county=c("Philadelphia"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  dplyr::select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age, GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

phillyTracts <- 
  phillyCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  dplyr::select(GEOID, geometry) %>% 
  st_sf
  
indego_census <- st_join(indego2 %>% 
          filter(is.na(start_lon) == FALSE &
                   is.na(start_lat) == FALSE &
                   is.na(end_lon) == FALSE &
                   is.na(end_lat) == FALSE) %>%
          st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326),
        phillyTracts %>%
          st_transform(crs=4326),
        join=st_intersects,
              left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start_lon = unlist(map(geometry, 1)),
         start_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)%>%
  st_as_sf(., coords = c("end_lon", "end_lat"), crs = 4326) %>%
  st_join(., phillyTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end_lon = unlist(map(geometry, 1)),
         end_lat = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  dplyr::select(-geometry)  

Weather

Since bike trips are affected by weather conditions, with ridership usually peaking during the summer months, I included weather data collected from the Philadelphia International Airport.

grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line(colour="#CA3C97") + 
  labs(title="Precipitation", x="Hour", y="Precipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line(colour="#CA3C97") + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line(colour="#CA3C97") + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="")

Space-time Panel

I created a space-time panel which contains every possible combination of bike share stations and time (hour/day). Each time period is represented by a row and if no trips originated from a bike share station during a given hour, a 0 is assigned.

study_panel <- 
  expand.grid(interval60=unique(indego_census$interval60), 
              start_station = unique(indego_census$start_station))%>%
 left_join(., indego_census %>%
              dplyr::select(start_station, Origin.Tract, start_lon, start_lat)%>%
              distinct() %>%
              group_by(start_station) %>%
              slice(1))

ride_panel <- 
  indego_census %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study_panel) %>% 
  group_by(interval60, start_station, Origin.Tract, start_lon, start_lat) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel)%>%
  ungroup() %>%
  filter(is.na(start_station) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride_panel <- 
  left_join(ride_panel, phillyCensus %>%
              as.data.frame() %>%
              dplyr::select(-geometry), by = c("Origin.Tract" = "GEOID"))

ride_panel <-
  ride_panel%>%
  left_join(st_drop_geometry(indego_nn) %>% distinct(start_station, .keep_all = TRUE) %>% dplyr::select(start_station, ends_with(".nn")), by = c("start_station"="start_station"))

Time Lags

Time lag variables were created to provide more information about the demand a few hours or a day before a specific time period. To control for the effects of demand that were likely disrupted by the Fourth of July holiday, I added holiday and holidayLag variables.

ride_panel <- 
  ride_panel %>% 
  arrange(start_station, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60)==186,1,0)) %>%
   mutate(day = yday(interval60)) %>%
   mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                 dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                 dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                 dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                 dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                 dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         holidayLag = replace_na(holidayLag, 0))

Exploratory Analysis

The dataset is split into a three-week training set and a two-week test set. To better understand trip patterns, I plotted the number of bike trips per week. Trips generally peak during the middle of the week.

mondays <- 
  mutate(ride_panel,
         monday = ifelse(dotw == "Mon" & hour(interval60) == 1,
                         interval60, 0)) %>%
  filter(monday != 0) 

rbind(
  mutate(ride_Train, Legend = "Training"), 
  mutate(ride_Test, Legend = "Testing")) %>%
  group_by(Legend, interval60) %>% 
  summarize(Trip_Count = sum(Trip_Count)) %>%
  ungroup() %>% 
  ggplot(aes(interval60, Trip_Count, colour = Legend)) + geom_line() +
  geom_vline(data = mondays, aes(xintercept = monday), linetype="dotted") +
  scale_colour_manual(values = palette2) +
  labs(title="Bike Share Trips by Week",
       subtitle="July 1 - August 5, 2021",
       x="Day", y="Trip Count") +
 plotTheme2() 

The mean number of hourly trips per station vary during different times of the day. Ridership is highest during the afternoon rush hours.

indego_census %>%
        mutate(time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
         group_by(interval60, start_station, time_of_day) %>%
         tally()%>%
  group_by(start_station, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1, fill="#CA3697")+
  labs(title="Mean Number of Hourly Trips per Station",
       subtitle= "July 1 - August 5, 2021",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+plotTheme2()

Most bike share stations have less than five trips per hour, but a few stations have much higher demand suggesting the need for rebalancing.

ggplot(indego_census %>%
         group_by(interval60, start_station) %>%
         tally())+
  geom_histogram(aes(n), binwidth = 5, fill="#CA3697")+
  labs(title="Bike Share Trips per Hour by Station",
       subtitle= "July 1 - August 5, 2021",
       x="Trip Counts", 
       y="Number of Stations")+plotTheme2()

The plot below confirms that demand is highest during the afternoon peak hours. More people also use the bike share on Wednesdays, Thursdays, and Fridays. There is also a large discrepancy between weekday and weekend trips.

ggplot(indego_census %>% mutate(hour = hour(mdy_hm(start_time))))+
  geom_freqpoly(aes(hour, color = dotw), binwidth = 1, size=1.2)+
  scale_color_manual(values = c("#20124D","#4B2991","#A3319F", "#EA4F88", "#F89078", "#F6BA7C", "#F8DC9B"))+
  labs(title="Bike Share Trips by Day of the Week",
       subtitle = "July 1 - August 5, 2021",
       x="Hour", 
       y="Trip Counts")+
  plotTheme2()

ggplot(indego_census %>% 
         mutate(hour = hour(mdy_hm(start_time)),
                weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday")))+
     geom_freqpoly(aes(hour, color = weekend), binwidth = 1, size=1.2)+
  scale_color_manual(values = palette2)+
  labs(title="Bike Share Trips - Weekend vs Weekday",
       subtitle= "July 1 - August 5, 2021",
       x="Hour", 
       y="Trip Counts") +
  plotTheme2()

The maps below confirm these observations.

indego_points <-
  indego_census %>% 
  mutate(hour = hour(mdy_hm(start_time)),
         weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  group_by(start_station, start_lat, start_lon, weekend, time_of_day)

ggplot()+
  geom_sf(data = phillyTracts %>%
            st_transform(crs=4326))+
  geom_point(data = indego_points %>% 
             tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.8, size = 1.0)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "C")+
  ylim(min(indego_points$start_lat), max(indego_points$start_lat))+
  xlim(min(indego_points$start_lon), max(indego_points$start_lon))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike Share Trips per Hour by Station",
       subtitle= "July 1 - August 5, 2021")+mapTheme2()

The animation below shows bike share activity from July 9 to 15, 2021.

stations<-
  indego2 %>% 
  dplyr::select(start_station, start_lon, start_lat) %>%
  group_by(start_station) %>%
  distinct()%>%
  filter(is.na(start_lon) == FALSE &
           is.na(start_lat) == FALSE) %>%
  st_as_sf(., coords = c("start_lon", "start_lat"), crs = 4326)

week28 <-
  filter(indego_census , week == 28)

week28.panel <-
  expand.grid(
    interval60 = unique(week28$interval60),
    Origin.Tract = unique(ride_panel$Origin.Tract))

bike.animation.data <-
  mutate(week28, Trip_Counter = 1) %>%
  right_join(week28) %>% 
  group_by(interval60, start_station) %>%
  summarise(Trip_Count = sum(Trip_Counter, na.rm=T)) %>% 
  ungroup() %>% 
  left_join(stations, by=c("start_station" = "start_station")) %>%
  st_sf() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count == 1 ~ "1 trips",
                           Trip_Count == 2 ~ "2 trips",
                           Trip_Count > 2 & Trip_Count <= 4 ~ "3-4 trips",
                           Trip_Count > 4 ~ "5+ trips")) %>%
  mutate(Trips  = fct_relevel(Trips, "0 trips","1 trips","2 trips",
                              "3-4 trips","5+ trips"))

bike_animation <-
  ggplot() +
  geom_sf(data = phillyTracts %>%
            st_transform(crs=4326))+
  geom_sf(data = bike.animation.data, 
             aes(color = Trips), size = 2) +
  scale_color_manual(values = palette5[2:5]) +
  ylim(min(indego_points$start_lat), max(indego_points$start_lat))+
  xlim(min(indego_points$start_lon), max(indego_points$start_lon))+
  labs(title = "Bike Trips by Start Stations",
       subtitle = "Hourly intervals: {current_frame}") +
  mapTheme2()+
  transition_manual(interval60)

animate(bike_animation, duration=50, renderer = gifski_renderer())

Plotting the bike share trips per week, we can see that the demand is generally stable throughout the month, except for Week 26 / Fourth of July holiday.

ggplot()+
  geom_sf(data = phillyCensus)+
  geom_point(data = indego_census %>%
               group_by(start_station, start_lat, start_lon, week)%>%
               tally(),
             aes(x=start_lon, y = start_lat, color = n), 
             fill = "transparent", alpha = 0.8, size = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "C")+
  ylim(min(indego_census$start_lat), max(indego_census$start_lat))+
  xlim(min(indego_census$start_lon), max(indego_census$start_lon))+
  facet_grid(~week)+
  labs(title="Sum of Bike Share Trips by Station and Week",
       subtitle = "July 1 - August 5, 2021")+
  mapTheme2()

Regression Models

Models

I developed five models based on my training data set, adding more and more variables to each model. The first model only includes the hour, day of the week, and temperature; the second focuses on the spatial process; the third combines variables in the first and second models; the fourth includes demographic and amenities features; and the fifth model has time and holiday lags.

reg1 <- #temporal only
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride_Train)

reg2 <- #spatial only
  lm(Trip_Count ~  start_station + dotw + Temperature,  data=ride_Train)

reg3 <- #spatial + temporal
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride_Train)

reg4 <- #spatial + temporal + census + amenities
  lm(Trip_Count ~  start_station +  hour(interval60) + dotw + Temperature + Precipitation + universities.nn + parks.nn + landmarks.nn + septaStops.nn + Percent_White + Mean_Commute_Time + Percent_Taking_Public_Trans + Med_Inc, 
     data=ride_Train)

reg5 <- #above plus time lags
  lm(Trip_Count ~  start_station + hour(interval60) + dotw + Temperature + Precipitation + universities.nn + parks.nn + landmarks.nn + septaStops.nn + Percent_White + Mean_Commute_Time + Percent_Taking_Public_Trans + Med_Inc + lagHour + lag2Hours +lag3Hours + lag12Hours + lag1day + holiday + holidayLag, 
     data=ride_Train) 

stargazer(reg1, reg2, reg3, reg4, reg5, type ="html", font.size = "small", single.row = TRUE)
Dependent variable:
Trip_Count
(1) (2) (3) (4) (5)
hour(interval60) 0.036*** (0.001) 0.037*** (0.001) 0.038*** (0.001) 0.012*** (0.001)
start_station -0.002*** (0.0001) -0.002*** (0.0001) -0.001*** (0.0001) -0.0004*** (0.0001)
dotw.L -0.036** (0.016) 0.006 (0.016) -0.033** (0.016) -0.032** (0.016) 0.001 (0.015)
dotw.Q 0.170*** (0.019) 0.377*** (0.018) 0.161*** (0.019) 0.162*** (0.018) 0.003 (0.017)
dotw.C -0.026* (0.015) -0.040*** (0.015) -0.032** (0.015) -0.034** (0.015) -0.034** (0.015)
dotw4 0.097*** (0.016) 0.067*** (0.016) 0.081*** (0.016) 0.082*** (0.016) 0.073*** (0.016)
dotw5 -0.082*** (0.015) -0.132*** (0.015) -0.079*** (0.015) -0.081*** (0.015) -0.084*** (0.015)
dotw6 -0.054*** (0.015) -0.034** (0.015) -0.058*** (0.015) -0.060*** (0.015) -0.029** (0.014)
Temperature 0.012*** (0.001) 0.034*** (0.001) 0.010*** (0.001) 0.011*** (0.001) -0.003*** (0.001)
Precipitation -0.069*** (0.007) -0.071*** (0.007) -0.034*** (0.006)
universities.nn -0.0001*** (0.00000) -0.00003*** (0.00000)
parks.nn 0.0001*** (0.00001) 0.00003*** (0.00001)
landmarks.nn -0.0001*** (0.00001) -0.00003*** (0.00000)
septaStops.nn -0.00001*** (0.00000) -0.00000 (0.00000)
Percent_White 0.723*** (0.039) 0.295*** (0.036)
Mean_Commute_Time 0.001*** (0.0002) 0.0004*** (0.0002)
Percent_Taking_Public_Trans -0.138 (0.101) -0.058 (0.092)
Med_Inc -0.00000*** (0.00000) -0.00000 (0.00000)
lagHour 0.223*** (0.004)
lag2Hours 0.130*** (0.004)
lag3Hours 0.083*** (0.004)
lag12Hours -0.021*** (0.004)
lag1day 0.196*** (0.004)
holiday 0.290*** (0.084)
holidayLagMinusOneDay -0.422*** (0.081)
holidayLagMinusThreeDays 0.232** (0.099)
holidayLagMinusTwoDays 0.703*** (0.099)
holidayLagPlusOneDay -0.240*** (0.081)
holidayLagPlustThreeDays -0.179* (0.099)
holidayLagPlustTwoDays -0.330*** (0.099)
Constant -0.559*** (0.084) 3.565*** (0.256) 5.035*** (0.256) 2.619*** (0.265) 1.570*** (0.242)
Observations 56,880 56,880 56,880 55,440 55,416
R2 0.046 0.033 0.057 0.108 0.270
Adjusted R2 0.046 0.032 0.057 0.108 0.269
Residual Std. Error 1.372 (df = 56871) 1.382 (df = 56871) 1.364 (df = 56869) 1.338 (df = 55421) 1.211 (df = 55385)
F Statistic 346.576*** (df = 8; 56871) 238.989*** (df = 8; 56871) 341.715*** (df = 10; 56869) 374.144*** (df = 18; 55421) 681.311*** (df = 30; 55385)
Note: p<0.1; p<0.05; p<0.01

Model Evaluation

MAE

I ran the predictions on the two-week test set. Adding the time lags to the fifth model significantly reduced its MAE.

week_predictions %>%
  dplyr::select(week, Regression, MAE) %>%
  gather(Variable, MAE, -Regression, -week) %>%
  ggplot(aes(week, MAE)) + 
  geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
  scale_fill_manual(values = palette6) +
  labs(title = "Mean Absolute Errors by Model Specification and Week") +
  plotTheme2()

The plot of the predicted and observed trips for Model E confirms that including the time lags improves the accuracy of our model.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         from_station_id = map(data, pull, start_station)) %>%
  dplyr::select(interval60, from_station_id, Observed, Prediction, Regression) %>%
  unnest() %>%
  na.omit() %>%
  gather(Variable, Value, -Regression, -interval60, -from_station_id) %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size = 1.1) + 
  scale_color_manual(values = palette2)+
  facet_wrap(~Regression, ncol=1) +
  labs(title = "Predicted/Observed Bike Share Time Series", x = "Hour", y= "Station Trips") +
  plotTheme2()

However, the prediction errors are much larger for stations in Center City, which are some of the busiest commuter stations in the system.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon)) %>%
  select(interval60, start_station, start_lat, start_lon, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression == "E_Time_Space_Amenities_TimeLags") %>%
  group_by(start_station, start_lat, start_lon) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "C")+
  ylim(min(indego_census$start_lat), max(indego_census$start_lat))+
  xlim(min(indego_census$start_lon), max(indego_census$start_lon))+
  labs(title="Mean Absolute Error for Model E",
       subtitle="Test Set")+
  mapTheme2()

Space Time Error Evaluation

The plots and maps below show the observed vs predicted values for different periods. The model underpredicts for higher trip counts.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw)) %>%
  select(interval60, start_station, start_lat, start_lon, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "E_Time_Space_Amenities_TimeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction), color="#666666")+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#CA3C97")+
  geom_abline(slope = 1, intercept = 0, color="#262626")+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme2()

The errors in the Center City stations are highest for the afternoon peak hours on weekdays and weekends as well as during the mid-day on weekends. Unsurprisingly, weekend patterns are generally less predictable since events that are happening in the city could impact demand.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw) ) %>%
  select(interval60, start_station, start_lat, start_lon,Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "E_Time_Space_Amenities_TimeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  group_by(start_station, weekend, time_of_day, start_lat, start_lon) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = phillyCensus, color = "grey", fill = "transparent")+
  geom_point(aes(x = start_lon, y = start_lat, color = MAE), 
             fill = "transparent", size = 1, alpha = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "C")+
  ylim(min(indego_census$start_lat), max(indego_census$start_lat))+
  xlim(min(indego_census$start_lon), max(indego_census$start_lon))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors",
       subtitle="Test Set")+
  mapTheme2()

I focused on the afternoon rush hours to check how the model performs relative to selected demographic variables. The plots below suggest that the errors are larger for stations in areas with high median incomes, share of white residents, and low public transit ridership.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start_station = map(data, pull, start_station), 
         start_lat = map(data, pull, start_lat), 
         start_lon = map(data, pull, start_lon),
         dotw = map(data, pull, dotw),
         Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
         Med_Inc = map(data, pull, Med_Inc),
         Percent_White = map(data, pull, Percent_White)) %>%
  select(interval60, start_station, start_lat, 
         start_lon, Observed, Prediction, Regression,
         dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  unnest() %>%
  filter(Regression == "E_Time_Space_Amenities_TimeLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush")) %>%
  filter(time_of_day == "PM Rush") %>%
  group_by(start_station, Percent_Taking_Public_Trans, Med_Inc, Percent_White) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start_station, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), color= "#CA3C97", method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a Function of Socio-economic Variables",
       subtitle="PM Rush Hour",
       y="Mean Absolute Error (Trips)")+
  plotTheme2()

Cross Validation

I performed a Leave one group out cross validation (LOGO-CV), where each station takes a turn as a hold-out. The histograms below show that the errors are not clustered tightly together, indicating that the model’s predictions are inconsistent.

reg.vars <- c("dotw", "Temperature", "Precipitation", "universities.nn", "parks.nn", "landmarks.nn", "septaStops.nn", "Percent_White", "Mean_Commute_Time", "Percent_Taking_Public_Trans", "lagHour", "lag2Hours", "lag3Hours", "lag12Hours",  "lag1day", "holiday", "holidayLag") #"hour"

crossValidate <- function(dataset, id, dependentVariable, indVariables) {
  
  allPredictions <- data.frame()
  cvID_list <- unique(dataset[[id]])
  
  for (i in cvID_list) {
    
    thisFold <- i
    cat("This hold out fold is", thisFold, "\n")
    
    fold.train <- filter(dataset, dataset[[id]] != thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, indVariables, dependentVariable)
    fold.test  <- filter(dataset, dataset[[id]] == thisFold) %>% as.data.frame() %>% 
      dplyr::select(id, indVariables, dependentVariable)
    
    regression <-
      glm(Trip_Count ~ ., family = "poisson", 
          data = fold.train %>% 
            dplyr::select(-id))
    
    thisPrediction <- 
      mutate(fold.test, Prediction = predict(regression, fold.test, type = "response"))
    
    allPredictions <-
      rbind(allPredictions, thisPrediction)
    
  }
  return(allPredictions)
}

reg.cv <- crossValidate(
  dataset = ride_panel,
  id = "start_station",
  dependentVariable = "Trip_Count",
  indVariables = reg.vars) %>%
  dplyr::select(cvID = start_station, Trip_Count, Prediction)

reg.summary <- mutate(reg.cv, Error = Prediction - Trip_Count,
                      Regression = "cross validation on the 5 week panel")

error_by_reg_and_fold <- 
  reg.summary %>%
  group_by(Regression, cvID) %>% 
  summarize(Mean_Error = mean(Prediction - Trip_Count, na.rm = T),
            MAE = mean(abs(Mean_Error), na.rm = T),
            SD_MAE = mean(abs(Mean_Error), na.rm = T)) %>%
  ungroup()
error_by_reg_and_fold.long <-
  error_by_reg_and_fold%>%
  gather(Vriable, Value, -cvID, -Regression)%>%
  unnest()
ggplot(error_by_reg_and_fold.long, aes(x = Value)) +
  geom_histogram(bins = 30, fill = "#CA3C97") +
  facet_wrap(~Vriable, ncol = 3, scales = "free") +
  plotTheme2()

Conclusion

From the exploratory analysis, we know that Center City stations have the highest demand and that the peak number of trips occur from around 4 pm to 7 pm, Wednesdays to Fridays. From the models developed, it is clear that adding time lags significantly improves our predictions. However, the model still underpredicts, especially for stations where the trip demand is higher. Our model also predicts poorly for weekend trips, possibly because of summer events that make it difficult to anticipate trip patterns. To balance its system, Indego should assume that demand is higher during the evening rush hour especially in Center City stations. More drivers should be dispatched to move bikes around during this period. Indego could also consider doubling rewards for IndeHeros who bring the bikes to stations that need it most.