Space-Time Prediction of Bike Share Demand in NYC

Executive Summary

In this report, a prediction model for bike share ridership is built based on bike share data of Citi Bike in NYC, taking not only spatial and temporal effects but also time lags and holiday effects into account. Generally, out model performs well in predicting the demand in the bike share system and can track the spatio-temporal component of bike usage. However, we must admit that the model yields rather high errors for active stations, and underpredict the activity especially for periods of high demand. With the result, the overall usage could be easily obtained, but we do not recommend predicting the system demand in specific stations that observe active bike activity in high-demand hours like peak. Otherwise, the underprediction may cause the insufficient docks as well as bikes in hot station, and therefore, the user dissatisfaction and loss.

1 Introduction

Bike share System gains popularity as a healthy, and low-carbon mode of travel. It is economical, energy-saving, environmentally-friendly, flexible and convenient. However, when the planning and maintaining the system, there appears great amounts of problems like bicycles re-balancing. Knowing the demand of the system ahead could be helpful to mitigate the unbanlance. In this report, a time-space predicting model is built to predict the bike activity of the system, with which, administrators can easily gain actionable insight into demand change in time and space, and therefore could better allocate the resources to those crowded stations.

2 Data Preprocessing & Exploratory Analyses

2.1 Setup

Before the analysis, some libraries are loaded and some graphic themes are set.

setwd("E:/GRAD/MUSA507/WEEK10")
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(gganimate)
library(gifski)
library(png)

plotTheme <- theme(
  plot.title =element_text(size=12),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())

mapTheme <- theme(plot.title =element_text(size=12),
                  plot.subtitle = element_text(size=8),
                  plot.caption = element_text(size = 6),
                  axis.line=element_blank(),
                  axis.text.x=element_blank(),
                  axis.text.y=element_blank(),
                  axis.ticks=element_blank(),
                  axis.title.x=element_blank(),
                  axis.title.y=element_blank(),
                  panel.background=element_blank(),
                  panel.border=element_blank(),
                  panel.grid.major=element_line(colour = 'transparent'),
                  panel.grid.minor=element_blank(),
                  legend.direction = "vertical", 
                  legend.position = "right",
                  plot.margin = margin(1, 1, 1, 1, 'cm'),
                  legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))

palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")

2.2 Data

2.2.1 Bike Share Data

In this report, the space-time prediction is conducted based on the bike share data in New York City in September, 2019, when the system has the highest usage in 2019. In September, the weather in New York City is relatively pleasant and recidents may use bike not only to commute but also for leisure trips.

Bike share data are download from https://s3.amazonaws.com/tripdata/index.html as .csv file. Our focus here is the station in Manhattan, and thus some data preprocessing are done. Bike trips data in .csv is first converted to simple feature with st_as_sf based on the longitude and latitude fields. By applying st_intersections to the simple feature and New York City census tracts data, bike trips happened in stations in Manhattan could be extracted. The whole processes are omitted.

#read data
bikeData <- read.csv("E:/GRAD/MUSA507/WEEK10/citibike-201909-NYC.csv")
bikeData <- bikeData %>%
  dplyr::select(-X)

In terms of the time patterns of the bike trips, we care about the hourly and daily change. Besides, a field recorded week information are created for data splitting in model building.

bikeData <- bikeData %>%
  mutate(interval60 = floor_date(ymd_hms(starttime), unit = "hour"),
         interval15 = floor_date(ymd_hms(starttime), unit = "15 mins"), #For animation
         week = week(interval60),
         dotw = wday(interval60, label=TRUE))

glimpse(bikeData)

## Observations: 1,929,874
## Variables: 19
## $ starttime               <fct> 2019-09-01 00:00:01.9580, 2019-09-01 0...
## $ stoptime                <fct> 2019-09-01 00:05:29.3410, 2019-09-01 0...
## $ tripduration            <int> 327, 1293, 1753, 613, 1571, 571, 2000,...
## $ start.station.id        <int> 3733, 3168, 3299, 486, 387, 440, 3295,...
## $ start.station.name      <fct> Avenue C & E 18 St, Central Park West ...
## $ end.station.id          <int> 504, 423, 3160, 478, 3440, 505, 492, 1...
## $ end.station.name        <fct> 1 Ave & E 16 St, W 54 St & 9 Ave, Cent...
## $ end.station.latitude    <dbl> 40.73222, 40.76585, 40.77897, 40.76030...
## $ end.station.longitude   <dbl> -73.98166, -73.98691, -73.97375, -73.9...
## $ bikeid                  <int> 39213, 15242, 38760, 32094, 38711, 316...
## $ usertype                <fct> Subscriber, Customer, Subscriber, Subs...
## $ birth.year              <int> 1968, 1969, 1990, 1992, 1969, 1995, 19...
## $ gender                  <int> 1, 0, 1, 1, 0, 1, 1, 1, 0, 1, 1, 1, 1,...
## $ start.station.latitude  <dbl> 40.73056, 40.78473, 40.78813, 40.74620...
## $ start.station.longitude <dbl> -73.97398, -73.96962, -73.95206, -73.9...
## $ interval60              <dttm> 2019-09-01, 2019-09-01, 2019-09-01, 2...
## $ interval15              <dttm> 2019-09-01, 2019-09-01, 2019-09-01, 2...
## $ week                    <dbl> 35, 35, 35, 35, 35, 35, 35, 35, 35, 35...
## $ dotw                    <ord> Sun, Sun, Sun, Sun, Sun, Sun, Sun, Sun...

2.2.2 Census Info

Census geography and variables are download with tidycensus package. Census geography is used for mapping, with which, the spatial patterns of the bike activity and performance of prediction model in each station could be easily known. Census variables are used for generalizability test.

NYCCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08301_001", "B08301_010"), 
          year = 2017, 
          state = "NY", 
          geometry = TRUE, 
          county=c("New York"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Travel_Time = B08013_001E,
         White_Pop = B02001_002E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, Travel_Time, 
         White_Pop, Means_of_Transport, 
         Total_Public_Trans, GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)

NYCTracts <- 
  NYCCensus %>%
  as.data.frame() %>%
  distinct(GEOID, .keep_all = TRUE) %>%
  select(GEOID, geometry) %>% 
  st_sf

ggplot() +
  geom_sf(data = NYCCensus, color = "grey20", fill = "transparent") + 
  labs(title="NYC Census Tracts")+
  mapTheme

Spatial information is then added to ridership data, the GEOID of census tract of start and end stations for spatial analyses.

bikeData <- st_join(bikeData %>% 
                        filter(is.na(start.station.longitude) == FALSE &
                                 is.na(start.station.latitude) == FALSE &
                                 is.na(end.station.longitude) == FALSE &
                                 is.na(end.station.latitude) == FALSE) %>%
                        st_as_sf(., coords = c("start.station.longitude", "start.station.latitude"), crs = 4326),
                      NYCTracts %>%
                        st_transform(crs=4326),
                      join=st_intersects,
                      left = TRUE) %>%
  rename(Origin.Tract = GEOID) %>%
  mutate(start.station.longitude = unlist(map(geometry, 1)),
         start.station.latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)%>%
  st_as_sf(., coords = c("end.station.longitude", "end.station.latitude"), crs = 4326) %>%
  st_join(., NYCTracts %>%
            st_transform(crs=4326),
          join=st_intersects,
          left = TRUE) %>%
  rename(Destination.Tract = GEOID)  %>%
  mutate(end.station.longitude = unlist(map(geometry, 1)),
         end.station.latitude = unlist(map(geometry, 2)))%>%
  as.data.frame() %>%
  select(-geometry)

2.2.3 Weather Data

Weather data in NYC is downloaded by riem_measures. The location of the station is in central park. Weather data including temperature, precipitation and wind speed on an hourly basis are obtained and shown in plot below.

As shown in the figure, the weather in September in NYC is rather pleasant, with a temperature approximately ranges from 60 to 80 degrees Fahrenheit (about 15 to 27 degree centigrade), and little rain.

weather.Panel <- 
  riem_measures(station = "NYC", date_start = "2019-09-01", date_end = "2019-09-30") %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#plot 
grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
    labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - New York City - September, 2019")

2.3 Create Space-time Panel

Here a space-time panel is created. In the panel, the study.panel is like an empty table to record the activity of each station for each time intervals.

study.panel <- 
  expand.grid(interval60=unique(bikeData$interval60), 
              start.station.id = unique(bikeData$start.station.id)) %>%
  left_join(., bikeData %>%
              select(start.station.id, start.station.name, Origin.Tract, start.station.longitude, start.station.latitude )%>%
              distinct() %>%
              group_by(start.station.id))

The full panel is then created by summarizing counts by station for each time interval. The weather data is joined to the panel by time interval info. Census data is added into the panel based on the census information.

ride.panel <- 
  bikeData %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval60, start.station.id, start.station.name, Origin.Tract, start.station.longitude, start.station.latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(weather.Panel, by='interval60') %>%
  ungroup() %>%
  filter(is.na(start.station.id) == FALSE) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

## Joining, by = c("start.station.id", "start.station.name", "interval60", "Origin.Tract", "start.station.longitude", "start.station.latitude")

ride.panel <- 
  left_join(ride.panel, NYCCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

2.4 Create Time Lags

Time lags are added to help us gain insight into the nuance about the bike activity in certain time slots. Besides, a dummy variable holidayLag is created to control the holiday influence on the demand during the three day weekend due to the Labor day. Since it is on Sep 2^nd, we consider two days before and after the holiday.

ride.panel <- 
  ride.panel %>% 
  arrange(start.station.id, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 245,1,0)) %>%
  mutate(day = yday(interval60)) %>%
  mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays"),
         holidayLag = replace_na(holidayLag, 0))

We then evaluate the correlation between these time lags and the trips count, and it is obvious that there’s a Pearson’s R of 0.84 for lagHour and lag1day respectively, and a Pearson’s R of 0.57 for lag2Hours, which indicates that the relationships are vary strong. And thus, we would like to take them into accounts.

as.data.frame(ride.panel) %>%
  group_by(interval60) %>% 
  summarise_at(vars(c(starts_with("lag")), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                              "lag12Hours","lag1day")))%>%
  group_by(Variable) %>%  
  summarize(Correlation = round(cor(Value, Trip_Count),2)) %>%
  kable(caption = "MAE by models") %>%
  kable_styling("striped", full_width = F,position = "center") %>%
  row_spec(2, color = "white", background = "#33638dff") %>%
  row_spec(4, color = "white", background = "#33638dff") %>%
  row_spec(6, color = "white", background = "#33638dff")

2.5 Exploratory Analyses

In this section, we explore the data and gain insight into the spatial extent, temporal pattern and the relationship of bike activity to one or more dependent variables.

First, let’s focus on the overtime pattern of ridership in September, 2019. There is a clear periodicity, where there are peak in rush hours on weekday and lull periods on weekends. Besides, there is a obvious reduce in system usage on the Labor day holiday.

ggplot(bikeData %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n),size=.7)+
  labs(title="Bike share trips per hr. NYC, September, 2019",
       x="Date", 
       y="Number of trips")+
  plotTheme

Then, let’s dive into the demand on weekday and weekend. From the plot here, we can see there is two huge peaks in rush hours and small peak at midnight and noon on weekday. The peak at PM lasts longer than AM.

bikeData <- mutate(bikeData, hr = hour(starttime),
                   weekend=ifelse(bikeData$dotw %in% c('Sat', 'Sun'), 'Weekend', 'Weekday'))

grid.arrange(
ggplot(data=bikeData) +
  geom_freqpoly(aes(hr, col=dotw), binwidth=1, size=1) +
  labs(title="Bike share trips in NYC, by day of the week, September, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme,
ggplot(data=bikeData) +
  geom_freqpoly(aes(hr, col=weekend), binwidth=1, size=1) +
  labs(title="Bike share trips in NYC - weekend vs weekday, September, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme,
ncol=2)

Let’s now look at the distribution of trip volume by station for different times of the day. It is clear that at daytime, the number of staions with low bike activity is much smaller than that at night. In peak hours, there are more stations who have high number of trips than in other time of the day.

bikeData <- mutate(bikeData, time_of_day = 
                     case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                               hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                               hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                               hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))

bikeData %>%
  group_by(interval60, start.station.name, time_of_day) %>%
  tally()%>%
  group_by(start.station.name, time_of_day)%>%
  summarize(mean_trips = mean(n))%>%
  ggplot()+
  geom_histogram(aes(mean_trips), binwidth = 1, fill="#1d3863")+
  labs(title="Mean Number of Hourly Trips Per Station. NYC, September, 2019",
       x="Number of trips", 
       y="Frequency")+
  facet_wrap(~time_of_day)+
  plotTheme

How about the spatial distribution of those high-demand stations? Figure below tells the answer. In short, the bike activity varies across the space and time. High demand in center and lower Manhattan in rush hours in weekday. Here are some details.

Most of the crowded stations cluster in center and lower Manhattan. Notably, there is a dark point overnight indicating the high demand at this station, and after checking the map, it is in Chealsea. For the bike activity on weekend, more trips are taken on mid-day, unlike that in weekday.

ggplot()+
  geom_sf(data = NYCCensus %>%
            st_transform(crs=4326))+
  geom_point(data = bikeData %>% 
               group_by(start.station.id, start.station.latitude, start.station.longitude, weekend, time_of_day) %>%
               tally(),
             aes(x=start.station.longitude, y = start.station.latitude, color = n), 
             fill = "transparent", alpha = 1, size = 1)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(bikeData$start.station.latitude), max(bikeData$start.station.latitude))+
  xlim(min(bikeData$start.station.longitude), max(bikeData$start.station.longitude))+
  facet_grid(weekend ~ time_of_day)+
  labs(title="Bike share trips per hr by station. NYC, September, 2019")+
  mapTheme

An animate is created based on the 15-minute bike acitivity on census tract level on a Tuesday. From the gif, we can easily see the spatio-temporal patterns of Citi Bike usage in NYC. More trips taken in rush hours and in central city.

ride.panel2 <- 
  bikeData %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel) %>% 
  group_by(interval15, start.station.id, start.station.name, Origin.Tract, start.station.longitude, start.station.latitude) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  ungroup() %>%
  filter(is.na(start.station.id) == FALSE) %>%
  mutate(week = week(interval15),
         dotw = wday(interval15, label = TRUE)) %>%
  filter(is.na(Origin.Tract) == FALSE)

ride.panel2 <- 
  left_join(ride.panel2, NYCCensus %>%
              as.data.frame() %>%
              select(-geometry), by = c("Origin.Tract" = "GEOID"))

week37 <-
  ride.panel2 %>%
  filter(week == 37 & dotw == "Tue")

week37.panel <-
  expand.grid(
    interval15=unique(week37$interval15),
    Origin.Tract = unique(week37$Origin.Tract))

ride.animation.data <-
  week37 %>%
  left_join(NYCTracts,by=c("Origin.Tract" = "GEOID")) %>%
  st_sf() %>%
  na.omit() %>%
  mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
                           Trip_Count > 0 & Trip_Count <= 5 ~ "1-5 trips",
                           Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
                           Trip_Count > 10 & Trip_Count <= 20 ~ "11-20 trips",
                           Trip_Count > 20 ~ "21+ trips")) %>%
  mutate(Trips  = factor(Trips, levels=c("0 trips","1-5 trips","6-10 trips","11-20 trips","21+ trips")))

rideshare_animation <-
  ggplot() +
  geom_sf(data = NYCCensus, color = "grey30", fill = "transparent")+
  geom_sf(data=ride.animation.data, aes(fill=Trips)) +
  scale_fill_manual(values = palette5) +
  labs(title = "Rideshare pickups for one day in November 2019, NYC",
       subtitle = "15 minute intervals: {current_frame}") +
  transition_manual(interval15) +
  mapTheme

animate(rideshare_animation, duration=20)

It is easy to think about the great influence weather posing on the bike activity. Plots above explore the relationship between demand and wind speed, temperature and precipitation, respectively. To better see the influence, I reclassify the incontinuous numerical variable into category variables. (Categories only used for plotting)

From the plots, we can concluded that:

Demand decreases as wind speed increase, many trips were completed when the wind is light, i.e., wind speed < 5.5 m/s.
Demand greatly increases when temperature is pleasant, ranging from 60 to 80 degree Fahrenheit. People prefer riding in warm rather than in cold weather.
Demand decreases during rainy days, even though the precipitation is small.

Since there are little rain in September in 2019, I only focus on the Temperature and Wind_Speed when building models.

grid.arrange(
ride.panel %>% 
  na.omit() %>%
  group_by(Wind_Speed, Trip_Count) %>% 
  summarise(count=sum(Trip_Count)) %>%
  mutate(Windms = Wind_Speed*0.5144,
         Wind = case_when(Wind_Speed<=5.5~"Light",
                          Wind_Speed>5.5&Wind_Speed<=11~"Moderate",
                          Wind_Speed>11~"Gale")) %>%
  ggplot(aes(Wind, count)) +
  geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#336387ff") +
  scale_x_discrete(limits=c("Light", "Moderate", "Gale")) +
  plotTheme,

ride.panel %>% 
  na.omit() %>%
  mutate(Temp = case_when(Temperature<60.0~"Cold",Temperature>=60.0&Temperature<70.0~"Cool",
                             Temperature>=70.0&Temperature<80.0~"Mild",Temperature>=80.0~"Hot")) %>%
  group_by(Temp, Trip_Count) %>% 
  summarise(count=sum(Trip_Count)) %>%
  ggplot(aes(Temp, count)) +
  geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#29af7fff")+
  scale_x_discrete(limits=c("Cold", "Cool", "Mild", "Hot")) +
  plotTheme,

ride.panel %>% 
  na.omit() %>%
  group_by(Precipitation, Trip_Count) %>% 
  summarise(count=sum(Trip_Count)) %>% 
  mutate(Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet")) %>%
  ggplot(aes(Preci, count)) +
  geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#dce319ff") +
  plotTheme, 
ncol=3)

3 Modeling

Before running the model, we first split the dataset into traning and test dataset, by the week info. Here, since we would include the holidayLag in our model and the holiday is on Sep 2^nd, to avoid the error in prediction caused by the unique fixed effects, we use the beginning of the month to predict the end of the month.

ride.Train <- filter(ride.panel, week < 38)
ride.Test <- filter(ride.panel, week >= 38)

3.1 Models Building

From what has been discussed in section 2.6, we find that the bike activity shows obvious spatial and temporal patterns. Based on the observations, we build in total 5 models.

In the reg1 model, we only focus on the time effects, including hour fixed effects and week fixed effects. Temperature is included.
In the reg2 model, spatial pattern is added by replacing the hour fixed effect by the station info start.station.id.
In the reg3 model, both time and space fixed effects are included.
In the reg4 model, time lags are added into the model.
In the reg5 model, holiday and holidayLag effects are taken into accounts.

#timeOnly
reg1 <- 
  lm(Trip_Count ~  hour(interval60) + dotw + Temperature,  data=ride.Train)
#spaceOnly
reg2 <- 
  lm(Trip_Count ~  start.station.name + dotw + Temperature,  data=ride.Train)
#time and space
reg3 <- 
  lm(Trip_Count ~  start.station.name + hour(interval60) + dotw + Temperature + Precipitation, 
     data=ride.Train)
#time lag added
reg4 <- 
  lm(Trip_Count ~  start.station.name +  hour(interval60) + dotw + Temperature + Wind_Speed +
       lagHour + lag2Hours + lag12Hours + lag1day, 
     data=ride.Train)
#holiday lag added
reg5 <- 
  lm(Trip_Count ~  Origin.Tract + hour(interval60) + dotw + Temperature + Wind_Speed +
       lagHour + lag2Hours +lag12Hours + lag1day + holidayLag + holiday, 
     data=ride.Train)

3.2 Prediction

To better handle the great dataset, nested data frame of test data is created by week. A new function model_pred is created which we can map onto each data frame in the nested structure.

#predictTestData,warning
ride.Test.weekNest <- 
  ride.Test %>%
  nest(-week) 

#predict function
model_pred <- function(dat, fit){
  pred <- predict(fit, newdata = dat)}

week_predictions <- 
  ride.Test.weekNest %>% 
  mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
         BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
         CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
         DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
         ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>% 
  gather(Regression, Prediction, -data, -week) %>%
  mutate(Observed = map(data, pull, Trip_Count),
         Absolute_Error = map2(Observed, Prediction,  ~ abs(.x - .y)),
         MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
         sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))

3.3 Models Comparison

Both model 4 and 5 perform well. Since there is no holiday in 38 and 39 weeks, the model with holiday-related variables fail to play its role.

There is an obvious reduce in MAE after adding time lags indicating it is a strong predictor.

week_predictions %>%
    dplyr::select(week, Regression, MAE) %>%
    gather(Variable, MAE, -Regression, -week) %>%
    ggplot(aes(week, MAE)) + 
    geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
    scale_fill_manual(values = palette5) +
    labs(title = "Mean Absolute Errors by model specification and week") +
    plotTheme

Table below offer MAE (Mean Absolute Error) values of 5 models, and in terms of the MAE, we can tell that the model 4 DTime_Space_FE_timeLages accounting for time and space effects, as well as time lags is the better one. The MAE value of this model is 3.15 and 3.01 for week 38 and 39; the average observed trips count is 6.99 and 6.71 relatively.

week_predictions %>%
  dplyr::select(week, Regression, MAE, sd_AE) %>%
  group_by(Regression, week) %>%
  kable(caption = "MAE by models") %>%
  kable_styling("striped", full_width = F) %>%
  row_spec(c(1,2), color = "white", background = "#19b3b1") %>%
  row_spec(c(3,4), color = "white", background = "#449cd4") %>%
  row_spec(c(5,6), color = "white", background = "#216583") %>%
  row_spec(c(7,8), color = "white", background = "#484085") %>%
  row_spec(c(9,10), color = "white", background = "#283b42")

MAE by models
week	Regression	MAE	sd_AE
38	ATime_FE	5.956063	7.637216
39	ATime_FE	5.918637	7.263762
38	BSpace_FE	5.446165	7.087116
39	BSpace_FE	5.321141	6.785044
38	CTime_Space_FE	5.109810	6.925424
39	CTime_Space_FE	5.045905	6.615511
38	DTime_Space_FE_timeLags	3.150144	4.567026
39	DTime_Space_FE_timeLags	3.012247	4.063679
38	ETime_Space_FE_timeLags_holidayLags	3.181455	4.551873
39	ETime_Space_FE_timeLags_holidayLags	3.016955	4.060218

Here, we do not consider MAPE (Mean Absolute Percentage Error) as acriterion for model selection. MAPE is calculated by dividing the error by the predicted value, which makes it misleading and biased. To put it simple, in our prediction, there are some predicted values very close to 0, while makes the MAPE much higher than 1 and thus the MAPE make no sense. Besides, for too low forecast, the MAPE would be infinite. What’s more, for example, when the error is 50, and the predicted value is 100 and 150, the MAPE would be so different, and this criterion push people to overpredciting model. Whereas, MAE, together with the average observed value, could provide a relative fair comparison.

Time-series predicted and observed bike activity of each model are plotted. Again, time lags are crucial predictors in the prediction model. However, it is easily to note that model 4 and 5 perform badly when predicting the ride trips in rush hours, or it generate greater errors for high demand predicting. For some low-demand hours, the prediction is also relatively inaccurate.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start.station.id = map(data, pull, start.station.id)) %>%
  dplyr::select(interval60, start.station.id, Observed, Prediction, Regression) %>%
  unnest() %>%
  gather(Variable, Value, -Regression, -interval60, -start.station.id) %>%
  group_by(Regression, Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size = 1.1) + 
  facet_wrap(~Regression, ncol=1) +
  labs(title = "Predicted/Observed bike share time series", subtitle = "NYC; A test set of 2 weeks",  x = "Hour", y= "Station Trips") +
  plotTheme

Obviously, the comparison results tell us that the model considering the space and time effects and time lags is the best model. But given that model 4 and 5 have similar MAEs and the practical prediction for bike activity in other months usually fail to avoid the holiday effects, model 5 would be a better choice. In the next section, we will evaluate the performance of model 5 ETime_Space_FE_timeLages_holidayLags in bike trips prediction.

4 Model Evaluation

In this section, we will evaluate our model by exploring its spatial, temporal and demographic dimensions of the error, and the generalizability is tested with census data. Moreover, some improvement could be done is also discussed here.

4.1 Spatio-temperal patterns of MAEs

We first plot predicted values against observed values for different times of day during the week and weekend. It is obvious that our model underpredict the ridership. Specifically, the model perform well for bike activity in less busy hours like overnight and mide-day usage. It fails to handle high-demand prediction. But how the error varies across space?

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start.station.id = map(data, pull, start.station.id), 
         start.station.latitude = map(data, pull, start.station.latitude), 
         start.station.longitude = map(data, pull, start.station.longitude),
         dotw = map(data, pull, dotw)) %>%
  select(interval60, start.station.id, start.station.longitude, 
         start.station.latitude, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))%>%
  ggplot()+
  geom_point(aes(x= Observed, y = Prediction))+
  geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#48bcfa")+
  geom_abline(slope = 1, intercept = 0)+
  facet_grid(time_of_day~weekend)+
  labs(title="Observed vs Predicted",
       x="Observed trips", 
       y="Predicted trips")+
  plotTheme

Figure below show the spatial dimensions of the MAE. Recall the bike activity by station we mapped before. Comparing two maps, it is obvious that the stations that are observed in high demand seem to have a high MAE. Thus, we can conclude that our model perform badly for those active stations.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start.station.id = map(data, pull, start.station.id), 
         start.station.latitude = map(data, pull, start.station.latitude), 
         start.station.longitude = map(data, pull, start.station.longitude)) %>%
  select(interval60, start.station.id, start.station.longitude, start.station.latitude, Observed, Prediction, Regression) %>%
  unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
  group_by(start.station.id, start.station.longitude, start.station.latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = NYCCensus, color = "grey30", fill = "transparent")+
  geom_point(aes(x = start.station.longitude, y = start.station.latitude, color = MAE), 
             fill = "transparent", alpha = 0.8)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(bikeData$start.station.latitude), max(bikeData$start.station.latitude))+
  xlim(min(bikeData$start.station.longitude), max(bikeData$start.station.longitude))+
  labs(title="Mean Abs Error, Test Set, Model 5")+
  mapTheme

Now let’s look at the spatio-temporal distribution of the MAEs. The MAEs are particularly high for prediction of hot stations in rush hours on weekday, and in daytime on weekend. These inaccurate prediction happen in mid and lower Manhattan.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start.station.id = map(data, pull, start.station.id), 
         start.station.latitude = map(data, pull, start.station.latitude), 
         start.station.longitude = map(data, pull, start.station.longitude),
         dotw = map(data, pull, dotw) ) %>%
  select(interval60, start.station.id, start.station.longitude, 
         start.station.latitude, Observed, Prediction, Regression,
         dotw) %>%
  unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush")) %>%
  group_by(start.station.id, weekend, time_of_day, start.station.longitude, start.station.latitude) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  ggplot(.)+
  geom_sf(data = NYCCensus, color = "grey30", fill = "transparent")+
  geom_point(aes(x = start.station.longitude, y = start.station.latitude, color = MAE), 
             fill = "transparent", size = 1, alpha = 0.8)+
  scale_colour_viridis(direction = -1,
                       discrete = FALSE, option = "D")+
  ylim(min(bikeData$start.station.latitude), max(bikeData$start.station.latitude))+
  xlim(min(bikeData$start.station.longitude), max(bikeData$start.station.longitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Mean Absolute Errors, Test Set")+
  mapTheme

4.2 Generalizability Test

Let’s focus on the morning commuters. Basically, there appear trends that model performs relatively worse for high-income, low-transit-usage, majority-White, and high-travel-time communities.

week_predictions %>% 
  mutate(interval60 = map(data, pull, interval60),
         start.station.id = map(data, pull, start.station.id), 
         start.station.latitude = map(data, pull, start.station.latitude), 
         start.station.longitude = map(data, pull, start.station.longitude),
         dotw = map(data, pull, dotw),
         Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
         Med_Inc = map(data, pull, Med_Inc),
         Percent_White = map(data, pull, Percent_White),
         Travel_Time = map(data, pull, Travel_Time)) %>%
  select(interval60, start.station.id, start.station.longitude, 
         start.station.latitude, Observed, Prediction, Regression,
         dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White,Travel_Time) %>%
  unnest() %>%
  filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
  mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush")) %>%
  filter(time_of_day == "AM Rush") %>%
  group_by(start.station.id, Percent_Taking_Public_Trans, Med_Inc, Percent_White,Travel_Time) %>%
  summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
  gather(-start.station.id, -MAE, key = "variable", value = "value")%>%
  ggplot(.)+
  geom_point(aes(x = value, y = MAE), alpha = 0.4)+
  geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
  facet_wrap(~variable, scales = "free")+
  labs(title="Errors as a function of socio-economic variables",
       y="Mean Absolute Error (Trips)")+
  plotTheme

In general, our model fail to be generalizable in space, time and demography. Thus, readers are recommend to predict the usage in stations with low demand, in time slots when system is less crowded. The model generate acceptable results in community where the percent of White is low; people have low income, high transit usage, and low travel time.

5 Discussion

From what has been discussed above, the model underaccess for periods of high demand but could sense the spatio-temporal cahnge in bike activity. Thus, the model should be used to predict station with moderate usage and should be improved if to used for prediction in rush hours. Besides, it works well in communities where people are mainly in minority and have low median household income, high transit usage, as well as low travel time.

Here provided some possible improvements:

For each stations, considering the user group and add user information as new predictor.
More city information about the NYC and the land use map would be helpful in understanding the sptial patterns of the errors.
Bike share could help solve ‘the last mile’ problems and thus, the distance of the station to transit stops may be important predictor.
Taking distance of the station to city infrastructure, like shopping malls, parks into accout.
Build different models for prediction in stations of different demand, in different time slots, or in weekday and weekend based on one’s focus.
To use more complexed regression models.