In this report, a prediction model for bike share ridership is built based on bike share data of Citi Bike in NYC, taking not only spatial and temporal effects but also time lags and holiday effects into account. Generally, out model performs well in predicting the demand in the bike share system and can track the spatio-temporal component of bike usage. However, we must admit that the model yields rather high errors for active stations, and underpredict the activity especially for periods of high demand. With the result, the overall usage could be easily obtained, but we do not recommend predicting the system demand in specific stations that observe active bike activity in high-demand hours like peak. Otherwise, the underprediction may cause the insufficient docks as well as bikes in hot station, and therefore, the user dissatisfaction and loss.
Bike share System gains popularity as a healthy, and low-carbon mode of travel. It is economical, energy-saving, environmentally-friendly, flexible and convenient. However, when the planning and maintaining the system, there appears great amounts of problems like bicycles re-balancing. Knowing the demand of the system ahead could be helpful to mitigate the unbanlance. In this report, a time-space predicting model is built to predict the bike activity of the system, with which, administrators can easily gain actionable insight into demand change in time and space, and therefore could better allocate the resources to those crowded stations.
Before the analysis, some libraries are loaded and some graphic themes are set.
setwd("E:/GRAD/MUSA507/WEEK10")
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(gganimate)
library(gifski)
library(png)
plotTheme <- theme(
plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, angle = 45, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
mapTheme <- theme(plot.title =element_text(size=12),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.line=element_blank(),
axis.text.x=element_blank(),
axis.text.y=element_blank(),
axis.ticks=element_blank(),
axis.title.x=element_blank(),
axis.title.y=element_blank(),
panel.background=element_blank(),
panel.border=element_blank(),
panel.grid.major=element_line(colour = 'transparent'),
panel.grid.minor=element_blank(),
legend.direction = "vertical",
legend.position = "right",
plot.margin = margin(1, 1, 1, 1, 'cm'),
legend.key.height = unit(1, "cm"), legend.key.width = unit(0.2, "cm"))
palette5 <- c("#eff3ff","#bdd7e7","#6baed6","#3182bd","#08519c")
Census geography and variables are download with tidycensus
package. Census geography is used for mapping, with which, the spatial patterns of the bike activity and performance of prediction model in each station could be easily known. Census variables are used for generalizability test.
NYCCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08301_001", "B08301_010"),
year = 2017,
state = "NY",
geometry = TRUE,
county=c("New York"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Travel_Time = B08013_001E,
White_Pop = B02001_002E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, Travel_Time,
White_Pop, Means_of_Transport,
Total_Public_Trans, GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport)
NYCTracts <-
NYCCensus %>%
as.data.frame() %>%
distinct(GEOID, .keep_all = TRUE) %>%
select(GEOID, geometry) %>%
st_sf
ggplot() +
geom_sf(data = NYCCensus, color = "grey20", fill = "transparent") +
labs(title="NYC Census Tracts")+
mapTheme
Spatial information is then added to ridership data, the GEOID of census tract of start and end stations for spatial analyses.
bikeData <- st_join(bikeData %>%
filter(is.na(start.station.longitude) == FALSE &
is.na(start.station.latitude) == FALSE &
is.na(end.station.longitude) == FALSE &
is.na(end.station.latitude) == FALSE) %>%
st_as_sf(., coords = c("start.station.longitude", "start.station.latitude"), crs = 4326),
NYCTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Origin.Tract = GEOID) %>%
mutate(start.station.longitude = unlist(map(geometry, 1)),
start.station.latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)%>%
st_as_sf(., coords = c("end.station.longitude", "end.station.latitude"), crs = 4326) %>%
st_join(., NYCTracts %>%
st_transform(crs=4326),
join=st_intersects,
left = TRUE) %>%
rename(Destination.Tract = GEOID) %>%
mutate(end.station.longitude = unlist(map(geometry, 1)),
end.station.latitude = unlist(map(geometry, 2)))%>%
as.data.frame() %>%
select(-geometry)
Weather data in NYC is downloaded by riem_measures
. The location of the station is in central park. Weather data including temperature, precipitation and wind speed on an hourly basis are obtained and shown in plot below.
As shown in the figure, the weather in September in NYC is rather pleasant, with a temperature approximately ranges from 60 to 80 degrees Fahrenheit (about 15 to 27 degree centigrade), and little rain.
weather.Panel <-
riem_measures(station = "NYC", date_start = "2019-09-01", date_end = "2019-09-30") %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
#plot
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - New York City - September, 2019")
Here a space-time panel is created. In the panel, the study.panel
is like an empty table to record the activity of each station for each time intervals.
study.panel <-
expand.grid(interval60=unique(bikeData$interval60),
start.station.id = unique(bikeData$start.station.id)) %>%
left_join(., bikeData %>%
select(start.station.id, start.station.name, Origin.Tract, start.station.longitude, start.station.latitude )%>%
distinct() %>%
group_by(start.station.id))
The full panel is then created by summarizing counts by station for each time interval. The weather data is joined to the panel by time interval info. Census data is added into the panel based on the census information.
ride.panel <-
bikeData %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval60, start.station.id, start.station.name, Origin.Tract, start.station.longitude, start.station.latitude) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(weather.Panel, by='interval60') %>%
ungroup() %>%
filter(is.na(start.station.id) == FALSE) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
## Joining, by = c("start.station.id", "start.station.name", "interval60", "Origin.Tract", "start.station.longitude", "start.station.latitude")
Time lags are added to help us gain insight into the nuance about the bike activity in certain time slots. Besides, a dummy variable holidayLag
is created to control the holiday influence on the demand during the three day weekend due to the Labor day. Since it is on Sep 2nd, we consider two days before and after the holiday.
ride.panel <-
ride.panel %>%
arrange(start.station.id, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 245,1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays"),
holidayLag = replace_na(holidayLag, 0))
We then evaluate the correlation between these time lags and the trips count, and it is obvious that there’s a Pearson’s R of 0.84 for lagHour
and lag1day
respectively, and a Pearson’s R of 0.57 for lag2Hours
, which indicates that the relationships are vary strong. And thus, we would like to take them into accounts.
as.data.frame(ride.panel) %>%
group_by(interval60) %>%
summarise_at(vars(c(starts_with("lag")), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))%>%
group_by(Variable) %>%
summarize(Correlation = round(cor(Value, Trip_Count),2)) %>%
kable(caption = "MAE by models") %>%
kable_styling("striped", full_width = F,position = "center") %>%
row_spec(2, color = "white", background = "#33638dff") %>%
row_spec(4, color = "white", background = "#33638dff") %>%
row_spec(6, color = "white", background = "#33638dff")
In this section, we explore the data and gain insight into the spatial extent, temporal pattern and the relationship of bike activity to one or more dependent variables.
First, let’s focus on the overtime pattern of ridership in September, 2019. There is a clear periodicity, where there are peak in rush hours on weekday and lull periods on weekends. Besides, there is a obvious reduce in system usage on the Labor day holiday.
ggplot(bikeData %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n),size=.7)+
labs(title="Bike share trips per hr. NYC, September, 2019",
x="Date",
y="Number of trips")+
plotTheme
Then, let’s dive into the demand on weekday and weekend. From the plot here, we can see there is two huge peaks in rush hours and small peak at midnight and noon on weekday. The peak at PM lasts longer than AM.
bikeData <- mutate(bikeData, hr = hour(starttime),
weekend=ifelse(bikeData$dotw %in% c('Sat', 'Sun'), 'Weekend', 'Weekday'))
grid.arrange(
ggplot(data=bikeData) +
geom_freqpoly(aes(hr, col=dotw), binwidth=1, size=1) +
labs(title="Bike share trips in NYC, by day of the week, September, 2019",
x="Hour",
y="Trip Counts")+
plotTheme,
ggplot(data=bikeData) +
geom_freqpoly(aes(hr, col=weekend), binwidth=1, size=1) +
labs(title="Bike share trips in NYC - weekend vs weekday, September, 2019",
x="Hour",
y="Trip Counts")+
plotTheme,
ncol=2)
Let’s now look at the distribution of trip volume by station for different times of the day. It is clear that at daytime, the number of staions with low bike activity is much smaller than that at night. In peak hours, there are more stations who have high number of trips than in other time of the day.
bikeData <- mutate(bikeData, time_of_day =
case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))
bikeData %>%
group_by(interval60, start.station.name, time_of_day) %>%
tally()%>%
group_by(start.station.name, time_of_day)%>%
summarize(mean_trips = mean(n))%>%
ggplot()+
geom_histogram(aes(mean_trips), binwidth = 1, fill="#1d3863")+
labs(title="Mean Number of Hourly Trips Per Station. NYC, September, 2019",
x="Number of trips",
y="Frequency")+
facet_wrap(~time_of_day)+
plotTheme
How about the spatial distribution of those high-demand stations? Figure below tells the answer. In short, the bike activity varies across the space and time. High demand in center and lower Manhattan in rush hours in weekday. Here are some details.
Most of the crowded stations cluster in center and lower Manhattan. Notably, there is a dark point overnight indicating the high demand at this station, and after checking the map, it is in Chealsea. For the bike activity on weekend, more trips are taken on mid-day, unlike that in weekday.
ggplot()+
geom_sf(data = NYCCensus %>%
st_transform(crs=4326))+
geom_point(data = bikeData %>%
group_by(start.station.id, start.station.latitude, start.station.longitude, weekend, time_of_day) %>%
tally(),
aes(x=start.station.longitude, y = start.station.latitude, color = n),
fill = "transparent", alpha = 1, size = 1)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bikeData$start.station.latitude), max(bikeData$start.station.latitude))+
xlim(min(bikeData$start.station.longitude), max(bikeData$start.station.longitude))+
facet_grid(weekend ~ time_of_day)+
labs(title="Bike share trips per hr by station. NYC, September, 2019")+
mapTheme
An animate is created based on the 15-minute bike acitivity on census tract level on a Tuesday. From the gif, we can easily see the spatio-temporal patterns of Citi Bike usage in NYC. More trips taken in rush hours and in central city.
ride.panel2 <-
bikeData %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel) %>%
group_by(interval15, start.station.id, start.station.name, Origin.Tract, start.station.longitude, start.station.latitude) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
filter(is.na(start.station.id) == FALSE) %>%
mutate(week = week(interval15),
dotw = wday(interval15, label = TRUE)) %>%
filter(is.na(Origin.Tract) == FALSE)
ride.panel2 <-
left_join(ride.panel2, NYCCensus %>%
as.data.frame() %>%
select(-geometry), by = c("Origin.Tract" = "GEOID"))
week37 <-
ride.panel2 %>%
filter(week == 37 & dotw == "Tue")
week37.panel <-
expand.grid(
interval15=unique(week37$interval15),
Origin.Tract = unique(week37$Origin.Tract))
ride.animation.data <-
week37 %>%
left_join(NYCTracts,by=c("Origin.Tract" = "GEOID")) %>%
st_sf() %>%
na.omit() %>%
mutate(Trips = case_when(Trip_Count == 0 ~ "0 trips",
Trip_Count > 0 & Trip_Count <= 5 ~ "1-5 trips",
Trip_Count > 5 & Trip_Count <= 10 ~ "6-10 trips",
Trip_Count > 10 & Trip_Count <= 20 ~ "11-20 trips",
Trip_Count > 20 ~ "21+ trips")) %>%
mutate(Trips = factor(Trips, levels=c("0 trips","1-5 trips","6-10 trips","11-20 trips","21+ trips")))
rideshare_animation <-
ggplot() +
geom_sf(data = NYCCensus, color = "grey30", fill = "transparent")+
geom_sf(data=ride.animation.data, aes(fill=Trips)) +
scale_fill_manual(values = palette5) +
labs(title = "Rideshare pickups for one day in November 2019, NYC",
subtitle = "15 minute intervals: {current_frame}") +
transition_manual(interval15) +
mapTheme
animate(rideshare_animation, duration=20)
It is easy to think about the great influence weather posing on the bike activity. Plots above explore the relationship between demand and wind speed, temperature and precipitation, respectively. To better see the influence, I reclassify the incontinuous numerical variable into category variables. (Categories only used for plotting)
From the plots, we can concluded that:
Demand decreases as wind speed increase, many trips were completed when the wind is light, i.e., wind speed < 5.5 m/s.
Demand greatly increases when temperature is pleasant, ranging from 60 to 80 degree Fahrenheit. People prefer riding in warm rather than in cold weather.
Demand decreases during rainy days, even though the precipitation is small.
Since there are little rain in September in 2019, I only focus on the Temperature
and Wind_Speed
when building models.
grid.arrange(
ride.panel %>%
na.omit() %>%
group_by(Wind_Speed, Trip_Count) %>%
summarise(count=sum(Trip_Count)) %>%
mutate(Windms = Wind_Speed*0.5144,
Wind = case_when(Wind_Speed<=5.5~"Light",
Wind_Speed>5.5&Wind_Speed<=11~"Moderate",
Wind_Speed>11~"Gale")) %>%
ggplot(aes(Wind, count)) +
geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#336387ff") +
scale_x_discrete(limits=c("Light", "Moderate", "Gale")) +
plotTheme,
ride.panel %>%
na.omit() %>%
mutate(Temp = case_when(Temperature<60.0~"Cold",Temperature>=60.0&Temperature<70.0~"Cool",
Temperature>=70.0&Temperature<80.0~"Mild",Temperature>=80.0~"Hot")) %>%
group_by(Temp, Trip_Count) %>%
summarise(count=sum(Trip_Count)) %>%
ggplot(aes(Temp, count)) +
geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#29af7fff")+
scale_x_discrete(limits=c("Cold", "Cool", "Mild", "Hot")) +
plotTheme,
ride.panel %>%
na.omit() %>%
group_by(Precipitation, Trip_Count) %>%
summarise(count=sum(Trip_Count)) %>%
mutate(Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet")) %>%
ggplot(aes(Preci, count)) +
geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#dce319ff") +
plotTheme,
ncol=3)
Before running the model, we first split the dataset into traning and test dataset, by the week info. Here, since we would include the holidayLag
in our model and the holiday is on Sep 2nd, to avoid the error in prediction caused by the unique fixed effects, we use the beginning of the month to predict the end of the month.
From what has been discussed in section 2.6, we find that the bike activity shows obvious spatial and temporal patterns. Based on the observations, we build in total 5 models.
In the reg1
model, we only focus on the time effects, including hour fixed effects and week fixed effects. Temperature is included.
In the reg2
model, spatial pattern is added by replacing the hour fixed effect by the station info start.station.id
.
In the reg3
model, both time and space fixed effects are included.
In the reg4
model, time lags are added into the model.
In the reg5
model, holiday
and holidayLag
effects are taken into accounts.
#timeOnly
reg1 <-
lm(Trip_Count ~ hour(interval60) + dotw + Temperature, data=ride.Train)
#spaceOnly
reg2 <-
lm(Trip_Count ~ start.station.name + dotw + Temperature, data=ride.Train)
#time and space
reg3 <-
lm(Trip_Count ~ start.station.name + hour(interval60) + dotw + Temperature + Precipitation,
data=ride.Train)
#time lag added
reg4 <-
lm(Trip_Count ~ start.station.name + hour(interval60) + dotw + Temperature + Wind_Speed +
lagHour + lag2Hours + lag12Hours + lag1day,
data=ride.Train)
#holiday lag added
reg5 <-
lm(Trip_Count ~ Origin.Tract + hour(interval60) + dotw + Temperature + Wind_Speed +
lagHour + lag2Hours +lag12Hours + lag1day + holidayLag + holiday,
data=ride.Train)
To better handle the great dataset, nested data frame of test data is created by week. A new function model_pred
is created which we can map onto each data frame in the nested structure.
#predictTestData,warning
ride.Test.weekNest <-
ride.Test %>%
nest(-week)
#predict function
model_pred <- function(dat, fit){
pred <- predict(fit, newdata = dat)}
week_predictions <-
ride.Test.weekNest %>%
mutate(ATime_FE = map(.x = data, fit = reg1, .f = model_pred),
BSpace_FE = map(.x = data, fit = reg2, .f = model_pred),
CTime_Space_FE = map(.x = data, fit = reg3, .f = model_pred),
DTime_Space_FE_timeLags = map(.x = data, fit = reg4, .f = model_pred),
ETime_Space_FE_timeLags_holidayLags = map(.x = data, fit = reg5, .f = model_pred)) %>%
gather(Regression, Prediction, -data, -week) %>%
mutate(Observed = map(data, pull, Trip_Count),
Absolute_Error = map2(Observed, Prediction, ~ abs(.x - .y)),
MAE = map_dbl(Absolute_Error, mean, na.rm = TRUE),
sd_AE = map_dbl(Absolute_Error, sd, na.rm = TRUE))
Both model 4 and 5 perform well. Since there is no holiday in 38 and 39 weeks, the model with holiday-related variables fail to play its role.
There is an obvious reduce in MAE after adding time lags indicating it is a strong predictor.
week_predictions %>%
dplyr::select(week, Regression, MAE) %>%
gather(Variable, MAE, -Regression, -week) %>%
ggplot(aes(week, MAE)) +
geom_bar(aes(fill = Regression), position = "dodge", stat="identity") +
scale_fill_manual(values = palette5) +
labs(title = "Mean Absolute Errors by model specification and week") +
plotTheme
Table below offer MAE (Mean Absolute Error) values of 5 models, and in terms of the MAE, we can tell that the model 4 DTime_Space_FE_timeLages
accounting for time and space effects, as well as time lags is the better one. The MAE value of this model is 3.15 and 3.01 for week 38 and 39; the average observed trips count is 6.99 and 6.71 relatively.
week_predictions %>%
dplyr::select(week, Regression, MAE, sd_AE) %>%
group_by(Regression, week) %>%
kable(caption = "MAE by models") %>%
kable_styling("striped", full_width = F) %>%
row_spec(c(1,2), color = "white", background = "#19b3b1") %>%
row_spec(c(3,4), color = "white", background = "#449cd4") %>%
row_spec(c(5,6), color = "white", background = "#216583") %>%
row_spec(c(7,8), color = "white", background = "#484085") %>%
row_spec(c(9,10), color = "white", background = "#283b42")
week | Regression | MAE | sd_AE |
---|---|---|---|
38 | ATime_FE | 5.956063 | 7.637216 |
39 | ATime_FE | 5.918637 | 7.263762 |
38 | BSpace_FE | 5.446165 | 7.087116 |
39 | BSpace_FE | 5.321141 | 6.785044 |
38 | CTime_Space_FE | 5.109810 | 6.925424 |
39 | CTime_Space_FE | 5.045905 | 6.615511 |
38 | DTime_Space_FE_timeLags | 3.150144 | 4.567026 |
39 | DTime_Space_FE_timeLags | 3.012247 | 4.063679 |
38 | ETime_Space_FE_timeLags_holidayLags | 3.181455 | 4.551873 |
39 | ETime_Space_FE_timeLags_holidayLags | 3.016955 | 4.060218 |
Here, we do not consider MAPE (Mean Absolute Percentage Error) as acriterion for model selection. MAPE is calculated by dividing the error by the predicted value, which makes it misleading and biased. To put it simple, in our prediction, there are some predicted values very close to 0, while makes the MAPE much higher than 1 and thus the MAPE make no sense. Besides, for too low forecast, the MAPE would be infinite. What’s more, for example, when the error is 50, and the predicted value is 100 and 150, the MAPE would be so different, and this criterion push people to overpredciting model. Whereas, MAE, together with the average observed value, could provide a relative fair comparison.
Time-series predicted and observed bike activity of each model are plotted. Again, time lags are crucial predictors in the prediction model. However, it is easily to note that model 4 and 5 perform badly when predicting the ride trips in rush hours, or it generate greater errors for high demand predicting. For some low-demand hours, the prediction is also relatively inaccurate.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start.station.id = map(data, pull, start.station.id)) %>%
dplyr::select(interval60, start.station.id, Observed, Prediction, Regression) %>%
unnest() %>%
gather(Variable, Value, -Regression, -interval60, -start.station.id) %>%
group_by(Regression, Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size = 1.1) +
facet_wrap(~Regression, ncol=1) +
labs(title = "Predicted/Observed bike share time series", subtitle = "NYC; A test set of 2 weeks", x = "Hour", y= "Station Trips") +
plotTheme
Obviously, the comparison results tell us that the model considering the space and time effects and time lags is the best model. But given that model 4 and 5 have similar MAEs and the practical prediction for bike activity in other months usually fail to avoid the holiday effects, model 5 would be a better choice. In the next section, we will evaluate the performance of model 5 ETime_Space_FE_timeLages_holidayLags
in bike trips prediction.
In this section, we will evaluate our model by exploring its spatial, temporal and demographic dimensions of the error, and the generalizability is tested with census data. Moreover, some improvement could be done is also discussed here.
We first plot predicted values against observed values for different times of day during the week and weekend. It is obvious that our model underpredict the ridership. Specifically, the model perform well for bike activity in less busy hours like overnight and mide-day usage. It fails to handle high-demand prediction. But how the error varies across space?
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start.station.id = map(data, pull, start.station.id),
start.station.latitude = map(data, pull, start.station.latitude),
start.station.longitude = map(data, pull, start.station.longitude),
dotw = map(data, pull, dotw)) %>%
select(interval60, start.station.id, start.station.longitude,
start.station.latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush"))%>%
ggplot()+
geom_point(aes(x= Observed, y = Prediction))+
geom_smooth(aes(x= Observed, y= Prediction), method = "lm", se = FALSE, color = "#48bcfa")+
geom_abline(slope = 1, intercept = 0)+
facet_grid(time_of_day~weekend)+
labs(title="Observed vs Predicted",
x="Observed trips",
y="Predicted trips")+
plotTheme
Figure below show the spatial dimensions of the MAE. Recall the bike activity by station we mapped before. Comparing two maps, it is obvious that the stations that are observed in high demand seem to have a high MAE. Thus, we can conclude that our model perform badly for those active stations.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start.station.id = map(data, pull, start.station.id),
start.station.latitude = map(data, pull, start.station.latitude),
start.station.longitude = map(data, pull, start.station.longitude)) %>%
select(interval60, start.station.id, start.station.longitude, start.station.latitude, Observed, Prediction, Regression) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags") %>%
group_by(start.station.id, start.station.longitude, start.station.latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = NYCCensus, color = "grey30", fill = "transparent")+
geom_point(aes(x = start.station.longitude, y = start.station.latitude, color = MAE),
fill = "transparent", alpha = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bikeData$start.station.latitude), max(bikeData$start.station.latitude))+
xlim(min(bikeData$start.station.longitude), max(bikeData$start.station.longitude))+
labs(title="Mean Abs Error, Test Set, Model 5")+
mapTheme
Now let’s look at the spatio-temporal distribution of the MAEs. The MAEs are particularly high for prediction of hot stations in rush hours on weekday, and in daytime on weekend. These inaccurate prediction happen in mid and lower Manhattan.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start.station.id = map(data, pull, start.station.id),
start.station.latitude = map(data, pull, start.station.latitude),
start.station.longitude = map(data, pull, start.station.longitude),
dotw = map(data, pull, dotw) ) %>%
select(interval60, start.station.id, start.station.longitude,
start.station.latitude, Observed, Prediction, Regression,
dotw) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush")) %>%
group_by(start.station.id, weekend, time_of_day, start.station.longitude, start.station.latitude) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
ggplot(.)+
geom_sf(data = NYCCensus, color = "grey30", fill = "transparent")+
geom_point(aes(x = start.station.longitude, y = start.station.latitude, color = MAE),
fill = "transparent", size = 1, alpha = 0.8)+
scale_colour_viridis(direction = -1,
discrete = FALSE, option = "D")+
ylim(min(bikeData$start.station.latitude), max(bikeData$start.station.latitude))+
xlim(min(bikeData$start.station.longitude), max(bikeData$start.station.longitude))+
facet_grid(weekend~time_of_day)+
labs(title="Mean Absolute Errors, Test Set")+
mapTheme
Let’s focus on the morning commuters. Basically, there appear trends that model performs relatively worse for high-income, low-transit-usage, majority-White, and high-travel-time communities.
week_predictions %>%
mutate(interval60 = map(data, pull, interval60),
start.station.id = map(data, pull, start.station.id),
start.station.latitude = map(data, pull, start.station.latitude),
start.station.longitude = map(data, pull, start.station.longitude),
dotw = map(data, pull, dotw),
Percent_Taking_Public_Trans = map(data, pull, Percent_Taking_Public_Trans),
Med_Inc = map(data, pull, Med_Inc),
Percent_White = map(data, pull, Percent_White),
Travel_Time = map(data, pull, Travel_Time)) %>%
select(interval60, start.station.id, start.station.longitude,
start.station.latitude, Observed, Prediction, Regression,
dotw, Percent_Taking_Public_Trans, Med_Inc, Percent_White,Travel_Time) %>%
unnest() %>%
filter(Regression == "ETime_Space_FE_timeLags_holidayLags")%>%
mutate(weekend = ifelse(dotw %in% c("Sun", "Sat"), "Weekend", "Weekday"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 19 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 19 ~ "PM Rush")) %>%
filter(time_of_day == "AM Rush") %>%
group_by(start.station.id, Percent_Taking_Public_Trans, Med_Inc, Percent_White,Travel_Time) %>%
summarize(MAE = mean(abs(Observed-Prediction), na.rm = TRUE))%>%
gather(-start.station.id, -MAE, key = "variable", value = "value")%>%
ggplot(.)+
geom_point(aes(x = value, y = MAE), alpha = 0.4)+
geom_smooth(aes(x = value, y = MAE), method = "lm", se= FALSE)+
facet_wrap(~variable, scales = "free")+
labs(title="Errors as a function of socio-economic variables",
y="Mean Absolute Error (Trips)")+
plotTheme
In general, our model fail to be generalizable in space, time and demography. Thus, readers are recommend to predict the usage in stations with low demand, in time slots when system is less crowded. The model generate acceptable results in community where the percent of White is low; people have low income, high transit usage, and low travel time.
From what has been discussed above, the model underaccess for periods of high demand but could sense the spatio-temporal cahnge in bike activity. Thus, the model should be used to predict station with moderate usage and should be improved if to used for prediction in rush hours. Besides, it works well in communities where people are mainly in minority and have low median household income, high transit usage, as well as low travel time.
Here provided some possible improvements:
For each stations, considering the user group and add user information as new predictor.
More city information about the NYC and the land use map would be helpful in understanding the sptial patterns of the errors.
Bike share could help solve ‘the last mile’ problems and thus, the distance of the station to transit stops may be important predictor.
Taking distance of the station to city infrastructure, like shopping malls, parks into accout.
Build different models for prediction in stations of different demand, in different time slots, or in weekday and weekend based on one’s focus.
To use more complexed regression models.