Space-Time Prediction of Scooter Activity in Louisville, KY

1 Introduction

1.1 Description

Four dockless vehicle operators: Bird, Lime, Bolt and Spin, have launched their opreation in the city of Louisville since August 2018. Their app-based rental scooters are designed to transport individuals during the last mile or so of their commute across the city. More than one year’s implementation has exposed several problems:

Shortages of scooters often occur where there is a high demand, leading to inconvenience for users and market loss for companies;
In places where there is no potential users, Scooters are oversupplied and piled. Depreciation without any revenue cuts down rate of return greatly.
Due to the characteristic of dockless, it is common to see people park their scooters in wrong areas, resulting in unpleasant city views and traffic chaos.

1.2 Motivation

One of the biggest operational challenges of dockless scooter system is to get vehicles to areas that are anticipated to have high demand but actually have no scooter. In this report, we manage to predict its suppy and demand based on existing data. By detecting real demand-supply situation, we hope this model can realize:

Better decision making in future investment, since Louisville is now allowing more companies to enter the market and scooter companies are considering double its fleet;
A successful re-balancing system with larger market share and more potential benefits, figuring out a specific amout of money that we can save with more detailed preparations;
A win-win outcome for both the providers and users, due to higher public transportation accessibility and less reliance on personal cars;
Suggestions for virtual dock site planning if necessary in the future.

For more about our app Sco-Porter, here is a video!

2 Data Preparation

Before the analysis, some libraries are loaded and some graphic themes are set.

library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(prob)
library(FNN)
library(osmdata)
library(gganimate)
library(gifski)
library(png)

plotTheme <- theme(
  plot.title =element_text(size=15),
  plot.subtitle = element_text(size=8),
  plot.caption = element_text(size = 6),
  axis.text.x = element_text(size = 10, hjust = 1),
  axis.text.y = element_text(size = 10),
  axis.title.y = element_text(size = 10),
  # Set the entire chart region to blank
  panel.background=element_blank(),
  plot.background=element_blank(),
  #panel.border=element_rect(colour="#F0F0F0"),
  # Format the grid
  panel.grid.major=element_line(colour="#D0D0D0",size=.2),
  axis.ticks=element_blank())
mapTheme <- function(base_size = 12) {
  theme(
    text = element_text( color = "black"),
    plot.title = element_text(size = 14,colour = "black"),
    plot.subtitle=element_text(face="italic"),
    plot.caption=element_text(hjust=0),
    axis.ticks = element_blank(),
    panel.background = element_blank(),axis.title = element_blank(),
    axis.text = element_blank(),
    axis.title.x = element_blank(),
    axis.title.y = element_blank(),
    panel.grid.minor = element_blank(),
    panel.border = element_rect(colour = "black", fill=NA, size=2)
  )
}

2.1 Data Collection

We compiled our dataset from:

All dockless trips across the city from all providers are obtained from Louisville Open Data, with details in month, date, day of the week;

scooters <- read.csv('https://data.louisvilleky.gov/sites/default/files/DocklessTripOpenData_7.csv',stringsAsFactors = FALSE)

Study area data is a shapefile downloaded from Louisville Open Data;

studyArea <- read_sf("C:/Users/sumyeem/Downloads/Dockless Vehicle Service Area/Dockless_Vehicle_Service_Area.shp") %>%
  st_transform(4326)

Holiday and event data are collected from web resources;
Census tract data are obtained using tidycensus package. Boundary shapefile is downloaded from TIGER/Line® Shapefiles dataset and have been clipped to fit the study area.

#census data
LVCensus <- 
  get_acs(geography = "tract", 
          variables = c("B01003_001", "B19013_001", 
                        "B02001_002", "B08013_001",
                        "B08012_001", "B08301_001", 
                        "B08301_010", "B01002_001",
                        "B08014_001", "B08014_002"), 
          year = 2017, 
          state = "KY", 
          geometry = TRUE, 
          county=c("Jefferson"),
          output = "wide") %>%
  rename(Total_Pop =  B01003_001E,
         Med_Inc = B19013_001E,
         Med_Age = B01002_001E,
         White_Pop = B02001_002E,
         Vehicle_own = B08014_001E,
         No_vehicle = B08014_002E,
         Travel_Time = B08013_001E,
         Num_Commuters = B08012_001E,
         Means_of_Transport = B08301_001E,
         Total_Public_Trans = B08301_010E) %>%
  select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
         Means_of_Transport, Total_Public_Trans,
         Med_Age,Vehicle_own,No_vehicle,
         GEOID, geometry) %>%
  mutate(Percent_White = White_Pop / Total_Pop,
         Mean_Commute_Time = Travel_Time / Total_Public_Trans,
         Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport,
         Percent_vehicle_available = 1-No_vehicle/Vehicle_own)

#census tract boundary
censusTract <- read_sf("E:/GRAD/MUSA507/finalProject/censusTratcs.shp") %>%
  st_transform(4326)

Weather data are obtained using riem_stations, including temperature, precipitation and wind speed data. Station ‘LOU’ is selected here;

weather.Panel <- 
  riem_measures(station = "LOU", date_start = "2019-03-01", date_end = "2019-09-01") %>%
  replace(is.na(.), 0) %>%
  mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label=TRUE)) %>%
  group_by(interval60) %>%
  summarize(Temperature = max(tmpf),
            Precipitation = sum(p01i),
            Wind_Speed = max(sknt)) %>%
  mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))

#plot 
grid.arrange(
  ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() + 
    labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() + 
    labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
  ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() + 
    labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
  top="Weather Data - Louisville - March to August, 2019")

Note that this is the final dataset which has been narrowed down to a set of best predictors through exploratory analysis and the modeling process.

2.2 Data Processing

This process included various transformations of the data in order to optimize predictive ability of each variable. For time scenario, we labeled start time and end time for each trip and sliced them into every hour intervals.

#Start 
scooterData <- st_intersection(scooters %>%
                              st_as_sf(.,coords = c("StartLongitude", "StartLatitude"), crs = 4326),
                            studyArea)
#delete na data
scooterData <- scooterData[-1,]
scooterData$StartDateTime <- paste(scooterData$StartDate,scooterData$StartTime,sep=' ')
scooterData$StartDateTime <- as.POSIXct(scooterData$StartDateTime,format = '%Y-%m-%d %H:%M')
scooterData <- mutate(scooterData,
                      interval60 = floor_date(ymd_hms(StartDateTime), unit = "hour"),
                      interval15 = floor_date(ymd_hms(StartDateTime), unit = "15 mins"))

#End
scooterData_end <- st_intersection(scooters %>%
                                  st_as_sf(.,coords = c("EndLongitude", "EndLatitude"), crs = 4326),
                                studyArea)
scooterData_end <- scooterData_end[-1,]
scooterData_end$EndDateTime <- paste(scooterData_end$EndDate,scooterData_end$EndTime,sep=' ')
scooterData_end$EndDateTime <- as.POSIXct(scooterData_end$EndDateTime,format = '%Y-%m-%d %H:%M')
scooterData_end <- mutate(scooterData_end,
                          interval60 = floor_date(ymd_hms(EndDateTime), unit = "hour"),
                          interval15 = floor_date(ymd_hms(EndDateTime), unit = "15 mins"))

We will focus on trips generated from March to August.

scooterData38_start <- subset(scooterData, year(interval60)==2019&month(interval60)>2 & month(interval60)<9)
scooterData38_start <- mutate(scooterData38_start, StartLongitude = unlist(map(geometry,1)),
                   StartLatitude = unlist(map(geometry,2)))

scooterData38_end <- subset(scooterData_end, year(interval60)==2019&month(interval60)>2 & month(interval60)<9)
scooterData38_end <- mutate(scooterData38_end, EndLongitude = unlist(map(geometry,1)),
                   EndLatitude = unlist(map(geometry,2)))

Census tract is set as the space unit in the first place and we projected scooter trips on the tracts. In this way, trips data are transformed from points into polygons.

scooterData_in_ct_s <- st_join(scooterData38_start, censusTract %>% select(GEOID),st_within, left=FALSE)
scooterData_in_ct_s <- scooterData_in_ct_s %>%
  st_set_geometry(NULL) %>%
  merge(.,censusTract %>% select(GEOID, geometry), by='GEOID') %>%
  st_sf()

scooterData_in_ct_e <- st_join(scooterData38_end, censusTract %>% select(GEOID),st_within, left=FALSE)
scooterData_in_ct_e <- scooterData_in_ct_e %>%
  st_set_geometry(NULL) %>%
  merge(.,censusTract %>% select(GEOID, geometry), by='GEOID') %>%
  st_sf()

2.2 Create Space-time Panels

Here, two space-time panels are created based on the labeled start time and end time. The study.panel is like an empty table where each row can record scooter activity in each census tract in every hour.

study.panel.start <- 
  expand.grid(interval60=unique(scooterData_in_ct_s$interval60), 
              GEOID = unique(scooterData_in_ct_s$GEOID))

study.panel.end <- 
  expand.grid(interval60=unique(scooterData_in_ct_e$interval60), 
              GEOID = unique(scooterData_in_ct_e$GEOID))

Two full panels are then created by summarizing counts by census tract for each time interval. Weather data are joined to the panels as well. We also added census tract data as spatial content.

ride.panel.start <- 
  scooterData_in_ct_s %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel.start) %>% 
  st_set_geometry(NULL) %>%
  group_by(interval60, GEOID) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(censusTract,by='GEOID') %>%
  left_join(weather.Panel, by='interval60') %>%
  ungroup() %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE),
         Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet"),
         Wind = case_when(Wind_Speed<=3~"Light", Wind_Speed> 3 & Wind_Speed <= 6~"Moderate", Wind_Speed > 6& Wind_Speed < 9~"Gale",
                          Wind_Speed >= 9~"Stormy"),
         time_GEOID=paste(as.character(interval60),GEOID))

ride.panel.end <- 
  scooterData_in_ct_e %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel.end) %>% 
  st_set_geometry(NULL) %>%
  group_by(interval60, GEOID) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(censusTract,by='GEOID') %>%
  left_join(weather.Panel, by='interval60') %>%
  ungroup() %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE),
         Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet"),
         Wind = case_when(Wind_Speed<=3~"Light", Wind_Speed>3&Wind_Speed<=6~"Moderate",Wind_Speed>6&Wind_Speed<9~"Gale",
                          Wind_Speed>=9~"Stormy"),
         time_GEOID=paste(as.character(interval60),GEOID))

2.3 Create Time Lag

Time lags are added to help us gain insight into the different patterns of scooter trips generated in various time slots. Besides, a dummy variable holidayLag is created to control the holiday influence on the demand during the three day weekend due to the holiday like Memorial Day. Since during selected time period, there are two holidays, Memorial Day on May 27th, and Independence day on July 4th, holiday lag should be considered. In addition to that, we divide six months into high, medium and low demand categories based on monthly use volume, and label them in monthCat column.

#start----
ride.panel.start <- 
  ride.panel.start %>% 
  arrange(GEOID, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 245|yday(interval60) ==185,1,0),
         weekday = ifelse(dotw != 'Sun' & dotw!='Sat',1,0),
         events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
         day = yday(interval60)) %>%
  mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
         monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
                              month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
                              month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
         holidayLag = replace_na(holidayLag, '0'))

#end----
ride.panel.end <- 
  ride.panel.end %>% 
  arrange(GEOID, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 147|yday(interval60) ==185,1,0),
         weekday = ifelse(dotw != 'Sun' & dotw!='Sat',1,0),
         events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
         day = yday(interval60)) %>%
  mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
         monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
                              month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
                              month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
         holidayLag = replace_na(holidayLag, '0'))

The correlation between these time lags and the trips count is evaluated respectively. It is obvious that there are Pearson’s R higher or around 0.5 for lagHour, lag2Hours, lag3Hours, lag4Hours and lag1day, which indicates the strong correlations. And thus, we would like to take them into account.

plotData.lag.start <-
  as.data.frame(ride.panel.start) %>%
  group_by(interval60) %>% 
  summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                              "lag12Hours","lag1day")))
correlation.lag.start <-
  plotData.lag.start %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, Trip_Count),2))

correlation.lag.start %>%
          kable(caption = "Pearson's R (Pickup)") %>%
          kable_styling("striped", full_width = F,position = "center")

Pearson’s R (Pickup)
Variable	correlation
lagHour	0.92
lag2Hours	0.82
lag3Hours	0.69
lag4Hours	0.56
lag12Hours	-0.31
lag1day	0.73

plotData.lag.end <-
  as.data.frame(ride.panel.end) %>%
  group_by(interval60) %>% 
  summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
  gather(Variable, Value, -interval60, -Trip_Count) %>%
  mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
                                              "lag12Hours","lag1day")))

correlation.lag.end <-
  plotData.lag.end %>%
  group_by(Variable) %>%  
  summarize(correlation = round(cor(Value, Trip_Count),2))

correlation.lag.end %>%
          kable(caption = "Pearson's R (Return)") %>%
          kable_styling("striped", full_width = F,position = "center")

Pearson’s R (Return)
Variable	correlation
lagHour	0.92
lag2Hours	0.82
lag3Hours	0.70
lag4Hours	0.56
lag12Hours	-0.31
lag1day	0.73

3 Exploratory Analysis

In this section, we explore the data and gain insight into the temperal and spatial pattern of scooter activity, as well as the relationship between trips count and one or more independent variables.

3.1 Overall analysis

First, we do some general analysis to have a rough understanding of the scooter trips in Louisville. From figure below, it could be concluded that generally, scooters are used for short-distance trips, usually for trips less than 1 mile.

scooterData38_start %>%
  mutate(distCat = case_when(TripDistance<1~'< 1 mile',
                             TripDistance<2 & TripDistance>=1~'1 ~ 2 miles',
                             TripDistance<3 & TripDistance>=2~'2 ~ 3 miles',
                             TripDistance<4 & TripDistance>=3~'3 ~ 4 miles',
                             TripDistance<5 & TripDistance>=4~'4 ~ 5 miles',
                             TripDistance>=5~'> 5 miles'),
         scooterTripCount = 1) %>% 
  group_by(distCat, scooterTripCount) %>% 
  summarise(count=sum(scooterTripCount)) %>%
  ggplot(aes(distCat, count)) +
  geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#238A8DFF") +
  xlab("Trip Distance")+
  scale_x_discrete(limits=c("< 1 mile", "2 ~ 3 miles", "3 ~ 4 miles", "4 ~ 5 miles", "> 5 mile")) +
  labs(title = "Scooter trips count by distance", subtitle = "Lousiville, March to August, 2019") +
  plotTheme

From histograms below, we can conclude that weather affects activity. There are more trips when the wind speed is slow, temperature is comfortable and it is not raining.

grid.arrange(
ride.panel.start %>% 
  na.omit() %>%
  group_by(Wind_Speed, Trip_Count) %>% 
  summarise(count=sum(Trip_Count)) %>%
  mutate(Windms = Wind_Speed*0.5144,
         Wind = case_when(Wind_Speed<=5.5~"Light",
                          Wind_Speed>5.5 & Wind_Speed<=11~"Moderate",
                          Wind_Speed>11~"Gale")) %>%
  ggplot(aes(Wind, count)) +
  geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#336387ff") +
  scale_x_discrete(limits=c("Light", "Moderate", "Gale")) +
  labs(title = "Wind Speed") +
  plotTheme,

ride.panel.start %>% 
  na.omit() %>%
  mutate(Temp = case_when(Temperature<60.0~"Cold",Temperature>=60.0&Temperature<70.0~"Cool",
                             Temperature>=70.0&Temperature<80.0~"Mild",Temperature>=80.0~"Hot")) %>%
  group_by(Temp, Trip_Count) %>% 
  summarise(count=sum(Trip_Count)) %>%
  ggplot(aes(Temp, count)) +
  geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#29af7fff")+
  scale_x_discrete(limits=c("Cold", "Cool", "Mild", "Hot")) +
  labs(title = "Temperature") +
  plotTheme,

ride.panel.start %>% 
  na.omit() %>%
  group_by(Precipitation, Trip_Count) %>% 
  summarise(count=sum(Trip_Count)) %>% 
  mutate(Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet")) %>%
  ggplot(aes(Preci, count)) +
  geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#dce319ff") +
  labs(title = 'Precipitation') +
  plotTheme, 
ncol=3)

3.1 Time Trend

First, let’s focus on the overtime pattern of ridership in July, 2019, when the number of trips is the highest. There is a clear periodicity, where there are peak in day time and more trips were made on weekends, especially on Saturday compared to trips made during weekdays. Moreover, we can see an increase in the trips count on July 4th, which is a Thurday holiday, compared to the average trip count on Thurday.

ggplot(scooterData38_start %>%
         mutate(month=month(interval60)) %>%
         subset(month == 7) %>%
         group_by(interval60) %>%
         tally())+
  geom_line(aes(x = interval60, y = n),size=.8, col='#42B5AD')+
  labs(title="Scooter trips per hour. Louisville, July, 2019",
       x="Date", 
       y="Number of trips")+
  plotTheme

For more details, we plot the scooter trips by day of the week. It can be concluded that there are more usage on Saturday and Friday. In terms of the hours, people usually travel by scooter during day time, especially after noon.

ggplot(data=scooterData38_start %>%
         mutate(month=month(interval60),
                dotw=weekdays(interval60))) +
         #subset(month == 7)) +
  geom_freqpoly(aes(HourNum, col=dotw), binwidth=1, size=1,) +
  scale_colour_viridis(discrete = T, name="Day of the week") +
  labs(title="Scooter trips in Louisville, by day of the week, March to August, 2019",
       x="Hour", 
       y="Trip Counts")+
  plotTheme

3.2 Space Trend

It is easy to think that most scooters usage would happen in CBD. To validate our thoughs, we plot distribution of scooters as below. In short, the scooter activity varies across the space and time. High demand occurs in downtown, old city and districts around University of Louisville, suggesting that city residents and students might be potential users of scooters.

ggplot()+
  geom_point(data = scooterData38_start %>% 
               group_by(StartLongitude, StartLatitude) %>%
               tally(),
             aes(x=StartLongitude, y = StartLatitude, color = n), 
             fill = "transparent", alpha = 0.5, size = 1)+
  geom_sf(data=censusTract,fill='transparent') +
  scale_colour_gradient(low='#bdebda',high='#0076be')+
  ylim(min(scooterData38_start$StartLatitude), max(scooterData38_start$StartLatitude))+
  xlim(min(scooterData38_start$StartLongitude), max(scooterData38_start$StartLongitude))+
  labs(title="Scooter trips by time of the day. \nLousiville, March to August, 2019")+
  mapTheme()

Then we plot distribution of trips by four time periods during one day: AM Rush from 7 am to 10 am, Mid-Day from 10 am to 15 pm, PM rush from 15 pm to 18 pm, and others for Overnight. It can be seen from the map that since there is no significant pattern difference among different time periods, people in Louisville do not widely use scooters for communiting.

scooterData38_start <- mutate(scooterData38_start, time_of_day = case_when(HourNum < 7 | HourNum > 18 ~ "Overnight",
                                                     HourNum >= 7 & HourNum < 10 ~ "AM Rush",
                                                     HourNum >= 10 & HourNum < 15 ~ "Mid-Day",
                                                     HourNum >= 15 & HourNum <= 18 ~ "PM Rush"))

scooterData38_end <- mutate(scooterData38_end, time_of_day = case_when(HourNum < 7 | HourNum > 18 ~ "Overnight",
                                                             HourNum >= 7 & HourNum < 10 ~ "AM Rush",
                                                             HourNum >= 10 & HourNum < 15 ~ "Mid-Day",
                                                             HourNum >= 15 & HourNum <= 18 ~ "PM Rush"))


scooterData38_start <- mutate(scooterData38_start,weekend = ifelse(scooterData38_start$DayOfWeek==6 | scooterData38_start$DayOfWeek==7, 'Weekend', 'Weekday'))
scooterData38_end <- mutate(scooterData38_end,weekend = ifelse(scooterData38_end$DayOfWeek==6 | scooterData38_end$DayOfWeek==7,'Weekend', 'Weekday'))

ggplot()+
  geom_point(data = scooterData38_start %>% 
               group_by(StartLongitude, StartLatitude, weekend, time_of_day) %>%
               tally(),
             aes(x=StartLongitude, y = StartLatitude, color = n), 
             fill = "transparent", alpha = 0.5, size = 1)+
  geom_sf(data=censusTract,fill='transparent') +
  scale_colour_gradient(low='#bdebda',high='#0076be') +
  ylim(min(scooterData38_start$StartLatitude), max(scooterData38_start$StartLatitude))+
  xlim(min(scooterData38_start$StartLongitude), max(scooterData38_start$StartLongitude))+
  facet_grid(weekend~time_of_day)+
  labs(title="Scooter pickup by time of the day. Lousiville, March to August, 2019")+
  mapTheme()

The result in choropleth map is the same as point map.

ggplot(data = scooterData_in_ct_s %>% 
               mutate(weekend = ifelse(scooterData_in_ct_s$DayOfWeek==6|scooterData_in_ct_s$DayOfWeek==7,
                                       'Weekend','Weekday'),
                      time_of_day = case_when(HourNum < 7 | HourNum > 18 ~ "Overnight",
                                              HourNum >= 7 & HourNum < 10 ~ "AM Rush",
                                              HourNum >= 10 & HourNum < 15 ~ "Mid-Day",
                                              HourNum >= 15 & HourNum <= 18 ~ "PM Rush")) %>%
               group_by(GEOID, weekend, time_of_day) %>%
               summarise(Trip_count = n()))+
  geom_sf(aes(fill = Trip_count))+
  scale_fill_gradient(low='#bdebda',high='#0076be') +
  facet_grid(weekend~time_of_day)+
  labs(title="Scooter trips on census-tract level by time of the day", subtitle = "Louisville, March to August, 2019")+
  mapTheme()

The animation below presents a daily scooter activity by census tract on a Saturday in July. We can see a cluster of trips in the center city census tract.

ride.panel.start.gif <- 
  scooterData_in_ct_s %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel.start) %>% 
  st_set_geometry(NULL) %>%
  group_by(interval15, GEOID) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  ungroup() %>%
  mutate(week = week(interval15),
         dotw = weekdays(interval15)) %>%
  left_join(censusTract,by='GEOID')

week27 <-
  ride.panel.start.gif %>%
  subset((week == 27) & (dotw == "Saturday"))

ride.animation.data <-
  week27 %>%
  st_sf() %>%
  na.omit() %>%
  mutate(Trips = case_when(Trip_Count > 0 & Trip_Count <= 3 ~ "1-5 trips",
                           Trip_Count > 3 & Trip_Count <= 10 ~ "6-10 trips",
                           Trip_Count > 10 & Trip_Count <= 20 ~ "11-20 trips",
                           Trip_Count > 20 ~ "21+ trips")) %>%
  mutate(Trips  = factor(Trips, levels=c("0 trips","1-5 trips","6-10 trips","11-20 trips","21+ trips")))

palette4 <- c("#cceff8","#9fe2bf","#20b3ab","#016690")

rideshare_animation <-
  ggplot() +
  geom_sf(data = censusTract, color = "grey30", fill = "transparent")+
  geom_sf(data=ride.animation.data, aes(fill=Trips)) +
  scale_fill_manual(values = palette4) +
  labs(title = "Scooter pickups for a Saturday in November 2019, NYC",
       subtitle = "15 minute intervals: {current_frame}") +
  transition_manual(interval15) +
  mapTheme()

animate(rideshare_animation, duration=20)

4 Spatial-Time Model Built on Census Tract

4.1 Model bulilding

In this part, we will build two spatial-time models, one for predicting scooter demand and one for supply, based on pickup (start trip info) and return (end trip info) respectively. The subtraction of two models can reflect whether supply and demand is well-balanced in each census tract.

\(Final Outcome = Predicted Supply - Predicted Demand\)

To realize the final goal, we firstly divide trip data into training dataset as trips generated in March, May and July, and test dataset comes from trips in April, June and August.

ride.train.s <- subset(ride.panel.start, month(interval60) == 3|month(interval60) == 5|month(interval60) == 7)
ride.test.s <- subset(ride.panel.start, month(interval60) == 4|month(interval60) == 6|month(interval60) == 8)

ride.train.e <- subset(ride.panel.end, month(interval60) == 3|month(interval60) == 5|month(interval60) == 7)
ride.test.e <- subset(ride.panel.end, month(interval60) == 4|month(interval60) == 6|month(interval60) == 8)

From what has been discussed in exploratory analysis above, it can be concluded that scooter activity shows obvious spatial and temporal patterns in downtown city and weekends. Then we decided to take these variables into models, which are:

Time effect: month, day of week, time of day, events, time lag family (1-hr, 2-hr, 3-hr, 4-hr, 12-hr, 1-day), holiday and holiday lag. Among, when building the first model, there are some usage peak on weekdays. And after researching, we find that these were caused by the events or festivals held in the city, and thus, we include all these date into consideration.
Space effect: census tract GEOID.
Other: temperature, wind speed and precipitation.

reg.s <- 
  lm(Trip_Count ~ GEOID + monthCat + time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours +lag3Hours + lag4Hours + lag1day +  holiday +holidayLag +weekday,
     data=ride.train.s)

reg.e <- 
  lm(Trip_Count ~ GEOID + monthCat +time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours +lag3Hours + lag4Hours + lag1day +  holiday + holidayLag + weekday,
     data=ride.train.e)

ride.test.s$pred <- round(predict(reg.s, newdata = ride.test.s))
ride.test.s$error <- ride.test.s$pred - ride.test.s$Trip_Count

ride.test.e$pred <- round(predict(reg.e, newdata = ride.test.e))
ride.test.e$error <-  ride.test.e$pred - ride.test.e$Trip_Count

In the final part of the model, we focus on detecting real demand(pickup)-supply(return) situation, which can be calculated as the subtratction of predicted supply and demand value. From the result, we find that generally, a census tract need 5 re-balancing on average and the error of our model is 0.96 meaning that we would overpredict or underredict about 1 rebalancing for each census tract on average. The final map shows that:

For most census tracts in Louisville, scooter demand and supply are well-balanced;
In downtown city, scooters are considerably undersupply, which will become a focal area in sooter rebanlancing;
Excess demand also happens in Wyandotte districts;
The northern part of Lousville may exist a problem of wasted oversupply.

return_by_CT <- ride.test.e %>%
  mutate(day=yday(interval60)) %>%
  group_by(GEOID, day) %>%
  summarise(return = sum(Trip_Count)) %>%
  right_join(censusTract)

pickup_by_CT <- ride.test.s %>%
  mutate(day=yday(interval60)) %>%
  group_by(GEOID, day) %>%
  summarise(pickup = sum(Trip_Count)) %>%
  right_join(censusTract)

return_by_CT$subtraction <- pickup_by_CT$pickup - return_by_CT$return

return_by_CT_pred <- ride.test.e %>%
  mutate(day=yday(interval60)) %>%
  group_by(GEOID,day) %>%
  summarise(return = sum(pred)) %>%
  right_join(censusTract)

pickup_by_CT_pred <- ride.test.s %>%
  mutate(day=yday(interval60)) %>%
  group_by(GEOID, day) %>%
  summarise(pickup = sum(pred)) %>%
  right_join(censusTract)

return_by_CT$subtraction_pred <- pickup_by_CT_pred$pickup - return_by_CT_pred$return
return_by_CT$error_sub <- return_by_CT$subtraction - return_by_CT$subtraction_pred

ggplot() + 
  geom_sf(data=return_by_CT %>% st_sf(),aes(fill = subtraction_pred), legend=T) +
  scale_fill_gradient(low='#bdebda',high='#0076be', name="Demand - Supply")+
  labs(title = "Excess supply and excess demand", subtitle = "Blue for excess demand, light cyan for excess supply.") + 
  geom_sf(data=censusTract, color='black', fill='transparent', size=.8) +
  mapTheme()

4.2 Evaluation of The Models

4.2.1 Accuracy

Mean Absolute Error (MAE) is applied to evaluate the acuuracy of the model. It can be calculated that for pickup activity and return activity, MAE is 0.378 and 0.468 respectively. The model for pickup performs better than that for return.

4.2.2 Generalizability

MAEs for different month categories are listed below. Apparently, our models perform better when predicting low to medium-demand months like April and June. The model generate great errors when predicting scooter activity in August. The root problem is that our model is poor when it comes to prediction of high demand.

ride.test.s %>%
  group_by(monthCat) %>%
  summarise(MAE = mean(abs(error))) %>%
  kable(caption = "MAE (pickup activity)") %>%
  kable_styling("striped", full_width = F,position = "center")

MAE (pickup activity)
monthCat	MAE
High-demand month	0.5455064
Low-demand month	0.3641645
Medium-demand month	0.4790297

ride.test.e %>%
  group_by(monthCat) %>%
  summarise(MAE = mean(abs(error))) %>%
  kable(caption = "MAE (return activity)") %>%
  kable_styling("striped", full_width = F,position = "center")

MAE (return activity)
monthCat	MAE
High-demand month	0.5510436
Low-demand month	0.3614943
Medium-demand month	0.4888506

As we mentioned above, months are divided into three categories: low demand in March and April, medium demand in May and June, high-demand in July and August. Here are predicted results from these three scenarios.

grid.arrange(
#low----
ggplot(data=ride.test.s %>% 
  subset(monthCat=='Low-demand month') %>%
  dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -GEOID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Pickup in Louisville; Low-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ggplot(ride.test.e %>% 
  subset(monthCat=='Low-demand month') %>%
  dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -GEOID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Return in Louisville; Low-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

#mild----
ggplot(ride.test.s %>% 
  subset(monthCat=='Medium-demand month') %>%
  dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -GEOID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Pickup in Louisville; Medium-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ggplot(ride.test.e %>% 
  subset(monthCat=='Medium-demand month') %>%
  dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -GEOID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Return in Louisville; Medium-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,
#high----
ggplot(ride.test.s %>% 
  subset(monthCat=='High-demand month') %>%
  dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -GEOID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Pickup in Louisville; High-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ggplot(ride.test.e %>% 
  subset(monthCat=='High-demand month') %>%
  dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -GEOID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Return in Louisville; High-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,
top='Predicted/Observed scooter time series', ncol=1
)

Then we plot the distribution of MAE by census tract. These maps indicate that for both the demand and supply model, higher error comes from downtown and university areas, where scooter demands are more intensive.

meanError.s <- ride.test.s %>%
  group_by(GEOID) %>%
  summarise(MAE = mean(abs(error))) %>%
  left_join(censusTract %>% st_transform(4326), by='GEOID') %>%
  st_sf()

meanError.e <- ride.test.e %>%
  group_by(GEOID) %>%
  summarise(MAE = mean(abs(error))) %>%
  left_join(censusTract %>% st_transform(4326), by='GEOID') %>%
  st_sf()

grid.arrange(
  ggplot() + 
    geom_sf(data=studyArea,fill='transparent')+
    geom_sf(data=meanError.s,aes(fill = MAE)) +
    scale_fill_gradient(low='#bdebda',high='#0076be')+
    labs(title='MAE by census tract, pickup')+
    mapTheme(),
  
  ggplot() + 
    geom_sf(data=studyArea,fill='transparent')+
    geom_sf(data=meanError.e,aes(fill = MAE)) +
    scale_fill_gradient(low='#bdebda',high='#0076be')+
    labs(title='MAE by census tract, return')+
    mapTheme(), ncol=2
)

Further generalizability test is conducted. Race data is used here to define the differences in Louisville’s urban context. Tracts with more than 50% of white population is defined as ‘Majority White’ in racial context.

Though there are some missing value in the census data, the results do bring us an insight into whether the model generalize to different urban contexts.

Though averaging across positive and negative counts does not provide the best indicator of accuracy (MAE is better), we use error here to see whether we onverpredict or underpredict.

LVCensus_sa <- mutate(LVCensus %>% st_transform(4326),
         raceContext = ifelse(Percent_White > 0.5, "Majority_White", "Majority_Non_White")) %>%
  .[censusTract,]

ggplot() + 
  geom_sf(data = LVCensus_sa, aes(fill = raceContext)) +
  geom_sf(data=studyArea, fill='transparent', color='white', size=1)+
  scale_fill_manual(values = c('#7fcdbb','#2c7fb8'),
                      name="Race Context") +
  labs(title = "Race Context", name="Race Context") +
  mapTheme()

With the slight difference in error when using our model to predict scooter activity in “Majority White” and “Majority Non-White” census tracts, we can conclude that our models generalize well with respect to ratial context. The difference is reflected in that the city would slightly underpredict in majority-White tracts.

ct_race <- left_join(LVCensus_sa %>% st_set_geometry(NULL), ride.test.s, by='GEOID')
ct_race %>%
  na.omit() %>%
  group_by(raceContext) %>%
  summarize(Error = mean(error)) %>%
  kable(caption = "Error (by race context)") %>%
  kable_styling("striped", full_width = F,position = "center")

Error (by race context)
raceContext	Error
Majority_Non_White	-0.0079663
Majority_White	-0.0175292

5 Spatial-Time Model Built for a Census Tract on Fishnet

Recall that in our first model, we build two models to predict scooter demand and supply by each census tract. However, such large unit area may weaken the model’s effectiveness and make it less useful in practice. To get a more detailed information about supply-demand situation within each census tract, let’s focus on the downtown city census tract as an example, where MAE is the highest and scooters are considerably undersupply in overall.

tract <- subset(censusTract,GEOID == '21111004900') 

scooterData900_start <- st_intersection(scooterData38_start %>%
                  st_as_sf(.,coords = c("StartLongitude", "StartLatitude"), crs = 4326),
                tract)
scooterData900_end <- st_intersection(scooterData38_end %>%
                    st_as_sf(.,coords = c("EndLongitude", "EndLatitude"), crs = 4326),
                    tract)

5.1 Data Wrangling

Based on the census tract boundary, we create a fishnet with unit area of 0.003 degree latitude × 0.003 degree latitude.For each fishnet. An uniqueID is generated for spatial join, and trip points are projected to the fishnet. In the mean time, cvIDs are created for further cross validation.

fishnet <- 
  st_make_grid(tract, cellsize = 0.003) %>%
  st_sf()

fishnet <- 
  fishnet[tract,] %>%
  mutate(uniqueID = as.character(rownames(.)),
         cvID = sample(round(nrow(fishnet) / 10), size=nrow(.), replace = TRUE)) %>%
  dplyr::select(uniqueID,cvID)

Same as what we have done before, we project the point data to fishnet.

Data_in_fishnet_s <- st_join(scooterData900_start, fishnet, st_intersects)
Data_in_fishnet_s <- Data_in_fishnet_s %>%
  st_set_geometry(NULL) %>%
  dplyr::select(-cvID) %>%
  left_join(.,fishnet, by='uniqueID') %>%
  st_sf()

Data_in_fishnet_e <- st_join(scooterData900_end, fishnet, st_intersects)
Data_in_fishnet_e <- Data_in_fishnet_e %>%
  st_set_geometry(NULL) %>%
  dplyr::select(-cvID) %>%
  left_join(.,fishnet, by='uniqueID') %>%
  st_sf()

Then, two space-time panels are created based on the start time and end time for each gird cell in every hour. Since process in feature engineering and time lag creation are quite similar to the previous model, description of this part is omitted.

study.panel.start.FN <- 
  expand.grid(interval60=unique(Data_in_fishnet_s$interval60), 
              uniqueID = unique(Data_in_fishnet_s$uniqueID))

study.panel.end.FN <- 
  expand.grid(interval60=unique(Data_in_fishnet_e$interval60),
              uniqueID = unique(Data_in_fishnet_e$uniqueID))

ride.panel.start.FN <- 
  Data_in_fishnet_s %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel.start.FN) %>% 
  st_set_geometry(NULL) %>%
  group_by(interval60, uniqueID) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(fishnet,by='uniqueID') %>%
  left_join(weather.Panel, by='interval60') %>%
  ungroup() %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE),
         Preci = case_when(Precipitation<0.001~"No rain",
                           Precipitation >= 0.001~"Wet"),
         Wind = case_when(Wind_Speed<=3~"Light", 
                          Wind_Speed> 3 & Wind_Speed <= 6~"Moderate", 
                          Wind_Speed > 6 & Wind_Speed < 9~"Gale", 
                          Wind_Speed >= 9~"Stormy"),
         time_uniqueID=paste(as.character(interval60),uniqueID))

ride.panel.end.FN <- 
  Data_in_fishnet_e %>%
  mutate(Trip_Counter = 1) %>%
  right_join(study.panel.end.FN) %>% 
  st_set_geometry(NULL) %>%
  group_by(interval60, uniqueID) %>%
  summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
  left_join(fishnet,by='uniqueID') %>%
  left_join(weather.Panel, by='interval60') %>%
  ungroup() %>%
  mutate(week = week(interval60),
         dotw = wday(interval60, label = TRUE),
         Preci = case_when(Precipitation < 0.001~"No rain", 
                           Precipitation >= 0.001~"Wet"),
         Wind = case_when(Wind_Speed <= 3~"Light", 
                          Wind_Speed > 3 & Wind_Speed <= 6~"Moderate", 
                          Wind_Speed > 6 & Wind_Speed < 9~"Gale",
                          Wind_Speed >= 9~"Stormy"),
         time_uniqueID=paste(as.character(interval60),uniqueID))

ride.panel.start.FN <- 
  ride.panel.start.FN %>% 
  arrange(uniqueID, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 245|yday(interval60) ==185,1,0),
         weekday = ifelse(dotw != 'Sun'&dotw!='Sat',1,0)) %>%
  mutate(day = yday(interval60)) %>%
  mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
         monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
                              month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
                              month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
         events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
         holidayLag = replace_na(holidayLag, '0'))

ride.panel.end.FN <- 
  ride.panel.end.FN %>% 
  arrange(uniqueID, interval60) %>% 
  mutate(lagHour = dplyr::lag(Trip_Count,1),
         lag2Hours = dplyr::lag(Trip_Count,2),
         lag3Hours = dplyr::lag(Trip_Count,3),
         lag4Hours = dplyr::lag(Trip_Count,4),
         lag12Hours = dplyr::lag(Trip_Count,12),
         lag1day = dplyr::lag(Trip_Count,24),
         holiday = ifelse(yday(interval60) == 245|yday(interval60) ==185,1,0),
         weekday = ifelse(dotw != 'Sun'&dotw!='Sat',1,0)) %>%
  mutate(day = yday(interval60)) %>%
  mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
                                dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
                                dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
                                dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
                                dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
                                dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
         time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
                                 hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
                                 hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
                                 hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
         monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
                              month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
                              month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
         events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
         holidayLag = replace_na(holidayLag, '0'))

5.2 Model Building

The data used for trainging and testing, as well as the predictors in the models are the same as the former model and would be omited. The only difference is we here use the uniqueID of fishnet accounts for the spatial effects.

ride.train.s.FN <- subset(ride.panel.start.FN, month(interval60) ==3 | month(interval60) == 5|month(interval60) == 7)
ride.test.s.FN <- subset(ride.panel.start.FN, month(interval60) ==4 | month(interval60) == 6|month(interval60) == 8)

ride.train.e.FN <- subset(ride.panel.end.FN, month(interval60) ==3 | month(interval60) == 5|month(interval60) == 7)
ride.test.e.FN <- subset(ride.panel.end.FN, month(interval60) ==4 | month(interval60) == 6|month(interval60) == 8)

reg.s.FN <- 
  lm(Trip_Count ~ uniqueID + monthCat + time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours + lag3Hours + lag4Hours + lag1day + holiday + holidayLag + weekday,
     data=ride.train.s.FN)

reg.e.FN <- 
  lm(Trip_Count ~ uniqueID + monthCat + time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours + lag3Hours + lag4Hours + lag1day + holiday + holidayLag + weekday,
     data=ride.train.e.FN)

ride.test.s.FN$pred <- round(predict(reg.s.FN, newdata = ride.test.s.FN))
ride.test.s.FN$error <- ride.test.s.FN$pred - ride.test.s.FN$Trip_Count 

ride.test.e.FN$pred <- round(predict(reg.e.FN, newdata = ride.test.e.FN))
ride.test.e.FN$error <- ride.test.e.FN$pred - ride.test.e.FN$Trip_Count

Same as what we have done before, we focus on detecting real demand(pickup)-supply(return) situation. From the result, we find that generally, a census tract need 5 re-balancing on average per day and the error of our model is 2 meaning that we would overpredict or underredict about 2 rebalancing for each cell on average. Based on the map below, we can conclude that scooter demand and supply in central city is poor-balanced.

return_by_FN <- ride.test.e.FN %>%
  mutate(day=yday(interval60)) %>%
  group_by(uniqueID, day) %>%
  summarise(return = sum(Trip_Count)) %>%
  right_join(fishnet)

pickup_by_FN <- ride.test.s.FN %>%
  mutate(day=yday(interval60)) %>%
  group_by(uniqueID, day) %>% 
  summarise(pickup = sum(Trip_Count)) %>%
  right_join(fishnet)

return_by_FN$subtraction <- pickup_by_FN$pickup - return_by_FN$return

return_by_FN_pred <- ride.test.e.FN %>%
  mutate(day=yday(interval60)) %>%
  group_by(uniqueID,day) %>%
  summarise(return = sum(pred)) %>%
  right_join(fishnet)

pickup_by_FN_pred <- ride.test.s.FN %>%
  mutate(day=yday(interval60)) %>%
  group_by(uniqueID, day) %>%
  summarise(pickup = sum(pred)) %>%
  right_join(fishnet)

return_by_FN$subtraction_pred <- pickup_by_FN_pred$pickup - return_by_FN_pred$return
return_by_FN$error_sub <- return_by_FN$subtraction - return_by_FN$subtraction_pred

ggplot() + 
  geom_sf(data=na.omit(return_by_FN) %>% st_sf(),aes(fill = subtraction_pred), legend=T) +
  scale_fill_gradient(low='#bdebda',high='#0076be', name="Demand - Supply")+
  labs(title = "Excess supply and excess demand", subtitle = "Blue for excess demand, light cyan for excess supply.") + 
  geom_sf(data=tract, color='black', fill='transparent', size=1) +
  mapTheme()

5.3 Evaluation of The Models

5.3.1 Accuracy

It can be calculated that for pickup activity and return activity, MAE is 0.700 and 0.649 respectively. The model for pickup performs better than that for return.

5.3.2 Genralizability

We can see that the model perform better when predicting the pickup and return activity in low to medium-demand months. Compared the MAE of these two month, it is concluded that our model is more reliable when predicting scooter activity in low or high usage months.

ride.test.s.FN %>%
  group_by(monthCat) %>%
  summarise(mae = mean(abs(error))) %>%
  kable(caption = "MAE (pickup activity)") %>%
  kable_styling("striped", full_width = F,position = "center")

MAE (pickup activity)
monthCat	mae
High-demand month	0.7446504
Low-demand month	0.4195220
Medium-demand month	0.6484296

ride.test.e.FN %>%
  group_by(monthCat) %>%
  summarise(mae = mean(abs(error))) %>%
  kable(caption = "MAE (return activity)") %>%
  kable_styling("striped", full_width = F,position = "center")

MAE (return activity)
monthCat	mae
High-demand month	0.6930751
Low-demand month	0.3832812
Medium-demand month	0.5984686

For more detail, we plot predicted/observed scooter time series and find that the model fail to predict the high demand during day time.

grid.arrange(
ride.test.s.FN %>% 
  subset(monthCat=='Low-demand month') %>%
  dplyr::select(interval60, uniqueID, Trip_Count, pred, monthCat) %>%
  gather(Variable, Value, -interval60, -uniqueID, -monthCat) %>%
  group_by(Variable, interval60, monthCat) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) +
  labs(title = "Pickup in Louisville; Low-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ride.test.e.FN %>% 
  subset(monthCat=='Low-demand month') %>%
  dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -uniqueID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Return in Louisville; Low-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ride.test.s.FN %>% 
  subset(monthCat=='Medium-demand month') %>%
  dplyr::select(interval60, uniqueID, Trip_Count, pred, monthCat) %>%
  gather(Variable, Value, -interval60, -uniqueID, -monthCat) %>%
  group_by(Variable, interval60, monthCat) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) +
  labs(title = "Pickup in Louisville; Medium-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ride.test.e.FN %>% 
  subset(monthCat=='Medium-demand month') %>%
  dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -uniqueID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Return in Louisville; Medium-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ride.test.s.FN %>% 
  subset(monthCat=='High-demand month') %>%
  dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -uniqueID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Pickup in Louisville; High-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme,

ride.test.e.FN %>% 
  subset(monthCat=='High-demand month') %>%
  dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
  gather(Variable, Value, -interval60, -uniqueID) %>%
  group_by(Variable, interval60) %>%
  summarize(Value = sum(Value)) %>%
  ggplot(aes(interval60, Value, colour=Variable)) + 
  geom_line(size=1.1) + 
  labs(title = "Return in Louisville; High-demand month",  x = "Hour", y= "Station Trips") +
  plotTheme, top='Predicted/Observed scooter time series', ncol=1)

Both the models perform better when predicting low-demand area and fail to account for the high demand in the center of the census tract.

meanError.s.FN <- ride.test.s.FN %>%
  group_by(uniqueID) %>%
  summarise(MAE = mean(abs(error))) %>%
  left_join(fishnet, by='uniqueID') %>%
  st_sf()

meanError.e.FN <- ride.test.e.FN %>%
  group_by(uniqueID) %>%
  summarise(MAE = mean(abs(error))) %>%
  left_join(fishnet, by='uniqueID') %>%
  st_sf()

grid.arrange(
ggplot() + 
  geom_sf(data=meanError.s.FN %>%st_sf(),aes(fill = MAE)) +
  geom_sf(data=tract,fill='transparent',colour='black', size=1) +
  scale_fill_gradient(low='#bdebda',high='#0076be')+
  labs(title='MAE by fishnet, pickup')+
  mapTheme(),

ggplot() + 
  geom_sf(data=meanError.e.FN,aes(fill = MAE)) +
  geom_sf(data=tract,fill='transparent',colour='black', size=1) +
  scale_fill_gradient(low='#bdebda',high='#0076be')+
  labs(title='MAE by fishnet, return')+
  mapTheme(), ncol=2
)

6 Limitation and Discussion

From what has been discussed above, the model underassess for periods of high demand but could sense the spatio-temporal change in scooter activity. With the results of the prediction, a simple calculation could be done to know more about the excess demand and supply across the city. However, this model is far from a mature one, and here are several steps to take in the future:

A more advanced prediction model.
Use additional factors.
A more appropriate space unit.

Based on our prediction result, here are some suggustions we think would be helpful to the four providers:

Virtual dock. Another problem for the scooter share is that some users park the scooters in wrong places. The setting of virtual dock not only solve this problem, but also make it efficient and convinient for the four companies to manage. With this relatively fixed “station”, the prediction would be more accurate.
Profit analysis. Based on the prediction result, we calculate the net revenue if this rebalancing system operate normally and every scooter demand can be met. It can be concluded that 91383 more trips would be accomplished in the last April, June and August. Acoording to the economic statistics, each scooter generates 0.95 dollar per ride, deducting costs of charging, repairs, credit card fees, customer support, and insurance. Therefore, these four companies will recollect a total profit of 86813 dollar.