Four dockless vehicle operators: Bird, Lime, Bolt and Spin, have launched their opreation in the city of Louisville since August 2018. Their app-based rental scooters are designed to transport individuals during the last mile or so of their commute across the city. More than one year’s implementation has exposed several problems:
Shortages of scooters often occur where there is a high demand, leading to inconvenience for users and market loss for companies;
In places where there is no potential users, Scooters are oversupplied and piled. Depreciation without any revenue cuts down rate of return greatly.
Due to the characteristic of dockless, it is common to see people park their scooters in wrong areas, resulting in unpleasant city views and traffic chaos.
One of the biggest operational challenges of dockless scooter system is to get vehicles to areas that are anticipated to have high demand but actually have no scooter. In this report, we manage to predict its suppy and demand based on existing data. By detecting real demand-supply situation, we hope this model can realize:
Better decision making in future investment, since Louisville is now allowing more companies to enter the market and scooter companies are considering double its fleet;
A successful re-balancing system with larger market share and more potential benefits, figuring out a specific amout of money that we can save with more detailed preparations;
A win-win outcome for both the providers and users, due to higher public transportation accessibility and less reliance on personal cars;
Suggestions for virtual dock site planning if necessary in the future.
For more about our app Sco-Porter, here is a video!
Before the analysis, some libraries are loaded and some graphic themes are set.
library(tidyverse)
library(sf)
library(lubridate)
library(tigris)
library(tidycensus)
library(viridis)
library(riem)
library(gridExtra)
library(knitr)
library(kableExtra)
library(RSocrata)
library(dplyr)
library(prob)
library(FNN)
library(osmdata)
library(gganimate)
library(gifski)
library(png)
plotTheme <- theme(
plot.title =element_text(size=15),
plot.subtitle = element_text(size=8),
plot.caption = element_text(size = 6),
axis.text.x = element_text(size = 10, hjust = 1),
axis.text.y = element_text(size = 10),
axis.title.y = element_text(size = 10),
# Set the entire chart region to blank
panel.background=element_blank(),
plot.background=element_blank(),
#panel.border=element_rect(colour="#F0F0F0"),
# Format the grid
panel.grid.major=element_line(colour="#D0D0D0",size=.2),
axis.ticks=element_blank())
mapTheme <- function(base_size = 12) {
theme(
text = element_text( color = "black"),
plot.title = element_text(size = 14,colour = "black"),
plot.subtitle=element_text(face="italic"),
plot.caption=element_text(hjust=0),
axis.ticks = element_blank(),
panel.background = element_blank(),axis.title = element_blank(),
axis.text = element_blank(),
axis.title.x = element_blank(),
axis.title.y = element_blank(),
panel.grid.minor = element_blank(),
panel.border = element_rect(colour = "black", fill=NA, size=2)
)
}
We compiled our dataset from:
studyArea <- read_sf("C:/Users/sumyeem/Downloads/Dockless Vehicle Service Area/Dockless_Vehicle_Service_Area.shp") %>%
st_transform(4326)
Holiday and event data are collected from web resources;
Census tract data are obtained using tidycensus
package. Boundary shapefile is downloaded from TIGER/Line® Shapefiles dataset and have been clipped to fit the study area.
#census data
LVCensus <-
get_acs(geography = "tract",
variables = c("B01003_001", "B19013_001",
"B02001_002", "B08013_001",
"B08012_001", "B08301_001",
"B08301_010", "B01002_001",
"B08014_001", "B08014_002"),
year = 2017,
state = "KY",
geometry = TRUE,
county=c("Jefferson"),
output = "wide") %>%
rename(Total_Pop = B01003_001E,
Med_Inc = B19013_001E,
Med_Age = B01002_001E,
White_Pop = B02001_002E,
Vehicle_own = B08014_001E,
No_vehicle = B08014_002E,
Travel_Time = B08013_001E,
Num_Commuters = B08012_001E,
Means_of_Transport = B08301_001E,
Total_Public_Trans = B08301_010E) %>%
select(Total_Pop, Med_Inc, White_Pop, Travel_Time,
Means_of_Transport, Total_Public_Trans,
Med_Age,Vehicle_own,No_vehicle,
GEOID, geometry) %>%
mutate(Percent_White = White_Pop / Total_Pop,
Mean_Commute_Time = Travel_Time / Total_Public_Trans,
Percent_Taking_Public_Trans = Total_Public_Trans / Means_of_Transport,
Percent_vehicle_available = 1-No_vehicle/Vehicle_own)
#census tract boundary
censusTract <- read_sf("E:/GRAD/MUSA507/finalProject/censusTratcs.shp") %>%
st_transform(4326)
riem_stations
, including temperature, precipitation and wind speed data. Station ‘LOU’ is selected here;weather.Panel <-
riem_measures(station = "LOU", date_start = "2019-03-01", date_end = "2019-09-01") %>%
replace(is.na(.), 0) %>%
mutate(interval60 = ymd_h(substr(valid,1,13))) %>%
mutate(week = week(interval60),
dotw = wday(interval60, label=TRUE)) %>%
group_by(interval60) %>%
summarize(Temperature = max(tmpf),
Precipitation = sum(p01i),
Wind_Speed = max(sknt)) %>%
mutate(Temperature = ifelse(Temperature == 0, 42, Temperature))
#plot
grid.arrange(
ggplot(weather.Panel, aes(interval60,Precipitation)) + geom_line() +
labs(title="Percipitation", x="Hour", y="Perecipitation") + plotTheme,
ggplot(weather.Panel, aes(interval60,Wind_Speed)) + geom_line() +
labs(title="Wind Speed", x="Hour", y="Wind Speed") + plotTheme,
ggplot(weather.Panel, aes(interval60,Temperature)) + geom_line() +
labs(title="Temperature", x="Hour", y="Temperature") + plotTheme,
top="Weather Data - Louisville - March to August, 2019")
Note that this is the final dataset which has been narrowed down to a set of best predictors through exploratory analysis and the modeling process.
This process included various transformations of the data in order to optimize predictive ability of each variable. For time scenario, we labeled start time and end time for each trip and sliced them into every hour intervals.
#Start
scooterData <- st_intersection(scooters %>%
st_as_sf(.,coords = c("StartLongitude", "StartLatitude"), crs = 4326),
studyArea)
#delete na data
scooterData <- scooterData[-1,]
scooterData$StartDateTime <- paste(scooterData$StartDate,scooterData$StartTime,sep=' ')
scooterData$StartDateTime <- as.POSIXct(scooterData$StartDateTime,format = '%Y-%m-%d %H:%M')
scooterData <- mutate(scooterData,
interval60 = floor_date(ymd_hms(StartDateTime), unit = "hour"),
interval15 = floor_date(ymd_hms(StartDateTime), unit = "15 mins"))
#End
scooterData_end <- st_intersection(scooters %>%
st_as_sf(.,coords = c("EndLongitude", "EndLatitude"), crs = 4326),
studyArea)
scooterData_end <- scooterData_end[-1,]
scooterData_end$EndDateTime <- paste(scooterData_end$EndDate,scooterData_end$EndTime,sep=' ')
scooterData_end$EndDateTime <- as.POSIXct(scooterData_end$EndDateTime,format = '%Y-%m-%d %H:%M')
scooterData_end <- mutate(scooterData_end,
interval60 = floor_date(ymd_hms(EndDateTime), unit = "hour"),
interval15 = floor_date(ymd_hms(EndDateTime), unit = "15 mins"))
We will focus on trips generated from March to August.
scooterData38_start <- subset(scooterData, year(interval60)==2019&month(interval60)>2 & month(interval60)<9)
scooterData38_start <- mutate(scooterData38_start, StartLongitude = unlist(map(geometry,1)),
StartLatitude = unlist(map(geometry,2)))
scooterData38_end <- subset(scooterData_end, year(interval60)==2019&month(interval60)>2 & month(interval60)<9)
scooterData38_end <- mutate(scooterData38_end, EndLongitude = unlist(map(geometry,1)),
EndLatitude = unlist(map(geometry,2)))
Census tract is set as the space unit in the first place and we projected scooter trips on the tracts. In this way, trips data are transformed from points into polygons.
scooterData_in_ct_s <- st_join(scooterData38_start, censusTract %>% select(GEOID),st_within, left=FALSE)
scooterData_in_ct_s <- scooterData_in_ct_s %>%
st_set_geometry(NULL) %>%
merge(.,censusTract %>% select(GEOID, geometry), by='GEOID') %>%
st_sf()
scooterData_in_ct_e <- st_join(scooterData38_end, censusTract %>% select(GEOID),st_within, left=FALSE)
scooterData_in_ct_e <- scooterData_in_ct_e %>%
st_set_geometry(NULL) %>%
merge(.,censusTract %>% select(GEOID, geometry), by='GEOID') %>%
st_sf()
Here, two space-time panels are created based on the labeled start time and end time. The study.panel
is like an empty table where each row can record scooter activity in each census tract in every hour.
study.panel.start <-
expand.grid(interval60=unique(scooterData_in_ct_s$interval60),
GEOID = unique(scooterData_in_ct_s$GEOID))
study.panel.end <-
expand.grid(interval60=unique(scooterData_in_ct_e$interval60),
GEOID = unique(scooterData_in_ct_e$GEOID))
Two full panels are then created by summarizing counts by census tract for each time interval. Weather data are joined to the panels as well. We also added census tract data as spatial content.
ride.panel.start <-
scooterData_in_ct_s %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel.start) %>%
st_set_geometry(NULL) %>%
group_by(interval60, GEOID) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(censusTract,by='GEOID') %>%
left_join(weather.Panel, by='interval60') %>%
ungroup() %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE),
Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet"),
Wind = case_when(Wind_Speed<=3~"Light", Wind_Speed> 3 & Wind_Speed <= 6~"Moderate", Wind_Speed > 6& Wind_Speed < 9~"Gale",
Wind_Speed >= 9~"Stormy"),
time_GEOID=paste(as.character(interval60),GEOID))
ride.panel.end <-
scooterData_in_ct_e %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel.end) %>%
st_set_geometry(NULL) %>%
group_by(interval60, GEOID) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(censusTract,by='GEOID') %>%
left_join(weather.Panel, by='interval60') %>%
ungroup() %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE),
Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet"),
Wind = case_when(Wind_Speed<=3~"Light", Wind_Speed>3&Wind_Speed<=6~"Moderate",Wind_Speed>6&Wind_Speed<9~"Gale",
Wind_Speed>=9~"Stormy"),
time_GEOID=paste(as.character(interval60),GEOID))
Time lags are added to help us gain insight into the different patterns of scooter trips generated in various time slots. Besides, a dummy variable holidayLag
is created to control the holiday influence on the demand during the three day weekend due to the holiday like Memorial Day. Since during selected time period, there are two holidays, Memorial Day on May 27th, and Independence day on July 4th, holiday lag should be considered. In addition to that, we divide six months into high, medium and low demand categories based on monthly use volume, and label them in monthCat
column.
#start----
ride.panel.start <-
ride.panel.start %>%
arrange(GEOID, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 245|yday(interval60) ==185,1,0),
weekday = ifelse(dotw != 'Sun' & dotw!='Sat',1,0),
events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
holidayLag = replace_na(holidayLag, '0'))
#end----
ride.panel.end <-
ride.panel.end %>%
arrange(GEOID, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 147|yday(interval60) ==185,1,0),
weekday = ifelse(dotw != 'Sun' & dotw!='Sat',1,0),
events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
holidayLag = replace_na(holidayLag, '0'))
The correlation between these time lags and the trips count is evaluated respectively. It is obvious that there are Pearson’s R higher or around 0.5 for lagHour
, lag2Hours
, lag3Hours
, lag4Hours
and lag1day
, which indicates the strong correlations. And thus, we would like to take them into account.
plotData.lag.start <-
as.data.frame(ride.panel.start) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))
correlation.lag.start <-
plotData.lag.start %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
correlation.lag.start %>%
kable(caption = "Pearson's R (Pickup)") %>%
kable_styling("striped", full_width = F,position = "center")
Variable | correlation |
---|---|
lagHour | 0.92 |
lag2Hours | 0.82 |
lag3Hours | 0.69 |
lag4Hours | 0.56 |
lag12Hours | -0.31 |
lag1day | 0.73 |
plotData.lag.end <-
as.data.frame(ride.panel.end) %>%
group_by(interval60) %>%
summarise_at(vars(starts_with("lag"), "Trip_Count"), mean, na.rm = TRUE) %>%
gather(Variable, Value, -interval60, -Trip_Count) %>%
mutate(Variable = factor(Variable, levels=c("lagHour","lag2Hours","lag3Hours","lag4Hours",
"lag12Hours","lag1day")))
correlation.lag.end <-
plotData.lag.end %>%
group_by(Variable) %>%
summarize(correlation = round(cor(Value, Trip_Count),2))
correlation.lag.end %>%
kable(caption = "Pearson's R (Return)") %>%
kable_styling("striped", full_width = F,position = "center")
Variable | correlation |
---|---|
lagHour | 0.92 |
lag2Hours | 0.82 |
lag3Hours | 0.70 |
lag4Hours | 0.56 |
lag12Hours | -0.31 |
lag1day | 0.73 |
In this section, we explore the data and gain insight into the temperal and spatial pattern of scooter activity, as well as the relationship between trips count and one or more independent variables.
First, we do some general analysis to have a rough understanding of the scooter trips in Louisville. From figure below, it could be concluded that generally, scooters are used for short-distance trips, usually for trips less than 1 mile.
scooterData38_start %>%
mutate(distCat = case_when(TripDistance<1~'< 1 mile',
TripDistance<2 & TripDistance>=1~'1 ~ 2 miles',
TripDistance<3 & TripDistance>=2~'2 ~ 3 miles',
TripDistance<4 & TripDistance>=3~'3 ~ 4 miles',
TripDistance<5 & TripDistance>=4~'4 ~ 5 miles',
TripDistance>=5~'> 5 miles'),
scooterTripCount = 1) %>%
group_by(distCat, scooterTripCount) %>%
summarise(count=sum(scooterTripCount)) %>%
ggplot(aes(distCat, count)) +
geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#238A8DFF") +
xlab("Trip Distance")+
scale_x_discrete(limits=c("< 1 mile", "2 ~ 3 miles", "3 ~ 4 miles", "4 ~ 5 miles", "> 5 mile")) +
labs(title = "Scooter trips count by distance", subtitle = "Lousiville, March to August, 2019") +
plotTheme
From histograms below, we can conclude that weather affects activity. There are more trips when the wind speed is slow, temperature is comfortable and it is not raining.
grid.arrange(
ride.panel.start %>%
na.omit() %>%
group_by(Wind_Speed, Trip_Count) %>%
summarise(count=sum(Trip_Count)) %>%
mutate(Windms = Wind_Speed*0.5144,
Wind = case_when(Wind_Speed<=5.5~"Light",
Wind_Speed>5.5 & Wind_Speed<=11~"Moderate",
Wind_Speed>11~"Gale")) %>%
ggplot(aes(Wind, count)) +
geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#336387ff") +
scale_x_discrete(limits=c("Light", "Moderate", "Gale")) +
labs(title = "Wind Speed") +
plotTheme,
ride.panel.start %>%
na.omit() %>%
mutate(Temp = case_when(Temperature<60.0~"Cold",Temperature>=60.0&Temperature<70.0~"Cool",
Temperature>=70.0&Temperature<80.0~"Mild",Temperature>=80.0~"Hot")) %>%
group_by(Temp, Trip_Count) %>%
summarise(count=sum(Trip_Count)) %>%
ggplot(aes(Temp, count)) +
geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#29af7fff")+
scale_x_discrete(limits=c("Cold", "Cool", "Mild", "Hot")) +
labs(title = "Temperature") +
plotTheme,
ride.panel.start %>%
na.omit() %>%
group_by(Precipitation, Trip_Count) %>%
summarise(count=sum(Trip_Count)) %>%
mutate(Preci = case_when(Precipitation<0.001~"No rain",Precipitation>=0.001~"Wet")) %>%
ggplot(aes(Preci, count)) +
geom_histogram(position = "stack", stat = "summary", fun.y = "mean", na.rm = TRUE, fill="#dce319ff") +
labs(title = 'Precipitation') +
plotTheme,
ncol=3)
First, let’s focus on the overtime pattern of ridership in July, 2019, when the number of trips is the highest. There is a clear periodicity, where there are peak in day time and more trips were made on weekends, especially on Saturday compared to trips made during weekdays. Moreover, we can see an increase in the trips count on July 4th, which is a Thurday holiday, compared to the average trip count on Thurday.
ggplot(scooterData38_start %>%
mutate(month=month(interval60)) %>%
subset(month == 7) %>%
group_by(interval60) %>%
tally())+
geom_line(aes(x = interval60, y = n),size=.8, col='#42B5AD')+
labs(title="Scooter trips per hour. Louisville, July, 2019",
x="Date",
y="Number of trips")+
plotTheme
For more details, we plot the scooter trips by day of the week. It can be concluded that there are more usage on Saturday and Friday. In terms of the hours, people usually travel by scooter during day time, especially after noon.
ggplot(data=scooterData38_start %>%
mutate(month=month(interval60),
dotw=weekdays(interval60))) +
#subset(month == 7)) +
geom_freqpoly(aes(HourNum, col=dotw), binwidth=1, size=1,) +
scale_colour_viridis(discrete = T, name="Day of the week") +
labs(title="Scooter trips in Louisville, by day of the week, March to August, 2019",
x="Hour",
y="Trip Counts")+
plotTheme
It is easy to think that most scooters usage would happen in CBD. To validate our thoughs, we plot distribution of scooters as below. In short, the scooter activity varies across the space and time. High demand occurs in downtown, old city and districts around University of Louisville, suggesting that city residents and students might be potential users of scooters.
ggplot()+
geom_point(data = scooterData38_start %>%
group_by(StartLongitude, StartLatitude) %>%
tally(),
aes(x=StartLongitude, y = StartLatitude, color = n),
fill = "transparent", alpha = 0.5, size = 1)+
geom_sf(data=censusTract,fill='transparent') +
scale_colour_gradient(low='#bdebda',high='#0076be')+
ylim(min(scooterData38_start$StartLatitude), max(scooterData38_start$StartLatitude))+
xlim(min(scooterData38_start$StartLongitude), max(scooterData38_start$StartLongitude))+
labs(title="Scooter trips by time of the day. \nLousiville, March to August, 2019")+
mapTheme()
Then we plot distribution of trips by four time periods during one day: AM Rush from 7 am to 10 am, Mid-Day from 10 am to 15 pm, PM rush from 15 pm to 18 pm, and others for Overnight. It can be seen from the map that since there is no significant pattern difference among different time periods, people in Louisville do not widely use scooters for communiting.
scooterData38_start <- mutate(scooterData38_start, time_of_day = case_when(HourNum < 7 | HourNum > 18 ~ "Overnight",
HourNum >= 7 & HourNum < 10 ~ "AM Rush",
HourNum >= 10 & HourNum < 15 ~ "Mid-Day",
HourNum >= 15 & HourNum <= 18 ~ "PM Rush"))
scooterData38_end <- mutate(scooterData38_end, time_of_day = case_when(HourNum < 7 | HourNum > 18 ~ "Overnight",
HourNum >= 7 & HourNum < 10 ~ "AM Rush",
HourNum >= 10 & HourNum < 15 ~ "Mid-Day",
HourNum >= 15 & HourNum <= 18 ~ "PM Rush"))
scooterData38_start <- mutate(scooterData38_start,weekend = ifelse(scooterData38_start$DayOfWeek==6 | scooterData38_start$DayOfWeek==7, 'Weekend', 'Weekday'))
scooterData38_end <- mutate(scooterData38_end,weekend = ifelse(scooterData38_end$DayOfWeek==6 | scooterData38_end$DayOfWeek==7,'Weekend', 'Weekday'))
ggplot()+
geom_point(data = scooterData38_start %>%
group_by(StartLongitude, StartLatitude, weekend, time_of_day) %>%
tally(),
aes(x=StartLongitude, y = StartLatitude, color = n),
fill = "transparent", alpha = 0.5, size = 1)+
geom_sf(data=censusTract,fill='transparent') +
scale_colour_gradient(low='#bdebda',high='#0076be') +
ylim(min(scooterData38_start$StartLatitude), max(scooterData38_start$StartLatitude))+
xlim(min(scooterData38_start$StartLongitude), max(scooterData38_start$StartLongitude))+
facet_grid(weekend~time_of_day)+
labs(title="Scooter pickup by time of the day. Lousiville, March to August, 2019")+
mapTheme()
The result in choropleth map is the same as point map.
ggplot(data = scooterData_in_ct_s %>%
mutate(weekend = ifelse(scooterData_in_ct_s$DayOfWeek==6|scooterData_in_ct_s$DayOfWeek==7,
'Weekend','Weekday'),
time_of_day = case_when(HourNum < 7 | HourNum > 18 ~ "Overnight",
HourNum >= 7 & HourNum < 10 ~ "AM Rush",
HourNum >= 10 & HourNum < 15 ~ "Mid-Day",
HourNum >= 15 & HourNum <= 18 ~ "PM Rush")) %>%
group_by(GEOID, weekend, time_of_day) %>%
summarise(Trip_count = n()))+
geom_sf(aes(fill = Trip_count))+
scale_fill_gradient(low='#bdebda',high='#0076be') +
facet_grid(weekend~time_of_day)+
labs(title="Scooter trips on census-tract level by time of the day", subtitle = "Louisville, March to August, 2019")+
mapTheme()
The animation below presents a daily scooter activity by census tract on a Saturday in July. We can see a cluster of trips in the center city census tract.
ride.panel.start.gif <-
scooterData_in_ct_s %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel.start) %>%
st_set_geometry(NULL) %>%
group_by(interval15, GEOID) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
ungroup() %>%
mutate(week = week(interval15),
dotw = weekdays(interval15)) %>%
left_join(censusTract,by='GEOID')
week27 <-
ride.panel.start.gif %>%
subset((week == 27) & (dotw == "Saturday"))
ride.animation.data <-
week27 %>%
st_sf() %>%
na.omit() %>%
mutate(Trips = case_when(Trip_Count > 0 & Trip_Count <= 3 ~ "1-5 trips",
Trip_Count > 3 & Trip_Count <= 10 ~ "6-10 trips",
Trip_Count > 10 & Trip_Count <= 20 ~ "11-20 trips",
Trip_Count > 20 ~ "21+ trips")) %>%
mutate(Trips = factor(Trips, levels=c("0 trips","1-5 trips","6-10 trips","11-20 trips","21+ trips")))
palette4 <- c("#cceff8","#9fe2bf","#20b3ab","#016690")
rideshare_animation <-
ggplot() +
geom_sf(data = censusTract, color = "grey30", fill = "transparent")+
geom_sf(data=ride.animation.data, aes(fill=Trips)) +
scale_fill_manual(values = palette4) +
labs(title = "Scooter pickups for a Saturday in November 2019, NYC",
subtitle = "15 minute intervals: {current_frame}") +
transition_manual(interval15) +
mapTheme()
animate(rideshare_animation, duration=20)
In this part, we will build two spatial-time models, one for predicting scooter demand and one for supply, based on pickup (start trip info) and return (end trip info) respectively. The subtraction of two models can reflect whether supply and demand is well-balanced in each census tract.
To realize the final goal, we firstly divide trip data into training dataset as trips generated in March, May and July, and test dataset comes from trips in April, June and August.
ride.train.s <- subset(ride.panel.start, month(interval60) == 3|month(interval60) == 5|month(interval60) == 7)
ride.test.s <- subset(ride.panel.start, month(interval60) == 4|month(interval60) == 6|month(interval60) == 8)
ride.train.e <- subset(ride.panel.end, month(interval60) == 3|month(interval60) == 5|month(interval60) == 7)
ride.test.e <- subset(ride.panel.end, month(interval60) == 4|month(interval60) == 6|month(interval60) == 8)
From what has been discussed in exploratory analysis above, it can be concluded that scooter activity shows obvious spatial and temporal patterns in downtown city and weekends. Then we decided to take these variables into models, which are:
Time effect: month, day of week, time of day, events, time lag family (1-hr, 2-hr, 3-hr, 4-hr, 12-hr, 1-day), holiday and holiday lag. Among, when building the first model, there are some usage peak on weekdays. And after researching, we find that these were caused by the events or festivals held in the city, and thus, we include all these date into consideration.
Space effect: census tract GEOID
.
Other: temperature, wind speed and precipitation.
reg.s <-
lm(Trip_Count ~ GEOID + monthCat + time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours +lag3Hours + lag4Hours + lag1day + holiday +holidayLag +weekday,
data=ride.train.s)
reg.e <-
lm(Trip_Count ~ GEOID + monthCat +time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours +lag3Hours + lag4Hours + lag1day + holiday + holidayLag + weekday,
data=ride.train.e)
ride.test.s$pred <- round(predict(reg.s, newdata = ride.test.s))
ride.test.s$error <- ride.test.s$pred - ride.test.s$Trip_Count
ride.test.e$pred <- round(predict(reg.e, newdata = ride.test.e))
ride.test.e$error <- ride.test.e$pred - ride.test.e$Trip_Count
In the final part of the model, we focus on detecting real demand(pickup)-supply(return) situation, which can be calculated as the subtratction of predicted supply and demand value. From the result, we find that generally, a census tract need 5 re-balancing on average and the error of our model is 0.96 meaning that we would overpredict or underredict about 1 rebalancing for each census tract on average. The final map shows that:
For most census tracts in Louisville, scooter demand and supply are well-balanced;
In downtown city, scooters are considerably undersupply, which will become a focal area in sooter rebanlancing;
Excess demand also happens in Wyandotte districts;
The northern part of Lousville may exist a problem of wasted oversupply.
return_by_CT <- ride.test.e %>%
mutate(day=yday(interval60)) %>%
group_by(GEOID, day) %>%
summarise(return = sum(Trip_Count)) %>%
right_join(censusTract)
pickup_by_CT <- ride.test.s %>%
mutate(day=yday(interval60)) %>%
group_by(GEOID, day) %>%
summarise(pickup = sum(Trip_Count)) %>%
right_join(censusTract)
return_by_CT$subtraction <- pickup_by_CT$pickup - return_by_CT$return
return_by_CT_pred <- ride.test.e %>%
mutate(day=yday(interval60)) %>%
group_by(GEOID,day) %>%
summarise(return = sum(pred)) %>%
right_join(censusTract)
pickup_by_CT_pred <- ride.test.s %>%
mutate(day=yday(interval60)) %>%
group_by(GEOID, day) %>%
summarise(pickup = sum(pred)) %>%
right_join(censusTract)
return_by_CT$subtraction_pred <- pickup_by_CT_pred$pickup - return_by_CT_pred$return
return_by_CT$error_sub <- return_by_CT$subtraction - return_by_CT$subtraction_pred
ggplot() +
geom_sf(data=return_by_CT %>% st_sf(),aes(fill = subtraction_pred), legend=T) +
scale_fill_gradient(low='#bdebda',high='#0076be', name="Demand - Supply")+
labs(title = "Excess supply and excess demand", subtitle = "Blue for excess demand, light cyan for excess supply.") +
geom_sf(data=censusTract, color='black', fill='transparent', size=.8) +
mapTheme()
Mean Absolute Error (MAE) is applied to evaluate the acuuracy of the model. It can be calculated that for pickup activity and return activity, MAE is 0.378 and 0.468 respectively. The model for pickup performs better than that for return.
MAEs for different month categories are listed below. Apparently, our models perform better when predicting low to medium-demand months like April and June. The model generate great errors when predicting scooter activity in August. The root problem is that our model is poor when it comes to prediction of high demand.
ride.test.s %>%
group_by(monthCat) %>%
summarise(MAE = mean(abs(error))) %>%
kable(caption = "MAE (pickup activity)") %>%
kable_styling("striped", full_width = F,position = "center")
monthCat | MAE |
---|---|
High-demand month | 0.5455064 |
Low-demand month | 0.3641645 |
Medium-demand month | 0.4790297 |
ride.test.e %>%
group_by(monthCat) %>%
summarise(MAE = mean(abs(error))) %>%
kable(caption = "MAE (return activity)") %>%
kable_styling("striped", full_width = F,position = "center")
monthCat | MAE |
---|---|
High-demand month | 0.5510436 |
Low-demand month | 0.3614943 |
Medium-demand month | 0.4888506 |
As we mentioned above, months are divided into three categories: low demand in March and April, medium demand in May and June, high-demand in July and August. Here are predicted results from these three scenarios.
grid.arrange(
#low----
ggplot(data=ride.test.s %>%
subset(monthCat=='Low-demand month') %>%
dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -GEOID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Pickup in Louisville; Low-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ggplot(ride.test.e %>%
subset(monthCat=='Low-demand month') %>%
dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -GEOID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Return in Louisville; Low-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
#mild----
ggplot(ride.test.s %>%
subset(monthCat=='Medium-demand month') %>%
dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -GEOID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Pickup in Louisville; Medium-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ggplot(ride.test.e %>%
subset(monthCat=='Medium-demand month') %>%
dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -GEOID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Return in Louisville; Medium-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
#high----
ggplot(ride.test.s %>%
subset(monthCat=='High-demand month') %>%
dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -GEOID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Pickup in Louisville; High-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ggplot(ride.test.e %>%
subset(monthCat=='High-demand month') %>%
dplyr::select(interval60, GEOID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -GEOID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)), aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Return in Louisville; High-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
top='Predicted/Observed scooter time series', ncol=1
)
Then we plot the distribution of MAE by census tract. These maps indicate that for both the demand and supply model, higher error comes from downtown and university areas, where scooter demands are more intensive.
meanError.s <- ride.test.s %>%
group_by(GEOID) %>%
summarise(MAE = mean(abs(error))) %>%
left_join(censusTract %>% st_transform(4326), by='GEOID') %>%
st_sf()
meanError.e <- ride.test.e %>%
group_by(GEOID) %>%
summarise(MAE = mean(abs(error))) %>%
left_join(censusTract %>% st_transform(4326), by='GEOID') %>%
st_sf()
grid.arrange(
ggplot() +
geom_sf(data=studyArea,fill='transparent')+
geom_sf(data=meanError.s,aes(fill = MAE)) +
scale_fill_gradient(low='#bdebda',high='#0076be')+
labs(title='MAE by census tract, pickup')+
mapTheme(),
ggplot() +
geom_sf(data=studyArea,fill='transparent')+
geom_sf(data=meanError.e,aes(fill = MAE)) +
scale_fill_gradient(low='#bdebda',high='#0076be')+
labs(title='MAE by census tract, return')+
mapTheme(), ncol=2
)
Further generalizability test is conducted. Race data is used here to define the differences in Louisville’s urban context. Tracts with more than 50% of white population is defined as ‘Majority White’ in racial context.
Though there are some missing value in the census data, the results do bring us an insight into whether the model generalize to different urban contexts.
Though averaging across positive and negative counts does not provide the best indicator of accuracy (MAE is better), we use error here to see whether we onverpredict or underpredict.
LVCensus_sa <- mutate(LVCensus %>% st_transform(4326),
raceContext = ifelse(Percent_White > 0.5, "Majority_White", "Majority_Non_White")) %>%
.[censusTract,]
ggplot() +
geom_sf(data = LVCensus_sa, aes(fill = raceContext)) +
geom_sf(data=studyArea, fill='transparent', color='white', size=1)+
scale_fill_manual(values = c('#7fcdbb','#2c7fb8'),
name="Race Context") +
labs(title = "Race Context", name="Race Context") +
mapTheme()
With the slight difference in error when using our model to predict scooter activity in “Majority White” and “Majority Non-White” census tracts, we can conclude that our models generalize well with respect to ratial context. The difference is reflected in that the city would slightly underpredict in majority-White tracts.
ct_race <- left_join(LVCensus_sa %>% st_set_geometry(NULL), ride.test.s, by='GEOID')
ct_race %>%
na.omit() %>%
group_by(raceContext) %>%
summarize(Error = mean(error)) %>%
kable(caption = "Error (by race context)") %>%
kable_styling("striped", full_width = F,position = "center")
raceContext | Error |
---|---|
Majority_Non_White | -0.0079663 |
Majority_White | -0.0175292 |
Recall that in our first model, we build two models to predict scooter demand and supply by each census tract. However, such large unit area may weaken the model’s effectiveness and make it less useful in practice. To get a more detailed information about supply-demand situation within each census tract, let’s focus on the downtown city census tract as an example, where MAE is the highest and scooters are considerably undersupply in overall.
tract <- subset(censusTract,GEOID == '21111004900')
scooterData900_start <- st_intersection(scooterData38_start %>%
st_as_sf(.,coords = c("StartLongitude", "StartLatitude"), crs = 4326),
tract)
scooterData900_end <- st_intersection(scooterData38_end %>%
st_as_sf(.,coords = c("EndLongitude", "EndLatitude"), crs = 4326),
tract)
Based on the census tract boundary, we create a fishnet with unit area of 0.003 degree latitude × 0.003 degree latitude.For each fishnet. An uniqueID is generated for spatial join, and trip points are projected to the fishnet. In the mean time, cvIDs are created for further cross validation.
fishnet <-
st_make_grid(tract, cellsize = 0.003) %>%
st_sf()
fishnet <-
fishnet[tract,] %>%
mutate(uniqueID = as.character(rownames(.)),
cvID = sample(round(nrow(fishnet) / 10), size=nrow(.), replace = TRUE)) %>%
dplyr::select(uniqueID,cvID)
Same as what we have done before, we project the point data to fishnet.
Data_in_fishnet_s <- st_join(scooterData900_start, fishnet, st_intersects)
Data_in_fishnet_s <- Data_in_fishnet_s %>%
st_set_geometry(NULL) %>%
dplyr::select(-cvID) %>%
left_join(.,fishnet, by='uniqueID') %>%
st_sf()
Data_in_fishnet_e <- st_join(scooterData900_end, fishnet, st_intersects)
Data_in_fishnet_e <- Data_in_fishnet_e %>%
st_set_geometry(NULL) %>%
dplyr::select(-cvID) %>%
left_join(.,fishnet, by='uniqueID') %>%
st_sf()
Then, two space-time panels are created based on the start time and end time for each gird cell in every hour. Since process in feature engineering and time lag creation are quite similar to the previous model, description of this part is omitted.
study.panel.start.FN <-
expand.grid(interval60=unique(Data_in_fishnet_s$interval60),
uniqueID = unique(Data_in_fishnet_s$uniqueID))
study.panel.end.FN <-
expand.grid(interval60=unique(Data_in_fishnet_e$interval60),
uniqueID = unique(Data_in_fishnet_e$uniqueID))
ride.panel.start.FN <-
Data_in_fishnet_s %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel.start.FN) %>%
st_set_geometry(NULL) %>%
group_by(interval60, uniqueID) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(fishnet,by='uniqueID') %>%
left_join(weather.Panel, by='interval60') %>%
ungroup() %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE),
Preci = case_when(Precipitation<0.001~"No rain",
Precipitation >= 0.001~"Wet"),
Wind = case_when(Wind_Speed<=3~"Light",
Wind_Speed> 3 & Wind_Speed <= 6~"Moderate",
Wind_Speed > 6 & Wind_Speed < 9~"Gale",
Wind_Speed >= 9~"Stormy"),
time_uniqueID=paste(as.character(interval60),uniqueID))
ride.panel.end.FN <-
Data_in_fishnet_e %>%
mutate(Trip_Counter = 1) %>%
right_join(study.panel.end.FN) %>%
st_set_geometry(NULL) %>%
group_by(interval60, uniqueID) %>%
summarize(Trip_Count = sum(Trip_Counter, na.rm=T)) %>%
left_join(fishnet,by='uniqueID') %>%
left_join(weather.Panel, by='interval60') %>%
ungroup() %>%
mutate(week = week(interval60),
dotw = wday(interval60, label = TRUE),
Preci = case_when(Precipitation < 0.001~"No rain",
Precipitation >= 0.001~"Wet"),
Wind = case_when(Wind_Speed <= 3~"Light",
Wind_Speed > 3 & Wind_Speed <= 6~"Moderate",
Wind_Speed > 6 & Wind_Speed < 9~"Gale",
Wind_Speed >= 9~"Stormy"),
time_uniqueID=paste(as.character(interval60),uniqueID))
ride.panel.start.FN <-
ride.panel.start.FN %>%
arrange(uniqueID, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 245|yday(interval60) ==185,1,0),
weekday = ifelse(dotw != 'Sun'&dotw!='Sat',1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
holidayLag = replace_na(holidayLag, '0'))
ride.panel.end.FN <-
ride.panel.end.FN %>%
arrange(uniqueID, interval60) %>%
mutate(lagHour = dplyr::lag(Trip_Count,1),
lag2Hours = dplyr::lag(Trip_Count,2),
lag3Hours = dplyr::lag(Trip_Count,3),
lag4Hours = dplyr::lag(Trip_Count,4),
lag12Hours = dplyr::lag(Trip_Count,12),
lag1day = dplyr::lag(Trip_Count,24),
holiday = ifelse(yday(interval60) == 245|yday(interval60) ==185,1,0),
weekday = ifelse(dotw != 'Sun'&dotw!='Sat',1,0)) %>%
mutate(day = yday(interval60)) %>%
mutate(holidayLag = case_when(dplyr::lag(holiday, 1) == 1 ~ "PlusOneDay",
dplyr::lag(holiday, 2) == 1 ~ "PlustTwoDays",
dplyr::lag(holiday, 3) == 1 ~ "PlustThreeDays",
dplyr::lead(holiday, 1) == 1 ~ "MinusOneDay",
dplyr::lead(holiday, 2) == 1 ~ "MinusTwoDays",
dplyr::lead(holiday, 3) == 1 ~ "MinusThreeDays"),
time_of_day = case_when(hour(interval60) < 7 | hour(interval60) > 18 ~ "Overnight",
hour(interval60) >= 7 & hour(interval60) < 10 ~ "AM Rush",
hour(interval60) >= 10 & hour(interval60) < 15 ~ "Mid-Day",
hour(interval60) >= 15 & hour(interval60) <= 18 ~ "PM Rush"),
monthCat = case_when(month(interval60)<5 | month(interval60)>9 ~ 'Low-demand month',
month(interval60)==6|month(interval60)==5 ~ 'Medium-demand month',
month(interval60)==7|month(interval60)==8 ~ 'High-demand month'),
events = ifelse(isin(c(76,97,103,117,124,165,185,227,236),yday(interval60)),1,0),
holidayLag = replace_na(holidayLag, '0'))
The data used for trainging and testing, as well as the predictors in the models are the same as the former model and would be omited. The only difference is we here use the uniqueID
of fishnet accounts for the spatial effects.
ride.train.s.FN <- subset(ride.panel.start.FN, month(interval60) ==3 | month(interval60) == 5|month(interval60) == 7)
ride.test.s.FN <- subset(ride.panel.start.FN, month(interval60) ==4 | month(interval60) == 6|month(interval60) == 8)
ride.train.e.FN <- subset(ride.panel.end.FN, month(interval60) ==3 | month(interval60) == 5|month(interval60) == 7)
ride.test.e.FN <- subset(ride.panel.end.FN, month(interval60) ==4 | month(interval60) == 6|month(interval60) == 8)
reg.s.FN <-
lm(Trip_Count ~ uniqueID + monthCat + time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours + lag3Hours + lag4Hours + lag1day + holiday + holidayLag + weekday,
data=ride.train.s.FN)
reg.e.FN <-
lm(Trip_Count ~ uniqueID + monthCat + time_of_day + events + Temperature + Wind_Speed + Preci + lagHour + lag2Hours + lag3Hours + lag4Hours + lag1day + holiday + holidayLag + weekday,
data=ride.train.e.FN)
ride.test.s.FN$pred <- round(predict(reg.s.FN, newdata = ride.test.s.FN))
ride.test.s.FN$error <- ride.test.s.FN$pred - ride.test.s.FN$Trip_Count
ride.test.e.FN$pred <- round(predict(reg.e.FN, newdata = ride.test.e.FN))
ride.test.e.FN$error <- ride.test.e.FN$pred - ride.test.e.FN$Trip_Count
Same as what we have done before, we focus on detecting real demand(pickup)-supply(return) situation. From the result, we find that generally, a census tract need 5 re-balancing on average per day and the error of our model is 2 meaning that we would overpredict or underredict about 2 rebalancing for each cell on average. Based on the map below, we can conclude that scooter demand and supply in central city is poor-balanced.
return_by_FN <- ride.test.e.FN %>%
mutate(day=yday(interval60)) %>%
group_by(uniqueID, day) %>%
summarise(return = sum(Trip_Count)) %>%
right_join(fishnet)
pickup_by_FN <- ride.test.s.FN %>%
mutate(day=yday(interval60)) %>%
group_by(uniqueID, day) %>%
summarise(pickup = sum(Trip_Count)) %>%
right_join(fishnet)
return_by_FN$subtraction <- pickup_by_FN$pickup - return_by_FN$return
return_by_FN_pred <- ride.test.e.FN %>%
mutate(day=yday(interval60)) %>%
group_by(uniqueID,day) %>%
summarise(return = sum(pred)) %>%
right_join(fishnet)
pickup_by_FN_pred <- ride.test.s.FN %>%
mutate(day=yday(interval60)) %>%
group_by(uniqueID, day) %>%
summarise(pickup = sum(pred)) %>%
right_join(fishnet)
return_by_FN$subtraction_pred <- pickup_by_FN_pred$pickup - return_by_FN_pred$return
return_by_FN$error_sub <- return_by_FN$subtraction - return_by_FN$subtraction_pred
ggplot() +
geom_sf(data=na.omit(return_by_FN) %>% st_sf(),aes(fill = subtraction_pred), legend=T) +
scale_fill_gradient(low='#bdebda',high='#0076be', name="Demand - Supply")+
labs(title = "Excess supply and excess demand", subtitle = "Blue for excess demand, light cyan for excess supply.") +
geom_sf(data=tract, color='black', fill='transparent', size=1) +
mapTheme()
It can be calculated that for pickup activity and return activity, MAE is 0.700 and 0.649 respectively. The model for pickup performs better than that for return.
We can see that the model perform better when predicting the pickup and return activity in low to medium-demand months. Compared the MAE of these two month, it is concluded that our model is more reliable when predicting scooter activity in low or high usage months.
ride.test.s.FN %>%
group_by(monthCat) %>%
summarise(mae = mean(abs(error))) %>%
kable(caption = "MAE (pickup activity)") %>%
kable_styling("striped", full_width = F,position = "center")
monthCat | mae |
---|---|
High-demand month | 0.7446504 |
Low-demand month | 0.4195220 |
Medium-demand month | 0.6484296 |
ride.test.e.FN %>%
group_by(monthCat) %>%
summarise(mae = mean(abs(error))) %>%
kable(caption = "MAE (return activity)") %>%
kable_styling("striped", full_width = F,position = "center")
monthCat | mae |
---|---|
High-demand month | 0.6930751 |
Low-demand month | 0.3832812 |
Medium-demand month | 0.5984686 |
For more detail, we plot predicted/observed scooter time series and find that the model fail to predict the high demand during day time.
grid.arrange(
ride.test.s.FN %>%
subset(monthCat=='Low-demand month') %>%
dplyr::select(interval60, uniqueID, Trip_Count, pred, monthCat) %>%
gather(Variable, Value, -interval60, -uniqueID, -monthCat) %>%
group_by(Variable, interval60, monthCat) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Pickup in Louisville; Low-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ride.test.e.FN %>%
subset(monthCat=='Low-demand month') %>%
dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -uniqueID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Return in Louisville; Low-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ride.test.s.FN %>%
subset(monthCat=='Medium-demand month') %>%
dplyr::select(interval60, uniqueID, Trip_Count, pred, monthCat) %>%
gather(Variable, Value, -interval60, -uniqueID, -monthCat) %>%
group_by(Variable, interval60, monthCat) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Pickup in Louisville; Medium-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ride.test.e.FN %>%
subset(monthCat=='Medium-demand month') %>%
dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -uniqueID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Return in Louisville; Medium-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ride.test.s.FN %>%
subset(monthCat=='High-demand month') %>%
dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -uniqueID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Pickup in Louisville; High-demand month", x = "Hour", y= "Station Trips") +
plotTheme,
ride.test.e.FN %>%
subset(monthCat=='High-demand month') %>%
dplyr::select(interval60, uniqueID, Trip_Count, pred) %>%
gather(Variable, Value, -interval60, -uniqueID) %>%
group_by(Variable, interval60) %>%
summarize(Value = sum(Value)) %>%
ggplot(aes(interval60, Value, colour=Variable)) +
geom_line(size=1.1) +
labs(title = "Return in Louisville; High-demand month", x = "Hour", y= "Station Trips") +
plotTheme, top='Predicted/Observed scooter time series', ncol=1)
Both the models perform better when predicting low-demand area and fail to account for the high demand in the center of the census tract.
meanError.s.FN <- ride.test.s.FN %>%
group_by(uniqueID) %>%
summarise(MAE = mean(abs(error))) %>%
left_join(fishnet, by='uniqueID') %>%
st_sf()
meanError.e.FN <- ride.test.e.FN %>%
group_by(uniqueID) %>%
summarise(MAE = mean(abs(error))) %>%
left_join(fishnet, by='uniqueID') %>%
st_sf()
grid.arrange(
ggplot() +
geom_sf(data=meanError.s.FN %>%st_sf(),aes(fill = MAE)) +
geom_sf(data=tract,fill='transparent',colour='black', size=1) +
scale_fill_gradient(low='#bdebda',high='#0076be')+
labs(title='MAE by fishnet, pickup')+
mapTheme(),
ggplot() +
geom_sf(data=meanError.e.FN,aes(fill = MAE)) +
geom_sf(data=tract,fill='transparent',colour='black', size=1) +
scale_fill_gradient(low='#bdebda',high='#0076be')+
labs(title='MAE by fishnet, return')+
mapTheme(), ncol=2
)
From what has been discussed above, the model underassess for periods of high demand but could sense the spatio-temporal change in scooter activity. With the results of the prediction, a simple calculation could be done to know more about the excess demand and supply across the city. However, this model is far from a mature one, and here are several steps to take in the future:
A more advanced prediction model.
Use additional factors.
A more appropriate space unit.
Based on our prediction result, here are some suggustions we think would be helpful to the four providers:
Virtual dock. Another problem for the scooter share is that some users park the scooters in wrong places. The setting of virtual dock not only solve this problem, but also make it efficient and convinient for the four companies to manage. With this relatively fixed “station”, the prediction would be more accurate.
Profit analysis. Based on the prediction result, we calculate the net revenue if this rebalancing system operate normally and every scooter demand can be met. It can be concluded that 91383 more trips would be accomplished in the last April, June and August. Acoording to the economic statistics, each scooter generates 0.95 dollar per ride, deducting costs of charging, repairs, credit card fees, customer support, and insurance. Therefore, these four companies will recollect a total profit of 86813 dollar.