Short-Term Forecasting of Emerging On-Demand Ridesharing Services LASSO RF

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 7

WK,QWHUQDWLRQDO&RQIHUHQFHRQ7UDQVSRUWDWLRQ,QIRUPDWLRQDQG6DIHW\ ,&7,6 $XJXVW%DQII&DQDGD

Short-Term Forecasting of Emerging On-Demand


Ride Services
Jiaokun Liu, Erjia Cui, Haoqiang Hu, Xiaowei Chen,
Feng Chen
Xiqun (Michael) Chen*
Key Laboratory of Road & Traffic Engineering of the
College of Civil Engineering and Architecture Ministry of Education
Zhejiang University Tongji University
Hangzhou 310058, China
4800 Cao'an Road, Shanghai 201804, China
Email: chenxiqun@zju.edu.cn
Email: fengchen@tongji.edu.cn
*Corresponding author

Abstract—In the last few years, on-demand ride services (i.e., social ridesplitting), DiDi Chauffeur, DiDi Bus, DiDi Test
boomed worldwide, and different modes of ridesourcing services Drive, DiDi Car Rental, and DiDi Enterprise Solutions to the
emerged, too. However, there have been few qualitative and users in China via an integrated smart phone application.
quantitative analyses on these ride service patterns, partially due
to the lack or unavailability of detailed on-demand ride service Although there were several studies on exploring on-
data. In this paper, we analyze the real-world individual-level demand car services patterns, for instance, an intercept survey
order and the trip data extracted from the DiDi's on-demand was utilized to understand the usage and the impact of on-
mobility platform in Hangzhou, China. This study intends to demand ride services [2-3], few studies offered the precise
understand the temporal and spatial travel pattern of passengers’ understanding of on-demand ride services in a real-world large-
demand and ride services which include four types, i.e., Taxi scale network primarily due to the lack of accurate data. The
Hailing, Private Car Service, Hitch, and Express. We study the contribution of this study is to understand emerging on-demand
relationship between different service modes of the drivers from ride services quantitatively based on big data extracted from
a selected region in specific time periods. In order to predict DiDi's on-demand mobility platform. In this study, we explore
travel demand of the aforementioned on-demand ride services, the on-demand platform users of four burgeoning ride services
we utilize LASSO (least absolute shrinkage and selection temporally and spatially based on real-world on-demand ride
operator) to rank features of the on-demand platform data (e.g., service data in Hangzhou, China. To the best of our knowledge,
distance, fee, and waiting time). An on-demand ride prediction this paper is one of the first studies to explore the travel
model is established based on the random forest (RF), which is
patterns of the emerging on-demand ride services, including
then compared with the autoregressive integrated moving
average (ARIMA) and support vector regression (SVR). The
the Taxi Hailing, Private Car Service, Hitch and Express.
results show that RF outperforms other models and it is utilized Furthermore, we employ a few prediction models, e.g., random
to provide an insight for forecasting the demand of distinctive on- forest (RF), autoregressive integrated moving average
demand ride service patterns. To the best knowledge of authors, (ARIMA), and support vector regression (SVR), to forecast the
this paper is among the first attempts to learn the temporal and short-term (e.g., 30 min) on-demand rides for the
spatial travel patterns, also to forecast emerging on-demand ride aforementioned services.
services.
The rest of this paper is organized as follows. Section II
describes the data used in this paper, and presents the temporal
Keywords—On-demand ride services; ridesourcing; spatial and
temporal demand pattern; LASSO; random forest; support vector and spatial distributions of on-demand ride services. In Section
regression; ARIMA III, the ARIMA, RF and SVR models based on different sizes
of training sets are established and compared for the on-
I. INTRODUCTION demand ride prediction. Section IV draws conclusions and
Traffic data have been exploded during the last few years prospects for further research.
and we have gradually entered the era of big data for II. DATA PREPARATION AND DESCRIPTION
transportation. Accordingly, our lives have been influenced by
the emerging on-demand ride services. For example, Uber is a A. Data Preparation
worldwide online ridesourcing or transportation network The data was randomly sampled at an approximate rate of
company (TNC), and as of August, 2016, the service that Uber 20% from the on-demand mobility platform of DiDi between
had provided was available in over 66 countries and 545 cities November 1 and 30, 2015, in Hangzhou, China. The number of
in the world. Since Uber's launch, several other companies successful on-demand ride orders (completed trips) during the
have replicated its business model. DiDi is currently the largest period was 251344.
TNC that provides on-demand ride services for close to 300
million users across over 400 cities in China [1]. It provides The region of our study is the central area of Hangzhou,
services including, Taxi Hailing, Private Car Service, Hitch China. The origin and destination of each trip were obtained by
converting the received GPS data into a planar coordinate

‹,((( 
WK,QWHUQDWLRQDO&RQIHUHQFHRQ7UDQVSRUWDWLRQ,QIRUPDWLRQDQG6DIHW\ ,&7,6 $XJXVW%DQII&DQDGD

system. The study region is divided by 30 km x 43 km, i.e., devoted to exploring the taxi travel patterns [4], we compare
1290 squares, each area of which is 1 km2. The boundaries are the conventional taxi mode with other three on-demand ride
set as 120.00 degrees east longitude, 120.45 degrees east services in order to have a better understanding of the
longitude, 30.12 north latitude, and 30.39 degrees north latitude. relationship among them. We utilize the directly measureable
explanatory variables in the original datasets and obtain
The aforementioned four types of on-demand ride services, indirect variables from both the individual and aggregate
including the Taxi Hailing, Private Car Service, Hitch and
Express are the focuses of this paper. Private Car Service, levels. The selected explanatory variables are listed in 䭉䈟
Hitch and Express upon the on-demand platform are emerging ᵚ᢮ࡠᕅ⭘ⓀDŽ.
travel modes in recent years. Despite much research has been
TABLE I. EXPLANATORY VARIABLES

ID Variable Abbreviation Level Unit Range/Set Mean

1 Air quality index a AQI individual NA b (0,500] 67.22


2 Origin longitude OriLongitude population NA [0,44] 18.44
3 Origin latitude OriLatitude population NA [0,30] 16.5
4 Destination longitude DestLongitude population NA [0,44] 18.42
5 Destination latitude DestLatitude population NA [0,30] 16.5
6 Waiting time WaitTime individual sec [1,7200] 2398.49
7 Striving time StrTime individual sec [1,7200] 299.5
8 Travel time TravelTime individual sec [1,7200] 1379.33
9 Distance Distance individual km (0,500] 8.24

10 Real fee RealFee individual RMB (0,300] 16.94


11 Speed Speed individual km/h (0,150] 23.87
12 Origin daily trips OriDailyTrips population trips [0,+∞) 144.32
13 Origin average distances to different destinations OriAveDisToOtherDest population km [0,200] 7
14 Destination daily trips DestDailyTrips population trips [0,+∞) 146.07
15 Destination average distances to different destinations DestAveDisToOtherDest population km [0,200] 6.96
16 Origin average waiting time OriAveWaitTime population sec [1,7200] 2385.1
17 Origin average striving time OriAveStrTime population sec [1,7200] 298.38
18 Destination average waiting time DestAveWaitTime population sec [1,7200] 1800.88
19 Destination average striving time DestAveStrTime population sec [1,7200] 218.82
20 OD trip ODTrips population trips [0,+∞) 71.4
21 Order origin longitude OrdOriLongitude individual degree E [119.57,120.46] 120.18
degree
22 Order origin latitude OrdOriLatitude individual [29.85,30.39] 30.26
N
23 Order destination longitude OrdDestLongitude individual degree E [119.57,120.46] 120.18
degree
24 Order destination latitude OrdDestLatitude individual [29.85,30.39] 30.26
N
25 Order time OrderCreationTime individual sec [0,86399] 54451
26 Order created by day of month Date individual date [1,30] 17.97
27 Order created by day of week DayOfWeek individual day {Monday,…,Sunday} NA
{0: unwilling;
28 willing to carpool CarpoolWill individual NA NA
1: willing}
{0: not matched;
29 Carpool is matched or not CarpoolMatching individual NA NA
1: matched}

a. AQI is distributed by day, which is provided by China's Ministry of Environmental Protection based on the level of 6 atmospheric pollutants (SO2, NO2, PM10, PM2.5,
CO and O3); b. NA: not applicable.

‹,((( 
WK,QWHUQDWLRQDO&RQIHUHQFHRQ7UDQVSRUWDWLRQ,QIRUPDWLRQDQG6DIHW\ ,&7,6 $XJXVW%DQII&DQDGD

B. Spatial Distribution of On-Demand Service to use Hitch in the evening peak hour, the number of Hitch
The on-demand rides of each mode have a considerable approaches to that of Taxi Hailing. Besides, the Express
influence on the travel demand management and traffic demand is approximately twice larger than those of Taxi and
control. In this section, we preliminarily investigate the spatial Hitch according to randomly selected orders in this time period.
distributions of the four on-demand ride services. Actually, there is more taxi demand in the real world, however,
some records of them are not available since some taxi drivers
We select all of the order records whose starting time is do not use the on-demand platform application. The total
between 17:00 and 18:00 in our dataset, and plot the spatial demand of Private Car Service is only 7116, which is much
distribution of the order origins in Fig. 1. A few characteristics smaller than other modes because of its higher price and some
of the spatial distributions can be revealed as follows: drivers request the reservation in advance.
First, the ride densities of the four on-demand ride services Second, all of the four on-demand ride services cluster in
are diversified. The total demand of the Express ride is 66337, several areas of Hangzhou, China, which include the
which is the largest mode among the four on-demand ride commercial, educational or industrial centers of the city. A few
services since its price is relatively lower than other modes and rides are randomly distributed in the vicinity of the suburb,
it is convenient for individuals to use this type of ride services. indicating the small demand of on-demand ride services in
The demands of Taxi Hailing and Hitch are 27676 and 27655 these areas.
between 17:00 and 18:00, respectively. Since individuals tend

(a) Taxi Hailing (b) Private Car Service

(c) Hitch (d) Express


Fig. 1. Origin distribution of on-demand ride services.

service. The normalized ride is defined as the percentage of


C. Demand Temporal Distribution
one-hour trips in the whole daily trips. As shown in 䭉䈟!ᵚ᢮
In order to better understand the temporal distribution of ࡠᕅ⭘ⓀDŽ, we use the average normalized rides based on 30
the four types of ride services, we select four time periods for days in our dataset. Note that the height of each column only
demonstrative purposes (i.e., 7:00-8:00, 11:00-12:00, 17:00- represents the normalized value between 0 and 1. The total
18:00 and 23:00-24:00), and use the normalized rides as an baseline rides of Taxi Hailing, Private Car Service, Hitch and
indicator to show the temporal variations of each type of

‹,((( 
WK,QWHUQDWLRQDO&RQIHUHQFHRQ7UDQVSRUWDWLRQ,QIRUPDWLRQDQG6DIHW\ ,&7,6 $XJXVW%DQII&DQDGD

Express are 479888, 80613, 228562 and 76627 in our database, to home/work, a majority of Hitch rides occur in the morning
respectively. and evening peak hours. Secondly, the taxi drivers usually
work day and night on alternative turns, so the normalized ride
As shown in 䭉䈟!ᵚ᢮ࡠᕅ⭘ⓀDŽ, we can see several demand of Taxi Hailing in midnight (23:00-24:00) is larger
temporal distinctions among the four types of ride services. than the other three types of services. However, the four types
Firstly, the normalized ride variations of Private Car Service of ride services share some common features. All of these ride
and Hitch are obvious in the evening peak hour (17:00-18:00), services are relatively fewer in the morning on weekends,
while those of Express and Taxi are less noticeable. Since indicating that fewer individuals commute to work on
residents tend to make an appointment with Hitch from and/or weekends.

Fig. 2. Traffic demand temporal distribution.

A. Prediction Models
III.TRAFFIC DEMAND PREDICTION
The ARIMA model is one of the widely-recognized
Since traffic is a complex phenomenon that can be benchmark models for short-term traffic flow forecasting. A
influenced by the interactions among different vehicle-driver non-seasonal ARIMA model is classified as an ARIMA (p, d,
combinations, and exogenous factors such as weather and q) model, where p basically refers to the autoregressive part, d
roadway conditions, it often experiences an intense fluctuation indicates the integrated part and the last parameter q represents
across different periods and under various conditions. As a the moving average part [5]. The basic idea of the ARIMA-
result, it is usually tricky to use closed-form exact equations to based prediction method is that we utilize the ARMA (p, q)
represent and predict traffic. However, based on the popularity model to build the seasonal sequence for a non-seasonal time
of machine learning, data-driven approaches become promising series, and then the sequence is converted into a non-seasonal
in modeling and predicting short-term traffic flows. sequence. The parameters (p, q) are determined according to
In this section, we employ several prediction models to the Akaike Information Criterion (AIC) [6]. Given a set of
forecast the total on-demand rides citywide and further analyze candidate prediction models, the preferred model is the one
different ride services using the calibrated models. First, we with the minimum AIC value.
divide the entire 30 days equally into 1440 intervals. Each Random forests for regression belong to the ensemble
interval represents half an hour. Second, we select 15 learning approach, which operates by constructing a multitude
explanatory variables as predictor variables and use the number of decision trees based on the training set and outputting the
of on-demand rides within each interval as the response mean prediction (regression) of the individual trees [7].
variable. Afterwards, we adopt methods including RF, ARIMA Random decision forests correct for decision trees' habit of
and SVR to build prediction models. Finally, we make over fitting to their training set.
comparisons among different models on their performance and
choose RF to further study the discrepancies among distinctive Support vector regression is also a machine learning model
ride services. that analyzes data employed for regression analysis [8]. The

‹,((( 
WK,QWHUQDWLRQDO&RQIHUHQFHRQ7UDQVSRUWDWLRQ,QIRUPDWLRQDQG6DIHW\ ,&7,6 $XJXVW%DQII&DQDGD

idea of SVR is based on the computation of a linear regression Fig. 3 shows the feature importance ranking results
function in a high dimensional feature space where the input returned by the LASSO. The relative importance or influence
data are mapped via a nonlinear function. Therefore, SVR of each individual variable is scaled so that the sum of them
could be utilized to build a regression model for prediction. for all the input variables equals to 100. As a result, the 15
most important features are selected as predictor variables in
B. Measures of Effectiveness (MoE) our model considering the lower prediction error rates and
In order to comprehensively evaluate the prediction relatively fewer amounts of variables.
performance of the ARIMA, RF and SVR models, three MoE
are employed. The results show that the most influential predictor is Date
(the day when the order is created), which demonstrates that
The mean absolute percentage error (MAPE) is defined by the on-demand ride is seriously affected by the order date in
our model. It may be because the demands of Express and
1 Ai  Fi
n
Private Car Service is not available in the first week of
MAPE
n
¦ Ai
u100%
November, 2015, in our dataset. Therefore, the travel demand
i 1

where Ai is the i th measured value, Fi is the i th prediction fluctuates obviously on different dates. Following the factor of
date, speed, carpool is matched or not, destination average
value, and n is the size of the test set. striving time are also very significant. These factors could
Although the concept of MAPE sounds simple and directly reflect the congestion of the traffic condition. It is
convincing, the measure is not defined when the actual value is reasonable since the number of on-demand rides would vary
zero [9]. There exist other MoE to overcome the issue with according to the average speed, carpool matching rate and
MAPE. The symmetric mean absolute percentage error destination average striving time. In contrast, AQI and real fee
(SMAPE) is an accuracy measure based on relative errors, have relatively tiny influence because these factors usually
given by have little effect on the travel demand in real world.
n
TABLE II. PREDICTION ERRORS BEFORE AND AFTER FEATURE
¦ A F i i SELECTION
SMAPE i 1
n
u 100%
¦ A i
 Fi Feature selection MAPE NRMSE SMAPE
i 1 After feature selection 24.86% 23.01% 11.45%
Another measure is the normalized root mean square error Before feature selection 40.19% 25.41% 13.12%
(NRMSE), given by
n

¦(A  F ) i i
2

NRMSE i 1
n
u 100%
¦A i
2

i 1

C. Feature Selection
In data mining applications, the input predictor variables
are seldom equally relevant. Only a few of them have
substantial influences on the response. Therefore, the feature
selection is conducted to overcome the overfitting issue.
Feature selection refers to the process of identifying the most
important variables or parameters and using only this subset as
features in the prediction model [10]. It has become a strategy
of the data dimension reduction by selecting a subset of
important predictor variables. It can simplify the models to Fig. 3. Feature importance ranking by LASSO.
make them easier to be interpreted by the researchers/users,
shorten training times and enhance generalization. D. Model Optimization
LASSO (least absolute shrinkage and selection operator) is We split the data into training, validation and test sets.
a regression analysis method that performs variable selection Typically, when we separate a dataset into a training set and a
so as to enhance the prediction accuracy and the interpretability test set, most of the data are used for training, and a smaller
of the statistical model it produces [11]. We use 80% data of portion of the data is used for validation and test purposes.
the whole dataset as the training set to compare the In RF and SVR, the parameters tuning is carried out as
performance of LASSO before and after the feature selection. follows: the original database is split into the training set,
䭉䈟!ᵚ᢮ࡠᕅ⭘ⓀDŽ displays that the MoE after the feature validation set and test set according to on the order creation
selection, including MAPE, NRMSE and SMAPE, are all time of the rides. For a demonstrative purpose, the 18-day data
smaller than those before the feature selection. Therefore, (November 1-18, 2016) are used as the training set for
LASSO is a favorable method and used for the feature parameter tuning, the 3-day data (November 19-21, 2016) are
selection in our study. used as the validation set, and the remaining data are used as

‹,((( 
WK,QWHUQDWLRQDO&RQIHUHQFHRQ7UDQVSRUWDWLRQ,QIRUPDWLRQDQG6DIHW\ ,&7,6 $XJXVW%DQII&DQDGD

the test set (November 22-30, 2016). The objective functions


are set as the NRMSE error of the validation set for RF and
SVR, respectively. The best parameters correspond to the
minimum validation error for RF and SVR. The errors are
shown in Table III.

TABLE III. MODEL VALIDATION DEMONSTRATION

RF SVR
MoE Validation Validation
Test error Test error
error error
NRMSE 9.52% 27.96% 23.99% 29.88%

In the ARIMA model, the parameters (p, q) are determined


by using the Akaike Information Criterion (AIC). We start with
a set of candidate parameters (p, q), and then find the model's
corresponding AIC value. In order to obtain the minimum AIC
value, (1, 49) are chosen as the optimal parameters regardless Fig. 4. Prediction errors with different testing set percentage.
of the training set size.

Model Comparison
In this study, we change the training set size to compare
the prediction performance of the three models. In order to test
the sensitivity of the data splitting. We choose the ratio of the
training set ranging from 40% to 85% of the whole dataset at
an increasing step of 5%, with a fixed validation set size
(10%), and use the rest of data as the test set. We carry out the
parameters tuning every time we change the training set. After
a model has been processed by using the training set, we test
the model by making predictions against the true values of the
test set. The three MoE, i.e., MAPE, SMAPE and NRMSE,
are used to compare the three models with respect to the
prediction accuracy. The results are shown in Fig. 4. After
comparing the performance of the three models, RF displays
the best performance, thus we use RF to predict the demand of
Express, Hitch and Private Car Service, respectively.
E. Prediction Results of Different On-Demand Ride Services
Since the emerging four types of on-demand ride services
are different from each other, it is interesting to provide a
valuable insight for predicting type-by-type rides. As
aforementioned, RF shows the best performance when it
predicts the city-wide ride in a short-term manner, so we
further apply RF to compare the different ride services. The
ratio of the training set is 80% out of the whole dataset. The
results are shown in 䭉䈟!ᵚ᢮ࡠᕅ⭘ⓀDŽIV and (a) .
In (a) , the x-axis represents 288 intervals in six days of the
test set. Each interval represents 30 min. The solid line stands
for the actual on-demand rides and the dashed line represents
the predicted on-demand rides. The Private Car Service and
Express are not available in 16 intervals (midnight and early
morning) each day, so we assume that the on-demand rides are
0 during these time periods. On the contrary, the Hitch data are
intact and vary periodically. The predicted Hitch rides well
match with the real-world rides. Furthermore, the NRMSE and
SMAPE of Hitch are both smaller than those of Private Car
Service and Express, as shown in 䭉䈟!ᵚ᢮ࡠᕅ⭘ⓀDŽ,
which indicates a better prediction accuracy. Although there
exist variances in the prediction accuracies among the three

‹,((( 
WK,QWHUQDWLRQDO&RQIHUHQFHRQ7UDQVSRUWDWLRQ,QIRUPDWLRQDQG6DIHW\ ,&7,6 $XJXVW%DQII&DQDGD

ride services, the prediction errors are generally acceptable in short-term prediction of on-demand rides, the random forest
the field application. model outperforms ARIMA and SVR based on MoE, e.g.,
MAPE, NRMSE, and SMAPE. When RF is utilized to predict
TABLE IV. PREDICTION ERRORS OF D IFFERENT ON-DEMAND R IDE the traffic demand of Express, Private Car Service and Hitch,
SERVICES respectively, Hitch demonstrates the best forecasting results.
On-demand ride The three models aforementioned could have other
MAPE NRMSE SMAPE
services
applications based on the on-demand mobility platform,
Express 13.65% 24.25% 11.78% including the assessment and prediction of gaps between the
Hitch 30.19% 17.66% 7.38% travel demand (passengers) and supply (drivers). One issue
Private Car Service 12.17% 23.30% 10.36% regarding the application in travel demand prediction is related
to parameter tuning. As mentioned in the model optimization
section, the performance of the three models is largely
influenced by their parameters. Therefore, there is a need to
test the optimal combination of variables when developing the
three models.
The ongoing research includes: exploring travel patterns of
on-demand ride services more deeply; balancing the supply
and demand via strategies such as surge pricing. Some of the
traffic congestion problems could be mitigated if the vehicles
on the ridesourcing network could be properly dispatched.
ACKNOWLEDGEMENT
This research is financially supported by Zhejiang
(a) Express Provincial Natural Science Foundation of China
(LR17E080002), Key Laboratory of Road & Traffic
Engineering of the Ministry of Education (TJDDZHCX004),
National Natural Science Foundation of China (51508505,
71771198, 51338008), and the Fundamental Research Funds
for the Central Universities (2017QNA4025).
REFERENCE
[1] DiDi. http://www.xiaojukeji.com/news/newslisten (Access on December
24, 2016).
[2] L. Rayle, S. Shaheen, N. Chan, D. Dai, and R. Cervero, "App-based, on-
demand ride services: comparing taxi and ridesourcing trips and user
(b) Hitch characteristics in San Francisco," University of California
Transportation Center, UCTC-FR-2014-08, 2014.
[3] X. Chen, M. Zahiri, and S. Zhang, "Understanding ridesplitting behavior
of on-demand ride services: An ensemble learning approach," Transp.
Res. Part C, vol. 76, pp. 51-70, January, 2017.
[4] X. Liu, L. Gong, Y. Gong, and Y. Liu, "Revealing travel patterns and
city structure with taxi trip data," J. Transp. Geogr., vol. 43, pp. 78-90,
February 2015.
[5] R.S. Tsay, Analysis of Financial Time Series, John Wiley & Sons, 2005.
[6] K. Aho, D. Derryberry, and T. Peterson, "Model selection for ecologists:
the worldviews of AIC and BIC," Ecology, vol. 95, no. 3, pp. 631 -636,
March 2014.
[7] T.K. Ho, "The random subspace method for constructing decision
forests," IEEE Trans. Pattern Anal. Mach. Intell., vol. 20, no. 8, pp. 832-
(c) Private Car Service 844, August 1998.
Fig. 5. Prediction results of different on-demand ride services in the last week [8] C. Cortes and V. Vapnik, "Support-vector networks," Mach. Learn., ,vol.
of November, 2015. 20, no. 3, pp. 273-297, September 1995.
[9] C. Tofallis, "A better measure of relative prediction accuracy for model
IV.CONCLUSION selection and model estimation," J. Oper. Res. Soc., vol. 66, no. 8, pp.
This paper preliminarily investigates four types of the 1352-1362, August 2015.
emerging on-demand ride services based on the largest [10] G. James, D. Witten, T. Hastie, and R. Tibshirani, An Introduction to
ridesourcing or transportation network company in China, i.e., Statistical Learning, New York: Springer, 2013.
DiDi. The travel patterns of Taxi Hailing, Express, Private Car [11] R. Tibshirani, "Regression shrinkage and selection via the lasso," J. R.
Stat. Soc. Ser. B-Stat. Methodol., vol. 58, no. 1, pp. 267-288, 1996
Service and Hitch share some generally common but locally
distinctive features temporally and spatially, resulting from the
different on-demand ride service strategies. Regarding the

‹,((( 

You might also like