Yao 2018

The Thirty-Second AAAI Conference
on Artificial Intelligence (AAAI-18)
Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction

Huaxiu Yao,∗ Fei Wu Jintao Ke∗ Xianfeng Tang
Pennsylvania State University Hong Kong University of Science Pennsylvania State University
{huaxiuyao, fxw133}@ist.psu.edu and Technology xianfeng@ist.psu.edu
jke@connect.ust.hk
Yitian Jia, Siyu Lu, Pinghua Gong, Zhenhui Li

Jieping Ye, Didi Chuxing Pennsylvania State University
{jiayitian, lusiyu, gongpinghua, yejieping}@didichuxing.com jessieli@ist.psu.edu
Abstract consumption. Currently, with the increasing popularity of

taxi requesting services such as Uber and Didi Chuxing, we
Taxi demand prediction is an important building block to en- are able to collect massive demand data at an unprecedented
abling intelligent transportation systems in a smart city. An
scale. The question of how to utilize big data to better predict
accurate prediction model can help the city pre-allocate re-
sources to meet travel demand and to reduce empty taxis on traffic demand has drawn increasing attention in AI research
streets which waste energy and worsen the traffic conges- communities.
tion. With the increasing popularity of taxi requesting ser- In this paper, we study the taxi demand prediction prob-
vices such as Uber and Didi Chuxing (in China), we are lem; that problem being how to predict the number of taxi
able to collect large-scale taxi demand data continuously. requests for a region in a future timestamp by using his-
How to utilize such big data to improve the demand pre- torical taxi requesting data. In literature, there has been
diction is an interesting and critical real-world problem. Tra- a long line of studies in traffic data prediction, including
ditional demand prediction methods mostly rely on time se-
ries forecasting techniques, which fail to model the complex
traffic volume, taxi pick-ups, and traffic in/out flow vol-
non-linear spatial and temporal relations. Recent advances ume. To predict traffic, time series prediction methods have
in deep learning have shown superior performance on tra- frequently been used. Representatively, autoregressive in-
ditionally challenging tasks such as image classification by tegrated moving average (ARIMA) and its variants have
learning the complex features and correlations from large- been widely applied for traffic prediction (Li et al. 2012;
scale data. This breakthrough has inspired researchers to ex- Moreira-Matias et al. 2013; Shekhar and Williams 2008).
plore deep learning techniques on traffic prediction problems. Based on the time series prediction method, recent stud-
However, existing methods on traffic prediction have only ies further consider spatial relations (Deng et al. 2016;
considered spatial relation (e.g., using CNN) or temporal re- Tong et al. 2017) and external context data (e.g., venue,
lation (e.g., using LSTM) independently. We propose a Deep weather, and events) (Pan, Demiryurek, and Shahabi 2012;
Multi-View Spatial-Temporal Network (DMVST-Net) frame-
work to model both spatial and temporal relations. Specif-
Wu, Wang, and Li 2016). While these studies show that
ically, our proposed model consists of three views: tempo- prediction can be improved by considering various addi-
ral view (modeling correlations between future demand val- tional factors, they still fail to capture the complex nonlinear
ues with near time points via LSTM), spatial view (modeling spatial-temporal correlations.
local spatial correlation via local CNN), and semantic view Recent advances in deep learning have enabled re-
(modeling correlations among regions sharing similar tem- searchers to model the complex nonlinear relationships and
poral patterns). Experiments on large-scale real taxi demand have shown promising results in computer vision and nat-
data demonstrate effectiveness of our approach over state-of-
ural language processing fields (LeCun, Bengio, and Hin-
the-art methods.
ton 2015). This success has inspired several attempts to use
deep learning techniques on traffic prediction problems. Re-
Introduction cent studies (Zhang, Zheng, and Qi 2017; Zhang et al. 2016)
propose to treat the traffic in a city as an image and the
Traffic is the pulse of a city that impacts the daily life of traffic volume for a time period as pixel values. Given a
millions of people. One of the most fundamental questions set of historical traffic images, the model predicts the traf-
for future smart cities is how to build an efficient transporta- fic image for the next timestamp. Convolutional neural net-
tion system. To address this question, a critical component work (CNN) is applied to model the complex spatial cor-
is an accurate demand prediction model. The better we can relation. Yu et al. (2017) proposes to use Long Short Term
predict demand on travel, the better we can pre-allocate re- Memory networks (LSTM) to predict loop sensor readings.
sources to meet the demand and avoid unnecessary energy They show the proposed LSTM model is capable of mod-
∗
The paper was done when these authors were interns in Didi eling complex sequential interactions. These pioneering at-
Chuxing. tempts show superior performance compared with previous
Copyright c 2018, Association for the Advancement of Artificial methods based on traditional time series prediction methods.
Intelligence (www.aaai.org). All rights reserved. However, none of them consider spatial relation and tempo-
2588
ral sequential relation simultaneously. location at a timestamp. In this section, we will discuss the
In this paper, we harness the power of CNN and LSTM in related work on traffic prediction problems.
a joint model that captures the complex nonlinear relations The traditional approach is to use time series prediction
of both space and time. However, we cannot simply apply method. Representatively, autoregressive integrated moving
CNN and LSTM on demand prediction problem. If treating average (ARIMA) and its variants have been widely used
the demand over an entire city as an image and applying in traffic prediction problem (Shekhar and Williams 2008;
CNN on this image, we fail to achieve the best result. We Li et al. 2012; Moreira-Matias et al. 2013).
realize including regions with weak correlations to predict a
target region actually hurts the performance. To address this Recent studies further explore the utilities of external
issue, we propose a novel local CNN method which only context data, such as venue types, weather conditions, and
considers spatially nearby regions. This local CNN method event information (Pan, Demiryurek, and Shahabi 2012;
is motivated by the First Law of Geography: “near things Wu, Wang, and Li 2016; Wang et al. 2017; Tong et al.
are more related than distant things,” (Tobler 1970) and it is 2017). In addition, various techniques have also been in-
also supported by observations from real data that demand troduced to model spatial interactions. For example, Deng
patterns are more correlated for spatially close regions. et al. (2016) used matrix factorization on road networks to
While local CNN method filters weakly correlated remote capture a correlation among road connected regions for pre-
regions, this fails to consider the case that two locations dicting traffic volume. Several studies (Tong et al. 2017;
could be spatially distant but are similar in their demand pat- Idé and Sugiyama 2011; Zheng and Ni 2013) also propose to
terns (i.e., on the semantic space). For example, residential smooth the prediction differences for nearby locations and
areas may have high demands in the morning when people time points via regularization for close space and time de-
transit to work, and commercial areas may be have high dependency. These studies assume traffic in nearby locations
mands on weekends. We propose to use a graph of regions to should be similar. However, all of these methods are based
capture this latent semantic, where the edge represents simi- on the time series prediction methods and fail to model the
larity of demand patterns for a pair of regions. Later, regions complex nonlinear relations of the space and time.
are encoded into vectors via a graph embedding method and Recently, the success of deep learning in the fields of
such vectors are used as context features in the model. In the computer vision and natural language processing (LeCun,
end, a fully connected neural network component is used for Bengio, and Hinton 2015; Krizhevsky, Sutskever, and Hin-
prediction. ton 2012) motivates researchers to apply deep learning tech-
Our method is validated via large-scale real-world taxi de- niques on traffic prediction problems. For instance, Wang
mand data from Didi Chuxing. The dataset contains taxi de- et al. (2017) designed a neural network framework using
mand requests through Didi service in the city of Guangzhou context data from multiple sources and predict the gap be-
in China over a two-month span, with about 300,000 re- tween taxi supply and demand. The method uses extensive
quests per day on average. We conducted extensive exper- features, but does not model the spatial and temporal inter-
iments to compare with state-of-the-art methods and have actions.
demonstrated the superior performance of our proposed A line of studies applied CNN to capture spatial corre-
method. lation by treating the entire city’s traffic as images. For ex-
In summary, our contributions are summarized as follow: ample, Ma et al. (2017) utilized CNN on images of traffic
• We proposed a unified multi-view model that jointly con- speed for the speed prediction problem. Zhang et al. (2016)
siders the spatial, temporal, and semantic relations. and Zhang, Zheng, and Qi (2017) proposed to use residual
• We proposed a local CNN model that captures local char- CNN on the images of traffic flow. These methods simply
acteristics of regions in relation to their neighbors. use CNN on the whole city and will use all the regions for
prediction. We observe that utilizing irrelevant regions (e.g.,
• We constructed a region graph based on the similarity of remote regions) for prediction of the target region might ac-
demand patterns in order to model the correlated but spa- tually hurts the performance. In addition, while these meth-
tially distant regions. The latent semantics of regions are ods do use traffic images of historical timestamps for predic-
learnt through graph embedding. tion, but they do not explicitly model the temporal sequential
• We conducted extensive experiments on a large-scale taxi dependency.
request dataset from Didi Chuxing. The results show that Another line of studies uses LSTM for modeling sequen-
our method consistently outperforms the competing base- tial dependency. Yu et al. (2017) proposed to apply Long-
lines. short-term memory (LSTM) network and autoencoder to
capture the sequential dependency for predicting the traf-
Related Work fic under extreme conditions, particularly for peak-hour and
Problems of traffic prediction could include predicting any post-accident scenarios. However, they do not consider the
traffic related data, such as traffic volume (collected from spatial relation.
GPS or loop sensors), taxi pick-ups or drop-offs, traffic flow, In summary, the biggest difference of our proposed
and taxi demand (our problem). The problem formulation method compared with literature is that we consider both
process for these different types of traffic data is the same. spatial relation and temporal sequential relation in a joint
Essentially, the aim is to predict a traffic-related value for a deep learning model.
2589
Preliminaries S × S image (e.g., 7 × 7 image in Figure 1(a)) having one
In this section, we first fix some notations and define the channel of demand values (with i being at the center of the
taxi demand problem. We follow previous studies (Zhang, image), where the size S controls the spatial granularity. We
Zheng, and Qi 2017; Wang et al. 2017) and define the set use zero padding for location at boundaries of the city. As
of non-overlapping locations L = {l1 , l2 , ..., li , ..., lN } as a result, we have an image as a tensor (having one channel)
rectangle partitions of a city, and the set of time intervals as Yti ∈ RS×S×1 , for each location i and time interval t. The
I = {I0 , I1 , ..., It , ..., IT }. 30 minutes is set as the length local CNN takes Yti as input Yti,0 and feeds it into K convo-
of the time interval. Alternatively, more sophisticated ways lutional layers. The transformation at each layer k is defined
of partitioning can also be used, such as partition space by as follows:
road network (Deng et al. 2016) or hexagonal partitioning. Yti,k = f (Yti,k−1 ∗ Wtk + bkt ), (1)
However, this is not the focus of this paper, and our method-
ology can still be applied. Given the set of locations L and where ∗ denotes the convolutional operation and f (·) is an
time intervals T , we further define the following. activation function. In this paper, we use the rectifier func-
Taxi request: A taxi request o is defined as a tuple tion as the activation, i.e., f (z) = max(0, z). Wtk and bkt
(o.t, o.l, o.u), where o.t is the timestamp, o.l represents the are two sets of parameters in the k th convolution layer. Note
location, and o.u is user identification number. The requester that the parameters Wt1,...,K and b1,...,K
t are shared across
identifications are used for filtering duplicated and spammer all regions i ∈ L to make the computation tractable.
requests. After K convolution layers, we use a flatten layer to
Demand: The demand is defined as the number of taxi transform the output Yti,K ∈ RS×S×λ to a feature vector
requests at one location per time point, i.e., yti = |{o : o.t ∈ 2
sit ∈ RS λ for region i and time interval t. At last, we use
It ∧o.l ∈ li }|, where |·| denotes the cardinality of the set. For a fully connected layer to reduce the dimension of spatial
simplicity, we use the index of time intervals t representing representations sit , which is defined as:
It , and the index of locations i representing li for rest of the
paper. ŝit = f (Wtf c sit + bft c ), (2)
Demand prediction problem: The demand prediction where Wtf c and bft c
are two learnable parameter sets at time
problem aims to predict the demand at time interval t + 1, interval t. Finally, for each time interval t, we get the ŝit ∈
given the data until time interval t. In addition to histori- Rd as the representation for region i.
cal demand data, we can also incorporate context features
such as temporal features, spatial features, meteorological Temporal View: LSTM
features (refer to Data Description section for more details). The temporal view models sequential relations in the de-
We denote those context features for a location i and a time mand time series. We propose to use Long Short-Term
point t as a vector eit ∈ Rr , where r is the number of fea- Memory (LSTM) network as our temporal view component.
tures. Therefore, our final goal is to predict LSTM (Hochreiter and Schmidhuber 1997) is a type of neu-
i
yt+1 = F(Yt−h,...,t
L
, Et−h,...,t
L
) ral network structure, which provides a good way to model
sequential dependencies by recursively applying a transition
for i ∈ L, where Yt−h,...,t
L
are historical demands and function to the hidden state vector of the input. It is pro-
Et−h,...,t are context features for all locations L for time in-
L posed to address the problem of classic Recurrent Neural
tervals from t − h to t, where t − h denotes the starting time Network (RNN) for its exploding or vanishing of gradient in
interval. We define our prediction function F(·) on all re- the long sequence training (Hochreiter et al. 2001).
gions and previous time intervals up to t − h to capture the LSTM learns sequential correlations stably by maintain-
complex spatial and temporal interaction among them. ing a memory cell ct in time interval t, which can be re-
garded as an accumulation of previous sequential informa-
Proposed DMVST-Net Framework tion. In each time interval, LSTM takes an input gti , ht−1
In this section, we provide details for our proposed and ct−1 in this work, and then all information is accumu-
Deep Multi-View Spatial-Temporal Network (DMVST-Net) lated to the memory cell when the input gate iit is activated.
framework, i.e., our prediction function F. Figure 1 shows In addition, LSTM has a forget gate fti . If the forget gate
the architecture of our proposed method. Our proposed is activated, the network can forget the previous memory
model has three views: spatial, temporal, and semantic view. cell cit−1 . Also, the output gate oit controls the output of the
memory cell. In this study, we follow the version proposed
Spatial View: Local CNN in (Graves 2013), which is defined as follows:
As we mentioned earlier, including regions with weak cor- iit = σ(Wi gti + Ui hit−1 + bi ),
relations to predict a target region actually hurts the per- fti = σ(Wf gti + Uf hit−1 + bf ),
formance. To address this issue, we propose a local CNN
method which only considers spatially nearby regions. Our oit = σ(Wo gti + Uo hit−1 + bo ),
(3)
intuition is motivated by the First Law of Geography (Tobler θti = tanh(Wg gti + Ug hit−1 + bg ),
1970) - “near things are more related than distant things”.
As shown in Figure 1(a), at each time interval t, we treat cit = fti ◦ cit−1 + iit ◦ θti ,
one location i with its surrounding neighborhood as one hit = oit ◦ tanh(cit ).
2590
a 7 × 7 Image 7 × 7 Image 7 × 7 Image a: Spatial View
b: Temporal View
c: Semantic View
ࢅ௜,଴
௧ି௛ାଵ ,௄ ࢅ௜,଴
௧ି௛ାଶ ࢅ௜,଴
௧
ࢅ௜,௄
௧ି௛ାଵ ࢅ௜௜,௄
௄
௧ି௛ାଶ ࢅ௜௜,௄
௧
FC Conv FC
FC Conv
Conv FC
FC
FC Conv
Conv ௜
‫ݕ‬௧ାଵ
b ࢎ௜௧ି௛ାଵ ࢎ௜௧ି௛ାଶ ࢎ௜௧ିଵ
ࢋ௜௧ି௛ାଵ ࢋ௜௧ି௛ାଵ ࢋ௜௧ Loss

࢙ො ௜௧ି௛ାଵ ࢙ො ௜௧ି௛ାଵ ࢙ො ௜௧ ࢎ௜௧
௜
‫ݕ‬ො௧ାଵ
c
40
FC
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13 ࢓௜
Embed FC ෝ௜
࢓
40
20
0
1 2 3 4 5 6 7 8 9 10 11 12 13
Figure 1: The Architecture of DMVST-Net. (a). The spatial component uses a local CNN to capture spatial dependency among
nearby regions. The local CNN includes several convolutional layers. A fully connected layer is used at the end to get a low
dimensional representation. (b). The temporal view employs a LSTM model, which takes the representations from the spatial
view and concatenates them with context features at corresponding times. (c). The semantic view first constructs a weighted
graph of regions (with weights representing functional similarity). Nodes are encoded into vectors. A fully connected layer is
used at the end for jointly training. Finally, a fully connected neural network is used for prediction.
where ◦ denotes Hadamard product and tanh is hyper- of two locations. We use the average weekly demand time
bolic tangent function. Both functions are element-wise. series as the demand patterns. The average is computed on
Wa , Ua , ba (a ∈ {i, f, o, g}) are all learnable parameters. the training data in the experiment. The graph is fully con-
The number of time intervals in LSTM is h and the output nected because every two regions can be reached.
of region i of LSTM after h time intervals is hit . In order to encode each node into a low dimensional vec-
As Figure 1(b) shows, the temporal component takes rep- tor and maintain the structural information, we apply a graph
resentations from the spatial view and concatenates them embedding method on the graph. For each node i (location),
with context features. More specifically, we define: the embedding method outputs the embedded feature vector
gti = ŝit ⊕ eit , (4) mi . In addition, in order to co-train the embedded mi with
where ⊕ denotes the concatenation operator, therefore, gti ∈ our whole network architecture, we feed the feature vector
Rr+d . mi to a fully connected layer, which is defined as:
m̂i = f (Wf e mi + bf e ), (6)
Semantic View: Structural Embedding
Intuitively, locations sharing similar functionality may have Wf e and bf e are both learnable parameters. In this paper, we
similar demand patterns, e.g., residential areas may have a use LINE for generating embeddings (Tang et al. 2015).
high number of demands in the morning when people transit
to work, and commercial areas may expect to have high de- Prediction Component
mands on weekends. Similar regions may not necessarily be Recall that our goal is to predict the demand at t + 1 given
close in space. Therefore, we construct a graph of locations the data till t. We join three views together by concatenating
representing functional (semantic) similarity among regions. m̂i with the output hit of LSTM:
We define the semantic graph of location as G =
(V, E, D), where the set of locations L are nodes V = L, qit = hit ⊕ m̂i . (7)
E ∈ V × V is the edge set, and D is a set of similarity on all Note that the output of LSTM hit
contains both effects of
the edges. We use Dynamic Time Warping (DTW) to mea- temporal and spatial view. Then we feed qit to the fully con-
sure the similarity ωij between node (location) i and node i
nected network to get the final prediction value ŷt+1 for each
(location) j. region. We define our final prediction function as:
ωij = exp(−αDTW(i, j)), (5) i
ŷt+1 = σ(Wf f qit + bf f ), (8)
where α is the parameter that controls the decay rate of the
distance (in this paper, α = 1), and DTW(i, j) is the dy- where Wf f and bf f are learnable parameters. σ(x) is a Sig-
namic time warping distance between the demand patterns moid function defined as σ(x) = 1/(1+e−x ). The output of
2591
Algorithm 1: Training Pipeline of DMVST-Net 300, 000 requests each day on average. The context features
used in our experiment are the similar types of features used
Input: Historical observations: Y1,...,t
L
; Context in (Tong et al. 2017). These features include temporal fea-
features: Et−h,...,t ; Region structure graph
L
tures (e.g., the average demand value in the last four time
G = (V, E, D); Length of the time period h; intervals), spatial features (e.g., longitude and latitude of the
Output: Learned DMVST-Net model region center), meteorological features (e.g., weather condi-
1 Initialization; tion), event features (e.g., holiday).
2 for ∀i ∈ L do In the experiment, the data from 02/01/2017 to
3 Use LINE on G and get the embedding result mi ; 03/19/2017 is used for training (47 days), and the data from
4 for ∀t ∈ [h, T ] do 03/20/2017 to 03/26/2017 (7 days) is used for testing. We
5 Sspa = [Yt−h+1
i
, Yt−h+2
i
, ..., Yti ]; use half an hour as the length of the time interval. When
6 Scox = [et−h+1 , et−h+2 , ..., eit ];
i i testing the prediction result, we use the previous 8 time in-
7 Append < {Sspa , Scox , mi }, yt+1 i
> to Ωbt ; tervals (i.e., 4 hours) to predict the taxi demand in the next
time interval. In our experiment, we filter the samples with
8 end demand values less than 10. This is a common practice used
9 end in industry. Because in the real-world applications, people
10 Initialize all learnable parameters θ in DMVST-Net; do not care about such low-demand scenarios.
11 repeat
12 Randomly select a batch of instance Ωbt from Ω; Evaluation Metric
13 Optimize θ by minimizing the loss function
We use Mean Average Percentage Error (MAPE) and
Eq. (9) with Ωbt
Rooted Mean Square Error (RMSE) to evaluate our algo-
14 until stopping criteria is met;
rithm, which are defined as follows:
1 |ŷt+1
ξ
i
− yt+1
i
|
M AP E = , (10)
our model is in [0, 1], as the demand values are normalized. ξ i=1 i
yt+1
We later denormalize the prediction to get the actual demand
ξ
values. 1
RM SE = i )2 ,
(ŷ i − yt+1 (11)
Loss function ξ i=1 t+1
In this section, we provide details about the loss function i i
where ŷt+1 and yt+1 mean the prediction value and real
used for jointly training our proposed model. The loss func- value of region i for time interval t + 1, and where ξ is total
tion we used is defined as: number of samples.

N i
yt+1 − ŷt+1
i
Methods for Comparison
L(θ) = ((yt+1
i
− ŷt+1
i
)2 + γ( i
)2 ), (9)
i=1
yt+1 We compared our model with the following methods, and
tuned the parameters for all methods. We then reported the
where θ are all learnable parameters in the DMVST-Net and best performance.
γ is a hyper parameter. The loss function consists of two
• Historical average (HA): Historical average predicts the
parts: mean square loss and square of mean absolute per-
demand using average values of previous demands at the
centage loss. In practice, mean square error is more rele-
location given in the same relative time interval (i.e., the
vant to predictions of large values. To avoid the training
same time of the day).
being dominated by large value samples, we in addition
minimize the mean absolute percentage loss. Note that, in • Autoregressive integrated moving average (ARIMA):
the experiment, all compared regression methods use the ARIMA is a well-known model for forecasting time se-
same loss function as defined in Eq. (9) for fair compari- ries which combines moving average and autoregressive
son. The training pipeline is outlined in Algorithm 1. We use components for modeling time series.
Adam (Kingma and Ba 2014) for optimization. We use Ten- • Linear regression (LR): We compare our method with
sorflow and Keras (Chollet and others 2015) to implement different versions of linear regression methods: ordinary
our proposed model. least squares regression (OLSR), Ridge Regression (i.e.,
with 2 -norm regularization), and Lasso (i.e., with 1 -
Experiment norm regularization).
Dataset Description • Multiple layer perceptron (MLP): We compare our
In this paper, we use a large-scale online taxi request dataset method with a neural network of four fully connected lay-
collected from Didi Chuxing, which is one of the largest on- ers. The number of hidden units are 128, 128, 64, and 64
line car-hailing companies in China. The dataset contains respectively.
taxi requests from 02/01/2017 to 03/26/2017 for the city • XGBoost (Chen and Guestrin 2016): XGBoost is a pow-
of Guangzhou. There are 20 × 20 regions in our data. The erful boosting tree based method and is widely used in
size of each region is 0.7km × 0.7km. There are about data mining applications.
2592
• ST-ResNet (Zhang, Zheng, and Qi 2017): ST-ResNet is Table 1: Comparison with Different Baselines
a deep learning based approach for traffic prediction. The
method constructs a city’s traffic density map at different Method MAPE RMSE
times as images. CNN is used to extract features from his- Historical average 0.2513 12.167
torical images. ARIMA 0.2215 11.932
Ordinary least square regression 0.2063 10.234
We used the same context features for all regression methods Ridge regression 0.2061 10.224
above. For fair comparisons, all methods (except ARIMA Lasso 0.2091 10.327
and HA) use the same loss function as our method defined Multiple layer perceptron 0.1840 10.609
in Eq. (9). XGBoost 0.1953 10.012
We also studied the effect of different view components ST-ResNet 0.1971 10.298
proposed in our method. DMVST-Net 0.1616 9.642
• Temporal view: For this variant, we used only LSTM
with inputs as context features. Note that, if we do not
use any context features but only use the demand value (i.e., 4 hours) for LSTM. The output dimension of graph em-
of last timestamp as input, LSTM does not perform well. bedding is set as 32. The output dimension for the semantic
It is necessary to use context features to enable LSTM to view is set to 6. We used Sigmoid function as the activation
model the complex sequential interactions for these fea- function for the fully connected layer in the final prediction
tures. component. Activation functions in other fully connected
• Temporal view + Semantic view: This method captures layers are ReLU. Batch normalization is used in the local
both temporal dependency and semantic information. CNN component. The batch size in our experiment was set
to 64. The first 90% of the training samples were selected for
• Temporal view + Spatial (Neighbors) view: In this vari-
training each model and the remaining 10% were in the val-
ant, we used the demand values of nearby regions at time
idation set for parameter tuning. We also used early-stop in
interval t as ŝit and combined them with context features
all the experiments. The early-stop round and the max epoch
as the input of LSTM. We wanted to demonstrate that sim-
were set to 10 and 100 in the experiment, respectively.
ply using neighboring regions as features cannot model
the complex spatial relations as our proposed local CNN
Performance Comparison
method.
Comparison with state-of-the-art methods. Table 1
• Temporal view + Spatial (LCNN) view: This variant shows the performance of the proposed method as com-
considers both temporal and local spatial views. The spa- pared to all other competing methods. DMVST-Net achieves
tial view uses the proposed local CNN for considering the lowest MAPE (0.1616) and the lowest RMSE (9.642)
neighboring relation. Note that when our local CNN uses among all the methods, which is 12.17% (MAPE) and
a local window that is large enough to cover the whole 3.70% (RMSE) relative improvement over the best perfor-
city, it is the same as the global CNN method. We stud- mance among baseline methods. More specifically, we can
ied the performance of different parameters and show that see that HA and ARIMA perform poorly (i.e., have a MAPE
if the size is too large, the performance is worse, which of 0.2513 and 0.2215, respectively), as they rely purely on
indicates the importance of locality. historical demand values for prediction. Regression methods
• DMVST-Net: Our proposed model, which combines spa- (OLSR, LASSO, Ridge, MLP and XGBoost) further con-
tial, temporal and semantic views. sider context features and therefore achieve better perfor-
mance. Note that the regression methods use the same loss
Preprocessing and Parameters function as our method defined in Eq. (9). However, the re-
gression methods do not model the temporal and spatial de-
We normalized the demand values for all locations to [0, 1] pendency. Consequently, our proposed method significantly
by using Max-Min normalization on the training set. We outperforms those methods.
used one-hot encoding to transform discrete features (e.g.,
Furthermore, our proposed method achieves 18.01%
holidays and weather conditions) and used Max-Min nor-
(MAPE) and 6.37% relative improvement over ST-ResNet.
malization to scale the continuous features (e.g., the average
Compared with ST-ResNet, our proposed method further
of demand value in last four time intervals). As our method
utilizes LSTM to model the temporal dependency, while at
outputs a value in [0, 1], we applied the inverse of the Max-
the same time considering context features. In addition, our
Min transformation obtained on training set to recover the
use of local CNN and semantic view better captures the cor-
demand value.
relation among regions.
All these experiments were run on a cluster with four
NVIDIA P100 GPUs. The size of each neighborhood con- Comparison with variants of our proposed method. Ta-
sidered was set as 9 × 9 (i.e., S = 9), which corresponds ble 2 shows the performance of DMVST-Net and its vari-
to 6km × 6km rectangles. For spatial view, we set K = 3 ants. First, we can see that both Temporal view + Spatial
(number of layers), τ = 3×3 (size of filter), λ = 64 (number (Neighbor) view and Temporal view + Spatial (LCNN) view
of filters used), and d = 64 (dimension of the output). For achieve a lower MAPE (a reduction of 0.63% and 6.10%,
the temporal component, we set the sequence length h = 8 respectively). The result demonstrates the effectiveness of
2593
Table 2: Comparison with Variants of DMVST-Net Table 3: Relative Increase in Error (RIE) on Weekends to
Weekdays
Method MAPE RMSE
Temporal view 0.1721 9.812 Method RIDGE MLP XGBoost ST-ResNet Temporal DMVST-Net
Temporal + Semantic view 0.1708 9.789 RIE 14.91% 10.71% 16.08% 4.41% 4.77% 4.04%
Temporal + Spatial (Neighbor) view 0.1710 9.796
Temporal + Spatial (LCNN) view 0.1640 9.695
DMVST-Net 0.1616 9.642 0.24
0.22
MAPE
0.2
considering neighboring spatial dependency. Furthermore,
0.18
Temporal view + Spatial (LCNN) view outperforms Tempo-
ral view + Spatial (Neighbor) view significantly, as the local 0.16
CNN can better capture the nonlinear relations. On the other
0.14
hand, Temporal view + Semantic view has a lower MAPE of Mon Tue Wed Thu Fri Sat Sun
0.1708 and an RMSE of 9.789 compared to Temporal view Ridge MLP Xgboost ST-ResNet Temporal DMVST-Net
only, demonstrating the effectiveness of our semantic view.

Lastly, the performance is best when all views are combined. Figure 2: The Results of Different Days.
Performance on Different Days ror of MAPE with respect to the length. We can see that
when the length is 4 hours, our method achieves the best
Figure 2 shows the performance of different methods on dif-
performance. The decreasing trend in MAPE as the length
ferent days of the week. Due to the space limitation, We only
increases shows the importance of considering the temporal
show MAPE here. We get the same conclusions of RMSE.
dependency. Furthermore, as the length increases to more
We exclude the results of HA and ARIMA, as they perform
than 4 hours, the performance slightly degrades but mainly
poorly. We show Ridge regression results as they perform
remains stable. One potential reason is that when consider-
best among linear regression models. In the figure, it shows
ing longer temporal dependency, more parameters need to
that our proposed method DMVST-Net outperforms other
be learned. As a result, the training becomes harder.
methods consistently in all seven days. The result demon-
strates that our method is robust.
Influence of Input Size for Local CNN
Moreover, we can see that predictions on weekends are
generally worse than on weekdays. Since the average num- Our intuition was that applying CNN locally avoids learn-
ber of demand requests is similar (45.42 and 43.76 for week- ing relation among weakly related locations. We verified
days and weekends, respectively), we believe the prediction that intuition by varying the input size S for local CNN.
task is harder for weekends as demand patterns are less reg- As the input size S becomes larger, the model may fit for
ular. For example, we can expect that residential areas may relations in a larger area. In Figure 3b, we show the per-
have high demands in the morning hours on weekdays, as formance of our method with respect to the size of the sur-
people need to transit to work. Such regular patterns are less rounding neighborhood map. We can see that when there
likely to happen on weekends. To evaluate the robustness of are three convolutional layers and the size of map is 9 × 9,
our method, we look at the relative increase in prediction er- the method achieves the best performance. The prediction
ror on weekends as compared to weekdays, i.e., defined as error increases as the size decreases to 5 × 5. This may be
¯ − wd|/
|wk ¯ wd,¯ where wd ¯ and wk¯ are the average prediction due to the fact that locally correlated neighboring locations
error of weekdays and weekends, respectively. The results are not fully covered. Furthermore, the prediction error in-
are shown in Table 3. For our proposed method, the relative creases significantly (more than 3.46%), as the size increases
increase in error is the smallest, at 4.04%. to 13 × 13 (where each area approximately covers more than
At the same time, considering temporal view, only 40% of the space in GuangZhou). The result suggests that
(LSTM) has a relative increase in error of 4.77%, while the locally significant correlations may be averaged as the size
increase is more than 10% for Ridge regression, MLP, and increases. We also increased the number of convolution lay-
XGBoost. The more stable performance of LSTM can be at- ers to four and five layers, as the CNN needed to cover larger
tributed to its modeling of the temporal dependency. We see area. However, we observed similar trends of prediction er-
that ST-ResNet has a more consistent performance (relative ror, as shown in Figure 3b. We can now see that the input
increase in error of 4.41%), as the method further models the size for local CNN when the method performs best remains
spatial dependency. Finally, our proposed method is more consistent (i.e., the size of map is 9 × 9).
robust than ST-ResNet.
Conclusion and Discussion
Influence of Sequence Length for LSTM The purpose of this paper is to inform of our proposal
In this section, we study how the sequence length for LSTM of a novel Deep Multi-View Spatial-Temporal Network
affects the performance. Figure 3a shows the prediction er- (DMVST-Net) for predicting taxi demand. Our approach in-
2594
0.19 3 Layers
Krizhevsky, A.; Sutskever, I.; and Hinton, G. E. 2012. Imagenet
0.17 4 Layers
5 Layers
classification with deep convolutional neural networks. In Ad-
0.18 vances in neural information processing systems, 1097–1105.
MAPE
MAPE
0.165
0.17 LeCun, Y.; Bengio, Y.; and Hinton, G. 2015. Deep learning. Nature
521(7553):436–444.
0.16 0.16
Li, X.; Pan, G.; Wu, Z.; Qi, G.; Li, S.; Zhang, D.; Zhang, W.; and
0.15 Wang, Z. 2012. Prediction of urban human mobility using large-
1 Hour 4 Hours 7 Hours 5 5 7 7 9 9 11 11 13 13
Sequence Length of LSTM Size of Surrounding Neighborhood Map scale taxi traces and its applications. Frontiers of Computer Science
6(1):111–121.
(a) (b)
Ma, X.; Dai, Z.; He, Z.; Ma, J.; Wang, Y.; and Wang, Y. 2017.
Learning traffic as images: a deep convolutional neural network
Figure 3: (a) MAPE with respect to sequence length for
for large-scale transportation network speed prediction. Sensors
LSTM. (b) MAPE with respect to the input size for local 17(4):818.
CNN.
Moreira-Matias, L.; Gama, J.; Ferreira, M.; Mendes-Moreira, J.;
and Damas, L. 2013. Predicting taxi–passenger demand using
streaming data. IEEE Transactions on Intelligent Transportation
tegrates the spatial, temporal, and semantic views, which are Systems 14(3):1393–1402.
modeled by local CNN, LSTM and semantic graph embed- Pan, B.; Demiryurek, U.; and Shahabi, C. 2012. Utilizing real-
ding, respectively. We evaluated our model on a large-scale world transportation data for accurate traffic prediction. In Data
taxi demand dataset. The experiment results show that our Mining (ICDM), 2012 IEEE 12th International Conference on,
proposed method significantly outperforms several compet- 595–604. IEEE.
ing methods. Shekhar, S., and Williams, B. 2008. Adaptive seasonal time se-
As deep learning methods are often difficult to interpret, it ries models for forecasting short-term traffic flow. Transportation
is important to understand what contributes to the improve- Research Record: Journal of the Transportation Research Board
ment. This is particularly important for policy makers. For (2024):116–125.
future work, we plan to further investigate the performance Tang, J.; Qu, M.; Wang, M.; Zhang, M.; Yan, J.; and Mei, Q. 2015.
improvement of our approach for better interpretability. In Line: Large-scale information network embedding. In Proceedings
addition, seeing as the semantic information is implicitly of the 24th International Conference on World Wide Web, 1067–
modeled in this paper, we plan to incorporate more explicit 1077. International World Wide Web Conferences Steering Com-
information (e.g., POI information) in our future work. mittee.
Tobler, W. R. 1970. A computer movie simulating urban growth in
Acknowledgments the detroit region. Economic geography 46(sup1):234–240.
The work was supported in part by NSF awards #1544455, Tong, Y.; Chen, Y.; Zhou, Z.; Chen, L.; Wang, J.; Yang, Q.; and
#1652525, #1618448, and #1639150. The views and con- Ye, J. 2017. The simpler the better: A unified approach to pre-
clusions contained in this paper are those of the authors and dicting original taxi demands on large-scale online platforms. In
should not be interpreted as representing any funding agen- Proceedings of the 23rd ACM SIGKDD International Conference
cies. on Knowledge Discovery and Data Mining. ACM.
Wang, D.; Cao, W.; Li, J.; and Ye, J. 2017. Deepsd: Supply-
References demand prediction for online car-hailing services using deep neural
Chen, T., and Guestrin, C. 2016. Xgboost: A scalable tree boosting networks. In Data Engineering (ICDE), 2017 IEEE 33rd Interna-
system. In Proceedings of the 22nd ACM SIGKDD International tional Conference on, 243–254. IEEE.
Conference on Knowledge Discovery and Data Mining, 785–794. Wu, F.; Wang, H.; and Li, Z. 2016. Interpreting traffic dynam-
ACM. ics using ubiquitous urban data. In Proceedings of the 24th ACM
Chollet, F., et al. 2015. Keras. https://github.com/fchollet/keras. SIGSPATIAL International Conference on Advances in Geographic
Information Systems, 69. ACM.
Deng, D.; Shahabi, C.; Demiryurek, U.; Zhu, L.; Yu, R.; and Liu, Y.
2016. Latent space model for road networks to predict time-varying Yu, R.; Li, Y.; Demiryurek, U.; Shahabi, C.; and Liu, Y. 2017.
traffic. Proceedings of the 22th ACM SIGKDD international con- Deep learning: A generic approach for extreme condition traffic
ference on Knowledge discovery and data mining. forecasting. In Proceedings of SIAM International Conference on
Data Mining.
Graves, A. 2013. Generating sequences with recurrent neural net-
works. arXiv preprint arXiv:1308.0850. Zhang, J.; Zheng, Y.; Qi, D.; Li, R.; and Yi, X. 2016. Dnn-based
prediction model for spatio-temporal data. In Proceedings of the
Hochreiter, S., and Schmidhuber, J. 1997. Long short-term mem- 24th ACM SIGSPATIAL International Conference on Advances in
ory. Neural computation 9(8):1735–1780. Geographic Information Systems, 92. ACM.
Hochreiter, S.; Bengio, Y.; Frasconi, P.; and Schmidhuber, J. 2001. Zhang, J.; Zheng, Y.; and Qi, D. 2017. Deep spatio-temporal resid-
Gradient flow in recurrent nets: the difficulty of learning long-term ual networks for citywide crowd flows prediction. Proceedings of
dependencies. the Thirty-First AAAI Conference on Artificial Intelligence.
Idé, T., and Sugiyama, M. 2011. Trajectory regression on road Zheng, J., and Ni, L. M. 2013. Time-dependent trajectory re-
networks. In Proceedings of the Twenty-Fifth AAAI Conference on gression on road networks via multi-task learning. In Proceedings
Artificial Intelligence, 203–208. AAAI Press. of the Twenty-Seventh AAAI Conference on Artificial Intelligence,
Kingma, D., and Ba, J. 2014. Adam: A method for stochastic 1048–1055. AAAI Press.
optimization. arXiv preprint arXiv:1412.6980.
2595

Yao 2018

Uploaded by

Copyright:

Available Formats

You might also like

Yao 2018

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Yao 2018

Uploaded by

Copyright:

Available Formats

The Thirty-Second AAAI Conference

on Artificial Intelligence (AAAI-18)

Deep Multi-View Spatial-Temporal Network for Taxi Demand Prediction

Yitian Jia, Siyu Lu, Pinghua Gong, Zhenhui Li

Abstract consumption. Currently, with the increasing popularity of

b ࢎ௜௧ି௛ାଵ ࢎ௜௧ି௛ାଶ ࢎ௜௧ିଵ

ࢋ௜௧ି௛ାଵ ࢋ௜௧ି௛ାଵ ࢋ௜௧ Loss

only, demonstrating the effectiveness of our semantic view.

You might also like