Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

1

Machine Learning for Spatiotemporal


Sequence Forecasting: A Survey
Xingjian Shi, Dit-Yan Yeung, Senior Member, IEEE

Abstract—Spatiotemporal systems are common in the real-world. Forecasting the multi-step future of these
spatiotemporal systems based on the past observations, or, Spatiotemporal Sequence Forecasting (STSF), is a
significant and challenging problem. Although lots of real-world problems can be viewed as STSF and many
arXiv:1808.06865v1 [cs.LG] 21 Aug 2018

research works have proposed machine learning based methods for them, no existing work has summarized and
compared these methods from a unified perspective. This survey aims to provide a systematic review of machine
learning for STSF. In this survey, we define the STSF problem and classify it into three subcategories: Trajectory
Forecasting of Moving Point Cloud (TF-MPC), STSF on Regular Grid (STSF-RG) and STSF on Irregular Grid
(STSF-IG). We then introduce the two major challenges of STSF: 1) how to learn a model for multi-step
forecasting and 2) how to adequately model the spatial and temporal structures. After that, we review the existing
works for solving these challenges, including the general learning strategies for multi-step forecasting, the
classical machine learning based methods for STSF, and the deep learning based methods for STSF. We also
compare these methods and point out some potential research directions.

Index Terms—Spatiotemporal Sequence Forecasting, Machine Learning, Deep Learning, Data Mining.

1 I NTRODUCTION

M ANY real-world phenomena are spatiotemporal,


such as the traffic flow, the diffusion of air
pollutants and the regional rainfall. Correctly predicting
algorithms can further be used to guide other tasks like
classification and control. For example, controlling a
robotic arm will be much easier once you can anticipate
the future of these spatiotemporal systems based on how it will move after performing specific actions [8].
the past observations is essential for a wide range of
In this paper, we review these machine learning
scientific studies and real-life applications like traffic
based methods for spatiotemporal sequence forecast-
management, precipitation nowcasting, and typhoon
ing. We define a length-T spatiotemporal sequence as
alert systems.
a series of matrices X1:T = [X1 , X2 , ...XT ]. Each
If there exists an accurate numerical model of the
Xt ∈ RK×(D+E) contains a set of coordinates and
dynamical system, which usually happen in the case
their corresponding measurements. Here, K is the
that we have fully understood its rules, forecasting
number of coordinates, D is the number of measure-
can be achieved by first finding the initial condition
ments, and E is the dimension of the coordinate. We
of the model and then simulate it to get the future
further denote the measurement part and coordinate
predictions [1]. However, for some complex spatiotem-
part of Xt as Mt ∈ RK×D and Ct ∈ RK×E .
poral dynamical systems like atmosphere, crowd, and
For many forecasting problems, we can use auxiliary
natural videos, we are still not entirely clear about their
information to enhance the prediction accuracy. For ex-
inner mechanisms. In these situations, machine learn-
ample, regarding wind speed forecasting, the latitudes
ing based methods, by which we can try to learn the
and past temperatures are also helpful for predicting the
systems’ inner “laws” based on the historical data, have
future wind speeds. We denote the auxiliary informa-
proven helpful for making accurate predictions [2], [3],
tion available at timestamp t as At . The Spatiotemporal
[4], [5], [6]. Moreover, recent studies [7], [8] have
Sequence Forecasting (STSF) problem is to predict the
shown that the “laws” inferred by machine learning
length-L (L > 1) sequence in the future given the
previous observations plus the auxiliary information
• Xingjian Shi is with the Department of Computer Science
and Engineering, the Hong Kong University of Science and
that is allowed to be empty. The mathematical form
Technology. E-mail: xshiab@cse.ust.hk is given in (1).
• Dit-Yan Yeung is with the Department of Computer Science
and Engineering, the Hong Kong University of Science and X̂t+1:t+L = arg max p(Xt+1:t+L | X1:t , At ). (1)
Technology. E-mail: dyyeung@cse.ust.hk X t+1:t+L

Here, we limit STSF to only contain problems where


2

TABLE 1
Types of spatiotemporal sequence forecasting problems.

Problem Name Coordinates Measurements


Trajectory Forecasting of Moving Point Cloud Changing Fixed/Changing
Spatiotemporal Forecasting on Regular Grid Fixed regular grid Changing
Spatiotemporal Forecasting on Irregular Grid Fixed irregular grid Changing

the input and output are both sequences with multiple objective while the IMS approach learns a one-step-
elements. According to this definition, problems like ahead forecaster and iteratively applies it to gener-
video generation using a static image [9] and dense ate multi-step predictions. However, choosing between
optical flow prediction [10], in which the input or the DMS and IMS involves a trade-off among forecasting
output contain a single element, will not be counted. bias, estimation variance, the length of the prediction
Based on the characteristics of the coordinate se- horizon and the model’s nonlinearity [20]. Recent stud-
quence C1:T and the measurement sequence M1:T , we ies have tried to find the mid-ground between these two
classify STSF into three categories as shown in Table 1. approaches [20], [21]. The second challenge is how to
If the coordinates are changing, the STSF problem is model the spatial and temporal structures within the
called Trajectory Forecasting of Moving Point Cloud data adequately. In fact, the number of free parameters
(TF-MPC). Problems like human trajectory predic- of a length-T spatiotemporal sequence can be up to
tion [4], [11] and human dynamics prediction [12], O(K T DT ) where K is usually above thousands for
[13], [14] fall into this category. We need to emphasize STSF-RG problems like video prediction. Forecasting
here that the entities in moving point cloud can not only in such a high dimensional space is impossible if
be human but also be general moving objects. If the the model cannot well capture the data’s underlying
coordinates are fixed, we only need to predict the future structure. Therefore, machine learning based methods
measurements. Based on whether the coordinates lie in for STSF, including both classical methods and deep
a regular grid, we can get the other two categories, learning based methods, have special designs in their
called STSF on Regular Grid (STSF-RG) and STSF model architectures for capturing these spatiotemporal
on Irregular Grid (STSF-IG). Problems like precipita- correlations.
tion nowcasting [5], crowd density prediction [15] and In this paper, we organize the literature about STSF
video prediction [2], in which we have dense observa- following these two challenges. Section 2 covers the
tions, are STSF-RG problems. Problems like air quality general learning strategies for multi-step forecasting
forecasting [3], influenza prediction [16] and traffic including IMS, DMS, boosting strategy and scheduled
speed forecasting [17], in which the monitoring stations sampling. Section 3 reviews the classical methods
(or sensors) spread sparsely across the city, are STSF- for STSF including the feature-based methods, state-
IG problems. We need to emphasize that although the space models, and Gaussian process models. Section 4
measurements in TF-MPC can also be time-variant due reviews the deep learning based methods for STSF
to factors like appearance change, most models for TF- including deep temporal generative models and feed-
MPC deal with the fixed-measurement case and only forward and recurrent neural network-based methods.
predict the coordinates. We thus focus on the fixed- Section 5 summarizes the survey and discusses some
measurement case in this survey. future research directions.
Having both the characteristics of spatial data like
images and temporal data like sentences and audios,
2 L EARNING S TRATEGIES FOR M ULTI -
spatiotemporal sequences contain information about
STEP F ORECASTING
“what”, “when”, and “where” and provide a com-
prehensive view of the underlying dynamical system. Compared with single-step forecasting, learning a
However, due to the complex spatial and temporal model for multi-step forecasting is more challenging
relationships within the data and the potential long because the forecasting output X̂t+1:t+L is a sequence
forecasting horizon, spatiotemporal sequence forecast- with non-i.i.d elements. In this section, we first intro-
ing imposes new challenges to the machine learn- duce and compare two basic strategies called IMS and
ing community. The first challenge is how to learn DMS [19] and then review two extensions that bridge
a model for multi-step forecasting. Early researches the gap between these two strategies. To simplify the
in time-series forecasting [18], [19] have investigated notation, we keep the following notation rules in this
two approaches: Direct Multi-step (DMS) estimation and the following sections. We denote the information
and Iterated Multi-step (IMS) estimation. The DMS available at timestamp t as Ft = {X1:t , At }. Learning
approach directly optimizes the multi-step forecasting an STSF model is thus to train a model, which can
3

be either probabilistic or deterministic, to make the the deterministic forecaster example above, the optimal
L-step-ahead prediction X̂t+1:t+L based on Ft . We muti-step-aheadhP forecaster, which can be obtained
i by
also denote the ground-truth at timestamp t as X̃t L (h)
minimizing E h=1 d(X̃t+h , m (Ft ; φ)) , is not
and recursively applying the basic function f (x) for the same as recursively applying the optimal one-step-
h times as f (h) (x). Setting the k th subscript to be ahead forecaster defined in (3) when m is nonlinear.
“:” means to select all elements at the k th dimension. This is because the forecasting error at earlier times-
Setting the k th subscript to a letter i means to select tamps will propagate to later timestamps [23].
the ith element at the k th dimension. For example,
for matrix X, Xi,: means the ith row, X:,j means the
2.2 Direct Multi-step Estimation
j th column, and Xi,j means the (i, j)th element. The
flattened version of matrices A, B and C is denoted as The main motivation behind DMS is to avoid the error
vec(A, B, C). drifting problem in IMS by directly minimizing the
long-term prediction error. Instead of training a single
model, DMS trains a different model mh for each
2.1 Iterative Multi-step Estimation forecasting horizon h. There can thus be L models in
The IMS strategy trains a one-step-ahead forecasting the DMS approach. When the forecasters are determin-
model and iteratively feeds the generated samples to istic, the forecasting process and training objective are
the one-step-ahead forecaster to get the multi-step- shown in (5) and (6) respectively where d(·, ·) is the
ahead prediction. There are two types of IMS mod- distance measure.
els, the deterministic one-step-ahead forecasting model
X̂t+h = mh (Ft ; φL ), ∀1 ≤ h ≤ L. (5)
Xt = m(Ft ; φ) and the probabilistic one-step-ahead
forecasting model p(Xt | Ft ; φ). For the deterministic
h i
φ?1 , ..., φ?L = arg min E d(X̃t+1:t+L , X̂t+1:t+L ) .
forecasting model, the forecasting process and the φ1 ,...,φL
training objective of the deterministic forecaster [20] (6)
L
" #
are shown in (2) and (3) respectively where d(x̃, x) X
φ? = arg min E d(X̃t+h , m(h) (Ft ; φ)) . (7)
measures the distance between x̃ and x. φ h=1
X̂t+h = m(h) (Ft ; φ), ∀1 ≤ h ≤ L. (2) To disentangle the model size from the number of
h i forecasting steps L, we can also construct mh by
φ? = arg min E d(X̃t+1 , m(Ft ; φ)) . (3) recursively applying the one-step-ahead forecaster m1 .
φ
In this case, the model parameters {φ1 , ..., φL } are
For the probabilistic forecasting model, we use the shared and the optimization problem turns into (7). We
chain-rule to get the probability of the length-L predic- need to emphasize here that (7), which optimizes the
tion, which is given in (4), and estimate the parameters multi-step-ahead forecasting error, is different from (3),
by Maximum Likelihood Estimation (MLE). which only minimizes the one-step-ahead forecasting
p(Xt+1:t+L | Ft ; φ) = p(Xt+1 | Ft ; φ) error. This construction method is widely adopted in
L deep learning based methods [7], [24].
Y (4)
p(Xt+i | Ft , Xt+1:t+i−1 ; φ).
i=2 2.3 Comparison of IMS and DMS
When the one-step-ahead forecaster is probabilistic and Chevillon [19] compared the IMS strategy and DMS
nonlinear, the MLE of (4) is generally intractable. In strategy. According to the paper, DMS can lead to more
this case, we can draw samples from p(Xt+1:t+L | accurate forecasts when 1) the model is misspecified, 2)
Ft ; φ) by techniques like Sequential Monte-Carlo the sequences are non-stationary, or 3) the training set
(SMC) [22] and predict based on these samples. is too small. However, DMS is more computationally
There are two advantages of the IMS approach: 1) expensive than IMS. For DMS, if the φh s are not
The one-step-ahead forecaster is easy to train because shared, we need to store and train L models. If the
it only needs to consider the one-step-ahead forecasting φh s are shared, we need to recursively apply m1
error and 2) we can generate predictions of an arbitrary for O(L) steps [19], [21], [25]. Both cases require
length by recursively applying the basic forecaster. larger memory storage or longer running time than
However, there is a discrepancy between training and solving the IMS objective. On the other hand, the
testing in IMS. In the training process, we use the training process of IMS is easy to parallelize since each
ground-truths of the previous L−1 steps to generate the forecasting horizon can be trained in isolation from the
Lth-step prediction. While in the testing process, we others [26]. Moreover, directly optimizing a L-step-
feed back the model predictions instead of the ground- ahead forecasting loss may fail when the forecasting
truths to the forecaster. This makes the model prone to model is highly nonlinear, or the parameters are not
accumulative errors in the forecasting process [21]. For well-initialized [21].
4

Due to these trade-offs, models for STSF problems size which will be very large for complex and nonlinear
choose the strategies that best match the problems’ models. The variance of the boosting strategy is smaller
characteristics. Generally speaking, DMS is preferred because the basic model is linear and the weak-learners
for short-term prediction while IMS is preferred for only consider the interaction between two elements.
long-term forecast. For example, the IMS strategy is Although the author only proposed the boosting
widely adopted in solving TF-MPC problems where strategy for the univariate case, the method, which lies
the forecasting horizons are long, which are generally in the mid-ground between IMS and DMS, should ap-
above 12 [4], [13] and can be as long as 100 [14]. The ply to the STSF problems with multiple measurements,
DMS strategy is used in STSF-RG problems where which leads to potential future works as discussed in
the dimensionality of data is usually very large and Section 5.
only short-term predictions are required [5], [6]. For
STSF-IG problems, some works adopt DMS when the
2.5 Scheduled Sampling
overall dimensionality of the forecasting sequence is
acceptable for the direct approach [3], [27]. The idea of Scheduled Sampling (SS) [21] is first to
Overall, IMS is easier to train but less accurate train the model with IMS and then gradually replaces
for multi-step forecasting while DMS is more difficult the ground-truths in the objective function with sam-
to train but more accurate. Since IMS and DMS are ples generated by the model. When all ground-truth
somewhat complementary, later studies have tried to samples are replaced with model-generated samples,
bridge the gap between these two approaches. We will the training objective becomes the DMS objective. The
introduce two representative strategies in the following generation process of SS is described in (9):
two sections. ∀1 ≤ h ≤ L,

2.4 Boosting Strategy X̂t+h ∼ p(X | Ft , X̄t+1:t+h−1 ; φ),


(9)
The boosting strategy proposed in [20] combines the X̄t+h = (1 − τt+h )X̂t+h + τt+h X̃t+h ,
IMS and DMS for univariate time series forecasting. τt+h ∼ B(1, k ).
The strategy boosts the linear forecaster trained by IMS
with several small and nonlinear adjustments trained by Here, X̂t+h and X̃t+h are the generated sample and
DMS. The model assumption is shown in (8). the ground-truth at forecasting horizon h, correspond-
ingly. p(X | Ft , X̄t+1:t+h−1 ; φ) is the basic one-step-
Ih
X ahead forecasting model. τt+h is a random variable
x̃t+h = m(h) (Ft ; φ)+ νli (x̃t−j , x̃t−k ; ψi )+et,h . following the binomial distribution, which controls
i=1
(8) whether to use the ground-truth or the generated sam-
Here, m(Ft ; φ) is the basic one-step-ahead forecasting ple. k is the probability of choosing the ground-truth
model, li (x̃t−j , x̃t−k ; ψi ) is the chosen weak learner at the k th iteration. In the training phase, SS minimizes
at step i, Ih is the boosting iteration number, ν is the the distance between X̂t+1:t+L and X̃t+1:t+L , which
shrinkage factor and et,h is the error term. The author is shown in (10). In the testing phase, τ is fixed to 0,
set the base linear learner to be the auto-regressive meaning that the model-generated samples are always
model and set the weak nonlinear learners to be the used.
 
penalized regression splines. Also, the weak learners Ep(X̂t+1:t+L |Ft ;φ) d(X̂t+1:t+L , X̃t+1:t+L )
were designed to only account for the interaction be- 
tween two elements in the observation history. The = Ep(τt+1:t+L−1 ) d(X̂t+1:t+L , X̃t+1:t+L )
L
(10)
training method is similar to the gradient boosting Y 
algorithm [28]. m(Ft ; φ) is first trained by minimizing p(X̂t+1 | Ft , X̄t+1:t+h−1 ; φ) .
the IMS objective (3). After that, the weak-learners are h=1

fit iteratively to the gradient residuals. If k equals to 1, the ground-truth will always be
The author also performed a theoretical analysis of chosen and the objective function will be the same as
the bias and variance of different models trained with in the IMS strategy. If k is 0, the generated sample
DMS, IMS and the boosting strategy for 2-step-ahead will always be chosen and the optimization objective
prediction. The result shows that the estimation bias of will be the same as in the DMS strategy. Based on this
DMS can be arbitrarily small while that of the IMS will observation, the author proposed to gradually decay
be amplified for larger forecasting horizon. The model k during training to make the optimization objective
estimated by the boosting strategy also has a bias, but shift smoothly from IMS to DMS, which is a type
it is much smaller than the IMS approach due to the of curriculum learning [29]. The paper also provided
compensation of the weak learners trained by DMS. some ways to change the sampling ratio.
Also, the variance of the DMS approach depends on One issue of SS is that the expectation over τt+h s
the variability of the input history Ft and the model in (10) makes the loss function non-differentiable. The
5

original paper [21] obviates the problem by treating performance relies heavily on the quality of the human-
X̂t+h s as constants, which does not optimize the true engineered features. We will briefly introduce two
objective [25], [30]. representative feature-based methods for two STSF-IG
The SS strategy is widely adopted in DL models problems to illustrate the approach.
for STSF. In [31], the author applied SS to video Ohashi and Torgo [34] proposed a feature-based
prediction and showed that the model trained via SS method for wind-speed forecasting. The author de-
outperformed the model trained via IMS. In [24], the signed a set of features called spatiotemporal indicators
author proposed a training strategy similar to SS except and applied several regression algorithms including
that the training curriculum is adapted from IMS to regression tree, support vector regression, and random
DMS by controlling the number of the prediction steps forest. There are two kinds of spatiotemporal indicators
instead of controlling the frequency of sampling the defined in the paper: the average or weighted average
ground-truth. In [32], [33], the author used SS to train of the observed historical values within a space-time
models for traffic speed forecasting and showed it neighborhood and the ratio of two average values calcu-
outperformed IMS. lated with different distance thresholds. The regression
model takes the spatiotemporal indicators as the input
to generate the one-step-ahead prediction. Experiment
3 C LASSICAL M ETHODS FOR STSF
results show that the spatiotemporal indicators are
In this section, we will review the classical methods helpful for enhancing the forecasting accuracy.
for STSF, including feature-based methods, state space Zheng et al. [3] proposed a feature-based method
models, and Gaussian process models. These methods for air quality forecasting. The author adopted the DMS
are generally based on the shallow models or spa- approach and combined multiple predictors, including
tiotemporal kernels. Since a spatiotemporal sequence the temporal predictor, the spatial predictor, and the in-
can be viewed as a multivariate sequence if we flatten flection predictor, to generate the features at location k
the Xt s into vectors or treat the observations at every and timestamp t. The temporal predictor focuses on the
location independently, methods designed for the gen- temporal side of the data and uses the measurements
eral multivariate time-series forecasting are naturally within a temporal sliding window to generate the fea-
applicable to STSF. However, directly doing so ignores tures. The spatial predictor focuses on the spatial side
the spatiotemporal structure within the data. There- of the data and uses the previous measurements within
fore, most algorithms introduced in this section have a nearby region to generate the features. Also, to deal
extended this basic approach by utilizing the domain with the sparsity and irregularity of the coordinates, the
knowledge of the specific task or considering the inter- author partitioned the neighborhood region into several
actions among different measurements and locations. groups and aggregated the inner measurements within
a group. The inflection predictor extracts the sudden
3.1 Feature-based Methods changes within the observations. For the regression
Feature-based methods solve the STSF by training a model, the temporal predictor uses the linear regres-
regression model based on some human-engineered sion model, the spatial predictor uses a 2-hidden-layer
spatiotemporal features. To reduce the problem size, neural network and the inflection predictor used the
most feature-based methods treat the measurements at regression tree model. A regression tree model further
the K locations to be independent given the extracted combines these predictors.
features. The high-level IMS and DMS models of
the feature-based methods are given in (11) and (12)
respectively. 3.2 State Space Models

X̂t+1,k = fIMS (Ψk (Ft ); φ), (11) The State Space Model (SSM) adopts a generative
viewpoint of the sequence. For the observed sequence
X̂t+1:t+L,k = fDMS (Ψk (Ft ; φ). (12)
X1:T , SSM assumes that each element Xt is generated
Here, Ψk (Ft ) represents the features at location k . by a hidden state Ht and these hidden states, which
Once the feature extraction method Ψk is defined, are allowed to be discrete in our definition, form a
any regression model f can be applied. The feature- Markov process. The general form of the state space
based methods are usually adopted in solving STSF- model [35] is given in (13). Here, g is the transition
IG problems. Because the observations are sparsely model, f is the observation model, Ht is the hidden
and irregularly distributed in STSF-IG, it is sometimes state, t is the noise of the transition model, and σt
more straightforward to extract human-engineered fea- is the noise of the observation model. The SSM has
tures than to develop a purely machine learning based a long history in forecasting [36] and many classical
model. Despite their straightforwardness, the feature- time-series models like Markov Model (MM), Hidden
based methods have a major drawback. The model’s Markov Model (HMM), Vector Autoregression (VAR),
6

and Autoregressive Integrated Moving Average Model (EM) algorithm, respectively [35]. It is worth empha-
(ARIMA) can be written as SSMs. sizing that the posterior of HMM is tractable even
for non-Gaussian observation models. This is because
Ht = g(Ht−1 , t ), (13)
the states of HMM are discrete and it is possible to
Xt = f (Ht , σt ). enumerate all the possibilities.
Linear-Gaussian SSM: When the states are con-
Given the underlying hidden state Ht , the obser-
tinuous and both g and f in (13) are linear-Gaussian,
vation Xt is independent concerning the historical
the SSM becomes the Linear-Gaussian State Space
information Ft . Thus, the posterior distribution of the
Model (LG-SSM) [35]:
sequence for the next L steps can be written as follows:
Z Ht = At Ht−1 + t ,
p(Xt+1:t+L |Ft ) = p(Ht | Ft ) Xt = Ct Ht + σt ,
(15)
t+L
Y (14) t ∼ N (0, Qt ),
p(Xi | Hi )p(Hi | Hi−1 )dHt:t+L σt ∼ N (0, Rt ),
i=t+1
where Qt and Rt are covariance matrices of the
The advantage of SSM for solving STSF is that
Gaussian noises. A large number of classical time-
it naturally models the uncertainty of the dynamical
series models like VAR and ARIMA can be written in
system due to its Bayesian formulation. The posterior
the form of LG-SSM [35], [38]. As both the transition
distribution (14) encodes the probability of different
model and the observation model are linear-Gaussian,
future outcomes. We can not only get the most likely
the posterior is also a linear-Gaussian and we can use
future but also know the confidence of our prediction
the Kalman Filtering (KF) algorithm [39] to calculate
by Bayesian inference. Common types of SSMs used
the mean and variance. Similar to HMM, the model
in STSF literatures have either discrete hidden states
parameters can also be learned by EM.
or linear transition models and have limited represen-
tational power [37]. In this section, we will focus on
these common types of SSMs and leave the review of 3.2.2 Group-based Methods
more advanced SSMs, which are usually coupled with Group-based methods divide the N locations into non-
deep architectures, in Section 4.2. We will first briefly intersected groups and train an independent SSM for
review the basics of three common classes of SSMs each group. To be more specific, for the location
and then introduce their representative spatiotemporal set S = {1, 2, ..., N }, the group-based methods as-
extensions, including the group-based method, Space- sume that it can be decomposed as S = ∪K i=1 Gi
time Autoregressive Moving Average (STARIMA), and where Gi ∩ Gj = ∅, ∀i 6= j . The measurement
the low-rank tensor auto-regression. sequences that belong to the same group, i.e., Di =
{{X1,j , ..., Xt,j }, j ∈ Gi }, are used to fit a single
SSM. If K is equal to 1, it reduces to the simplest case
3.2.1 Common Types of SSMs
where the measurement sequences at all locations share
In STSF literature, the following three classes of SSMs the same model.
are most common. This technique has mainly been applied to TF-MPC
First-order Markov Model: When we set Ht problems. Asahara et al. [40] proposed a group-based
to be Xt−1 in (13) and constrain it to be discrete, extension of MM called Mixed-Markov-chain Model
the SSM reduces to the first-order MM. Assuming (MMM) for pedestrian-movement prediction. MMM
Xt s have C choices, the first-order MM is deter- assumes that the people belonging to a specific group
mined by the transition matrix A ∈ RC×C where satisfy the same first-order MM. To use the first-order
Ai,j = p(Xt+1 = i | Xt = j) and the prior MM, the author manually discretized the coordinates
distribution π ∈ RC where πi = p(X = i). The max- of each person. The group-assignments along with the
imum posterior estimator can be obtained via dynamic model parameters are learned by EM. Experiments
programming and the optimal model parameters can be show that MMM outperforms MM and HMM in this
solved in closed form [35]. task. Mathew et al. [41] extended the work by using
Hidden Markov Model: When the states are HMM as the base model. Also, in [41], the groups are
discrete, the SSM turns into the HMM. Assuming determined by off-line clustering algorithms instead of
the Ht s have C choices, the HMM is determined by EM.
the transition matrix A ∈ RC×C , the state prior π Group-based methods capture the intra-group rela-
and the class conditional density p(Xt | Ht = i). tionships by model sharing. However, it cannot capture
For HMM, the maximum posterior estimator and the the inter-group correlations. Also, only a single group
optimal model parameters can be calculated via the assignment is considered while there may be multi-
forward algorithm and the Expectation Maximization ple grouping relationships in the spatiotemporal data.
7

Moreover, the latent state transitions of SSMs have


some internal grouping effects. It is often not necessary
to explicitly divide the locations into different groups.
Due to these limitations, group-based methods are less
mentioned in later research works.

3.2.3 STARIMA
Fig. 1. Illustration of the spatial order relationships in
STARIMA [42], [43], [44] is a classical spatiotem- STARIMA. From left to right are spatial order equals to 1,
poral extension of the univariate ARIMA model [45]. 2 and 3. The blue circles are neighbors of the white circle in
STARIMA emphasizes the impact of a location’s spa- the middle.
tial neighborhoods on its future values. The funda-
mental assumption is that the future measurements
it is designed for problems where there is only one
at a specific location i is not only related to the
measurement at each location and has not considered
past measurements observed at this location but also
the case of multiple correlated measurements.
related to the past measurements observed at its local
neighborhoods. STARIMA is designed for univariate
3.2.4 Low-rank Tensor Auto-regression
STSF-RG and STSF-IG problems where there is a
single measurement at each location. If we denote the Bahadori et al. [46] proposed to train an auto-regression
measurements at timestamp t as xt ∈ RN , the model model with low-rank constraint to better capture the la-
of STARIMA is shown in the following: tent structure of the spatiotemporal data. This method is
designed for STSF-IG and STSF-RG problems where
p X
λk the coordinates are fixed. Assuming the observed mea-
surements Xt ∈ RN ×D satisfy the VAR(J ) model,
X
∆d x t = φkl W(l) ∆d xt−k + t
k=1 l=0 the optimization problem of the low-rank tensor auto-
p Xmk
(16) regression is given in (18):
X
(l)
− θkl W t−k .
k=1 l=0 T T
2
X X
Here, ∆ is the temporal difference operator where min kX̂t − Xt kF + µ tr(X̂Tt LX̂t ),
W
∆xt = xt − xt−1 and ∆k xt = ∆{∆k−1 xt }. φkl t=J+1 t=1
(18)
and θkl are respectively the autoregressive parameter s.t. r(W) ≤ ρ,
and the moving average parameter at temporal lag k X̂t+1,k = Wk Xt−J+1:t , ∀t, k.
and spatial lag l. λk , mk are maximum spatial orders.
W(l) is a predefined N × N matrix of spatial order l. Here, W ∈ RN ×D×JD is the three-dimensional
t is the normally distributed error vector at time t that weight tensor. L ∈ RD×D is the spatial Laplacian
satisfies the following relationships: matrix calculated from the fixed coordinate matrix
C ∈ RN ×E . r(W) means the rank of the weight
tensor and ρ is the maximum rank. The regularization
E[t ] = 0, term in the objective function constrains the predicted
( results to be spatially smooth. µ is the strength of this
h i σ 2 I, s = 0
E t Tt+s = , (17)
spatial similarity constraint.
0, otherwise Constraining the rank of the weight tensor imposes
h i
E xt Tt+s = 0, ∀s > 0. an implicit spatiotemporal structure in the data. When
the weight tensor is low-rank, the observed sequences
can be described with a few latent factors.
The optimization problem (18) is non-convex and
The spatial order matrix W(l) is the key in NP-hard in general [47] due to the low-rank con-
STARIMA for modeling the spatiotemporal correla- straint. One standard approach is to relax the rank
(l)
tions. Wi,j is only non-zero if the site j is the lth constraint to a nuclear norm constraint and optimize
order neighborhood of site i. W(0) is defined to be the relaxed problem. However, this approach is not
the identity matrix. The spatial order relationship in computationally feasible for solving (18) because of the
a two-dimensional coordinate system is illustrated in high dimensionality of the spatiotemporal data. Thus,
Figure 1. However, the W(l) s need to be predefined researchers [46], [48], [49] proposed optimization al-
by the model builder and cannot vary over-time. Also, gorithms to accelerate the learning process.
the STARIMA still belongs to the class of LG-SSM. It The primary limitation of the low-rank tensor auto-
is thus not suitable for modeling nonlinear spatiotem- regression is that the regression model is still linear
poral systems. Another problem of STARIMA is that even if we have perfectly learned the weights. We
8

cannot directly apply it to solve forecasting problems also a Gaussian distribution as shown in (21).
that have strong nonlinearities.
p(f ? | X ? , X D , Y D ) = N (µ? , Σ? ),
K = κ(X D , X D ) + σ 2 I
µ? = κ(X ? , X D )K−1 Y D ,
3.3 Gaussian Process Σ? = κ(X ? , X ? ) − κ(X ? , X D )K−1 κ(X ? , X D ),
(21)
Gaussian Process (GP) is a non-parametric Bayesian If the observation model is non-Gaussian, which is
model. GP defines a prior p(f ) in the function space common for STSF problems [53], Laplacian approx-
and tries to infer the posterior p(f | D) given the imation can be utilized to approximate the poste-
observed data [35]. GP assumes the joint distribu- rior [35]. The inference process of GP is computa-
tion of the function values of an arbitrary finite set tionally expensive due to the involvement of K−1
3
{x1 , x2 , ..., xn }, i.e., p(f (x1 ), ..., f (xn )), is a Gaus- and the naive approach requires O(|D| ) computations
2
sian distribution. A typical form of the GP prior is and O(|D| ) storage. There are many acceleration
shown in (19): techniques to speed up the inference [53]. We will not
investigate them in detail in this survey and readers can
f ∼ GP(m(x), κ(x, x0 )), (19) refer to [51], [53].
where m(x) is the mean function and κ(x, x0 ) is the The key ingredient of GP is the kernel function.
kernel function. For STSF problems, the kernel function should take
In geostatistics, GP regression is also known as both the spatial and temporal correlations into account.
kriging [50], [51], which is used to interpolate the The common strategy for defining complex kernels is
missing values of the data. When applied to STSF, GP to combine multiple basic kernels by operations like
is usually used in STSF-IG and STSF-RG problems summation and multiplication. Thus, GP models for
where the coordinates are fixed. The general formula STSF generally use different kernels to model different
of the GP based models is shown in (20): aspects of the spatiotemporal sequence and ensemble
them together to form the final kernel function. In [53],
f ∼ GP(m(X), κθ ), the author proposed to decompose the spatiotempo-
(20)
Y ∼ p(Y | f (X)). ral kernel as κst ((s, t), (s0 , t0 )) = κs (s, s0 )κt (t, t0 )
where κs is the spatial kernel and κt is the temporal
Here, X contains the space-time positions of the
kernel [53]. If we denote the kernel matrix corre-
samples, which is the concatenation of the spatial
sponding to κst , κs , κt as Kst , Ks , Kt , this decom-
coordinates and the temporal index. For example,
position results in Kst = Ks ⊗ Kt where ⊗ is
Xk = (tk , ik , jk ) where tk is the timestamp and
the Kronecker product. Based on this observation, the
(ik , jk ) is the spatial coordinate. κθ is the kernel
author utilized the computational benefits of Kronecker
function parameterized by θ. p(Y | f (X)) is the
algebra to scale up the inference of the GP models [54].
observation model. Like SSM, GP is also a generative
In [16], the author combined two spatial kernels and
model. The difference is that GP directly generates
one temporal kernel to construct a GP model for
the function describing the relationship between the
seasonal influenza prediction. The author defined the
observation and the future while SSM generates the
basic spatial kernels as the average of multiple RBF
future measurements one-by-one through the Markov
kernels. The temporal kernel was defined to be the sum
assumption. We need to emphasize here that GP can
of the periodic kernel, the Pacioreks kernel, and the
also be applied to solve TF-MPC problems. A well-
exponential kernel. The final kernel function used in
known example is the Interacting Gaussian Processes
the paper was κfinal = κspace + κtime + κtime × κspace2 .
(IGP) [52], in which a GP models the movement of
each person, and a potential function defined over the
positions of different people models their interactions. 3.4 Remarks
Nevertheless, the method for calculating the posterior In this section, we reviewed three types of classi-
of IGP is similar to that of (20) and we will stick to this cal methods for solving the STSF problems, feature-
model formulation. based methods, SSMs, and GP models. The feature-
To predict the measurements given a GP model, based methods are simple to implement and can pro-
we are required to calculate the predictive posterior vide useful predictions in some practical situations.
p(f ? | X ? , X D , Y D ) where {X D , Y D } are the ob- However, these methods rely heavily on the human-
served data points and X ? is the candidate space- engineered features and may not be applied without
time coordinates to infer the measurements. If the the help of domain experts. The SSMs assume that the
observation model is a Gaussian distribution, i.e., p(Y | observations are generated by Markovian hidden states.
f (X)) = N (f (X), σ 2 ), the posterior distribution is We introduced three common types of SSMs: first-
9

order MM, HMM and LG-SSM. Group-based methods TABLE 2


extend SSMs by creating groups among the states or Activations that appear in models reviewed in this survey.
observations. STARIMA extends ARIMA by explicitly
modeling the impact of nearby locations on the future. Name Formula
Low-rank tensor autoregression extends the basic au- σ (sigmoid) σ(x) = 1+e1−x
x −x
toregression model by adding low-rank constraints on tanh tanh(x) = eex −e
+e−x
the model parameters. The advantage of these common ReLU [58] ReLU(x) = max(0, x)
types of SSMs for STSF is that we can calculate Leaky ReLU [59] LReLU(x) = max(−αx, x)
the exact posterior distribution and conduct Bayesian
inference. However, the overall nonlinearity of these
models are limited and may not be suitable for some
joint distribution and the conditional distributions are
complex STSF problems like video prediction. GP
defined in (22) [26]:
models the inner characteristics of the spatiotemporal
sequences by spatiotemporal kernels. Although being 1
p(v, h) = exp(−E(v, h)),
very powerful and flexible, GP has high computational Z
and storage complexity. Without any optimization, the E(v, h) = −bT v − cT h − vT Wh,
3
inference of GP requires O(|D| ) computations and nh
Y  
2
O(|D| ) storage, which makes it not suitable for cases p(h | v) = σ (2h − 1) ◦ (c + WT v) ,
j
when a huge number of training instances are available. j=1
Ynv
p(v | h) = σ ((2v − 1) ◦ (b + Wh))j .
j=1
(22)
4 D EEP L EARNING M ETHODS FOR STSF Here, b is the bias of the visible state, c is the bias of
the hidden state, W is the weight between visible states
and hidden states, σ is the sigmoid function defined in
Deep Learning (DL) is an emerging new area in ma- Table 2, and Z is the normalization constant for energy-
chine learning that studies the problem of how to learn based models, which is also known as the partition
a hierarchically-structured model to directly map the function [35]. RBM can be learned by Contrastive Di-
raw inputs to the desired outputs [26]. Usually, DL vergence (CD) [56], which approximates the gradient
model stacks some basic learnable blocks, or layers, to of log Z by drawing samples from k cycles of MCMC.
form a deep architecture. The overall network is then
The basic form of RBM has been extended in var-
trained end-to-end. In recent years, a growing number
ious ways. Gaussian-Bernoulli RBM (GRBM) [57] as-
of researchers have proposed DL models for STSF. To
sumes the visible variables are continuous and replaces
better understand these models, we will start with an
the Bernoulli distribution with a Gaussian distribution.
introduction of the relevant background of DL and then
Spike and Slab RBM (ssRBM) maintains two types of
review the models for STSF.
hidden variables to model the covariance.
Activations: Activations refer to the element-
wise nonlinear mappings h = f (x). Activation func-
4.1 Preliminaries tions that are used in the reviewed models are listed in
Table 2.
4.1.1 Basic Building Blocks Sigmoid Belief Network: Sigmoid Belief Net-
work (SBN) is another building block of DGM. Unlike
In this section, we will review the basic blocks that RBM which is undirected, SBN is a directed graphical
have been used to build DL models for STSF. Since model. SBN contains a hidden binary state h and a
there is a vast literature on DL, the introduction of this visible binary state v. The basic generation process of
part serves only as a brief introduction. Readers can SBN is given in the following:
refer to [26] for more details. T
p(vm | h) = σ(wm h + cm ),
Restricted Boltzmann Machine: Restricted (23)
p(hj ) = σ(bj ).
Boltzmann Machine (RBM) [55] is the basic building
block of Deep Generative Models (DGMs). RBMs are Here, bj , cm are biases and wm is the weight. Be-
undirected graphical models that contain a layer of cause the posterior distribution is intractable, SBN is
observable variables v and another layer of hidden learned by methods like Neural Variational Inference
variables h. These two layers are interconnected and and Learning (NVIL) [60] and Wake-Sleep (WS) [61]
form a bipartite graph. In its simplest form, h and algorithm, which train an inference network along with
v are all binary and contain nh and nv nodes. The the generation network.
10

Fully-connected Layer: The Fully-Connected Graph Convolution: Graph convolution general-


(FC) layer is the very basic building block of DL izes the standard convolution, which is operated over
models. It refers to the linear-mapping from the input a regular grid topology, to convolution over graph
x to the hidden state h, i.e, h = Wx + b, where W structures [33], [66], [67], [68]. There are two inter-
is the weight and b is the bias. pretations of graph convolution: the spectral interpre-
Convolution and Pooling: The convolution layer tation and the spatial interpretation. In the spectral
and pooling layer are first proposed to take advantage interpretation [66], [69], graph convolution is defined
of the translation invariance property of image data. by the convolution theorem, i.e., the Fourier transform
The convolution layer computes the output by scanning of convolution is the point-wise product of Fourier
over the input and applying the same set of linear filters. transforms:
Although the input can have an arbitrary dimensional- x ∗G y = F −1 (F (x) ◦ F (y)) = U(UT x ◦ UT y).
ity [62], we will just introduce 2D convolution and 2D (27)
pooling in this section. For input X ∈ RCi ×Hi ×Wi , Here, F (x) = UT x is the graph Fourier transform.
the output of the convolution layer H ∈ RCo ×Ho ×Wo , The transformation matrix U is defined based on
which is also known as feature map, will be calculated the eigendecomposition of the graph Laplacian. Def-
as the following: ferrard et al. [69] defined U as the eigenvectors of
1 1
H = W ∗ X + b. (24) I − D− 2 AD− 2 , in which D is the diagonal degree
matrix, A is the adjacency matrix, and I is the identity
Here, W ∈ RCi ×Co ×Kh ×Kw is the weight, Kh and matrix.
Kw are the kernel sizes, “∗” denotes convolution, and In the spatial interpretation, graph convolution is
b is the bias. If we denote the feature vector at the defined by a localized parameter-sharing operator that
(i, j) position as H:,i,j , the flattened weight as W = aggregates the features of the neighboring nodes, which
vec(W), and the flattened version of the local region can also be termed as graph aggregator [33], [70]:
inputs as xN (i,j) = vec([X:,s,t | (s, t) ∈ N (i, j)]),
(24) can be rewritten as (25): hi = γθ (xi , {zNi }), (28)

H:,i,j = WxN (i,j) + b. (25) where hi is the output feature vector of node i, γθ (·)
is the mapping function parameterized by θ, xi is the
This shows that convolution layer can also be viewed input feature of node i, and zNi contains the features of
as applying multiple FC layers with shared weight node i’s neighborhoods. The mapping function needs
and bias on local regions of the input. Here, the to be permutation invariant and dynamically resizing.
ordered neighborhood set N (i, j) is determined by One example is the pooling aggregator defined in (29):
hyper-parameters like the kernel size, stride, padding,
and dilation [63]. For example, if the kernel size of hi = φo (xi ⊕ poolj∈Ni (φv (zj ))), (29)
the convolution filter is 3 × 3, we have N (i, j) = where φo and φv are mapping functions and ⊕ means
{(i + s, j + t) | −1 ≤ s, t ≤ 1}. Readers can concatenation. Compared with the spectral interpre-
refer to [26], [63] for details about how to calculate tation, the spatial interpretation does not require the
the neighborhood set N (i, j) based on these hyper- expensive eigendecomposition and is more suitable for
parameters. Also, the out-of-boundary regions of the large graphs. Moreover, Defferrard et al. [69] and Kipf
input are usually set to zero, which is known as zero- & Welling [68] proposed accelerated versions of the
padding. spectral graph convolution that could be interpreted
Like convolution layer, the pooling layer also scans from the spatial perspective.
over the input and applies the same mapping function.
Pooling layer generally has no parameter and uses 4.1.2 Types of DL Models for STSF
some reduction operations like max, sum, and average Based on the characteristic of the model formulation,
for mapping. The formula of the pooling layer is given we can categorize DL models for STSF into two major
in the following: types: Deep Temporal Generative Models (DTGMs)
Hk,i,j = g({Xk,s,t | (s, t) ∈ N (i, j)}), (26) and Feedforward Neural Network (FNN) and Recurrent
Neural Network (RNN) based methods. In the follow-
where g is the element-wise reduction operation. ing, we will briefly introduce the basics about DTGM,
Deconvolution and Unpooling: The deconvolu- FNN, and RNN.
tion [64] and unpooling layer [65] are first proposed to DTGM: DTGMs are DGMs for temporal se-
serve as the “inverse” operation of the convolution and quences. DTGM adopts a generative viewpoint of the
the pooling layer. Computing the forward pass of these sequence and tries to define the probability distribu-
two layers is equivalent to computing the backward tion p(X1:T ). The aforementioned RBM and SBN are
pass of the convolution and pooling layers [64]. building-blocks of DTGM.
11

FNN: FNNs, also called deep feedforward net- is turned on. Also, the cell can be updated or cleared
works [26], refer to deep learning models that construct when it or ft is activated. By using the memory cell
the mapping y = f (x; θ) by stacking various of the and gates to control information flow, the gradient
aforementioned basic blocks such as FC layer, convolu- will be trapped in the cell and will not vanish over
tion layer, deconvolution layer, graph convolution layer, time, which is also known as the Constant Error
and activation layer. Common types of FNNs include Carousel (CEC) [75]. LSTM belongs to a broader
Multilayer Perceptron (MLP) [26], [71], which stacks category of RNNs called gated RNNs [26]. We will
multiple FC layers and nonlinear activations, Convo- not introduce other architectures like Gated Recur-
lutional Neural Network (CNN) [72], which stacks rent Unit (GRU) in this survey because LSTM is the
multiple convolution and pooling layers, and graph most representative example. To simplify the notation,
CNN [69], which stacks multiple graph convolution we will also denote the transition rule of LSTM as
layers. The parameters of FNN are usually estimated by ht , ct = LSTM(ht−1 , ct−1 , xt ; W).
minimizing the loss function plus some regularization
terms. Usually, the optimization problem is solved via
4.2 Deep Temporal Generative Models
stochastic gradient-based methods [26], in which the
gradient is computed by backpropagation [73]. In this section, we will review DTGMs for STSF.
Similar to SSM and GP model, DTGM also adopts a
RNN: RNN generalizes the structure of FNN by
generative viewpoint of the sequences. The character-
allowing cyclic connections in the network. These re-
istics of DTGMs is that models belonging to this class
current connections make RNN especially suitable for
can all be stacked to form a deep architecture, which
modeling sequential data. Here, we will only introduce
enables them to model complex system dynamics. We
the discrete version of RNN which updates the hidden
will survey four DTGMs in the following. The general
states using ht = f (ht−1 , xt ; θ) [26]. These hidden
structures of these models are shown in Figure 2.
states can be further used to compute the output and
Temporal Restricted Boltzmann Machine:
define the loss function. One example of an RNN
Sutskever & Hinton [76] proposed Temporal Restricted
model is shown in (30). The model uses the tanh
Boltzmann Machine (TRBM), which is a temporal
activation and the fully-connected layer for recurrent
extension of RBM. In TRBM, the bias of the RBM in
connections. It uses another full-connected layer to get
the current timestamp is determined by the previous m
the output:
RBMs. The conditional probability of TRBM is given
ht = tanh(Wh ht−1 + Wx xt + b), in the following:
(30)
ot = Wo ht + c. 1  
p(vt , ht | v̂t , ĥt ) = exp −E(vt , ht | v̂t , ĥt ) ,
Similar to FNNs, RNNs are also trained by stochastic Z
gradient based methods. To calculate the gradient, we E(vt , ht | v̂t , ĥt ) = −fbv (v̂t )T vt − fbh (v̂t , ĥt )T ht
can first unfold an RNN to a FNN and then perform − vtT Wht ,
backpropagation on the unfolded computational graph. Xm
This algorithm is called Backpropagation Through fbv (v̂t ) = a + ATi vt−i ,
Time (BPTT) [26]. i=1
m m
Directly training the RNN shown in (30) with X X
fbh (v̂t , ĥt ) =b + BTi vt−i + CTi ht−i ,
BPTT will cause the vanishing and exploding gradient
i=1 i=1
problems [74]. One way to solve the vanishing gra- (32)
dient problem is to use the Long-Short Term Memory where v̂t = vt−m:t−1 and ĥt = ht−m:t−1 . The
(LSTM) [75], which has a cell unit that is designed joint distribution is modeled as the multiplication of
to capture the long-term dependencies. The formula of the conditional probabilities, i.e., p(v1:T , h1:T ) =
QT
LSTM is given in the following: i=1 p(vt , ht | v̂t , ĥt ).
it = σ(Wxi xt + Whi ht−1 + bi ), The inference of TRBM is difficult and involves
evaluating the ratio of two partition functions [77].
ft = σ(Wxf xt + Whf ht−1 + bf ),
The original paper adopted a heuristic inference pro-
ct = ft ◦ ct−1 + cedure [76]. The results on a synthetic bouncing ball
(31)
it ◦ tanh(Wxc xt + Whc ht−1 + bc ), dataset show that a two-layer TRBM outperforms a
ot = σ(Wxo xt + Who ht−1 + bo ), single-layer TRBM in this task.
Recurrent Temporal Restricted Boltzmann Ma-
ht = ot ◦ tanh(ct ).
chine: Later, Sutskever et al. [77] proposed the
Here, ct is the memory cell and it , ft , ot are the input Recurrent Temporal Restricted Boltzmann Machine
gate, the forget gate, and the output gate, respectively. (RTRBM) that extends the idea of TRBM by condition-
The memory cell is accessed by the next layer when ot ing the parameters of the RBM at timestamp t on the
12

output of an RNN. The model of RTRBM is described


in the following: h1
U
h2 h3 h4 h1 h2 h3 h4

1 W W

p(vt , ht | rt−1 ) = exp (−E(vt , ht | rt−1 )) , v1 v2 v3 v4 v1 v2 v3 v4

Z U
U

(a) TRBM
E(vt , ht | rt−1 ) = −hTt Wvt − cT vt − bT ht r1 r2 r3 r4

− hTt Urt−1 , h1 h2 h3 h4
(b) RTRBM
(
σ(Wvt + Urt−1 + b), t > 1
rt = . v1 v2 v3 v4

σ(Wvt + binit ), t=1 (d) TSBN (1­layer) MW ∘ W


h1 h2 h3 h4

(33) v1 v2 v3 v4

Here, rt encodes the visible states v1:t . The inference h


(2)

1
h
(2)

2
h
(2)

3
h
(2)

4
MU ∘ U
MU ∘ U

is simpler in RTRBM than TRBM because the model (1) (1) (1) (1)
r1 r2 r3 r4

h h h h

is the same as the standard RBM once rt−1 is given. 1 2 3 4

(c) SRTRBM

Similar to RBM, RTRBM can also be learned by CD. v1 v2 v3 v4

The author also extended RTRBM to use GRBM for (e) TSBN (2­layers)

modeling continuous time-series. Qualitative experi-


ments on the synthetic bouncing ball dataset show that
RTRBM can generate more realistic sequences than Fig. 2. Structure of the surveyed four types of deep temporal
generative models: TRBM, RTRBM, SRTRBM and TSBN.
TRBM.
Structured Recurrent Temporal Restricted
Boltzmann Machine: Mittelman et al. [78] proposed update the mask. The author also gives the ssRBM
Structured Recurrent Temporal Restricted Boltzmann version of SRTBM and RTRBM by replacing RBM
Machine (SRTRBM) that improves upon RTRBM by with ssRBM while fixing the part of the energy func-
learning the connection structure based on the spa- tion related to rt to be unchanged. Experiments on
tiotemporal patterns of the input. SRTRBM uses two human motion prediction, which is a TF-MPC task,
block-masking matrices to mask the weights between and climate data prediction, which is an STSF-IG task,
the hidden and visible nodes. The formula of SRTRBM prove that SRTRBM achieves smaller one-step-ahead
is similar to RTRBM except for the usage of two prediction error than RTRBM and RNN. This shows
masking matrices MW and MU : that using a masked weight matrix to model the spatial
and temporal correlations is useful for STSF problems.
E(vt , ht | rt−1 ) = −hTt Ŵvt − cT vt
− bT ht − hTt Ûrt−1 , Temporal Sigmoid Belief Network: Gan et
( al. [79] proposed a temporal extension of SBN called
σ(Ŵvt + Ûrt−1 + b), t > 1
rt = , (34) Temporal Sigmoid Belief Network (TSBN). TSBN con-
σ(Ŵvt + binit ), t=1 nects a sequence of SBNs in a way that the bias
Ŵ = W ◦ MW , of the SBN at timestamp t depends on the state of
the previous SBNs. For a length-T binary sequence
Û = U ◦ MU .
v1:T , TSBN describes the following joint probability,
Here, MW and MU are block-masking matrices gen- in which we have omitted the biases for simplicity:
erated by a graph G = (V ; E). To construct the graph,
pθ (v1:T , h1:T ) = p(h1 )p(v1 | h1 )
the visible units and the hidden units are divided into
T
|V | nonintersecting subsets. Each node  in the graph Y
p(ht | ht−1 , vt−1 )p(vt | ht , vt−1 ),
is assigned to a subset of hidden units Ch and a subset
t=2
of visible units Cv . The ways to separate and assign
the units are based on the spatiotemporal coordinates
p(ht,j = 1 | ht−1 , vt−1 ) = σ(ATj ht−1 + BTj vt−1 ),
of the data. If there is an edge between two nodes in p(vt,j = 1 | ht , vt−1 ) = σ(CTj ht + DTj vt−1 ).
the graph, the corresponding hidden and visible subsets (35)
are connected in SRTRBM. Also, the author extended TSBN to the continuous
domain by using a conditional Gaussian for the visible
If the masks are fixed, the learning process of
nodes:
SRTRBM is the same as RTRBM. To learn the mask,
the paper proposed to use the sigmoid function σ(νij ) p(vt | ht , vt−1 ) = N (µt , diag(σt2 )),
as a soft mask and learn the parameters νij . Also, µt = AT ht−1 + BT vt−1 + e, (36)
a sparsity regularizer is added to νij s to encourage 0T 0T 0
log(σt2 ) = A ht−1 + B vt−1 + e .
sparse connections. The training process is first to fix
the mask while updating the other parameters and then Multiple single-layer TSBNs can be stacked to
13

form a deep architecture. For a multi-layer TSBN, the features from images and videos [62], [72]. Therefore,
hidden states of the lower layers are conditioned on the researchers have used CNNs to build the encoder and
states of the higher layers. The connection structure of the forecaster. In [83], two input frames are encoded
a multi-layer TSBN is given in Figure 2(e). The author independently to two state vectors by a 2D CNN.
investigated both the stochastic and deterministic con- The future frame is generated by linearly extrapolating
ditioning for multi-layer models and showed that the these two state vectors and then mapping the extrap-
deep model with stochastic conditioning works better. olated vector back to the image space with a series
Experiment results on the human motion prediction of 2D deconvolution and unpooling layers. Also, to
task show a multi-layer TSBN with stochastic condi- better capture the spatial features, the author proposed
tioning outperforms variants of SRTRBM and RTRBM. the phase pooling layer, which keeps both the pooled
values and pooled indices in the feature maps. In [6],
multi-scale 2D CNNs with ReLU activations were used
4.3 FNN and RNN based Models
as the encoder and the forecaster. Rather than encoding
One shortcoming of DTGMs for STSF is that the the frames independently, the author concatenated the
learning and inference processes are often complicated input frames along the channel dimension to generate
and time-consuming. Recently, there is a growing trend a single input tensor. The input tensor is rescaled to
of using FNN and RNN based methods for STSF. Com- multiple resolutions and encoded by a multi-scale 2D
pared with DTGM, FNN and RNN based methods are CNN. To better capture the spatiotemporal correlations,
simpler in the learning and inference processes. More- Vondrick et al. [9] proposed to use 3D CNNs as the
over, these methods have achieved good performance encoder and forecaster. Also, the author proposed to
in many STSF problems like video prediction [31], use a 2D CNN to predict the background image and
[80], [81], precipitation nowcasting [5], [82], traffic only use the 3D CNN to generate the foreground.
speed forecasting [17], [32], [33], human trajectory Later, Shi et al. [82] compared the performance of
prediction [4], and human motion prediction [14]. In 2D CNN and 3D CNN for precipitation nowcasting
the following, we will review these methods. and showed that 3D CNN outperformed 2D CNN in
the experiments. Kalchbrenner et al. [84] proposed the
4.3.1 Encoder-Forecaster Structure Video Pixel Network (VPN) that uses multiple causal
Recall that the goal of STSF is to predict the length-L convolution layers [85] in the forecaster. Unlike the
sequence in the future given the past observations. The previous 2D CNN and 3D CNN models, of which the
Encoder-Forecaster (EF) structure [5], [7] first encodes forecaster directly generates the multi-step predictions,
the observations into a finite-dimensional vector or VPN generates the elements in the spatiotemporal se-
a probability distribution using the encoder and then quence one-by-one with the help of causal convolution.
generates the one-step-ahead prediction or multi-step Experiments show that VPN outperforms the baseline
predictions using the forecaster: which uses a 2D CNN to generate the predictions
directly. Recently, Xu et al. [86] proposed the PredCNN
s = f (Ft ; θ1 ),
(37) network that stacks multiple dilated causal convolution
X̂t+1:t+L = g(s; θ2 ). layers to encode the input frames and predict the future.
Here, f is the encoder and g is the forecaster. In Experiments show that PredCNN achieves state-of-the-
our definition, we also allow probabilistic encoder and art performance in video prediction.
forecaster, where s and X̂t+1:t+L turn into random Besides using CNNs, researchers have also used
variables: RNNs to build models for STSF-RG. The advantage of
s ∼ f (Ft ; θ1 ), using RNN for STSF problems is that it naturally mod-
(38) els the temporal correlations within the data. Srivastava
X̂t+1:t+L ∼ πg (s; θ2 ).
et al. [7] proposed to use multi-layer LSTM networks
Using probabilistic encoder and forecaster is a way to as the encoder and the forecaster. To use the LSTM, the
handle uncertainty and we will review the details in author directly flattened the images into vectors. Oh et
Section 4.3.5. All of the surveyed FNN and RNN based al. [24] proposed to use 2D CNNs to encode the image
methods fit into the EF framework. In the following, we frames before feeding into LSTM layers. However,
will review how previous works design the encoder and the state-state connection structure of Fully-Connected
the forecaster for different types of STSF problems, i.e., LSTM (FC-LSTM) is redundant for STSF and is not
STSF-RG, STSF-IG, and TF-MPC. suitable for capturing the spatiotemporal correlations
in the data [5]. To solve the problem, Shi et al. [5]
4.3.2 Models for STSF-RG proposed the Convolutional LSTM (ConvLSTM). Con-
For STSF-RG problems, the spatiotemporal sequences vLSTM uses convolution instead of full-connection in
lie in a regular grid and can naturally be modeled both the feed-forward and recurrent connections of
by CNNs, which have proved effective for extracting LSTM. The formula of ConvLSTM is given in the
14

following:
It = σ(Wxi ∗ Xt + Whi ∗ Ht−1 + bi ),
Ft = σ(Wxf ∗ Xt + Whf ∗ Ht−1 + bf ),
Ct = Ft ◦ Ct−1 +
It ◦ tanh(Wxc ∗ Xt + Whc ∗ Ht−1 + bc ), (a) Structure of ConvRNN.
Ot = σ(Wxo ∗ Xt + Who ∗ Ht−1 + bo ),
Ht = Ot ◦ tanh(Ct ).
(39)
Here, all the states, cells, and gates are three-
dimensional tensors. The author used two ConvLSTM
layers as the encoder and the forecaster and showed that
1) ConvLSTM outperforms FC-LSTM when applied (b) Structure of TrajRNN.
to the precipitation nowcasting problem and 2) the
two-layer model performs better than the single-layer Fig. 3. Illustration of the connection structures of ConvRNN,
TrajRNN. Links with the same color share the same transi-
model. Similar to the way of extending LSTM to tion weights. Source: [82].
ConvLSTM, other types of RNNs, like GRU, can be ex-
tended to Convolutional RNNs (ConvRNNs) by using
convolution in state-state transitions. Since ConvRNN
ory. Unlike ConvLSTM, of which the memory states
combines the advantage of CNN and RNN, it is widely
are constrained inside each LSTM layer, PredRNN
adopted in later researches [31], [80], [87], [88], [89]
maintains another global memory state to store the
as a basic block for building DL models for STSF-
spatiotemporal information in the lower layers and in
RG. There are also attempts to improve the structure
the previous timestamps. In PredRNN, memory states
of ConvRNN. Shi et al. [82] proposed the Trajectory
are allowed to zigzag across the RNN layers. The
GRU (TrajGRU) model that improves ConvGRU by
structure of PredRNN is illustrated in Figure 4(b).
actively learning the recurrent connection structure.
Here, Mlt s are the global spatiotemporal memory
Because the convolutional recurrence structure adopted
states and are updated in a similar way as the cell
in ConvRNNs is location-invariant, it is not suitable for
states in ConvLSTM except that it is updated in a
modeling location-variant motions, e.g., rotation and
zigzag order. Based on PredRNN, Wang et al. [81]
scaling. Thus, TrajGRU uses a learned subnetwork to
proposed the PredRNN++ structure which added more
output the recurrent connection structures:
nonlinearities when updating Mlt s. Experiments show
Ut , Vt = γ(Xt , Ht−1 ), that PredRNN++ outperforms both PredRNN and other
L
X baseline models including TrajGRU and variants of
l ConvLSTM.
Zt = σ(Wxz ∗ Xt + Whz ∗ Ht−1,Ut,l ,Vt,l ),
l=1 Apart from studying the basic building blocks like
L
X CNNs and RNNs, another line of research tries to
l
Rt = σ(Wxr ∗ Xt + Whr ∗ Ht−1,Ut,l ,Vt,l ), disentangle the motion and content of the spatiotem-
l=1 poral sequence to better forecast the future. Jia et
Ht0 = f (Wxh ∗ Xt + al. [80] and Finn et al. [31] proposed to predict how
L
X the frames will transform instead of directly predict-
l
Rt ◦ ( Whh ∗ Ht−1,Ut,l ,Vt,l )), ing the future frames. Finn et al. [31] proposed two
l=1 transformation methods named Convolutional Dynamic
Ht = (1 − Zt ) ◦ Ht0 + Zt ◦ Ht−1 . Neural Advection (CDNA) and Spatial Transformer
(40) Predictors (STP). To predict the frame at timestamp
Here, γ is a convolutional subnetwork that out- t + 1, CDNA generates N 2D convolutional filters
puts L couples of flow fields Ut , Vt ∈ RL×H×W . {Wt1 , Wt2 , ..., WtNP
} and a masking tensor M ∈
Ht−1,Ut,l ,Vt,l means to shift the elements in Ht−1 N
RN ×H×W where c=1 Mc,i,j = 1. The previous
through the flow field Ut,l , Vt,l by bilinear warp- frame Xt is transformed N times by these filters and
ing [90]. The connection structures of ConvRNN and these N transformed frames are linearly combined by
TrajRNN are illustrated in Figure 3. Experiments on the masking tensor to generate the final prediction.
both the synthetic dataset and the real-world HKO-7 STP is similar to CDNA except that the Wtn s become
benchmark show that TrajGRU outperforms 2D CNN, the parameters of different affine transformations [90].
3D CNN, and ConvGRU. Wang et al. [91] proposed Similar to CDNA and STP, Jia et al. [80] proposed the
the Predictive RNN (PredRNN), which extends Con- Dynamic Filter Network (DFN) that directly outputs
vLSTM by adding a new type of spatiotemporal mem- the parameters of a K × K local filter for every
15

forecasting. To convert the traffic stations into a graph,


the author generated the adjacency matrix based on
the thresholded pairwise road network distances. After
that, DCRNN is used to encode the sequence of graphi-
cal data and predict the future. The formula of DCRNN
is similar to ConvRNN except that the Diffusion Con-
volution (DC) is used in the input-state and state-state
(a) Structure of a stack of three ConvLSTM
layers. transitions. DC is a type of graph convolution that
considers the direction information in the graph, which
has the following formula:
K−1
X
Y= (θk,1 (D−1 k −1 T k
O A) ) + θk,2 (DI A ) )X,
k=0
(41)
where X ∈ RN ×C is the input features, A is the
adjacency matrix, DO is the diagonal out-degree ma-
(b) Structure of PredRNN with three layers. trix, DI is the diagonal in-degree matrix, θk,1 , θk,2
are the parameters, K is the number of diffusion
Fig. 4. Comparison of multi-layer ConvLSTMs and the Pre- steps, N is the number of nodes in the graph, and
dRNN. Mt is the global spatiotemporal memory that is C is the number of the features. Unlike other graph
updated along the red links.
convolution operators like ChebNet [69] which oper-
ates on undirected graphs, DC operates on a directed
location. The prediction is generated by applying these graph and is more suitable for real-life scenarios.
filters to the previous frame. Actually, DFN can be Experiments on two real-world traffic datasets show
viewed as a special case of CDNA. If we manually that DCRNN outperforms FC-LSTM and the ChebNet-
fix the 2-dimensional convolutional filters in CDNA based Graph Convolutional Recurrent Neural Network
to have K 2 elements, of which only a single ele- (GCRNN) [32]. Later, Zhang et al. [33] proposed the
ment is non-zero, CDNA turns into DFN. Villegas et Graph GRU (GGRU) network that generalizes the idea
al. [87] proposed Motion-Content Network (MCnet) of DCRNN. Instead of proposing a single model, the
which uses two separated networks to encode the author proposed a unified method for constructing an
motion and the content. In MCnet, the input of the RNN based on an arbitrary graph convolution operator.
motion encoder is the sequence of difference images The formula of GGRU is given in the following:
{X2 − X1 , ..., Xt − Xt−1 } and the input of the content Ut =σ(ΓΘxu (Xt , Xt ; G)
encoder is the last observed image Xt . The encoded
+ ΓΘhu (Xt ⊕ Ht−1 , Ht−1 ; G)),
motion features and content features are fused to-
gether to generate the prediction. Experiments show Rt =σ(ΓΘxr (Xt , Xt ; G)
that disentangling the motion and the content helps + ΓΘhr (Xt ⊕ Ht−1 , Ht−1 ; G)),
improve the performance. This architecture has been H0t =h(ΓΘxh (Xt , Xt ; G)
used in other papers [89], [92] to enhance the prediction
+ Rt ◦ ΓΘhh (Xt ⊕ Ht−1 , Ht−1 ; G)),
performance.
Ht =(1 − Ut ) ◦ H0t + Ut ◦ Ht−1 .
4.3.3 Models for STSF-IG (42)
Unlike STSF-RG problems, the measurements in Here, Ut , Rt are the update gate and reset gate that
STSF-IG problems are observed in some sparsely dis- controls the memory flow, Ht is the hidden state, Xt
tributed stations which are more challenging to deal is the input, h(·) is the activation function, and Γ(·, ·)
with. The simple strategy is to model the measurements is an arbitrary graph convolution operator. The author
observed at each station independently, which has been also proposed a new multi-head gated graph attention
adopted in Yu et al. [17]. However, this strategy has not aggregator that uses neural attention to aggregate the
considered the correlation among different stations. A neighboring features. Experiments show that GGRU
recent trend in solving this problem is to convert these outperforms DCRNN if the gated attention aggregator
sparsely distributed stations to a graph and utilize graph is used. Yu et al. [93] proposed the Spatio-Temporal
convolution to build the encoder and the forecaster. We Graph Convolutional Network (STGCN) that applies
will mainly introduce these graph convolution-based graph convolution in a different manner. The author
methods in this section. directly stacked multiple Spatio-Temporal Convolution
Li et al. [32] proposed the Diffusion Convolutional (ST-Conv) blocks to build the network. The ST-Conv
Recurrent Neural Network (DCRNN) for traffic speed block is a concatenation of two temporal convolution
16

layers and one graph convolution layer. For the graph 4.3.5 Methods for Handling Uncertainty
convolution operator, the paper tested the ChebNet Most real-world dynamical systems, e.g., atmosphere,
and its first-order approximation [68] and found that are inherently stochastic and unpredictable. Simply as-
ChebNet performed better. The major advantage of suming the outcome is deterministic, which is common
STGCN over RNN-based methods is it is faster in the in FNN and RNN based models, will generate blurry
training phase. predictions. Thus, recent studies [9], [83], [95] have
proposed ways for handling uncertainty.
4.3.4 Models for TF-MPC In [13], the author proposed to use a probabilistic
forecaster that outputs the parameters of a Gaussian
The goal of TF-MPC is to predict the future trajectories Mixture Model (GMM). This enables the network to
of all objects in the moving point cloud. Fragkiadaki generate stochastic predictions. The author found that
et al. [13] proposed a basic RNN based method for the smallest forecasting error was always produced
this problem. In the model, the coordinate sequence by the most probable sample, which was similar to
of each human joint is encoded by the same LSTM. the output of a model trained by the Euclidean loss.
The drawback of this basic method is that it does not Apart from using a probabilistic forecaster, the author
model the correlation among the individuals. To solve also proposed a curriculum learning strategy which
the problem, Alahi et al. proposed the SocialLSTM gradually adds noise to the input during training. This
model [4] that uses a set of interacted LSTMs to can be viewed as regularizing the model not to be
model the trajectory of the crowd. In SocialLSTM, overconfident about its predictions. This technique has
the position of each person is modeled by an LSTM also been adopted in later works [14]. Goroshin et
and a social pooling layer interconnects these LSTMs. al. [83] proposed to add a latent variable δ in the
Social pooling aggregates the hidden states of the network to represent the uncertainty of the prediction.
nearby LSTMs at the previous timestamp and use The latent variable δi is not determined by the input
this aggregated state to help decide the current state. Xi and can be adjusted in the learning process to
The transition rule of the SocialLSTM is given in the minimize the loss. Before updating the parameters of
following: the network, the algorithm first updates δi for k steps
i j i j i where k is a hyperparameter. At test time, the δi s is
Nt = {j | j ∈ Ni , |xt − xt | ≤ m, |yt − yt | ≤ n},
X j sampled based on its distribution on the training set.
Hit = ht−1 , Another approach to handle uncertainty is to utilize
j∈Nti the recently popularized conditional Generative Adver-
eit = φ1 (xit , yti ; We ), sarial Network (GAN) [96], [97]. Mathieu et al. [6]
proposed to use conditional GAN as the loss function
ait = φ2 (Hit ; Wl ),
to train the multi-scale 2D CNN. The GAN loss used
hit , cit = LSTM(hit−1 , cit−1 , vec(eit , ait ); Wl ). in the paper is given as follows:
(43)
Here, Hit is the aggregated state vector in the social min max EY∼pdata (Y|X ) [log DθD (X , Y)]
j j θG θD (44)
pooling step, (xt , yt ) is the coordinate of the j th
+ Ez∼pz [log(1 − DθD (X , GθG (X , z)))].
person at timestamp t, φ1 , φ2 are mapping functions,
and m, n are distance thresholds. Experiments show Here, DθD is the discriminator, GθG is the forecasting
that adding the social pooling layer is helpful for model, X is the observed sequence, Y is the ground-
the TF-MPC task. In [94], the author proposed an truth sequence, and p(z) is the noise distribution. The
improved version of social pooling called View Frus- discriminator is trained to differentiate between the
tum Of Attention (VFOA) social pooling. This method ground-truth output and the prediction generated by the
defines the neighborhood set in the social pooling layer forecaster. On the other hand, the forecasting model is
based on the calculated VFOA interest region. Jain et trained to fool the discriminator. The final loss function
al. [14] proposed the Structural-RNN (S-RNN) network in the paper combined the GAN loss, the L2 loss, and
to solve the human motion prediction problem. S-RNN the image gradient difference loss. Experiments show
imposes structures in the recurrent connections based that the model trained with GAN generates sharper
on a given spatiotemporal graph. For human motion predictions. Vondrick et al. [9] used a similar GAN
prediction, this spatiotemporal graph is constructed by loss to train the 3D CNN. Jang et al. [89] proposed to
the relationship between different human joints. S- use two discriminators to respectively discriminate the
RNN maintains two types of RNNs, nodeRNN and motion and appearance part of the generated samples.
edgeRNN. The outputs of edgeRNNs are fed to the Besides GAN, the Variational Auto-encoder
corresponding nodeRNNs. Experiments show that the (VAE) [98] has also been adopted for dealing with the
S-RNN achieves the state-of-the-art result in the human uncertainty in STSF. Babaeizadeh et al. [95] proposed
motion prediction problem. Stochastic Variational Video Prediction (SV2P) which
17

uses the following objective function: gies for multi-step sequence forecasting, which contain
the IMS, DMS, boosting strategy, and scheduled sam-
l = −Eqφ (Z|X1:T ) [log pθ (Xt:T | X1:t , Z)]
(45) pling. We showed that IMS and DMS have their ad-
+ DKL (qφ (Z | X1:T )||p(Z)), vantages and disadvantages. The boosting strategy and
where qφ is the variational inference network that scheduled sampling can be viewed as the mid-grounds
approximates the posterior of the latent variable Z and between these two approaches. Next, we reviewed the
pθ is an RNN generator. The experiments show that specific methods for STSF. We divided these methods
adding VAE greatly alleviates the blurry problem of into two major categories: classical methods and DL
the predictions. Later, Denton & Fergus [99] proposed based methods. As mentioned in 3.4 and 4.4, the exist-
the Stochastic Video Generator with Learned Prior ing classical methods and deep learning methods have
(SVG-LP) which uses a learned time-varying prior their own trade-offs when solving the STSF problem.
pψ (Zt | X1:t−1 ) to replace the fixed prior p(Z) The overall summarization of the reviewed methods is
in (45). The extension in their paper is similar to given in Table 3.
the Variational RNN (VRNN) [100]. Learning a time- There are many future research directions for ma-
variant prior can be interpreted as learning a predic- chine learning for STSF. 1) We can enhance the
tive model of uncertainty. The author showed in the strategies for the multi-step sequence forecasting. One
experiments that using a learned prior improves the direction is to design new training curriculum to shift
forecasting performance. from IMS to DMS. For example, we can extend the
forecasting horizon exponentially from 1 to N to better
4.4 Remarks model longer sequences, which is common in TF-
MPC problems. Also, we can generalize the boost-
In this section, we reviewed two types of DL based
ing method described in Section 2.4 to the general
methods for STSF: DTGMs and FNN and RNN based
STSF problems. In addition, we can reduce the com-
methods. The majority of the reviewed DTGMs treat
plexity of DMS from O(h) to O(log h) by training
STSF as a general multivariate sequence forecasting
problem and SRTRBM is the only model that gives spe-
O(log h) models that generate the predictions of hori-
zons 20 , 21 , ..., 2log h−1 . We can use these models to
cial treatment of the spatiotemporal sequences. Also,
make prediction for any horizon up to h in the spirit
learning a DTGM requires approximation and sam-
of binary coding. Moreover, the objective function
pling techniques and is more complicated compared
mentioned in Section 2.5 will not be differentiable
with the FNN and RNN based methods. However,
due to the sampling process and the original paper has
DTGM can well capture the uncertainty in the data
just ignored this problem. This problem can be solved
due to its stochastic generative process. On the other
by approximating the gradient through techniques in
hand, the FNN and RNN based models are easy to
reinforcement learning [101]. 2) We can propose new
train and fast for prediction. We reviewed the repre-
FNN and RNN based methods for STSF-IG problems.
sentative methods for STSF-RG, STSF-IG, and TF-
Compared with STSF-RG, DL for STSF-IG is less
MPC. Although these three types of STSF problems
mature and we can bring the ideas in STSF-RG, like
involve different tasks, the proposed methods for these
decomposing motion and content, adding a global
problems have strong relationships. For example, Con-
memory structure, and using GAN/VAE to handle
vLSTM, DCRNN, and SocialLSTM, which are re-
uncertainty, to STSF-IG. 3) We can combine DTGMs
spectively proposed for STSF-RG, STSF-IG, and TF-
with FNN and RNN based methods to get the best of
MPC, improve upon RNN by introducing structured
both worlds. The generation flavor of DTGMs has its
recurrent connections. To be more specific, ConvL-
role to play in FNN and RNN based models particularly
STM uses convolution in recurrent transitions, DCRNN
for handling uncertainty. One feasible approach is to
uses the graph convolution, and SocialLSTM uses the
use the idea of Bayesian Deep Learning (BDL) [102],
social pooling layer. Therefore, the success of these
in which FNN and RNN based models are used for
models greatly relies on the appropriate design of the
perception and DTGMs are used for inference. 4) We
basic network architecture. Also, FNN and RNN based
can study the type of TF-MPC problem where both the
methods are usually deterministic and are not well-
measurements and coordinates are changing. We can
designed for capturing the uncertainty in the data. We
consider to solve the “video object prediction” problem
thus reviewed some recently popularized techniques for
where we have the bounding boxes of the objects at
handling uncertainty including GAN and VAE, which
the previous frames and the task is to predict both the
have improved the forecasting performance.
objects’ future appearances and bounding boxes. 5) We
can study new learning scenarios like online learning
5 S UMMARY AND F UTURE W ORKS and transfer learning for STSF. For the STSF problem,
In this survey, we reviewed the machine learning based the online learning scenario is essential because spa-
methods for STSF. We first examined the general strate- tiotemporal data, like regional rainfall, usually arrive in
18

√ TABLE 3
Summary of the reviewed methods for STSF. We√put “ ” under STSF-RG, STSF-IG, and TF-MPC to show the method has
been extended to solve the problem. We put “ ” under “Unc.” to show some of the methods have handled uncertainty.

Category Subcategory Methods STSF-RG STSF-IG TF-MPC Unc.



Spatiotemporal indicator [34] √
Feature-based
Multiple predictors [3] √ √
Group-based [40], [41] √ √ √
Classical
SSM STARIMA [42], [43], [44] √ √ √
Low-rankness [46], [48], [49] √ √ √ √
GP GP [16], [52], [53], [54]
√ √ √
TRBM [76] √ √ √
RTRBM [77] √ √ √ √
DTGM
SRTRBM [78] √ √ √
TSBN [79] √ √ √ √
FC-LSTM √ √
2D CNN [6], [24] & 3D CNN [9] √
ConvLSTM [5] & TrajGRU [82] √
PredRNN [81], [91] √
DL VPN [84] & PredCNN [86] √
Learn transformation [31], [80] √ √
FNN & RNN Motion + Content [87], [89], [92] √
DCRNN [32] & GGRU [33] √
STGCN [93] √ √
Social pooling [4], [94] √ √
S-RNN [14] √ √
δ -gradient [83] √ √
SV2P [95] & SVG-LP [99]

an online fashion [82]. The transfer learning scenario [7] N. Srivastava, E. Mansimov, and R. Salakhutdinov,
is also essential because STSF can act as a pretraining “Unsupervised Learning of Video Representations using
LSTMs,” in ICML, 2015.
method. The model trained by STSF can be transferred
[8] C. Finn and S. Levine, “Deep visual foresight for planning
to solve other tasks like classification and control [8], robot motion,” in ICRA, 2017.
[31]. Designing a better transfer learning method for [9] C. Vondrick, H. Pirsiavash, and A. Torralba, “Generating
STSF will improve the performance of these tasks. videos with scene dynamics,” in NIPS, 2016.
[10] J. Walker, A. Gupta, and M. Hebert, “Dense optical flow
prediction from a static image,” in ICCV, 2015.
[11] K. Yamaguchi, A. C. Berg, L. E. Ortiz, and T. L. Berg,
R EFERENCES “Who are you with and where are you going?” in CVPR,
2011.
[1] B. R. Hunt, E. J. Kostelich, and I. Szunyogh, “Efficient [12] G. W. Taylor, G. E. Hinton, and S. T. Roweis, “Modeling
data assimilation for spatiotemporal chaos: A local en- human motion using binary latent variables,” in NIPS,
semble transform kalman filter,” Physica D: Nonlinear 2006.
Phenomena, vol. 230, no. 1, pp. 112–126, 2007. [13] K. Fragkiadaki, S. Levine, P. Felsen, and J. Malik, “Re-
[2] M. Ranzato, A. Szlam, J. Bruna, M. Mathieu, R. Col- current network models for human dynamics,” in ICCV,
lobert, and S. Chopra, “Video (language) modeling: A 2015.
baseline for generative models of natural videos,” CoRR,
[14] A. Jain, A. R. Zamir, S. Savarese, and A. Sax-
vol. abs/1412.6604, 2014.
ena, “Structural-RNN: Deep learning on spatio-temporal
[3] Y. Zheng, X. Yi, M. Li, R. Li, Z. Shan, E. Chang, and T. Li,
graphs,” in CVPR, 2016.
“Forecasting fine-grained air quality based on big data,” in
KDD, 2015. [15] A. Zonoozi, J. Kim, X. Li, and G. Cong, “Periodic-CRN: A
[4] A. Alahi, K. Goel, V. Ramanathan, A. Robicquet, L. Fei- convolutional recurrent model for crowd density prediction
Fei, and S. Savarese, “Social LSTM: Human trajectory with recurring periodic patterns,” in IJCAI, 2018.
prediction in crowded spaces,” in CVPR, 2016. [16] R. Senanayake, S. OCallaghan, and F. Ramos, “Predicting
[5] X. Shi, Z. Chen, H. Wang, D.-Y. Yeung, W.-k. Wong, and spatio-temporal propagation of seasonal influenza using
W.-c. Woo, “Convolutional LSTM network: A machine variational gaussian process regression,” in AAAI, 2016.
learning approach for precipitation nowcasting,” in NIPS, [17] R. Yu, Y. Li, C. Shahabi, U. Demiryurek, and Y. Liu,
2015. “Deep learning: A generic approach for extreme condition
[6] M. Mathieu, C. Couprie, and Y. LeCun, “Deep multi- traffic forecasting,” in SDM, 2017.
scale video prediction beyond mean square error,” in ICLR, [18] D. R. Cox, “Prediction by exponentially weighted moving
2016. averages and related methods,” Journal of the Royal Sta-
19

tistical Society. Series B (Methodological), pp. 414–422, [42] A. Cliff and J. Ord, “Model building and the analysis of
1961. spatial pattern in human geography,” Journal of the Royal
[19] G. Chevillon, “Direct multi-step estimation and forecast- Statistical Society. Series B (Methodological), vol. 37,
ing,” Journal of Economic Surveys, vol. 21, no. 4, pp. 746– no. 3, pp. 297–348, 1975.
785, 2007. [43] A. Cliff and J. K. Ord, “Space-time modelling with an
[20] S. B. Taieb and R. Hyndman, “Boosting multi-step autore- application to regional forecasting,” Transactions of the
gressive forecasts,” in ICML, 2014. Institute of British Geographers, pp. 119–128, 1975.
[21] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Sched- [44] P. E. Pfeifer and S. J. Deutsch, “A starima model-building
uled sampling for sequence prediction with recurrent neu- procedure with application to description and regional
ral networks,” in NIPS, 2015. forecasting,” Transactions of the Institute of British Ge-
[22] P. Del Moral, A. Doucet, and A. Jasra, “Sequential monte ographers, pp. 330–349, 1980.
carlo samplers,” Journal of the Royal Statistical Society: [45] G. E. Box, G. M. Jenkins, G. C. Reinsel, and G. M. Ljung,
Series B (Methodological), vol. 68, no. 3, pp. 411–436, Time series analysis: Forecasting and control. John Wiley
2006. & Sons, 2015.
[23] J.-L. Lin and C. W. Granger, “Forecasting from non-linear [46] M. T. Bahadori, Q. R. Yu, and Y. Liu, “Fast multivariate
models in practice,” Journal of Forecasting, vol. 13, no. 1, spatio-temporal analysis via low rank tensor learning,” in
pp. 1–9, 1994. NIPS, 2014.
[24] J. Oh, X. Guo, H. Lee, R. L. Lewis, and S. Singh, “Action- [47] B. Recht, M. Fazel, and P. A. Parrilo, “Guaranteed
conditional video prediction using deep networks in atari minimum-rank solutions of linear matrix equations via
games,” in NIPS, 2015. nuclear norm minimization,” SIAM Review, vol. 52, no. 3,
[25] A. M. Lamb, A. G. ALIAS PARTH GOYAL, Y. Zhang, pp. 471–501, 2010.
S. Zhang, A. C. Courville, and Y. Bengio, “Professor [48] R. Yu, D. Cheng, and Y. Liu, “Accelerated online low-rank
forcing: A new algorithm for training recurrent networks,” tensor learning for multivariate spatio-temporal streams,”
in NIPS, 2016. in ICML, 2015.
[26] I. Goodfellow, Y. Bengio, and A. Courville, “Deep [49] R. Yu and Y. Liu, “Learning from multiway data: Simple
learning,” 2016, book in preparation for MIT press. and efficient tensor regression,” in ICML, 2016.
[Online]. Available: http://www.deeplearningbook.org [50] A. G. Journel and C. J. Huijbregts, Mining geostatistics.
[27] M. Wytock and J. Z. Kolter, “Sparse gaussian conditional Academic press, 1978.
random fields: Algorithms, theory, and application to en- [51] C. E. Rasmussen, “Gaussian processes for machine learn-
ergy forecasting.” in ICML, 2013. ing,” 2006.
[28] J. Friedman, T. Hastie, and R. Tibshirani, The elements of [52] P. Trautman and A. Krause, “Unfreezing the robot: Navi-
statistical learning. Springer series in statistics Springer, gation in dense, interacting crowds,” in IROS, 2010.
Berlin, 2001, vol. 1. [53] S. Flaxman, “Machine learning in space and time,” PhD
[29] Y. Bengio, J. Louradour, R. Collobert, and J. Weston, dissertation, Carnegie Mellon University, 2015.
“Curriculum learning,” in ICML, 2009. [54] S. Flaxman, A. G. Wilson, D. B. Neill, H. Nickisch,
[30] F. Huszar, “How (not) to train your generative model: and A. J. Smola, “Fast Kronecker inference in Gaussian
Scheduled sampling, likelihood, adversary?” CoRR, vol. processes with non-Gaussian likelihoods,” in ICML, 2015.
abs/1511.05101, 2015. [55] P. Smolensky, “Information processing in dynamical sys-
[31] C. Finn, I. Goodfellow, and S. Levine, “Unsupervised tems: Foundations of harmony theory,” DTIC Document,
learning for physical interaction through video prediction,” Tech. Rep., 1986.
in NIPS, 2016. [56] G. E. Hinton, “Training products of experts by minimiz-
[32] Y. Li, R. Yu, C. Shahabi, and Y. Liu, “Diffusion con- ing contrastive divergence,” Neural Computation, vol. 14,
volutional recurrent neural network: Data-driven traffic no. 8, pp. 1771–1800, 2002.
forecasting,” in ICLR, 2018. [57] M. Welling, M. Rosen-Zvi, and G. E. Hinton, “Exponential
[33] J. Zhang, X. Shi, J. Xie, H. Ma, I. King, and D.-Y. Yeung, family harmoniums with an application to information
“GaAN: Gated attention networks for learning on large and retrieval,” in NIPS, 2004.
spatiotemporal graphs,” in UAI, 2018. [58] X. Glorot, A. Bordes, and Y. Bengio, “Deep sparse rectifier
[34] O. Ohashi and L. Torgo, “Wind speed forecasting using neural networks.” in AISTATS, 2011.
spatio-temporal indicators.” in ECAI, 2012. [59] A. L. Maas, A. Y. Hannun, and A. Y. Ng, “Rectifier
[35] K. P. Murphy, Machine learning: A probabilistic perspec- nonlinearities improve neural network acoustic models,”
tive. MIT press, 2012. in ICML, 2013.
[36] J. G. De Gooijer and R. J. Hyndman, “25 years of time [60] A. Mnih and K. Gregor, “Neural variational inference and
series forecasting,” International Journal of Forecasting, learning in belief networks,” in ICML, 2014.
vol. 22, no. 3, pp. 443–473, 2006. [61] G. E. Hinton, P. Dayan, B. J. Frey, and R. M. Neal,
[37] M. Fraccaro, S. r. K. Sø nderby, U. Paquet, and O. Winther, “The ”wake-sleep” algorithm for unsupervised neural net-
“Sequential neural models with stochastic layers,” in NIPS, works,” Science, vol. 268, no. 5214, p. 1158, 1995.
2016. [62] D. Tran, L. Bourdev, R. Fergus, L. Torresani, and
[38] K. H. Brodersen, F. Gallusser, J. Koehler, N. Remy, S. L. M. Paluri, “Learning spatiotemporal features with 3d con-
Scott et al., “Inferring causal impact using bayesian struc- volutional networks,” in ICCV, 2015.
tural time-series models,” The Annals of Applied Statistics, [63] F. Yu and V. Koltun, “Multi-scale context aggregation by
vol. 9, no. 1, pp. 247–274, 2015. dilated convolutions,” in ICLR, 2016.
[39] R. E. Kalman, “A new approach to linear filtering and pre- [64] M. D. Zeiler, D. Krishnan, G. W. Taylor, and R. Fergus,
diction problems,” Journal of Basic Engineering, vol. 82, “Deconvolutional networks,” in CVPR, 2010.
no. 1, pp. 35–45, 1960. [65] M. D. Zeiler and R. Fergus, “Visualizing and understand-
[40] A. Asahara, K. Maruyama, A. Sato, and K. Seto, ing convolutional networks,” in ECCV, 2014.
“Pedestrian-movement prediction based on mixed markov- [66] J. Bruna, W. Zaremba, A. Szlam, and Y. LeCun, “Spectral
chain model,” in SIGSPATIAL, 2011. networks and locally connected networks on graphs,” in
[41] W. Mathew, R. Raposo, and B. Martins, “Predicting fu- ICLR, 2014.
ture locations with hidden markov models,” in UbiComp. [67] D. K. Duvenaud, D. Maclaurin, J. Iparraguirre, R. Bom-
ACM, 2012. barell, T. Hirzel, A. Aspuru-Guzik, and R. P. Adams,
20

“Convolutional networks on graphs for learning molecular [93] B. Yu, H. Yin, and Z. Zhu, “Spatio–temporal graph con-
fingerprints,” in NIPS, 2015. volutional networks: A deep learning framework for traffic
[68] T. N. Kipf and M. Welling, “Semi-supervised classification forecasting,” in IJCAI, 2018.
with graph convolutional networks,” in ICLR, 2017. [94] I. Hasan, F. Setti, T. Tsesmelis, A. Del Bue, F. Galasso,
[69] M. Defferrard, X. Bresson, and P. Vandergheynst, “Con- and M. Cristani, “MX-LSTM: Mixing tracklets and vislets
volutional neural networks on graphs with fast localized to jointly forecast trajectories and head poses,” in CVPR,
spectral filtering,” in NIPS, 2016. 2018.
[70] W. Hamilton, Z. Ying, and J. Leskovec, “Inductive repre- [95] M. Babaeizadeh, C. Finn, D. Erhan, R. H. Campbell, and
sentation learning on large graphs,” in NIPS, 2017. S. Levine, “Stochastic variational video prediction,” in
[71] F. Rosenblatt, Perceptions and the theory of brain mecha- ICLR, 2018.
nisms. Spartan books, 1962. [96] I. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu,
[72] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “ImageNet D. Warde-Farley, S. Ozair, A. Courville, and Y. Bengio,
classification with deep convolutional neural networks,” in “Generative adversarial nets,” in NIPS, 2014.
NIPS, 2012. [97] M. Mirza and S. Osindero, “Conditional generative adver-
[73] J. Nocedal and S. Wright, Numerical optimization. sarial nets,” CoRR, vol. abs/1411.1784, 2014.
Springer Science & Business Media, 2006. [98] D. P. Kingma and M. Welling, “Auto-encoding variational
bayes,” in ICLR, 2014.
[74] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty
[99] E. Denton and R. Fergus, “Stochastic video generation
of training recurrent neural networks.” in ICML, 2013.
with a learned prior,” in ICML, 2018.
[75] S. Hochreiter and J. Schmidhuber, “Long short-term mem-
[100] J. Chung, K. Kastner, L. Dinh, K. Goel, A. C. Courville,
ory,” Neural Computation, vol. 9, no. 8, pp. 1735–1780,
and Y. Bengio, “A recurrent latent variable model for
1997.
sequential data.” in NIPS, 2015.
[76] I. Sutskever and G. E. Hinton, “Learning multilevel dis- [101] R. S. Sutton, D. A. McAllester, S. P. Singh, Y. Mansour
tributed representations for high-dimensional sequences.” et al., “Policy gradient methods for reinforcement learning
in AISTATS, 2007. with function approximation.” in NIPS, 1999.
[77] I. Sutskever, G. E. Hinton, and G. W. Taylor, “The re- [102] H. Wang and D.-Y. Yeung, “Towards bayesian deep learn-
current temporal restricted boltzmann machine,” in NIPS, ing: A framework and some existing methods,” IEEE
2009. Transaction on Knowledge and Data Engineering, vol. 28,
[78] R. Mittelman, B. Kuipers, S. Savarese, and H. Lee, “Struc- no. 12, pp. 3395–3408, Dec. 2016.
tured recurrent temporal restricted boltzmann machines,”
in ICML, 2014.
[79] Z. Gan, C. Li, R. Henao, D. E. Carlson, and L. Carin,
“Deep temporal sigmoid belief networks for sequence
modeling,” in NIPS, 2015. Xingjian Shi received his BEng
[80] X. Jia, B. De Brabandere, T. Tuytelaars, and L. V. Gool, degree in information security from
“Dynamic filter networks,” in NIPS, 2016. Shanghai Jiao Tong University,
[81] Y. Wang, Z. Gao, M. Long, J. Wang, and P. S. Yu, China. He is currently a Ph.D
“PredRNN++: Towards a resolution of the deep-in-time PLACE student at the Department of
dilemma in spatiotemporal predictive learning,” in ICML, PHOTO Computer Science and Engineering
2018. HERE of Hong Kong University of Science
[82] X. Shi, Z. Gao, L. Lausen, H. Wang, D.-Y. Yeung, W.-k. and Technology. His research
Wong, and W.-c. WOO, “Deep learning for precipitation interests are in deep learning
nowcasting: A benchmark and a new model,” in NIPS, and spatiotemporal analysis. He
2017. received the Hong Kong PhD
[83] R. Goroshin, M. F. Mathieu, and Y. LeCun, “Learning to Fellowship in 2014.
linearize under uncertainty,” in NIPS, 2015.
[84] N. Kalchbrenner, A. v. d. Oord, K. Simonyan, I. Danihelka,
O. Vinyals, A. Graves, and K. Kavukcuoglu, “Video pixel
networks,” in ICML, 2016.
[85] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, Dit-Yan Yeung received his BEng
O. Vinyals, A. Graves, N. Kalchbrenner, A. W. Senior, degree in electrical engineering and
and K. Kavukcuoglu, “WaveNet: A generative model for MPhil degree in computer science
raw audio,” CoRR, vol. abs/1609.03499, 2016. from the University of Hong Kong,
[86] Z. Xu, Y. Wang, M. Long, and J. Wang, “PredCNN: PLACE and PhD degree in computer sci-
Predictive learning with cascade convolutions,” in IJCAI, PHOTO ence from the University of South-
2018. HERE ern California. He started his aca-
[87] R. Villegas, J. Yang, S. Hong, X. Lin, and H. Lee, “De- demic career as an assistant profes-
composing motion and content for natural video sequence sor at the Illinois Institute of Tech-
prediction,” in ICLR, 2017. nology in Chicago. He then joined
[88] A. Bhattacharyya, B. Schiele, and M. Fritz, “Accurate and the Hong Kong University of Science
diverse sampling of sequences based on a best of many and Technology where he is now a full professor and the
sample objective,” in CVPR, 2018. Acting Head of the Department of Computer Science and
[89] Y. Jang, G. Kim, and Y. Song, “Video prediction with Engineering. His research interests are in computational
appearance and motion conditions,” in ICML, 2018. and statistical approaches to machine learning and artificial
[90] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial intelligence.
transformer networks,” in NIPS, 2015.
[91] Y. Wang, M. Long, J. Wang, Z. Gao, and P. S. Yu, “Pre-
dRNN: Recurrent neural networks for predictive learning
using spatiotemporal lstms,” in NIPS, 2017.
[92] J. Xu, B. Ni, Z. Li, S. Cheng, and X. Yang, “Structure
preserving video prediction,” in CVPR, 2018.

You might also like