Deep Learning For Time Series Forecasting: The Electric Load Case

See discussions, stats, and author profiles for this publication at: https://www.researchgate.
net/publication/334624519
Deep Learning for Time Series Forecasting: The Electric Load Case
Preprint · July 2019
CITATIONS READS
0 9,359
3 authors:
Alberto Gasparin Slobodan Lukovic

University of Lugano University of Lugano
5 PUBLICATIONS 30 CITATIONS 32 PUBLICATIONS 393 CITATIONS
SEE PROFILE SEE PROFILE
Cesare Alippi
Politecnico di Milano
312 PUBLICATIONS 5,886 CITATIONS
SEE PROFILE
Some of the authors of this publication are also working on these related projects:
Swiss Competence Center for Energy Research on the Electrical Infrastructure (SCCER-FURIES) View project
SCCER - FURIES View project
All content following this page was uploaded by Cesare Alippi on 06 August 2019.
The user has requested enhancement of the downloaded file.

1
Deep Learning for Time Series Forecasting:

The Electric Load Case
Alberto Gasparin, Slobodan Lukovic, and Cesare Alippi, Fellow, IEEE
Abstract—Management and efficient operations in critical for efficient power supply [3]. Furthermore, we differentiate
infrastructure such as Smart Grids take huge advantage of forecasting on the granularity level at which it is applied.
accurate power load forecasting which, due to its nonlinear For instance, in individual household scenario, load prediction
nature, remains a challenging task. Recently, deep learning
has emerged in the machine learning field achieving impressive is rather difficult as power consumption patterns are highly
performance in a vast range of tasks, from image classification volatile. On the contrary, aggregated load consumption i.e.,
to machine translation. Applications of deep learning models that associated with a neighborhood, a region, or even an
to the electric load forecasting problem are gaining interest entire state, is normally easier to predict as the resulting signal
among researchers as well as the industry, but a comprehensive exhibit slower dynamics.
and sound comparison among different architectures is not yet
available in the literature. This work aims at filling the gap Historical power loads are time-series affected by several
by reviewing and experimentally evaluating on two real-world external time-variant factors, such as weather conditions, hu-
datasets the most recent trends in electric load forecasting, by man activities, temporal and seasonal characteristics that make
contrasting deep learning architectures on short term forecast their predictions a challenging problem. A large variety of
(one day ahead prediction). Specifically, we focus on feedforward prediction methods has been proposed for the electric load
and recurrent neural networks, sequence to sequence models and
temporal convolutional neural networks along with architectural forecasting over the years and, only the most relevant ones
variants, which are known in the signal processing community are reviewed in this section. Autoregressive moving average
but are novel to the load forecasting one. models (ARMA) were among the first model families used in
Index Terms—smart grid, electric load forecasting, time-series short-term load forecasting [4], [5]. Soon they were replaced
prediction, multi-step ahead forecasting, deep learning, recur- by ARIMA and seasonal ARIMA models [6] to cope with
rent neural network, lstm, gru, temporal convolutional neural time variance often exhibited by load profiles. In order to
network, sequence to sequence models include exogenous variables like temperature into the fore-
casting method, model families were extended to ARMAX
I. I NTRODUCTION [7], [8] and ARIMAX [9]. The main shortcoming of these
Mart grids aim at creating automated and efficient en- system identification families is the linearity assumption for
S ergy delivery networks which improve power delivery
reliability and quality, along with network security, energy
the system being observed, hypothesis that does not generally
hold. In order to solve this limitation, nonlinear models like
efficiency, and demand-side management aspects [1]. Modern Feed Forward Neural Networks were proposed and became
power distribution systems are supported by advanced mon- attractive for those scenarios exhibiting significant nonlinear-
itoring infrastructures that produce immense amount of data, ity, as in load forecasting tasks [3], [10]–[13]. The intrinsic
thus enabling fine grained analytics and improved forecasting sequential nature of time series data was then exploited by
performance. In particular, electric load forecasting emerges considering sophisticated techniques ranging from advanced
as a critical task in the energy field, as it enables useful feed forward architecture with residual connections [14] to
support for decision making, supporting optimal pricing strate- convolutional approaches [15], [16] and Recurrent Neural
gies, seamless integration of renewables and maintenance cost Networks [17], [18] along with their many variants such
reductions. Load forecasting is carried out at different time as Echo-state Network [18]–[20], Long-Short Term Memory
horizons, ranging from milliseconds to years, depending on [18], [21]–[23] and Gated Recurrent Unit [18], [24]. Moreover,
the specific problem at hand. some hybrid architectures have also been proposed aiming to
In this work we focus on the day-ahead prediction problem capture the temporal dependencies in the data with recurrent
also referred in the literature as short term load forecasting networks while performing a more general feature extraction
(STLF) [2]. Since deregulation of electric energy distribution operation with convolutional layers [25], [26].
and wide adoption of renewables strongly affects daily mar- Different reviews address the load forecasting topic by
ket prices, STLF emerges to be of fundamental importance means of (not necessarily deep) neural networks. In [36]
the authors focus on the use of some deep learning archi-
A. Gasparin is with the Faculty of Informatics, Università della Svizzera tectures for load forecasting. However, this review lacks a
Italiana, 6900 Lugano, Switzerland (e-mail: alberto.gasparin@usi.ch).
S. Lukovic is with the Faculty of Informatics, Università della Svizzera comprehensive comparative study of performance verified on
Italiana, 6900 Lugano, Switzerland (e-mail: slobodan.lukovic@usi.ch). common load forecasting benchmarks. The absence of valid
C. Alippi is with the Department of Electronics, Information, and Bioengi- cost-performance metric does not allow the report to make
neering, Politecnico di Milan, 20133 Milan, Italy, and also with the Faculty
of Informatics, Università della Svizzera Italiana, 6900 Lugano, Switzerland conclusive statements. In [18] an exhaustive overview of
(e-mail: cesare.alippi@polimi.it; cesare.alippi@usi.ch). recurrent neural networks for short term load forecasting is
2
Reference Predictive Family of Models Time Horizon Exogenous Variables Dataset (Location)
[18] LSTM, GRU, ERNN, NARX, ESN D - Rome, Italy
[18] LSTM, GRU, ERNN, NARX, ESN D T New England [27]
[28] ERNN H T, H, P, other∗ Palermo, Italy
[17] ERNN H T, W, H Hubli, India
[20] ESN 15min to 1Y - Sceaux, France [29]
[21] LSTM, NARX D - Unknown
[22] LSTM D(?) C, TI Australia [30]
[23] LSTM 2W to 4M T, W, H, C, TI France
[31] LSTM 2D T, P, H, C, TI Unknown [32]
[33] LSTM, seq2seq-LSTM 60 H C, TI Sceaux, France [29]
[34] LSTM, seq2seq-LSTM 12 H T, C, TI New England [35]
[24] GRU D T, C, other∗∗ Dongguan, China
[15] CNN D C, TI USA
[16] CNN D C, TI Sceaux, France
[25] CNN + LSTM D T, C, TI North-China
[26] CNN + LSTM D - North-Italy
Table I
Time Horizon: H( HOUR ), D ( DAY ), W( WEEK ), M( MONTH ), Y( YEAR ), ?(N OT EXPLICITLY STATED , THUS , INFERRED FROM TEXT ) Exogenous variables: T
( TEMPERATURE ), W ( WIND SPEED ), H ( HUMIDITY ), P ( PRESSURE ), C ( CALENDAR INCLUDING DATE AND HOLIDAYS INFORMATION ), TI ( TIME ), ∗
OTHER INPUT FEATURES WERE CREATED FOR THIS DATASET, ∗∗ CATEGORICAL WEATHER INFORMATION IS USED ( E . G ., SUNNY, CLOUDY ), Dataset: THE
DATA SOURCE , A LINK IS PROVIDED WHENEVER AVAILABLE .
presented. The very detailed work considers one layer (not The rest of this paper is organized as follows.
deep) recurrent networks only. A comprehensive summary of In Section II we formally introduce the forecasting problems
the most relevant researches dealing with STLF employing along with the notation that will be used in this work. In
recurrent neural networks, convolutional neural networks and Section III we introduce Feed Forward Neural Networks
seq2seq models is presented in Table I. It emerges that most of (FNNs) and the main concepts relevant to the learning task.
the works have been performed on different datasets, making We also provide a short review of the literature regarding the
it rather difficult - if not impossible - to asses their absolute use of FNNs for the load forecasting problem.
performance and, consequently, recommend the best state-of- In Section IV we provide a general overview of Recurrent
the-art solutions for load forecast. Neural Networks (RNNs) and their most advanced architec-
In this survey we consider the most relevant -and recent- tures: Long Short-Term Memory and Gated Recurrent Unit
deep architectures and contrast them in terms of perfor- networks.
mance accuracy on open-source benchmarks. The considered In Section V Sequence To Sequence architectures (seq2seq)
architectures include recurrent neural networks, sequence to are discussed as a general improvement over recurrent neural
sequence models and temporal convolutional neural networks. networks. We present both, simple and advanced models built
The experimental comparison is performed on two different on the sequence to sequence paradigm.
real-world datasets which are representatives of two distinct In Section VI Convolutional Neural Networks are introduced
scenarios. The first one considers power consumption at an and one of their most recent variant, the temporal convo-
individual household level with a signal characterized by high lutional network (TCN), is presented as the state-of-the-art
frequency components while the second one takes into account method for univariate time-series prediction.
aggregation of several consumers. Our contributions consist in: In Section VII the real-world datasets used for models compar-
ison are presented. For each dataset, we provide a description
• A comprehensive review. The survey provides a com-
of the preprocessing operations and the techniques that have
prehensive investigation of deep learning architectures
been used to validate the models performance.
known to the smart grid literature as well as novel recent
Finally, In Section VIII we draw conclusions based on the
ones suitable for electric load forecasting.
performed assessments.
• A multi-step prediction strategy comparison for recurrent
neural networks: we study and compare how different
prediction strategies can be applied to recurrent neural II. P ROBLEM D ESCRIPTION
networks. To the best of our knowledge this work has In basic multi-step ahead electric load forecasting a uni-
not been done yet for deep recurrent neural networks. variate time series s = [s[0], s[1] . . . , s[T ]] that spans through
• A relevant performance assessment. To the best of our several years is given. In this work, input data are presented to
knowledge, the present work provides the first system- the different predictive families of models as a regressor vector
atic experimental comparison of the most relevant deep composed of fixed time-lagged data associated with a window
learning architectures for the electric load forecasting size of length nT which slides over the time series. Given
problems of individual and aggregated electric demand. It this fixed length view of past values, a predictor f aims at
should be noted that envisaged architectures are domain forecasting the next nO values of the time series. In this work
independent and, as such, can be applied in different the forecasting problem is studied as a supervised learning
forecasting scenarios. problem. As such, given the input vector at discrete time t
3
Fixed-length Time Windows ...

(Sliding Window Apprach)
𝑛𝑇 𝑛𝑂
𝑠(𝑡)
One fixed-length Time

Window
Indexes refer to the

𝑠[𝑡-𝑛𝑇 +1] 𝑠[𝑡-𝑛𝑇 +2] ... 𝑠[𝑡] 𝑠[𝑡+1] ... 𝑠[𝑡+𝑛𝑂 ]
whole time series
𝐲𝐭
Indexes refer to the

𝐱[0] 𝐱[1] ... 𝐱[𝑛𝑇 -1] 𝑦[𝑛𝑇 -1] ... 𝑦[𝑛𝑇 +𝑛𝑂 -2]
time window
𝐱𝐭
Figure 1. A sliding windowed approach is used to frame the forecasting problem into a supervised machine learning problem. The target signal s is split in
multiple input output pairs (xt , yt )∀t ∈ {nT , nT + 1, . . . , T − nO }
defined as xt = [s[t − nT + 1], . . . , s[t]] ∈ IRnT , the fore-

Notation Description
casting problem requires to infer the next nO measurements
nT window size of the regressor vector
yt = [s[t + 1], . . . , s[t + nO ]] ∈ IRnO or a subset of. To ease
nO time horizon of the forecast
the notation we express the input and output vectors in the d−1 number of exogenous variable
reference system of the time window instead of the time series u scalar value
one. By following this approach, the input vector at discrete u vector/matrix
uT vector/matrix transposed
time t becomes xt = [xt [0], . . . , xt [nT − 1]] ∈ IRnT , xt [i] = elementwise product
s[i + 1 + t − nT ] and the corresponding output vector is * convolution operator
yt = [yt [nT −1], . . . , yt [nT +nO −2]] ∈ IRnO . yt characterizes ∗d dilated convolution operation
the real output values defined as yt [t] = xt [t + 1], ∀t ∈ T . ` index for the lth layer
regressor vector at discrete time t
Similarly, we denote as ŷt = f (xt ; Θ̂) ∈ IRnO , the prediction xt (reference system: time-series)
vector provided by a predictive model f whose parameters xt ∈ IRnT or xt ∈ IRnT ×d
vector Θ has been estimated by optimizing a performance true output for the input sequence at time t
function. yt (reference system: time-series)
yt ∈ IRnO
predicted output for the input sequence at time t
Without loss of generality, in the remaining of the paper, ŷt (reference system: time-series)
we drop the subscript t from the inner elements of xt and ŷt ∈ IRnO
input vector of load & other features at time t
yt . The introduced notation, along with the sliding window x[t] (reference system: time window)
approach, is depicted in Figure 1. x[t] ∈ IR or x[t] ∈ IRd
value of the load time-series at time t + 1
In certain applications we will additionally be provided y[t] (reference system: time window)
y[t] ∈ IR
with d − 1 exogenous variables (e.g., the temperatures) each exogenous features vector at time t
of which representing a univariate time series aligned in zt (reference system: time series)
time with the data of electricity demand. In this scenario zt ∈ IRnT ×d−1
the components of the regressor vector become vectors, exogenous features vector at time t
z[t] (reference system:time window).
i.e., xt = [x[0], . . . , x[nT − 1]] ∈ IRnT ×d . Indeed, each z[t] ∈ IRd−1
element of the input sequence is represented as x[t] = h hidden state vector
[x[t], z0 [t], . . . , zd−2 [t]] ∈ IRd where x[t] ∈ IR is the scalar Θ model’s vector of parameters
nH number of hidden neurons
load measurement at time t, while zk [t] ∈ IR is the scalar
Table II
value of the k th exogenous feature. T HE NOMENCLATURE USED IN THIS WORK .
The nomenclature used in this work is given in Table II.

4
III. F EED F ORWARD N EURAL N ETWORKS where η is the learning rate and ∇Θ L(Θ) the gradient w.r.t. Θ.
Stochastic Gradient Descent (SGD), RMSProp [38], Adagrad
Feed Forward Neural Networks (FNNs) are parametric
[39], Adam [40] are popular learning procedures. The learning
model families characterized by the universal function ap-
procedure yields estimate Θ̂ = Θk associated with the
proximation property [37]. Their computational architectures
predictive model f (xt ; Θ̂).
are composed of a layered structure consisting of three main
In our work, deep FNNs are the baseline model architec-
building blocks: the input layer, the hidden layer(s) and the
tures.
output layer. The number of hidden layers (L > 1), determines
In multi-step ahead prediction the output layer dimension
the depth of the network, while the size of each layer, i.e., the
coincides with the forecasting horizon nO > 1. The dimension
number nH,` of hidden units of the ` − th layer defines its
of the input vector depends also on the presence of exogenous
complexity in terms of neurons. FNNs provide only direct
variables; this aspect is further discussed in Section VII.
forward connections between two consecutive layers, each
connection associated with a trainable parameter; note that
given the feedfoward nature of the computation no recursive A. Related Work
feedback is allowed. More in detail, given a vector x ∈ IRnT The use of Feed Forward Neural networks in short term load
fed at the network input, the FNN’s computation can be forecasting dates back to the 90s. Authors in [11] propose a
expressed as: shallow neural network with a single hidden layer to provide a
24-hour forecast using both load and temperature information.
a` = W`T h`−1 + b` , ` = 1, ...L (1)
In [10] one day ahead forecast is implemented using two
h` = φ` (a` ) (2) different prediction strategies: one network provides all 24
forecast values in a single shot (MIMO strategy) while an-
where h0 = xt ∈ IRnT and ŷt = hL ∈ IRnO .
other single output network provides the day-ahead prediction
Each layer ` is characterized with its own parameters matrix
by recursively feedbacking its last value estimate (recurrent
W` ∈ IRnH,`−1 ×nH,` and bias vector b` ∈ IRnH,` . Hereafter,
strategy). The recurrent strategy shows to be more efficient in
in order to ease the notation, we incorporate the bias term
terms of both training time and forecasting accuracy. In [41]
in the weight matrix, i.e., W` = [W` ; b` ] and h` = [h` ; 1].
the authors present a feed forward neural network to forecast
Θ = [W1 , . . . , WL ] groups all the network’s parameters.
electric loads on a weekly basis. The sparsely connected feed
Given a training set of N input-output vectors in the (xi , yi )
forward architecture receives the load time-series, temperature
form, i = 1, . . . , N , the learning procedure aims at identifying
readings, as well as the time and day of the week. It is shown
a suitable configuration of parameters Θ̂ that minimizes a loss
that the extra information improves the forecast accuracy
function L evaluating the discrepancy between the estimated
compared to an ARIMA model trained on the same task. [12]
values f (xt ; Θ) and the measurements yt :
presents one of the first multi-layer FNN to forecast the hourly
load of a power system.
Θ̂ = arg min .L(Θ) A detailed review concerning applications of artificial neural
Θ
networks in short-term load forecasting can be found in [3].
The mean squared error: However, this survey dates back to the early 2000s, and does
N not discuss deep models. More recently, architectural variants
1 X of feed forward neural networks have been used; for example,
L(Θ) = (yt − f (xt ; Θ))2 (3)
N t=1 in [14] a ResNet [42] inspired model is used to provide day
ahead forecast by leveraging on a very deep architecture. The
is a very popular loss function for time series prediction
article shows a significant improvement on aggregated load
and, not rarely, a regularization penalty term is introduced to
forecasting when compared to other (not-neural) regression
prevent overfitting and improve the generalization capabilities
models on different datasets.
of the model
N
1 X IV. R ECURRENT N EURAL NETWORKS
L(Θ) = (yt − f (xt ; Θ))2 + Ω(Θ). (4)
N t=1 In this section we overview recurrent neural networks, and,
in particular the Elmann Net architecture [43], Long-Short
The most used regularization scheme controlling model com-
Term Memory [44] and Gated Recurrent Unit [45] networks.
plexity is the L2 regularization Ω(Θ) = λkΘk22 , being
Afterwords, we introduce deep recurrent neural networks
λ a suitable hyper-parameter controlling the regularization
and discuss different strategies to perform multi-step ahead
strength.
forecasting. Finally, we present related work in short-term load
As Equation 4 is not convex, the solution cannot be obtained forecasting that leverages on recurrent networks.
in a closed form with linear equation solvers or convex
optimization techniques. Parameters estimation (learning pro-
cedure) operates iteratively e.g., by leveraging on the gradient A. Elmann RNNs (ERNN)
descent approach: Elmann Recurrent Neural Networks (ERNN) were proposed
in [43] to generalize feedforward neural networks for better
Θk = Θk−1 − η∇Θ L(Θ) (5)

Θ=Θk−1
handling ordered data sequences like time-series.
5
The reason behind the effectiveness of RNNs in dealing Through Time [48] TBPTT(τb , τf ) is used. The method
with sequences of data comes from their ability to learn a processes an input window of length nT one timestep at a
compact representation of the input sequence xt by means of time and runs BPTT for τb timesteps every τf steps. Notice
a recurrent function f that implements the following mapping: that having τb < nT does not limit the memory capacity
of the network as the hidden state incorporates information
h[t] = f (h[t − 1], x[t]; Θ) (6) taken from the whole sequence. Despite that, setting τb to
a very low number may result in poor performance. In the
By expanding Equation 6 and given a sequence of inputs
literature BPTT is considered equivalent to TBPTT(τb = nT ,
xt = [x[0], . . . , x[nT − 1]], x[t] ∈ IRd the computation
τf = 1). In this work we used epoch-wise Truncated BPTT
becomes:
i.e., TBPTT(τb = nT , τf = nT ) to indicate that the weights
a[t] = WT h[t − 1] + UT x[t] (7) update is performed once a whole sequence has been pro-
h[t] = φ(a[t]) (8) cessed.
T Despite of the model simplicity, Elmann RNNs are hard to
y[t] = ψ(V h[t]) (9)
train due to ineffectiveness of gradient (back)propagation. In
where W ∈ IRnH ×nH , U ∈ IRd×nH , V ∈ IRnH ×nO are fact, it emerges that the propagation of gradient is effective
the weight matrices for hidden-hidden, input-hidden, hidden- for short-term connections but is very likely to fail for long-
output connections respectively, φ(·) is an activation function term ones, when the gradient norm usually shrinks to zero or
(generally the hyperbolic tangent one) and ψ(·) is normally diverges. These two behaviours are known as the vanishing
a linear function. The computation of a single module in an gradient and the exploding gradient problems [49], [50] and
Elmann recurrent neural network is depicted in Figure 3. were extensively studied in the machine learning community.
[ ]
B. Long Short-Term Memory (LSTM)
Recurrent neural networks with Long Short-Term Memory

[ -1] (LSTM) were introduced to cope with the vanishing and
exploding gradients problems occurring in ERNNs and, more
in general, in standard RNNs [44]. LSTM networks maintain
[ ] the same topological structure of ERNN but differ in the
Affine transform.
composition of the inner module - or cell.
Element-wise
Input Ouput Concat. +
operations
nonlinear op.
Figure 3. A simple ERNN block with one cell implementing Equation 78 once [ -1] x + [ ]
rewritten as matrix concatenation: a[t] = [W, U]T [h[t − 1], x[t]], h[t] =
φ(a[t]), with [W, U] ∈ IR(nH +d)×nH and [h[t − 1], x[t]] ∈ IRnH +d , x
[ ] [ ]
Usually φ(·) is the hyperbolic tangent. ̃ ]
[
[ ] x
It can be noted that an ERNN processes one element of
[ -1] [ ]
the sequence at a time, preserving its inherent temporal order.
After reading an element from the input sequence x[t] ∈ IRd
the network updates its internal state h[t] ∈ IRnH using both (a [ ]
transformation of) the latest state h[t−1] and (a transformation

Affine transform.
of) the current input (Equation 6). The described process can Input Ouput Concat. +
Element-wise
operations
nonlinear op.
be better visualized as an acyclic graph obtained from the
original cyclic graph (left side of Figure 2) via an operation Figure 4. Long-Short Term Memory block with one cell.
known as time unfolding (right side of Figure 2). It is of
fundamental importance to point out that all nodes in the
unfolded network share the same parameters, as they are just Each LSTM cell has the same input and output as an
replicas distributed over time. ordinary ERNN cell but, internally, it implements a gated
The parameters of the network Θ = [W, U, V] are usually system that controls the neural information processing (see
learned via Backpropagation Through Time (BPTT) [46], [47], Figure Figure 3 and 4). The key feature of gated networks is
a generalized version of standard Backpropagation. In order their ability to control the gradient flow by acting on the gate
to apply gradient-based optimization, the recurrent neural net- values; this allows to tackle the vanishing gradient problem,
work has to be transformed through the unfolding procedure as LSTM can maintain its internal memory unaltered for long
shown in Figure 2. In this way, the network is converted time intervals. Notice from the equations below that the inner
into a FNN having as many layers as time intervals in the state of the network results as a linear combination of the old
input sequence, and each layer is constrained to have the state and the new state (Equation 14). Part of the old state
same weight matrices. In practice Truncated Backpropagation is preserved and flows forward while in the ERNN the state
6
̂ ̂ [0] ̂ [ ]
UNFOLDING [0] ... [ -1] [ ]
[0] [ ]
Figure 2. (Left) A simple RNN with a single input. The black box represents the delay operator which leads to Equation 6. (Right) The network after
unfolding. Note that the structure reminds that of a (deep) feed forward neural network but, here, each layer is constrained to share the same weights. hinit
is the initial state of the network which is usually set to zero.
value is completely replaced at each timestep (Equation 8). In 1

detail, the neural computation is:
x + [ ]
x x
[ ] [ ]
i[t] = ψ (Wf h[t − 1] + Uf x[t]) (10)
f [t] = ψ (Wi h[t − 1] + Ui x[t]) (11)
[ -1]
o[t] = ψ (Wo h[t − 1] + Uo x[t]) (12)
c[t] = φ (Wc h[t − 1] + Uc x[t])
e (13)
[ ]
c[t] = f [t] c[t − 1] + i[t] e
c[t] (14)
Affine transform.
Element-wise
h[t] = o[t] φ(c[t]) (15) Input Ouput Concat. +
nonlinear op.
operations
Figure 5. Gated Recurrent Unit memory block with one cell.
where Wf , Wi , Wo , Wc ∈ IRnH ×nH , Uf , Ui , Uo , Uc ∈ The basic components of a GRU cell are outlined in Figure
IRnH ×d are parameters to be learned, is the Hadamard 5, whereas the neural computation is controlled by:
product, ψ(·) is generally a sigmoid activation while φ(·)
can be any non-linear one (hyperbolic tangent in the original u[t] = ψ (Wu h[t − 1] + Uu x[t]) (16)
paper). The cell state c[t] encodes the - so far learned - r[t] = ψ (Wr h[t − 1] + Ur x[t]) (17)
information from the input sequence. At timestep t the flow h[t] = φ (Wc (r[t] h[t − 1]) + Uc x[t])
e (18)
of information within the unit is controlled by three elements
called gates: the forget gate f [t] controls the cell state’s content h[t] = u[t] h[t − 1] + (1 − u[t]) h[t]
e (19)
and changes it when obsolete, the input gate i[t] controls which where Wu , Wr , Wc ∈ IRnH ×nH , Uu , Ur , Uc ∈ IRnH ×d
state value will be updated and how much, e c[t], finally the are the parameters to be learned, ψ(·) is generally a sigmoid
output gate o[t] produces a filtered version of the cell state activation while φ(·) can be any kind of non-linearity (in the
and serves it as the network’s output h[t] [51]. original work it was an hyperbolic tangent). u[t] and r[t] are
the update and the reset gates, respectively. Several works in
the natural language processing community show that GRUs
perform comparably to LSTM but train generally faster due
to the lighter computation [52], [53].
C. Gated Recurrent Units (GRU) D. Deep Recurrent Neural Networks

All recurrent architectures presented so far are characterized
by a single layer. In turn, this implies that the computation
Firstly introduced in [45], GRUs are a simplified variant of is composed by an affine transformation followed by a non-
LSTM and, as such, belong to the family of gated RNNs. linearity. That said, the concept of depth in RNN is less
GRUs distinguish themselves from LSTMs for merging in straightforward than in feed-forward architectures. Indeed, the
one gate functionalities controlled by the forget gate and the later ones become deep when the input is processed by a
input gate. This kind of cell ends up having just two gates, large number of non-linear transformations before generating
which results in a more parsimonious architecture compared the output values. However, according to this definition, an
to LSTM that, instead, has three gates. unfolded RNN is already a deep model given its multiple
7
non-linear processing layers. That said, a deep multi-level Algorithm 2 Direct Strategy for Multi-Step Forecasting
processing can be applied to all the transition functions (input- 1: x ← xt
hidden, hidden-hidden, hidden-output) as there are no interme- 2: o ← empty list
diate layers involved in these computations [54]. Deepness can 3: k ← 1
also be introduced in recurrent neural networks by stacking 4: while k < nO + 1 do
recurrent layers one on top of the other [55]. As this deep 5: o ← concatenate(o, fk (x))
architecture is more intriguing, in this work, we refer it as a 6: k ←k+1
Deep RNN. By iterating the RNN computation, the function 7: end while
implemented by the deep architecture can be represented as: 8: return o as ŷt
h` [t] = f (h` [t − 1], h`−1 [t]; Θ) ` = 1, 2, ..., L (20)

Algorithm 3 DirRec Strategy for Multi-Step Forecasting
where h` [t] is the hidden state at timestep t for layer `. 1: x ← xt
Notice that h0 [t] = x[t]. It has been empirically shown 2: o ← empty list
in several works that Deep RNNs are better to capture the 3: k ← 1
temporal hierarchy exhibited by time-series then their shallow 4: while k < nO + 1 do
counterpart [54], [56], [57]. Of course, hybrid architectures 5: o ← fk (x)
having different layers -recurrent or not- can be considered as 6: o ← concatenate(o, o)
well. 7: x ← concatenate(x, vectorize(o))
8: k ←k+1
E. Multi-Step Prediction Schemes 9: end while
10: return o as ŷt
There are five different architecture-independent strategies
for multi-step ahead forecasting [58]:
Recursive strategy (Rec): a single model is trained to
perform a one-step ahead forecast given the input sequence. MIMO strategy: (Multiple input - Multiple output) [60],
Subsequently, during the operational phase, the forecasted a single predictor f is trained to forecast a whole output
output is recursively fedback and considered to be the correct sequence of length nO in one-shot, i.e., differently from the
one. By iterating nO times this procedure we generate the previous cases the output of the model is not a scalar but a
forecast values at time t + nO . The procedure is described vector:
in Algorithm 1, where x[1 :] is the input vector without its
first element while the vectorize(·) procedure concatenates ŷt = f (xt )
the scalar output y to the exogenous input variables.
DIRMO strategy: [61], represents a trade-off between the
Algorithm 1 Recursive Strategy (Rec) for Multi-Step Fore-
Direct strategy and the MIMO strategy. It divides the nO steps
casting
forecasts into smaller forecasting problems, each of which of
1: x ← xt
length s. It follows that d nsO e predictors are used to solve the
2: o ← empty list
problem.
3: k ← 1
4: while k < nO + 1 do Given the considerable computational demand required by
5: o ← f (x) RNNs during training, we focus on multi-step forecasting
6: o ← concatenate(o, o) strategies that are computationally cheaper, specifically, Recur-
7: x ← concatenate(x[1 :], vectorize(o))) sive and MIMO strategies [58]. We will call them RNN-Rec
8: k ←k+1 and RNN-MIMO.
9: end while Given the hidden state h[t] at timestep t, the hidden-output
10: return o as ŷt mapping is obtained through a fully connected layer on top
of the recurrent neural network. The objective of this dense
To summarize, the predictor f receives in input a vector x network is to learn the mapping between the last state of the
of length nT and outputs a scalar value o. recurrent network, which represents a kind of lossy summary
Direct strategy: design a set of nO independent predictors of the task-relevant aspect of the input sequence, and the
fk , k = 1, . . . , nO , each of which providing a forecast at time output domain. This holds for all the presented recurrent
t + k. Similarly to the recursive strategy, each predictor fk networks and is consistent with Equation 9. In this work
outputs a scalar value o, but the input vector is the same to RNN-Rec and RNN-MIMO differ in the cardinality of the
all the predictors. Algorithm 2 details the procedure. output domain, which is 1 for the former and nO for the
latter, meaning that in Equation 9 either V ∈ IRnH ×1 or
DirRec strategy: [59] is a combination of the above two
V ∈ IRnH ×nO . The objective function is:
strategies. Similar to the direct approach, nO models are used,
but here, each predictor leverages on an enlarged input set, nO −1
obtained by adding the results of the forecast at the previous 1 X
L(Θ) = (y[t] − ŷ[t])2 + Ω(Θ) (21)
timestep. The procedure is detailed in Algorithm 3. nO t=0
8
F. Related work translation [45], [64], [65] but has emerged as the golden
In [17] an Elmann recurrent neural network is considered standard in different fields such as speech recognition [66]–
to provide hourly load forecasts. The study also compares the [68] and image captioning [69].
performance of the network when additional weather informa- The core idea of this general framework is to employ two
tion such as temperature and humidity are fed to the model. networks resulting in an encoder-decoder architecture. The
The authors conclude that, as expected, the recurrent network first neural network (possibly deep) f , an encoder, reads the
benefits from multi-input data and, in particular, weather input sequence xt ∈ IRnT ×d of length nT one timestep at a
ones. [28] makes use of ERNN to forecast household electric time; the computation generates a, generally lossy, fixed di- 0
consumption obtained from a suburban area in the neighbours mensional vector representation of it c = f (xt , Θf ), c ∈ IRd .
of Palermo (Italy). In addition to the historical load measure- This embedded representation is usually called context in the
ments, the authors introduce several features to enhance the literature and can be the last hidden state of the encoder or a
model’s predictive capabilities. Besides the weather and the function of it. Then, a second neural network g - the decoder
calendar information, a specific ad-hoc index was created to - will learn how to produce the output sequence ŷt ∈ IRnO
assess the influence of the use of air-conditioning equipment given the context vector, i.e., ŷ = g(c, Θg ). The schematics
on the electricity demand. In recent years, LSTMs have been of the whole architecture is depicted in Figure 6.
adopted in short term load forecasting, proving to be more The encoder and the decoder modules are generally two
effective then traditional time-series analysis methods. In [21] recurrent neural networks trained end-to-end to minimize the
LSTM is shown to outperform traditional forecasting methods objective function:
being able to exploit the long term dependencies in the time O −1
nX
series to forecast the day-ahead load consumption. Several L(Θ) = (y[t] − ŷ[t])2 + Ω(Θ), Θ = [Θf , Θg ] (22)
works proved to be successful in enhancing the recurrent t=0
neural network capabilities by employing multivariate input ŷ[t] = g(y[t − 1], h[t − 1], c; Θ) (23)
data. In [22] the authors propose a deep, LSTM based archi-
tecture that uses past measurements of the whole household where ŷ[t] is the decoder’s estimate at time t, y[t] is the real
consumption along with some measurements from selected measurement, h[t − 1] is the decoder’s last state, c is the
appliances to forecast the consumption of the subsequent time context vector from the encoder, x is the input sequence and
interval (i.e., a one step prediction). In [23] a LSTM-based Ω(Θ) the regularization term. The training procedure for this
network is trained using a multivariate input which includes type of architecture is called teacher forcing [70]. As shown
temperature, holiday/working day information, date and time in Figure 6 and explained in Equation 23, during training,
information. Similarly, in [31] a power demand forecasting the decoder’s input at time t is the ground-truth value y[t − 1],
model based on LSTM shows an accuracy improvement com- which is then used to generate the next state h[t] and, then, the
pared to more traditional machine learning techniques such as estimate ŷ[t]. During inference the true values are unavailable
Gradient Boosting Trees and Support Vector Regression. and replaced by the estimates:
GRUs have not been used much in the literature as LSTM ŷ[t] = g(ŷ[t − 1], h[t − 1], c; Θ). (24)
networks are often preferred. That said, the use of GRU-
based networks is reported in [18], while a more recent This discrepancy between training and testing results in errors
study [24] uses GRUs for the daily consumption forecast accumulating over time during inference. In the literature this
of individual customers. Thus, investigating deep GRU-based problem is often referred to as exposure bias [71]. Several
architectures is a relevant scientific topic, also thanks to their solutions have been proposed to address this problem; in [72]
faster convergence and simpler structure compared to LSTM the authors present scheduled sampling, a curriculum learning
[52]. strategy that gradually changes the training process by switch-
Despite all these promising results, an extensive study of ing the decoder’s inputs from ground-truth values to model’s
recurrent neural networks [18], and in particular of ERNN, predictions. The professor forcing algorithm, introduced in
LSTM, GRU, ESN [62] and NARX, concludes that none of [73], uses an adversarial framework to encourage the dynamics
the investigated recurrent architectures manages to outperform of the recurrent network to be the same both at training and
the others in all considered experiments. Moreover, the authors operational (test) time. Finally, in recent years, reinforcement
noticed that recurrent cells with gated mechanisms like LSTM learning methods have been adopted to train sequence to
and GRU perform comparably well to much simpler ERNN. sequence models; a comprehensive review is presented in [74].
This may indicate that in short-term load forecasting gating In this work we investigate two sequence to sequence
mechanism may be unnecessary; this issue is further investi- architectures, one trained via teacher forcing (TF) and one
gated -and evidence found- in the present work. using self-generated (SG) samples. The former is characterized
by Equation 23 during training while Equation 24 is used
during prediction. The latter architecture adopts Equation 24
V. S EQUENCE T O S EQUENCE MODELS both for training and prediction. The decoder’s dynamics are
Sequence To Sequence (seq2seq) architectures [63] or summarized in Figure 7. It is clear that the two training
encoder-decoder models [45] were initially designed to solve procedures differ in the decoder’s input source: ground-truth
RNNs inability to produce output sequences of arbitrary values in teacher forcing, estimated values in self-generated
length. The architecture was firstly used in neural machine training.
9
𝑦̂ [𝑛𝑇 -1] 𝑦̂ [𝑛𝑇 ] ... 𝑦̂ [𝑛𝑇 +𝑛𝑂 -2]

ENCODER
RNN RNN ... RNN RNN RNN ... RNN
STATE EMBEDDINGS
... ... ... ... ... ...
... ...
𝐱[0] 𝐱[1] 𝐱[𝑛𝑇 -1] 𝐱[𝑛𝑇 -1] 𝐱[𝑛𝑇 ] 𝐱[𝑛𝑇 +𝑛𝑂 -2]
DECODER
Figure 6. seq2seq (Encoder-Decoder) architecture with a general Recurrent Neural network both for the encoder and the decoder. Assuming a Teacher Forcing
training process, the solid lines in the decoder represents the training phase while the dotted lines depicts the values’path during prediction.
̂ [ ] ̂ [ +1] ̂ [ + ] ̂ [ ] ̂ [ +1] ... ̂ [ + ]
... ... ... ... ... ...
...
[ ] [ +1] [ + ] [ ]
DECODER (TF) DECODER (SG)
Figure 7. (Left) decoder with ground-truth inputs (Teacher Forcing). (Right) Decoder with self-generated inputs.
A. Related Work term electric load forecasting, have been mainly addressed
Only recently seq2seq models have been adopted in short with recurrent neural networks, but recent research indicates
term load forecasting. In [33] a LSTM based encoder-decoder that convolutional networks can also attain state-of-the-art-
model is shown to produce superior performance compared to performance in several applications including audio generation
standard LSTM. In [75] the authors introduce an adaptation [85], machine translation [86] and time-series prediction [87].
of RNN based sequence-to-sequence architectures for time-
As the name suggests, these kind of networks are based
series forecasting of electrical loads to demonstrate its better
on a discrete convolution operator that produces an output
performance with respect to a suite of models ranging from
feature map f by sliding a kernel w over the input x. Each
standard RNNs to classical time series techniques.
element in the output feature map is obtained by summing up
the result of the element-wise multiplication between the input
VI. C ONVOLUTIONAL N EURAL N ETWORKS patch (i.e., a slice of the input having the same dimensionality
Convolutional Neural Networks (CNNs) [76] are a family of the kernel) and the kernel. The number of kernels (filters)
of neural networks designed to work with data that can be M used in a convolutional layer determines the depth of the
structured in a grid-like topology. CNNs were originally used output volume (i.e., the number of output feature maps). To
on two dimensional and three-dimensional images, but they are control the other spatial dimensions of the output feature maps
also suitable for one-dimensional data such as univariate time- two hyper-parameters are used: stride and padding. Stride
series. Once recognized as a very efficient solution for image represents the distance between two consecutive input patches
recognition and classification [42], [77]–[79], CNNs have and can be defined for each direction of motion. Padding refers
experienced wide adoption in many different computer vision to the possibility of implicitly enlarging the inputs by adding
tasks [80]–[84]. Moreover, sequence modeling tasks, like short (usually) zeros at the borders to control the output size w.r.t
10
the input one. Indeed, without padding, the dimensionality of input x ∈ IRnT , and a kernel w ∈ IRk , a dilated convolution
the output would be reduced after each convolutional layer. output using a dilation factor d becomes:
Considering a 1D time-series x ∈ IRnT and a one-
dimensional kernel w ∈ IRk , the ith element of the convo- k−1
X
lution between x and w is: f (i) = (x ∗d w)(i) = x(i − dj)w(j) (26)
j=0
k−1
X This is a major advantage w.r.t simple causal convolutions,
f (i) = (x ∗ w)(i) = x(i − j)w(j) (25) as in the later case the receptive field r grows linearly with
j=0
the depth of the network r = k(L − 1) while with dilated
convolutions the dependence is exponential r = 2L−1 k,
with f ∈ IRnT −k+1 if no zero-padding is used, otherwise ensuring that a much larger history size is used by the network.
padding matches the input dimensionality, i.e., f ∈ IRnT . b) Residual Connections: Despite the implementation of
Equation 25 is referred to the one-dimensional input case dilated convolution, the CNN still needs a large number of lay-
but can be easily extended to multi-dimensional inputs (e.g., ers to learn the dynamics of the inputs. Moreover, performance
images, where x ∈ IRW ×H×D ) [88]. The reason behind the often degrade with the increase of the network depth. The
success of these networks can be summarized in the following degradation problem has been first addressed in [42] where
three points: the authors propose a deep residual learning framework. The
• local connectivity: each hidden neuron is connected to a authors observe that for a L-layers network with a training
subset of input neurons that are close to each other (ac- error , inserting k extra layers on top of it should either leave
cording to specific spatio-temporal metric). This property the error unchanged or improve it. Indeed, in the worst case
allows the network to drastically reduce the number of scenario, the new k stacked non linear layers should learn
parameters to learn (w.r.t. a fully connected network) and the identity mapping y = H(x) = x where x is the output
facilitate computations. of the network having L layers and y is the output of the
• parameter sharing: the weights used to compute the network with L+k layers. Although almost trivial, in practice,
output neurons in a feature map are the same, so that neural networks experience problems in learning this identity
the same kernel is used for each location. This allows to mapping. The proposed solution suggests these stacked layers
reduce the number of parameters to learn. to fit a residual mapping F(x) = H(x) − x instead of
• translation equivariance: the network is robust to an the desired one, H(x). The original mapping is recast into
eventual shifting of its input. F(x) + x which is realized by feed forward neural networks
with shortcut connections; in this way the identity mapping is
In our work we focus on a convolutional architecture in- learned by simply driving the weights of the stacked layers to
spired by Wavenet [85], a fully probabilistic and autoregressive zero.
model used for generating raw audio wave-forms and extended By means of the two aforementioned principles, the tem-
to time-series prediction tasks [87]. Up to the authors’ knowl- poral convolutional network is able to exploit a large history
edge this architecture has never been proposed to forecast the size in an efficient manner. Indeed, as observed in [89], these
electric load. A recent empirical comparison between tempo- models present several computational advantages compared
ral convolutional networks and recurrent networks has been to RNNs. In fact, they have lower memory requirements
carried out in [89] on tasks such as polymorphic music and during training and the predictions for later timesteps are not
charter-sequence level modelling. The authors were the first done sequentially but can be computed in parallel exploiting
to use the name Temporal Convolutional Networks (TCNs) to parameter sharing. Moreover, TCNs training is much more
indicate convolutional networks which are autoregressive, able stable than that involving RNNs allowing to avoid the ex-
to process sequences of arbitrary length and output a sequence ploding/vanishing gradient problem. For all the above, TCNs
of the same length. To achieve the above the network has to have demonstrated to be promising area of research for time
employ causal (dilated) convolutions and residual connections series prediction problems and here, we aim to assess their
should be used to handle a very long history size. forecasting performance w.r.t state-of-the-art models in short-
a) Dilated Causal Convolution (DCC): Being TCNs a term load forecasting. The architecture used in our work
family of autoregressive models, the estimated value at time is depicted in Figure 10, which is, except for some minor
t must depend only on past samples and not on future ones modifications, the network structure detailed in [87]. In the
(Figure 9). To achieve this behavior in a Convolutional Neural first layer of the network we process separately the load
Network the standard convolution operator is replaced by information and, when available, the exogenous information
causal convolution. Moreover, zero-padding of length (filter such as temperature readings. Later the results will be con-
size - 1) is added to ensure that each layer has the same length catenated together and processed by a deep residual network
of the input layer. To further enhance the network capabilities with L layers. Each layer consists of a residual block with
dilated causal convolutions are used, allowing to increase the 1D dilated causal convolution, a rectified linear unit (ReLU)
receptive field of the network (i.e., the number of input neurons activation and finally dropout to prevent overfitting. The output
to which the filter is applied) and its ability to learn long- layer consists of 1x1 convolution which allows the network to
term dependencies in the time-series. Given a one-dimensional output a one-dimensional vector y ∈ IRnT having the same
11
Figure 8. A 3 layers CNN with causal convolution (no dilation), Figure 9. A 3 layers CNN with dilated causal convolutions. The dilation factor d
the receptive field r is 4. grows on each layer by a factor of two and the kernel size k is 2, thus the output
neuron is influence by 8 input neurons, i.e., the history size is 8
Conv1D (1x1)
L
Conv1D
Dropout
ReLU
+ ... + Conv1D (1x1) ̂
...
Conv1D
Dropout
ReLU
Conv1D (1x1)
Figure 10. TCN Architecture. xt is the vector of historical loads along with the exogenous features for the time window indexes from 0 to nT , zt is the
vector of exogenous variables related to the last nO indexes of the time window (when available), ŷt is the output vector. Residual Blocks are composed by
a 1D Dilated Causal Convolution, a ReLU activation and Dropout. The square box represents a concatenation between (transformed) exogenous features and
(transformed) historical loads.
dimensionality of the input vector x. To approach multi-step evaluated on a large dataset containing hourly loads from
forecasting, we adopt a MIMO strategy. a city in North China and is compared with a three-layer
feed forward neural network. A different hybrid approach is
presented in [26], the authors process the load information in
A. Related Work parallel with a CNN and an LSTM. The features generated
In the short-term load forecasting relevant literature, CNNs by the two networks are then used as an input for a final
have not been studied to a large extent. Indeed, until recently, prediction network (fully connected) in charge of forecasting
these models were not considered for any time-series related the day-ahead load.
problem. Still, several works tried to address the topic; in
[15] a deep convolutional neural network model named Deep-
Energy is presented. The proposed network is inspired by VII. P ERFORMANCE A SSESSMENT
the first architectures used in ImageNet challenge (e.g, [77]),
In this section we perform evaluation and assessment of all
alternating convolutional and pooling layers, halving the width
the presented architectures. The testing is carried out by means
of the feature map after each step. According to the provided
of three use cases that are based on two different datasets used
experimental results, DeepEnergy can precisely predict energy
as benchmarks. We first introduce the performance metrics that
load in the next three days outperforming five other machine
we considered for both network optimization and testing, then
learning algorithms including LSTM and FNN. In [16] a
describe the datasets that have been used and finally we discuss
CNN is compared to recurrent and feed forward approaches
results.
showing promising results on a benchmark dataset. In [25]
a hybrid approach involving both convolutional and recurrent
architectures is presented. The authors integrate different input
A. Performance Metrics
sources and use convolutional layers to extract meaningful
features from the historic load while the recurrent network The efficiency of the considered architectures has been
main task is to learn the system’s dynamics. The model is measured and quantified using widely adopted error metrics.
12
Specifically, we adopted the Root mean squared error (RMSE) Dataset Train Test
and the Mean Absolute Error (MAE): IHEPC 103301 35040
v GEFCom2014 44640 8928
O −1
u N −1
u 1 X 1 nX Table III
RMSE = t (ŷi [t] − yi [t])2 S AMPLE SIZE OF TRAIN , VALIDATION AND TEST SETS FOR EACH DATASET.
N i=0 nO t=0
N −1 nO −1
1 X 1 X
MAE = | ŷi [t] − yi [t] | In this use case we performed the forecasting using only
N i=0 nO t=0 historical load values. The right side of Figure 11 depicts
where N is the number of input-output pairs provided to the the average weekly electric consumption. As expected, it can
model in the course of testing, yi [t] and ŷi [t] are respectively be observed that the highest consumption is registered in the
the real load values and the estimated load values at time t morning and evening periods of day when the occupancy of
for sample i (i.e., the i − th time window). h·i is the mean resident houses is high. Moreover, the average load profile
operator, k · k2 is the euclidean L2 norm, while k · k1 is the over a week clearly shows that weekdays are similar while
L1 norm. y ∈ IRnO and ŷ ∈ IRnO are the real load values weekends present a different trend of consumption.
and the estimated load values for one sample, respectively. The figure shows that the data are characterized by high
Still, a more intuitive and indicative interpretation of prediction variance. The prediction task consists in forecasting the elec-
efficiency of the estimators can be expressed by the normalized tric load for the next day, i.e., 96 timesteps ahead.
root mean squared error which, differently from the two above In order to assess the performance of the architectures we
metrics, is independent from the scale of the data: hold out a portion of the data which denotes our test set
and comprises the last year of measurements. The remaining
RMSE measurements are repeatedly divided in two sets, keeping aside
NRMSE% = · 100
ymax − ymin a month of data every five ones. This process allows us to
where ymax and ymin are the maximum and minimum value build a training set and a validation set for which different
of training dataset, respectively. In order to quantify the hyper-parameters configurations can be evaluated. Only the
proportion of variance in the target that is explained by the best performing configuration is later evaluated on the test
forecasting methods we consider also the R2 index: set.
N −1
1 X RSS
R2 = (1 − ) C. Use Case II and III
N i=0 TSS
The other two use cases are based on the GEFCom2014
1
nO
X 1
nO
X dataset [35], which was made available for an online fore-
RSS = (ŷi [t] − yi [t])2 TSS = (yi [t] − ȳi )2 casting competition that lasted between August 2015 and De-
nO nO
t t cember 2015. The dataset contains 60.6k hourly measurements
nO of (aggregated) electric power consumption collected by ISO
1
P
where ȳi = nO yi [t] New England between January 2005 and December 2011. Dif-
t
All considered models have been implemented in Keras 2.12 ferently from the IHEPC dataset, temperature values are also
[90] with Tensorflow [91] as backend. The experiments are available and are used by the different architectures to enhance
executed on a Linux cluster with an Intel(R) Xeon(R) Silver their prediction performance. In particular the input variables
CPU and an Nvidia Titan XP. being used for forecasting the subsequent nO at timestep t
include: several previous load measurements, the temperature
measurements for the previous timesteps registered by 25 dif-
B. Use Case I ferent stations, hour, day, month and year of the measurements.
The first use case considers the Individual household electric We apply standard normalization to load and temperature
power consumption data set (IHEPC) which contains 2.07M measurements while for the other variables we simply apply
measurements of electric power consumption for a single one-hot encoding, i.e., a K-dimensional vector in which one of
house located in Sceaux (7km of Paris, France). Measure- the elements equals 1, and all remaining elements equal 0 [92].
ments are collected every minute between December 2006 On the right side of Figure 11 we observe the average load and
and November 2010 (47 months) [29]. In this study we focus the data dispersion on a weekly basis. Compared to IHEPC,
on prediction of the ”Global active power” parameter. Nearly the load profiles look much more regular. This meets intuitive
1.25% of measurements are missing, still, all the available expectations as the load measurements in the first dataset come
ones come with timestamps. We reconstruct the missing values from a single household, thus the randomness introduced by
using the mean power consumption for the corresponding time user behaviour makes more remarkable impact on the results.
slot across the different years of measurements. In order to On the opposite, the load information in GEFCom2014 comes
have a unified approach we have decided to resample the from the aggregation of the data provided by several different
dataset using a sampling rate of 15 minutes which is a widely smart meters; clustered data exhibits a more stable and regular
adopted standard in modern smart meters technologies. In pattern. The main task of these use cases, as well the previous
Table III the sample size are outlined for each dataset. one, consists in forecasting the electric load for the next day,
13
i.e., 24 timesteps ahead. The hyper-parameters optimization outperform sequence to sequence architectures which are
and the final score for the models follow the same guidelines supposed to better model complex temporal dynamics like the
provided for IHEPC, the number of points for each subset is one exhibited by the residential load curve. Nevertheless, by
described in Table III. observing the performance of recurrent networks trained with
the recursive strategy, this behaviour is less surprising. In fact,
compared with the aggregated load profiles, the load curve
D. Results
belonging to a single smart meter is way more volatile and
The compared architectures are the ones presented in sensitive to customers behaviour. For this reason, leveraging
previous sections with one exception. We have additionally geographical and socio-economic features that characterize the
considered a deeper variant of a feed forward neural network area where the user lives may allow deep networks to generate
with residual connections which is named DFNN in the better predictions.
remainder of the work. In accordance to the findings of [93] we For visualization purposes we compare all the models
have employed a 2-shortcut network, i.e., the input undergoes performance on a single day prediction scenario on the left
two affine transformations each followed by a non linearity side of Figure 12. On the right side of Figure 12 we quantify
before being summed to its original values. For regularization the differences between the best predictor (the GRU-MIMO)
purposes we have included Dropout and Batch Normalization and the actual measurements; the thinner the line the closer
[94] in each residual block. We have additionally inserted the prediction to the true data. Furthermore, in this Figure, we
this model in the results comparison because it represents an concatenate multiple day predictions to have a wider time span
evolution of standard feed forward neural networks which is and evaluate the model predictive capabilities. We observe
expected to better handle highly complex time-series data. that the model is able to generate a prediction that correctly
Table IV summarizes the best configurations found trough models the general trend of the load curve but fails to predict
grid search for each model and use case. For both datasets we steep peaks. This might come from the design choice of
experimented different input sequences of length nT . Finally, using MSE as the optimization metric, which could discourage
we used a window size of four days, which represents the deep models to predict high peaks as large errors are hugely
best trade-off between performance and memory requirements. penalized, and therefore, predicting a lower and smoother
The output sequence length nO is fixed to one day. For each function results in better performance according to this metric.
model we identified the optimal number of stacked layers in Alternatively, some of the peaks may simply represent noise
the network L, the number of hidden units per layer nH , the due to particular user behaviour and thus unpredictable by
regularization coefficient λ (L2 regularization) and the dropout definition.
rate pd . Moreover, for TCN we additionally tuned the width k The load curve of the second dataset (GEFCom2014 )
of the convolutional kernel and the number of filters applied results from the aggregation of several different load profiles
at each layer M (i.e., the depth of each output volume after producing a smoother load curve when compared with the
the convolution operation). The dilation factor is increased individual load case. Hyper-parameters optimization and the
exponentially with the depth of the network, i.e. d = 2` with final score for the models can be found in Table IV.
` being the ` − th layer of the network. Table VI and Table VII show the experimental results ob-
Table V summarizes the test scores of the presented archi- tained by the models in two different scenarios. In the former
tectures obtained for the IHEPC dataset. Certain similarities case, only load values were provided to the models while
among networks trained for different uses cases can be spotted in the latter scenario the input vector has been augmented
out already at this stage. In particular, we observe that all with the exogenous features described before. Compared to
models exploit a small number of neurons. This is not usual the previous dataset this time series exhibits a much more
in deep learning but - at least for recurrent architectures - is regular pattern; as such we expect the prediction task to
consistent with [18]. With some exceptions, recurrent networks be easier. Indeed, we can observe a major improvement in
benefit from a less strict regularization; dropout is almost terms of performance across all the models. As already noted
always set to zero and λ values are small. in [22], [95] the prediction accuracy increases significantly
Among Recurrent Neural Networks we observe that, in when the forecasting task is carried out on a smooth load
general, the MIMO strategy outperforms the recursive one curve (resulting from the aggregation of many individual
in this multi step prediction task. This is reasonable in such consumers).
a scenario. Indeed, the recursive strategy, differently from We can observe that, in general, all models except plain
the MIMO one, is highly sensitive to errors accumulation FNNs benefit from the presence of exogenous variables.
which, in a highly volatile time series as the one addressed When exogenous variables are adopted, we notice a major
here, results in a very inaccurate forecast. Among the MIMO improvement by RNNs trained with the recursive strategy
models we observe that gated networks perform significantly which outperform MIMO ones. This increase in accuracy can
better than simple Elmann network one. This suggests that be attributed to a better capacity of leveraging the exogenous
gated systems are effectively learning to better exploit the time series of temperatures to yield a better load forecast.
temporal dependency in the data. In general we notice that all Moreover, RNNs with MIMO strategy gain negligible im-
the models, except the RNNs trained with recursive strategy, provements compared to their performance when no extra-
achieve comparable performance and none really stands out. It feature is provided. This kind of architectures use a feed-
is interesting to comment that GRU-MIMO and LSTM-MIMO forward neural network to map their final hidden state to a
14
4.0
220
3.5
3.0 200
Active Power (MW)
2.5 180
Load (MW)
2.0 160
1.5 140
1.0 120
0.5 100
0.0 80
Mon Tue Wed Thu Fri Sat Sun Mon Tue Wed Thu Fri Sat Sun
Time Time
Figure 11. Weekly statistics for the electric load in the whole IHEPC (Left) and GEFCom2014 datasets (right). The bold line is the mean curve, the dotted
line is the median and the green area covers one standard deviation from the mean.
ERNN LSTM GRU seq2seq

Hyperparameters Dataset FNN DFNN TCN
Rec MIMO Rec MIMO Rec MIMO TF SG
IHPEC 3 6 8 3 1 2 1 2 1 1 1
L GEFCOM 1 6 6 4 4 4 2 4 1 2 1
GEFCOMexog 1 6 8 2 1 4 1 2 2 2 3
IHPEC 256-128x2 50 10 30 20 20 10 50 30 50
nH GEFCOM 60 30 20 50 15 20 30 20 10 50
GEFCOMexog 60 30 10 30 30 50 10 15 20 20-15-10
IHPEC 0.001 0.0005 0.005 0.001 0.001 0.001 0.001 0.001 0.0005 0.01 0.01
λ GEFCOM 0.01 0.0005 0.01 0.01 0.0005 0.001 0.001 0.01 0.0005 0.01 0.01
GEFCOMexog 0.005 0.0005 0.005 0.0005 0.001 0.0005 0.0005 0.001 0.01 0.001 0.01
IHPEC 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.0 0.0 0.1 0.2
pdrop GEFCOM 0.1 0.1 0.1 0.1 0.0 0.1 0.0 0.1 0.0 0.1 0.1
GEFCOMexog 0.1 0.1 0.1 0.0 0.0 0.0 0.0 0.1 0.0 0.1 0.0
IHPEC 2, 32
k, M GEFCOM 2, 16
GEFCOMexog 2, 64
Table IV
B EST CONFIGURATIONS FOUND VIA G RID S EARCH FOR THE IHEPC DATASET, GEFC OM 2014 DATASET AND GEFC OM 2014 WITH EXOGENOUS
FEATURES .
RMSE MAE NRMSE R2 found in order to provide a boost in prediction performance

FNN 0.76±0.01 0.53±0.01 10.02±0.17 0.250±0.026 comparable to the one achieved by employing the recursive
DFNN 0.75±0.01 0.53±0.01 9.90±0.05 0.269±0.007 strategy.
TCN 0.76±0.008 0.54±0.00 10.07±0.11 0.245±0.017 For reasons that are similar to those discussed above,
MIMO 0.79±0.00 0.56±0.00 10.33±0.08 0.201±0.012
ERNN sequence to sequence models trained via teacher forcing
Rec 0.88±0.02 0.69±0.03 11.61±0.29 0.001±0.039
MIMO 0.75±0.00 0.53±0.00 9.85±0.04 0.276±0.006 (seq2seq-TF) experienced an improvement when exogenous
LSTM features are used. Still, seq2seq trained in free-running mode
Rec 0.84±0.06 0.60±0.07 11.06±0.74 0.085±0.125
GRU
MIMO 0.75±0.00 0.52±0.00 9.83±0.03 0.279±0.004 (seq2seq-SG) proves to be a valid alternative to standard
Rec 0.89±0.02 0.70±0.02 11.64±0.23 0.00±0.04 seq2seq-TF producing high quality predictions in all use
TF 0.78±0.01 0.57±0.02 10.22±0.17 0.221±0.026
seq2seq cases. The absence of a discrepancy between training and
SG 0.76±0.01 0.53±0.01 10.00±0.14 0.253±0.03
inference in terms of data generating distribution shows to
Table V be an advantage as seq2seq-SG is less sensitive to noise and
I NDIVIDUAL HOUSEHOLD ELECTRIC POWER CONSUMPTION DATA SET
RESULTS . E ACH MODEL’ S MEAN SCORE (± ONE STANDARD . DEVIATION ) error propagation.
COMES FROM 10 REPEATED TRAINING PROCESSES . Finally, we notice that TCNs perform well in all the
presented use cases. Considering their lower memory require-
ments in the training process along with their inherent paral-
lelism this type of networks represents a promising alternative
sequence of nO values, i.e., the estimates. Exogenous variables to recurrent neural networks for short-term load forecasting.
are elaborated directly by this FNN, which, as observed The results of predictions are presented in the same fashion
above, shows to have problems in handling both load data as for the previous use case in Figure 13. Observe that,
and extra information. Consequently, a better way of injecting in general, all the considered models are able to produce
exogenous variables in MIMO recurrent network needs to be reasonable estimates as sudden picks in consumption are
15
3.5 Measurements 4.0

gru_mimo
3.0 dfnn 3.5
tcn
lstm_rec 3.0
2.5 rnn_mimo
Active Power (kW)
Active Power (kW)

lstm_mimo 2.5
2.0 rnn_rec
gru_rec 2.0
1.5
fnn
seq2seq_tf 1.5
seq2seq_sg
1.0
1.0
0.5 0.5
0 25 50 75 100 125 150 175 200 0 100 200 300 400 500 600
Steps (15 min) Steps (15 min)
Figure 12. (Right) Predictive performance of all the models on a single day for IHEPC dataset. The left portion of the image shows (part of) the measurements
used as input while the right side with multiple lines represents the different predictions. (Left) Difference between the best model’s predictions (GRU-MIMO)
and the actual measurements. The thinner the line the closer the prediction is to the true data.
RMSE MAE NRMSE R2 the short-term load forecasting problem, paving the way for
FNN 21.1±2.5 15.5±2.1 7.01±0.82 0.833±0.041 standardized assessment and identification of the most optimal
DFNN 22.4±6.2 17.1±6.2 7.44±2.01 0.801±0.124 solutions in this field. The focus has been given to the three
TCN 17.2±0.1 11.5±0.1 5.71±0.14 0.891±0.00 main families of models, namely, Recurrent Neural Networks,
MIMO 18.0±0.3 11.9±0.4 5.99±0.11 0.879±0.046
ERNN Sequence to Sequence Architectures and recently developed
Rec 27.0±2.3 20.7±2.5 8.95±0.78 0.732±0.046
MIMO 19.5±0.5 13.7±0.6 6.47±0.18 0.861±0.007 Temporal Convolutional Neural Networks. An architectural
LSTM description along with a technical discussion on how multi-
Rec 25.6±2.2 18.4±1.3 8.52±0.72 0.757±0.041
GRU
MIMO 19.0±0.2 13.1±0.3 6.29±0.07 0.868±0.003 step ahead forecasting is achieved, has been provided for each
Rec 26.7±3.3 19.8±3.1 8.85±1.09 0.737±0.064 considered model. Moreover, different forecasting strategies
TF 21.5±2.1 15.4±1.9 7.13±0.69 0.829±0.034
seq2seq are discussed and evaluated, identifying advantages and draw-
SG 17.1±0.2 11.3±0.2 5.67±0.06 0.893±0.002
backs for each of them. The evaluation has been carried out
Table VI on the three real-world use cases that refer to two distinct
GEFC OM 2014 RESULTS WITHOUT ANY EXOGENOUS VARIABLE . E ACH
MODEL’ S MEAN SCORE (± ONE STANDARD . DEVIATION ) COMES FROM 10
scenarios for load forecasting. Indeed, one use case deals
REPEATED TRAINING PROCESSES . with dataset coming from a single household while the other
two tackle the prediction of a load curve that represents
several aggregated meters, dispersed over the wide area. Our
RMSE MAE NRMSE R2 findings concerning application of recurrent neural networks
FNN 27.9±2.8 20.8±2.4 9.28±0.93 0.709±0.062 to short-term load forecasting, show that the simple ERNN
DFNN 23.0±1.2 15.6±0.7 7.62±0.41 0.805±0.021 performs comparably to gated networks such as GRU and
TCN 15.4±1.5 8.6±1.7 5.00±0.22 0.917±0.007
MIMO 17.9±0.3 11.7±0.3 5.94±0.01 0.883±0.004
LSTM when adopted in aggregated load forecasting. Obtained
ERNN results demonstrate that the extra complexity added by gated
Rec 14.7±1.0 8.6±1.0 4.88±0.19 0.925±0.005
MIMO 18.1±1.3 12.1±1.3 6.01±0.42 0.877±0.018 networks does not yield remarkable benefits for any of the
LSTM
Rec 13.8±0.6 7.5±0.3 4.59±0.18 0.930±0.006 presented forecasting tasks. In fact, the less costly alternative
MIMO 17.8±0.2 11.7±0.2 5.93±0.07 0.882±0.002 provided by ERNN may represent the most effective solution
GRU
Rec 16.7±0.5 10.0±0.6 5.54±0.15 0.898±0.006
as it allows to reduce the training time without remarkable
TF 14.3±1.0 8.5±0.9 4.74±0.32 0.924±0.014
seq2seq impact on prediction accuracy. A significant difference exists
SG 15.9±1.8 9.8±1.8 5.28±0.60 0.907±0.021
for single house electric load forecasting where the gated
Table VII
GEFC OM 2014 RESULTS WITH EXOGENOUS VARIABLES . E ACH MODEL’ S
networks shows to be superior to Elmann ones suggesting that
MEAN SCORE (± ONE STANDARD . DEVIATION ) COMES FROM 10 the gated mechanism allows to better handle irregular time
REPEATED TRAINING PROCESSES . series. Sequence to Sequence models have demonstrated to
be quite efficient in load forecasting tasks even though they
seem to fail in outperforming RNNs. In general we can claim
smoothed. Therefore, predictors greatly improve their accuracy that seq2seq architectures do not represent a golden standard
when predicting day ahead values for the aggregated load in load forecasting as they are in other domains like natural
curves with respect to individual households scenario. language processing. In addition to that, regarding this family
of architectures, we have observed that teacher forcing may
VIII. C ONCLUSIONS not represent the best solution for training seq2seq models
In this work we have surveyed and experimentally eval- on short-term load forecasting tasks. Despite being harder in
uated the most relevant deep learning models applied to terms of convergence, free-running models learn to handle
16
275
275
250
250
225
Measurements 225
200 seq2seq_sg
dfnn 200
Load (MW)
Load (MW)
175 rnn_rec
lstm_rec 175
150 gru_rec
rnn_mimo 150
lstm_mimo
125 tcn 125
fnn
100 seq2seq_tf 100
gru_mimo
75 75
0 10 20 30 40 0 50 100 150 200 250
Steps (1 Hour) Steps (1 Hour)
Figure 13. (Right) Predictive performance of all the models on a single day for GEFCom2014 dataset. The left portion of the image shows (part of) the
measurements used as input while the right side with multiple lines represents the different predictions. (Left) Difference between the best model’s predictions
(LSTM-Rec) and the actual measurements. The thinner the line the closer the prediction to the true data.
their own errors, avoiding the discrepancy between training [8] H.-T. Yang, C.-M. Huang, and C.-L. Huang, “Identification of armax
and testing that is a well known issue for teacher forcing. It model for short term load forecasting: An evolutionary programming
approach,” in Power Industry Computer Application Conference, 1995.
turns out to be worth efforts to further investigate capabilities Conference Proceedings., 1995 IEEE. IEEE, 1995, pp. 325–330.
of seq2seq models trained with intermediate solutions such as [9] G. R. Newsham and B. J. Birt, “Building-level occupancy data to
professor forcing. Finally, we evaluated the recently developed improve arima-based electricity use forecasts,” in Proceedings of the
2nd ACM workshop on embedded sensing systems for energy-efficiency
Temporal Convolutional Neural Networks which demonstrated in building. ACM, 2010, pp. 13–18.
convincing performance when applied to load forecasting [10] K. Y. Lee, Y. T. Cha, and J. H. Park, “Short-term load forecasting
tasks. Therefore, we strongly believe that the adoption of these using an artificial neural network,” IEEE Transactions on Power Systems,
vol. 7, no. 1, pp. 124–132, Feb 1992.
networks for sequence modelling in the considered field is very
[11] D. C. Park, M. A. El-Sharkawi, R. J. Marks, L. E. Atlas, and M. J.
promising and might even introduce a significant advance in Damborg, “Electric load forecasting using an artificial neural network,”
this area that is emerging as a key importance for future Smart IEEE Transactions on Power Systems, vol. 6, no. 2, pp. 442–449, May
Grid developments. 1991.
[12] D. Srinivasan, A. Liew, and C. Chang, “A neural network short-term
load forecaster,” Electric Power Systems Research, vol. 28, no. 3, pp.
ACKNOWLEDGMENT 227 – 234, 1994. [Online]. Available: http://www.sciencedirect.com/
science/article/pii/037877969490037X
This project is carried out within the frame of the Swiss
[13] I. Drezga and S. Rahman, “Short-term load forecasting with local ann
Centre for Competence in Energy Research on the Future predictors,” IEEE Transactions on Power Systems, vol. 14, no. 3, pp.
Swiss Electrical Infrastructure (SCCER-FURIES) with the 844–850, Aug 1999.
financial support of the Swiss Innovation Agency (Innosuisse [14] K. Chen, K. Chen, Q. Wang, Z. He, J. Hu, and J. He, “Short-term load
forecasting with deep residual networks,” IEEE Transactions on Smart
- SCCER program). Grid, pp. 1–1, 2018.
[15] P.-H. Kuo and C.-J. Huang, “A high precision artificial neural networks
R EFERENCES model for short-term energy load forecasting,” Energies, vol. 11, no. 1,
2018. [Online]. Available: http://www.mdpi.com/1996-1073/11/1/213
[1] X. Fang, S. Misra, G. Xue, and D. Yang, “Smart grid — the new
[16] K. Amarasinghe, D. L. Marino, and M. Manic, “Deep neural networks
and improved power grid: A survey,” IEEE Communications Surveys
for energy load forecasting,” in 2017 IEEE 26th International Sympo-
Tutorials, vol. 14, no. 4, pp. 944–980, Fourth 2012.
sium on Industrial Electronics (ISIE), June 2017, pp. 1483–1488.
[2] E. Almeshaiei and H. Soltan, “A methodology for electric power load
forecasting,” Alexandria Engineering Journal, vol. 50, no. 2, pp. 137 – [17] S. Nayaka, A. Yelamali, and K. Byahatti, “Electricity short term load
144, 2011. [Online]. Available: http://www.sciencedirect.com/science/ forecasting using elman recurrent neural network,” 11 2010, pp. 351 –
article/pii/S1110016811000330 354.
[3] H. S. Hippert, C. E. Pedreira, and R. C. Souza, “Neural networks for [18] F. M. Bianchi, E. Maiorino, M. C. Kampffmeyer, A. Rizzi, and
short-term load forecasting: a review and evaluation,” IEEE Transactions R. Jenssen, “An overview and comparative analysis of recurrent neural
on Power Systems, vol. 16, no. 1, pp. 44–55, Feb 2001. networks for short term load forecasting,” CoRR, vol. abs/1705.04378,
[4] J.-F. Chen, W.-M. Wang, and C.-M. Huang, “Analysis of an adaptive 2017. [Online]. Available: http://arxiv.org/abs/1705.04378
time-series autoregressive moving-average (arma) model for short-term [19] F. M. Bianchi, E. D. Santis, A. Rizzi, and A. Sadeghian, “Short-term
load forecasting,” Electric Power Systems Research, vol. 34, no. 3, pp. electric load forecasting using echo state networks and pca decomposi-
187–196, 1995. tion,” IEEE Access, vol. 3, pp. 1931–1943, 2015.
[5] S.-J. Huang and K.-R. Shih, “Short-term load forecasting via arma [20] E. Mocanu, P. H. Nguyen, M. Gibescu, and W. L. Kling, “Deep learning
model identification including non-gaussian process considerations,” for estimating building energy consumption,” Sustainable Energy, Grids
IEEE Transactions on power systems, vol. 18, no. 2, pp. 673–679, 2003. and Networks, vol. 6, pp. 91–99, 2016.
[6] M. T. Hagan and S. M. Behr, “The time series approach to short term [21] J. Zheng, C. Xu, Z. Zhang, and X. Li, “Electric load forecasting in smart
load forecasting,” IEEE Transactions on Power Systems, vol. 2, no. 3, grids using long-short-term-memory based recurrent neural network,” in
pp. 785–791, 1987. Information Sciences and Systems (CISS), 2017 51st Annual Conference
[7] C.-M. Huang, C.-J. Huang, and M.-L. Wang, “A particle swarm opti- on. IEEE, 2017, pp. 1–6.
mization to identifying the armax model for short-term load forecasting,” [22] W. Kong, Z. Y. Dong, Y. Jia, D. J. Hill, Y. Xu, and Y. Zhang, “Short-term
IEEE Transactions on Power Systems, vol. 20, no. 2, pp. 1126–1133, residential load forecasting based on lstm recurrent neural network,”
2005. IEEE Transactions on Smart Grid, 2017.
17
[23] S. Bouktif, A. Fiaz, A. Ouni, and M. Serhani, “Optimal deep learning in Natural Language Processing, EMNLP 2014, October 25-29,
lstm model for electric load forecasting using feature selection and 2014, Doha, Qatar, A meeting of SIGDAT, a Special Interest
genetic algorithm: Comparison with machine learning approaches,” Group of the ACL, 2014, pp. 1724–1734. [Online]. Available:
Energies, vol. 11, no. 7, p. 1636, 2018. http://aclweb.org/anthology/D/D14/D14-1179.pdf
[24] Y. Wang, M. Liu, Z. Bao, and S. Zhang, “Short-term load forecasting [46] P. J. Werbos, “Backpropagation through time: What it does and how to
with multi-source data using gated recurrent unit neural networks,” do it,” 1990.
Energies, vol. 11, p. 1138, 05 2018. [47] D. E. Rumelhart, G. E. Hinton, and R. J. Williams, “Parallel distributed
[25] W. He, “Load forecasting via deep neural networks,” Procedia processing: Explorations in the microstructure of cognition, vol. 1,”
Computer Science, vol. 122, pp. 308 – 314, 2017, 5th International D. E. Rumelhart, J. L. McClelland, and C. PDP Research Group,
Conference on Information Technology and Quantitative Management, Eds. Cambridge, MA, USA: MIT Press, 1986, ch. Learning Internal
ITQM 2017. [Online]. Available: http://www.sciencedirect.com/science/ Representations by Error Propagation, pp. 318–362. [Online]. Available:
article/pii/S1877050917326170 http://dl.acm.org/citation.cfm?id=104279.104293
[26] C. Tian, J. Ma, C. Zhang, and P. Zhan, “A deep neural network model [48] R. J. Williams and J. Peng, “An efficient gradient-based algorithm for
for short-term load forecast based on long short-term memory network on-line training of recurrent network trajectories,” Neural Computation,
and convolutional neural network,” Energies, vol. 11, p. 3493, 12 2018. vol. 2, 09 1998.
[27] T. Hong, P. Pinson, and S. Fan, “Global energy forecasting competition [49] Y. Bengio, P. Simard, and P. Frasconi, “Learning long-term
2012,” International Journal of Forecasting, vol. 30, no. 2, pp. 357 – dependencies with gradient descent is difficult,” Trans. Neur. Netw.,
363, 2014. [Online]. Available: http://www.sciencedirect.com/science/ vol. 5, no. 2, pp. 157–166, Mar. 1994. [Online]. Available:
article/pii/S0169207013000745 http://dx.doi.org/10.1109/72.279181
[28] A. Marvuglia and A. Messineo, “Using recurrent artificial neural [50] R. Pascanu, T. Mikolov, and Y. Bengio, “On the difficulty of training
networks to forecast household electricity consumption,” Energy recurrent neural networks,” in Proceedings of the 30th International
Procedia, vol. 14, pp. 45 – 55, 2012, 2011 2nd International Conference Conference on International Conference on Machine Learning - Volume
on Advances in Energy Engineering (ICAEE). [Online]. Available: 28, ser. ICML’13. JMLR.org, 2013, pp. III–1310–III–1318. [Online].
http://www.sciencedirect.com/science/article/pii/S1876610211043116 Available: http://dl.acm.org/citation.cfm?id=3042817.3043083
[29] D. Dheeru and E. Karra Taniskidou, “UCI machine learning repository,” [51] K. Greff, R. K. Srivastava, J. Koutnı́k, B. R. Steunebrink, and J. Schmid-
2017. [Online]. Available: http://archive.ics.uci.edu/ml huber, “Lstm: A search space odyssey,” IEEE Transactions on Neural
[30] “Smart grid, smart city, australian govern., australia, can- Networks and Learning Systems, vol. 28, no. 10, pp. 2222–2232, Oct
berray.” [Online]. Available: http://www.industry.gov.au/ENERGY/ 2017.
PROGRAMMES/SMARTGRIDSMARTCITY/Pages/default.aspx [52] J. Chung, Çaglar Gülçehre, K. Cho, and Y. Bengio, “Empirical evalua-
[31] Y. Cheng, C. Xu, D. Mashima, V. L. L. Thing, and Y. Wu, “Powerl- tion of gated recurrent neural networks on sequence modeling,” CoRR,
stm: Power demand forecasting using long short-term memory neural vol. abs/1412.3555, 2014.
network,” in Advanced Data Mining and Applications, G. Cong, W.- [53] W. Yin, K. Kann, M. Yu, and H. Schütze, “Comparative study of cnn and
C. Peng, W. E. Zhang, C. Li, and A. Sun, Eds. Cham: Springer rnn for natural language processing,” arXiv preprint arXiv:1702.01923,
International Publishing, 2017, pp. 727–740. 2017.
[32] “Umass smart dataset,” http://traces.cs.umass.edu/index.php/Smart/ [54] R. Pascanu, Ç. Gülçehre, K. Cho, and Y. Bengio, “How to construct
Smart, 2017. deep recurrent neural networks,” CoRR, vol. abs/1312.6026, 2013.
[33] D. L. Marino, K. Amarasinghe, and M. Manic, “Building energy [Online]. Available: http://arxiv.org/abs/1312.6026
load forecasting using deep neural networks,” in Industrial Electronics [55] J. Schmidhuber, “Learning complex, extended sequences using the
Society, IECON 2016-42nd Annual Conference of the IEEE. IEEE, principle of history compression,” Neural Comput., vol. 4, no. 2,
2016, pp. 7046–7051. pp. 234–242, Mar. 1992. [Online]. Available: http://dx.doi.org/10.1162/
[34] H. Wilms, M. Cupelli, and A. Monti, “Combining auto-regression with neco.1992.4.2.234
exogenous variables in sequence-to-sequence recurrent neural networks [56] A. Graves, A. Mohamed, and G. E. Hinton, “Speech recognition with
for short-term load forecasting,” in 2018 IEEE 16th International deep recurrent neural networks,” CoRR, vol. abs/1303.5778, 2013.
Conference on Industrial Informatics (INDIN). IEEE, 2018, pp. 673– [Online]. Available: http://arxiv.org/abs/1303.5778
679. [57] M. Hermans and B. Schrauwen, “Training and analysing deep
[35] T. Hong, P. Pinson, S. Fan, H. Zareipour, A. Troccoli, and recurrent neural networks,” in Advances in Neural Information
R. J. Hyndman, “Probabilistic energy forecasting: Global energy Processing Systems 26, C. J. C. Burges, L. Bottou, M. Welling,
forecasting competition 2014 and beyond,” International Journal of Z. Ghahramani, and K. Q. Weinberger, Eds. Curran Associates, Inc.,
Forecasting, vol. 32, no. 3, pp. 896 – 913, 2016. [Online]. Available: 2013, pp. 190–198. [Online]. Available: http://papers.nips.cc/paper/
http://www.sciencedirect.com/science/article/pii/S0169207016000133 5166-training-and-analysing-deep-recurrent-neural-networks.pdf
[36] A. Almalaq and G. Edwards, “A review of deep learning methods applied [58] S. B. Taieb, G. Bontempi, A. F. Atiya, and A. Sorjamaa, “A review and
on load forecasting,” in 2017 16th IEEE International Conference on comparison of strategies for multi-step ahead time series forecasting
Machine Learning and Applications (ICMLA), Dec 2017, pp. 511–516. based on the nn5 forecasting competition,” Expert Systems with
[37] B. C. Csáji, “Approximation with artificial neural networks.” Applications, vol. 39, no. 8, pp. 7067 – 7083, 2012. [Online]. Available:
[38] G. Hinton, http://www.cs.toronto.edu/∼tijmen/csc321/slides/lecture http://www.sciencedirect.com/science/article/pii/S0957417412000528
slides lec6.pdf. [59] A. Sorjamaa and A. Lendasse, “Time series prediction using dirrec
[39] M. D. Zeiler, “Adadelta: An adaptive learning rate method,” vol. 1212, strategy,” vol. 6, 01 2006, pp. 143–148.
12 2012. [60] G. Bontempi, “Long term time series prediction with multi-input multi-
[40] D. P. Kingma and J. Ba, “Adam: A method for stochastic optimization,” output local learning,” Proceedings of the 2nd European Symposium on
CoRR, vol. abs/1412.6980, 2014. Time Series Prediction (TSP), ESTSP08, 01 2008.
[41] S. Chen, D. Yu, and A. Moghaddamjo, “Weather sensitive short-term [61] S. B. Taieb, G. Bontempi, A. Sorjamaa, and A. Lendasse, “Long-term
load forecasting using nonfully connected artificial neural network,” prediction of time series by combining direct and mimo strategies,”
IEEE Transactions on Power Systems (Institute of Electrical and Elec- 2009 International Joint Conference on Neural Networks, pp. 3054–
tronics Engineers); (United States), no. 3, 8 1992. 3061, 2009.
[42] K. He, X. Zhang, S. Ren, and J. Sun, “Deep residual learning for image [62] F. M. Bianchi, E. D. Santis, A. Rizzi, and A. Sadeghian, “Short-term
recognition,” in 2016 IEEE Conference on Computer Vision and Pattern electric load forecasting using echo state networks and pca decomposi-
Recognition, CVPR 2016, Las Vegas, NV, USA, June 27-30, 2016, 2016, tion,” IEEE Access, vol. 3, pp. 1931–1943, 2015.
pp. 770–778. [Online]. Available: https://doi.org/10.1109/CVPR.2016.90 [63] I. Sutskever, O. Vinyals, and Q. V. Le, “Sequence to sequence learning
[43] J. L. Elman, “Finding structure in time,” COGNITIVE SCIENCE, vol. 14, with neural networks,” in Advances in Neural Information Processing
no. 2, pp. 179–211, 1990. Systems 27: Annual Conference on Neural Information Processing
[44] S. Hochreiter and J. Schmidhuber, “Long short-term memory,” Neural Systems 2014, December 8-13 2014, Montreal, Quebec, Canada,
Computation, vol. 9, no. 8, pp. 1735–1780, 1997. [Online]. Available: 2014, pp. 3104–3112. [Online]. Available: http://papers.nips.cc/paper/
https://doi.org/10.1162/neco.1997.9.8.1735 5346-sequence-to-sequence-learning-with-neural-networks
[45] K. Cho, B. van Merrienboer, Ç. Gülçehre, D. Bahdanau, F. Bougares, [64] D. Bahdanau, K. Cho, and Y. Bengio, “Neural machine translation by
H. Schwenk, and Y. Bengio, “Learning phrase representations jointly learning to align and translate,” CoRR, vol. abs/1409.0473, 2014.
using RNN encoder-decoder for statistical machine translation,” [65] Y. Wu, M. Schuster, Z. Chen, Q. V. Le, M. Norouzi, W. Macherey,
in Proceedings of the 2014 Conference on Empirical Methods M. Krikun, Y. Cao, Q. Gao, K. Macherey, J. Klingner, A. Shah,
18
M. Johnson, X. Liu, u. Kaiser, S. Gouws, Y. Kato, T. Kudo, H. Kazawa, [85] A. van den Oord, S. Dieleman, H. Zen, K. Simonyan, O. Vinyals,
and J. Dean, “Google’s neural machine translation system: Bridging the A. Graves, N. Kalchbrenner, A. Senior, and K. Kavukcuoglu, “Wavenet:
gap between human and machine translation,” 09 2016. A generative model for raw audio,” in Arxiv, 2016. [Online]. Available:
[66] A. Graves, A. Mohamed, and G. Hinton, “Speech recognition with deep https://arxiv.org/abs/1609.03499
recurrent neural networks,” in 2013 IEEE International Conference on [86] J. Gehring, M. Auli, D. Grangier, D. Yarats, and Y. N. Dauphin,
Acoustics, Speech and Signal Processing, May 2013, pp. 6645–6649. “Convolutional sequence to sequence learning,” arXiv preprint
[67] J. K. Chorowski, D. Bahdanau, D. Serdyuk, K. Cho, and Y. Bengio, arXiv:1705.03122, 2017.
“Attention-based models for speech recognition,” in Advances in Neural [87] A. Borovykh, S. Bohte, and K. Oosterlee, “Conditional time series
Information Processing Systems 28, C. Cortes, N. D. Lawrence, D. D. forecasting with convolutional neural networks,” in Lecture Notes in
Lee, M. Sugiyama, and R. Garnett, Eds. Curran Associates, Inc., Computer Science/Lecture Notes in Artificial Intelligence, Sep. 2017,
2015, pp. 577–585. [Online]. Available: http://papers.nips.cc/paper/ pp. 729–730.
5847-attention-based-models-for-speech-recognition.pdf [88] V. Dumoulin and F. Visin, “A guide to convolution arithmetic for
[68] D. Bahdanau, J. Chorowski, D. Serdyuk, P. Brakel, and Y. Bengio, deep learning,” CoRR, vol. abs/1603.07285, 2016. [Online]. Available:
“End-to-end attention-based large vocabulary speech recognition,” in http://arxiv.org/abs/1603.07285
2016 IEEE International Conference on Acoustics, Speech and Signal [89] S. Bai, J. Z. Kolter, and V. Koltun, “An empirical evaluation of generic
Processing (ICASSP), March 2016, pp. 4945–4949. convolutional and recurrent networks for sequence modeling,” arXiv
[69] K. Xu, J. Ba, R. Kiros, K. Cho, A. Courville, R. Salakhudinov, preprint arXiv:1803.01271, 2018.
R. Zemel, and Y. Bengio, “Show, attend and tell: Neural image [90] F. Chollet et al., “Keras,” https://github.com/fchollet/keras, 2015.
caption generation with visual attention,” in Proceedings of the 32nd [91] M. Abadi, A. Agarwal, P. Barham, E. Brevdo, Z. Chen, C. Citro, G. S.
International Conference on Machine Learning, ser. Proceedings of Corrado, A. Davis, J. Dean, M. Devin, S. Ghemawat, I. Goodfellow,
Machine Learning Research, F. Bach and D. Blei, Eds., vol. 37. Lille, A. Harp, G. Irving, M. Isard, Y. Jia, R. Jozefowicz, L. Kaiser,
France: PMLR, 07–09 Jul 2015, pp. 2048–2057. [Online]. Available: M. Kudlur, J. Levenberg, D. Mané, R. Monga, S. Moore, D. Murray,
http://proceedings.mlr.press/v37/xuc15.html C. Olah, M. Schuster, J. Shlens, B. Steiner, I. Sutskever, K. Talwar,
[70] R. J. Williams and D. Zipser, “A learning algorithm for continually P. Tucker, V. Vanhoucke, V. Vasudevan, F. Viégas, O. Vinyals,
running fully recurrent neural networks,” Neural Computation, vol. 1, P. Warden, M. Wattenberg, M. Wicke, Y. Yu, and X. Zheng,
pp. 270–280, 1989. “TensorFlow: Large-scale machine learning on heterogeneous systems,”
[71] M. Ranzato, S. Chopra, M. Auli, and W. Zaremba, “Sequence level 2015, software available from tensorflow.org. [Online]. Available:
training with recurrent neural networks,” CoRR, vol. abs/1511.06732, https://www.tensorflow.org/
2015. [92] C. M. Bishop, Pattern Recognition and Machine Learning (Information
[72] S. Bengio, O. Vinyals, N. Jaitly, and N. Shazeer, “Scheduled Science and Statistics). Berlin, Heidelberg: Springer-Verlag, 2006.
sampling for sequence prediction with recurrent neural networks,” [93] S. Li, J. Jiao, Y. Han, and T. Weissman, “Demystifying resnet,” arXiv
in Proceedings of the 28th International Conference on Neural preprint arXiv:1611.01186, 2016.
Information Processing Systems - Volume 1, ser. NIPS’15. Cambridge, [94] S. Ioffe and C. Szegedy, “Batch normalization: Accelerating deep
MA, USA: MIT Press, 2015, pp. 1171–1179. [Online]. Available: network training by reducing internal covariate shift,” 2015, pp.
http://dl.acm.org/citation.cfm?id=2969239.2969370 448–456. [Online]. Available: http://jmlr.org/proceedings/papers/v37/
[73] A. Lamb, A. Goyal, Y. Zhang, S. Zhang, A. C. Courville, and Y. Bengio, ioffe15.pdf
“Professor forcing: A new algorithm for training recurrent networks,” in [95] A. Marinescu, C. Harris, I. Dusparic, S. Clarke, and V. Cahill, “Residen-
NIPS, 2016. tial electrical demand forecasting in very small scale: An evaluation of
forecasting methods,” in 2013 2nd International Workshop on Software
[74] Y. Keneshloo, T. Shi, N. Ramakrishnan, and C. K. Reddy, “Deep
Engineering Challenges for the Smart Grid (SE4SG), May 2013, pp.
reinforcement learning for sequence to sequence models,” CoRR, vol.
25–32.
abs/1805.09461, 2018.
[75] H. Wilms, M. Cupelli, and A. Monti, “Combining auto-regression with
exogenous variables in sequence-to-sequence recurrent neural networks
for short-term load forecasting,” 07 2018, pp. 673–679.
[76] Y. Lecun, L. Bottou, Y. Bengio, and P. Haffner, “Gradient-based learning
applied to document recognition,” in Proceedings of the IEEE, 1998, pp.
2278–2324.
[77] A. Krizhevsky, I. Sutskever, and G. E. Hinton, “Imagenet classification
with deep convolutional neural networks,” in Advances in Neural
Information Processing Systems 25, F. Pereira, C. J. C. Burges,
L. Bottou, and K. Q. Weinberger, Eds. Curran Associates, Inc.,
2012, pp. 1097–1105. [Online]. Available: http://papers.nips.cc/paper/
4824-imagenet-classification-with-deep-convolutional-neural-networks.
pdf
[78] K. Simonyan and A. Zisserman, “Very deep convolutional networks
for large-scale image recognition,” CoRR, vol. abs/1409.1556, 2014.
[Online]. Available: http://arxiv.org/abs/1409.1556
[79] C. Szegedy, W. Liu, Y. Jia, P. Sermanet, S. Reed, D. Anguelov, D. Erhan,
V. Vanhoucke, and A. Rabinovich, “Going deeper with convolutions,”
in The IEEE Conference on Computer Vision and Pattern Recognition
(CVPR), June 2015.
[80] R. Girshick, “Fast r-cnn,” in Proceedings of the IEEE international
conference on computer vision, 2015, pp. 1440–1448.
[81] S. Ren, K. He, R. Girshick, and J. Sun, “Faster r-cnn: Towards real-time
object detection with region proposal networks,” in Advances in neural
information processing systems, 2015, pp. 91–99.
[82] M. Jaderberg, K. Simonyan, A. Zisserman et al., “Spatial transformer
networks,” in Advances in neural information processing systems, 2015,
pp. 2017–2025.
[83] R. Dahl, M. Norouzi, and J. Shlens, “Pixel recursive super resolution,”
arXiv preprint arXiv:1702.00783, 2017.
[84] C. Ledig, L. Theis, F. Huszár, J. Caballero, A. Cunningham, A. Acosta,
A. Aitken, A. Tejani, J. Totz, Z. Wang et al., “Photo-realistic single
image super-resolution using a generative adversarial network,” in 2017
IEEE Conference on Computer Vision and Pattern Recognition (CVPR).
IEEE, 2017, pp. 105–114.
View publication stats

Deep Learning For Time Series Forecasting: The Electric Load Case

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Deep Learning For Time Series Forecasting: The Electric Load Case

Uploaded by

Copyright:

Available Formats

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

Preprint · July 2019

Alberto Gasparin Slobodan Lukovic

SEE PROFILE SEE PROFILE

SCCER - FURIES View project

The user has requested enhancement of the downloaded file.

Deep Learning for Time Series Forecasting:

Fixed-length Time Windows ...

One fixed-length Time

Indexes refer to the

Indexes refer to the

defined as xt = [s[t − nT + 1], . . . , s[t]] ∈ IRnT , the fore-

The nomenclature used in this work is given in Table II.

B. Long Short-Term Memory (LSTM)

Recurrent neural networks with Long Short-Term Memory

transformation of) the latest state h[t−1] and (a transformation

UNFOLDING [0] ... [ -1] [ ]

value is completely replaced at each timestep (Equation 8). In 1

Figure 5. Gated Recurrent Unit memory block with one cell.

C. Gated Recurrent Units (GRU) D. Deep Recurrent Neural Networks

h` [t] = f (h` [t − 1], h`−1 [t]; Θ) ` = 1, 2, ..., L (20)

𝑦̂ [𝑛𝑇 -1] 𝑦̂ [𝑛𝑇 ] ... 𝑦̂ [𝑛𝑇 +𝑛𝑂 -2]

RNN RNN ... RNN RNN RNN ... RNN

RNN RNN ... RNN RNN RNN ... RNN

̂ [ ] ̂ [ +1] ̂ [ + ] ̂ [ ] ̂ [ +1] ... ̂ [ + ]

RNN RNN ... RNN RNN RNN ... RNN

... ... ... ... ... ...

RNN RNN ... RNN RNN RNN ... RNN

DECODER (TF) DECODER (SG)

+ ... + Conv1D (1x1) ̂

ERNN LSTM GRU seq2seq

RMSE MAE NRMSE R2 found in order to provide a boost in prediction performance

3.5 Measurements 4.0

Active Power (kW)

View publication stats

You might also like