12 ASAP TimeSeriesForcasting_Neural networks_Day 12-15

FOUNDATION TO DATA SCIENCE
Business Analytics
MODULE II-Unit:3
Neural Networks for Time Series Forcasting
Prof. Dr. George Mathew
B.Sc., B.Tech, PGDCA, PGDM, MBA, PhD 1
Time Series Forecasting
2.3.1 What is Time-Series analysis
2.3.1.1 Decomposition
2.3.2 Correlation and Causation
2.3.3 Autocorrelation
2.3.4 Forecasting Data
2.3.5 Autoregressive Modeling, Moving averages and ARIMA
2.3.6 Neural networks for Time series data
Neural networks
for
Time series Forcasting
An Introduction to Neural Networks
Artificial neural networks are popular machine learning techniques that simulate
the mechanism of learning in biological organisms. The human nervous system
contains cells, which are referred to as neurons. The neurons are connected to one
another with the use of axons and dendrites, and the connecting regions between
axons and dendrites are referred to as synapses. These connections are illustrated in
Figure 1.1(a). The strengths of synaptic connections often change in response to
external stimuli. This change is how learning takes place in living organisms.
This biological mechanism is simulated in artificial neural networks, which contain
computation units that are referred to as neurons. Throughout this book, we will use the
term “neural networks” to refer to artificial neural networks rather than biological ones.
The computational units are connected to one another through weights, which serve the
same role as the strengths of synaptic connections in biological organisms. Each input to a
neuron is scaled with a weight, which affects the function computed at that unit. This
architecture is illustrated in Figure 1.1(b). An artificial neural network computes a function
of the inputs by propagating the computed values from the input neurons to the output
neuron(s) and using the weights as intermediate parameters. Learning occurs by changing
the weights connecting the neurons
.
Figure 1.2: An illustrative comparison of the accuracy of a typical
machine learning algorithm with that of a large neural network.
Deep learners become more attractive than
conventional methods primarily when sufficient data/computational
power is available. Recent years have seen an increase in data
availability and computational power, which has led to a “Cambrian
explosion” in deep learning technology.
The Basic Architecture of Neural Networks
In this section, we will introduce single-layer and multi-layer neural networks. In the
singlelayer network, a set of inputs is directly mapped to an output by using a generalized
variation of a linear function. This simple instantiation of a neural network is also referred
to as the perceptron. In multi-layer neural networks, the neurons are arranged in layered
fashion, in which the input and output layers are separated by a group of hidden layers.
This layer-wise architecture of the neural network is also referred to as a feed-forward
network. This section will discuss both single-layer and multi-layer networks.
Neural Networks for Time Series Forecasting
Neural networks (NNs) are used in a variety of business applications, including
forecasting and marketing research. They have been successfully used for
forecasting financial data series.
Here are some types of neural networks for time series forecasting:
Recurrent Neural Networks (RNNs)
A type of artificial neural network (ANN) that has shown accurate results for time
series forecasting. RNNs are made up of a series of interconnected neural
networks at different time intervals. They are the most popular deep learning
technique for time series forecasting. However, they suffer from the vanishing
gradient problem when applied to long sequences.
Convolutional Neural Networks (CNNs)
A type of neural network that was designed to efficiently handle image
data. CNNs can learn and automatically extract features from raw input data,
which can be applied to time series forecasting problems.
LSTM models
A type of recurrent neural network that is especially adept at handling time-series
data. RNN and LSTM can capture time dependency in data, which makes them
perform better in stock price prediction than other models.
Deep learning neural networks are capable of automatically learning and
extracting features from raw and imperfect data. Deep learning supports multiple
inputs and outputs. Deep learning methods offer a lot of promise for time
series forecasting, such as the automatic learning of temporal dependence
and the automatic handling of temporal structures like trends and
seasonality.
Recurrent neural networks—specifically long short-term memory(LSTM)
networks—are good at extracting patterns in input data that span over relatively
long sequences.
Recurrent Neural Networks for Time Series Forecasting – In this section I
will introduce a very popular type of artificial neural networks: recurrent
neural networks, also known as RNNs. Recurrent neural networks are a
class of neural networks that allow previous outputs to be used as inputs
while having hidden states.
How to Develop GRUs and LSTMs for Time Series Forecasting – here, you will learn
how to prepare your data for deep learning models and develop GRU models for a
range of time series forecasting problems.
Reasons to Add Deep Learning to Your Time Series Toolkit
The goal of machine learning is to find features to train a model that transforms input data
(such as pictures, time series, or audio) to a given output (such as captions, price values, or
transcriptions). Deep learning is a subset of machine learning algorithms that learn to
extract these features by representing input data as vectors and transforming them with a
series of linear algebra operations into a given output. In order to clarify further the
difference between deep learning and machine learning, let's start by defining each of
these two fields of study separately:
Machine learning is the practice of using an algorithm to break up data, learn from it, and
then use this data to make some predictions about a certain phenomenon. The learning
process is generally based on the following steps:
1. We feed our algorithm with data.
2. We then use this data to teach our model how to learn from previous observations.
3. We run a test to check if our model has learned enough from previous observations and
we evaluate its performance.
4. If the model is performing well (based on our expectations and requirements), we
deploy and push it into production, making it available to other stakeholders in the
organization or outside the business.
5. Finally, we consume our deployed model to perform a certain automated predictive
task.
Reasons to Add Deep Learning to Your Time Series Toolkit
Deep learning is a subset of machine learning. Deep learning
algorithms are specific types of machine learning algorithms that are
based on artificial neural networks. In this case, the learning process
is based on the same steps as those for machine learning, but it is
called deep because the structure of algorithm is based on artificial
neural networks that consist of multiple input, output, and hidden
layers, containing units that transform
the input data into an information that the next layer can use to
perform a certain automated predictive task once the deep learning
model is deployed.
It is important to compare the two techniques and understand the
main differences
The key differences between machine learning and deep learning
The goal of machine learning is to find features to train a model that transforms
input data (such as pictures, time series, or audio) to a given output (such as
captions, price values, or transcriptions). Deep learning is a subset of machine
learning algorithms that learn to extract these features by representing input data
as vectors and transforming them with a series of linear algebra operations into a
given output. In order to clarify further the difference between deep learning and
machine learning, let's start by defining each of these two fields of study
separately:
Machine learning is the practice of using an algorithm to break up data, learn from it, and
then use this data to make some predictions about a certain phenomenon. The learning
process is generally based on the following steps:
1. We feed our algorithm with data.
2. We then use this data to teach our model how to learn from previous observations.
3. We run a test to check if our model has learned enough from previous observations and
we evaluate its performance.
4. If the model is performing well (based on our expectations and requirements), we deploy
and push it into production, making it available to other stakeholders in the organization or
outside the business.
5. Finally, we consume our deployed model to perform a certain automated predictive task.
Deep learning is a subset of machine learning
Deep learning algorithms are specific types of machine learning algorithms that are based
on artificial neural networks. In this case, the learning process is based on the same
steps as those for machine learning, but it is called deep because the structure of algorithm
is based on artificial neural networks that consist of multiple input, output, and hidden
layers, containing units that transform the input data into an information that the next layer
can use to perform a certain automated predictive task once the deep learning model is
deployed.
Data scientists then evaluate whether the output is what they expected, using an
equation called loss function. The goal of the process is to use the result of the
loss function from each training input to guide the model to extract features that
will result in a lower loss value on the next pass . Deep learning
neural networks have three main intrinsic capabilities:
• Deep learning neural networks are capable of automatically learning and extracting
features from raw and imperfect data.
• Deep learning supports multiple inputs and outputs.
• Recurrent neural networks—specifically long short-term memory (LSTM) networks and
gated recurrent units (GRUs)—are good at extracting patterns in input data that span
over relatively long sequences.
Deep Learning Algorithms for Time Series Forecasting
Key to the use of deep learning algorithms for time series
forecasting is the choice of multiple input data. We can think about
three main sources of data that can be used as input and mapped to
each forecast lead time for a target variable:
➢ Univariate data, such as lag observations from the target variable
that is being forecasted.
➢ Multivariate data, such as lag observations from other variables
(for
➢ example, weather and targets in case of air pollution forecasting
problems).
➢ Metadata, such as data about the date or time being forecast.
This type of data can provide additional insight into historical
patterns, helping create richer data sets and more accurate
forecasts.
Recurrent Neural Networks for Time Series Forecasting
Recurrent neural network (RNN), also known as autoassociative or feedback network,
belongs to a class of artificial neural networks where connections between units form a
directed cycle. This creates an internal state of the network, which allows it to exhibit
dynamic temporal behavior.
RNN can leverage their internal memory to handle sequences of inputs and, instead of
mapping inputs to outputs alone, it is capable of leveraging a mapping function for the
inputs over time to an output. RNNs have shown to achieve the state-of-the-art results in
many applications with time series or sequential data, including machine translation and
speech recognition.
In particular, LSTM is a type of RNN architecture that performs particularly well on
different temporal processing tasks, and LSTM networks are able to address the issue of
large time lags in input data successfully. LSTM networks have several nice properties such
as strong prediction performance as well as the ability to capture long-term temporal
dependencies and variable-length observations.
Exploiting the power of customized RNN models along with LSTM models is a promising
venue to effectively model time series data: in the next few sections of this chapter, you
will see how RNNs, LSTMs, and Python can help data scientists build accurate models for
their time series forecasting solutions.
Deep Learning for Time Series
Deep learning for time series is a relatively new endeavor, but
it’s a promising one. Because deep learning is a highly flexible
technique, it can be advantageous for time series analysis.
Deep learning describes a branch of machine learning in
which a “graph” is built that connects input nodes to a
complicated structure of nodes and edges. In passing from
one node to another via an edge, a value is multiplied
by that edge’s weight and then, usually, passed through some
kind of nonlinear activation function. It is this nonlinear
activation function that makes deep learning so interesting: it
enables us to fit highly complex, nonlinear data, something
that had not been very successfully done previously.
A simple feed forward network.
Time Series Components
Long term trend
Long term trend is the overall general direction of the data, obtained
ignoring any short term effects such as seasonal variations or noise.
Seasonality
Seasonality refers to periodic fluctuations that are repeated throughout all
the time series period.
Stationarity
Stationarity is an important characteristic of time series. A time series is
said to be stationary if its mean, variance and covariance don’t have
significant changes over time. There are we many transformations that can
extract the stationary part of a non-stationary process.
Noise
Every set of data has noise, that refers to random fluctuations or variations
due to uncontrolled factors.
Autocorrelation
Autocorrelation is the correlation between the time series and a lagged
version of itself, and is used to identify seasonality and trend in time series
data.
Time Series Forecasting with
traditional Machine Learning
The most classical Machine Learning models used to solve
this problem are ARIMA models and exponential smoothing.
ARIMA stands for combination of Autoregressive (AR) and
Moving Average (MA) approaches within building a composite
model of the time series. This model is very simple, but might
have good results. It includes parameters to account for
seasonality, long term trend, autoregressive and moving
average terms, in order to handle the autocorrelation
embedded in the data.
In Exponential smoothing forecasts are made on the basis of
weighted averages like in ARIMA models, but in this case
different decreasing weights are assigned to each
observations and less importance is given to observations as
we move further from the present.
Neural networks for Time series data
Deep Learning Neural Networks Are Capable of
Automatically Learning and Extracting Features from Raw
and Imperfect Data.
Time series is a type of data that measures how things change
over time. In time series, time isn’t just a metric, but a primary axis. This
additional dimension represents both an opportunity and a constraint for
time series data because it provides a source of additional information but
makes time series problems more challenging as specialized handling of
the data is required.
Moreover, this temporal structure can carry additional information,
like trends and seasonality, that data scientists need to deal with in order to
make their time series easier to model with any type of classical forecasting
methods. Neural networks can be useful for time series forecasting
problems by eliminating the immediate need for massive feature
engineering processes, data scaling procedures, and making the data
stationary by differencing.
Convolutional neural networks (CNNs).
Convolution means applying a kernel (a matrix) to a larger matrix by sliding
it across the larger matrix, forming a new one. Each element of the new
matrix is the sum of element-wise multiplication of the kernel and a
subsection of the larger matrix. This kernel is applied repeatedly as it slides
across a matrix/image. This is done with a prespecified number of kernels
so that different features can emerge. A schematic of how this works on
many layers is shown in Figure 10-8.
Figure 10. A convolutional network. Many two-dimensional windows of
specified kernel size slide across the original image, producing many
feature maps out of the trainable weights applied to the image. Often these
are pooled and otherwise postprocessed with an activation function. The
process is repeated several times over several layers to collapse many
features down into a smaller range of values, ultimately leading to, for
example, classification ratings.
A main feature of convolutions is that all spaces are treated equally. This
makes sense for images but not for time series, where some points in time
are necessarily closer than others.
Convolutional networks are usually structured to be scale invariant so that,
say, a horse can be identified in an image whether it’s a larger or smaller
portion of the image. However, in time series, again, we likely want to
preserve scale and scaled features.
A yearly seasonal oscillation should not be interpreted in the same way or
“triggered” by the same feature selector as a daily oscillation, although in
some contexts this might be helpful.
Some uses of convolutional networks for time series
applications include:
• Establishing a “fingerprint” for an internet user’s browsing
history, which helps detect anomalous browsing activity
• Identifying anomalous heartbeat patterns from EKG data
• Generating traffic predictions based on past recordings from
multiple locations in a large city
Convolutions are not all that interesting per se to apply to a
univariate time series. Multichannel time series can be more
interesting because then we can develop a 2D (or even 3D)
image where time is only one axis.
Recurrent Neural Networks
Recurrent Neural Networks are networks of neuron-like nodes organized
into successive layers, with an architecture similar to the one of standard
Neural Networks. Infact, like in standard Neural Networks, neurons are
divided in input layer, hidden layers and output layer. Each connection
between neurons has a corresponding trainable weight.
The difference is that in this case every neurons is assigned to a fixed time
step. The neurons in the hidden layer are also forwarded in a time
dependent direction, that means that everyone of them is fully connected
only with the neurons in the hidden layer with the same assigned time step,
and is connected with a one-way connection to every neuron assigned to
the next time step. The input and output neurons are connected only to the
hidden layers with the same assigned time step.
Since the output of the hidden layer of one time step is part of
the input of the next time step, the activation of the neurons is
computed in time order: at any given time step, only the
neurons assigned to that time step computes their activation.
Deep Learning for Time Series Forecasting
The use of Deep Learning for Time Series Forecasting overcomes the
traditional Machine Learning disadvantages with many different
approaches. In this 5 different Deep Learning Architecture for Time Series
Forecasting are presented:
Recurrent Neural Networks (RNNs), that are the most classical and used
architecture for Time Series Forecasting problems;
Long Short-Term Memory (LSTM), that are an evolution of RNNs
developed in order to overcome the vanishing gradient problem;
Gated Recurrent Unit (GRU), that are another evolution of RNNs, similar
to LSTM;
Encoder-Decoder Model, that is a model for RNNs introduced in order to
address the problems where input sequences differ in length from output
sequences;
Attention Mechanism, that is an evolution of the Encoder-Decoder
Model, developed in order to avoid forgotting of the earlier parts of the
sequence.
Recurrent Neural Networks are networks of neuron-like nodes organized
into successive layers, with an architecture similar to the one of standard
Neural Networks. Infact, like in standard Neural Networks, neurons are
divided in input layer, hidden layers and output layer. Each connection
between neurons has a corresponding trainable weight.
The difference is that in this case every neurons is assigned to a fixed time
step. The neurons in the hidden layer are also forwarded in a time
dependent direction, that means that everyone of them is fully connected
only with the neurons in the hidden layer with the same assigned time step,
and is connected with a one-way connection to every neuron assigned to
the next time step. The input and output neurons are connected only to the
hidden layers with the same assigned time step.
Since the output of the hidden layer of one time step is part of
the input of the next time step, the activation of the neurons is
computed in time order: at any given time step, only the
neurons assigned to that time step computes their activation.
Long Short-Term Memory (LSTM)
Long Short-Term Memory Networks (LSTM) have been developed to
overcome the vanishing gradient problem in the standard RNN by improving
the gradient flow within the network. This is achieved using a LSTM unit in
place of the hidden layer. As shown in the Figure below, a LSTM unit is
composed of:
A cell state, that brings information along the entire sequence and
represents the memory of the network;
A forget gate, that decides what is relevant to keep from previous time
steps;
An input gate, that decides what information is relevant to add from the
current time step;
An output gate, that decides the value of the output at current time step.
Similarly to the RNNs, the input vector at time t is connected to the LSTM
cell of time t by a weight matrix U, the LSTM cell is connected to the the
LSTM cell of time t-1 and t+1 by a weight matrix W, and the the LSTM cell is
connected to the output vector of time t by a weight matrix V. The
matrices W and U are divided in submatrices (Wf, Wi, Wg, Wo;
Uf, Ui, Ug, Uo) that are connected to different elements of the LSTM unit, as
shown in the Figure below. All the weight matrices are shared across time.
The cell state transfers the relevant information during processing, so that
also the information from the previous time steps arrives at each time step,
reducing the effects of short-term memory. During training over all the time
steps, the gates learn which information is important to keep or to forget,
and add them to the cell state, or remove them from it.
In this way LSTM allows the recovery of data transferred in memory, solving
the vanishing gradient problem. LSTM are useful for classifying, processing,
and predicting time series with time lags of unknown duration.
Forget Gate
The first gate is the forget gate. This gate decides which information should
be deleted or saved. The information from the previous hidden state and the
information from the current input are passed through the sigmoid function.
An output is close to 0 it means that the information can be forgotten, while
an output close to 1 means that the information must be saved.
Input Gate
The second gate is the input gate. This is used to update the cell state.
Input Gate
The second gate is the input gate. This is used to update the cell
state.
Cell State
After the activation of the input gate, the cell state can be
calculated. First, the cell state of the previous time step gets
element-wise multiplied by the output of the forget gate. This
gives the possibility to ignore values in the cell state when
they are multiplied by values close to 0. Then the output of
the input gate is element-wise added to the cell state.
Output Gate
The third and final gate is the output gate, that decides the
value of the next hidden state, which contains information
about previous inputs.
LSTM (short for long short-term memory) primarily
solves the vanishing gradient problem in
backpropagation. LSTMs use a gating mechanism
that controls the memoizing process. Information in
LSTMs can be stored, written, or read via gates that
open and close. These gates store the memory in the
analog format, implementing element-wise
multiplication by sigmoid ranges between 0-1. Analog,
being differentiable in nature, is suitable for
backpropagation.
Let's look at the architecture of an LSTM.
Tanh is a non-linear activation function. It regulates the values flowing
through the network, maintaining the values between -1 and 1. To
avoid information fading, a function is needed whose second derivative
can survive for longer. There might be a case where some values
become enormous, further causing values to be insignificant. You can
see how the value 5 remains between the boundaries because of the
function.
Sigmoid
Sigmoid belongs to the family of non-linear activation functions. It is
contained by the gate. Unlike tanh, sigmoid maintains the values between
0 and 1. It helps the network to update or forget the data. If the
multiplication results in 0, the information is considered forgotten.
Similarly, the information stays if the value is 1.
This will help the network learn which data can be forgotten and which
data is important to keep.
Gated Recurrent Unit (GRU)
The GRU is a new generation of Recurrent Neural Networks and is very
similar to an LSTM. To solve the vanishing gradient problem of a standard
RNN, GRU uses the update gate and reset gate. These are two gates
decide what information should be passed to the output. These two gates
can be trained to keep information from many time steps before the actual
time step, without washing it through time, or to remove information which
is irrelevant for the prediction. If carefully trained, GRU can perform
extremely well even in complex scenarios.
As shown in the Figure below, a GRU unit is composed of:
• a reset gate, that decides how much of the information from the previous
time steps can be forgotten;
• an update gate, that decides how much of the information from the
previous time steps must be saved;
• a memory, that brings informations along the entire sequence and
represents the memory of the network.
Implementation of RNN, LSTM, GRU
RNN, LSTM and GRU can be implemented using Keras
API, that is designed to be easy to use and customize. The
following 3 RNN layers are present in Keras:
• keras.layers.SimpleRNN
• keras.layers.LSTM
• keras.layers.GRU
They allow you to quickly create recurring templates
without having to make difficult configuration choices.
Moreover it’s possible to define a custom RNN cell layer
with the desired behavior, allowing to quickly test various
different prototypes in a flexible way with minimal code. On
the Tensorflow website it’s possible to find instructions and
many examples of the use of these layers.
Encoder-Decoder Model
In RNN, LSTM, GRU each input corresponds to an output for the same time
step. However in many real cases we want to predict an output sequence
given an input sequence of different length, without a correspondence
between each input and each output. This situation is called sequence to
sequence mapping model, and lies behind numerous commonly used
applications like for example language translations, voice-enabled devices
and online chatbots.
The Encoder-Decoder model for Recurrent Neural Networks was
introduced in order to address the sequence-to-sequence mapping models.
An Encoder-Decoder takes a sequence as input and generates the most
probable next sequence as output. As the name suggests, the model is
comprised of two sub-models:
the encoder, that is responsible for stepping through the input time steps
and encoding the entire sequence into a fixed length vector called a context
vector;
the decoder, that is responsible for stepping through the output time steps
while reading from the context vector.
Encoder
The encoder is a stack of several recurrent units, that can be
simple RNNs, LSTM cells or GRU cells. Each unit accepts a
single element of the input sequence, collects information from
that element and propagates it forward.
The hidden state vector h(t) is computed using the function of

the chosen recurrent unit. The function is applied with the
appropriate weights to the previous hidden state h(t-1) and the
input vector x(t):
The final hidden state vector h(t) contains all the encoded
information from the previous hidden representations and
previous inputs.
or:
Decoder
it:
The decoder consists in a stack of several recurrent units. Each
recurrent unit accepts a hidden state s(t-1) from the previous
unit and produces and output y^(t) as well as its own hidden
state s(t).
The hidden state s(t) is computed according to the the function
of the chosen recurrent unit:
The output y^(t) is computed using the softmax function using

the hidden state at the current time step s(t) together with the
respective weight, in order to create a probability vector:

12 ASAP TimeSeriesForcasting_Neural networks_Day 12-15

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

12 ASAP TimeSeriesForcasting_Neural networks_Day 12-15

Uploaded by

Copyright:

Available Formats

FOUNDATION TO DATA SCIENCE

The hidden state vector h(t) is computed using the function of

The output y^(t) is computed using the softmax function using

You might also like