Demystifying Deep Learning2

NNS AND DEEP LEARNING Dr V Vella
AGENDA
Quick refresher on Gradient Descent and Probabilistic Perspectives
Differentiation Methods and Autodiff
Computational Graphs
Deep Networks – “Computational Graph” Architecture
Recurrent Neural Networks
Example of Algo Trading
SIMPLE REGRESSION
Hypothesis Function:
Cost Function:
GRADIENT DESCENT
Model training:
PROBABILISTIC INTERPRETATION
Let us assume that the target variables and the inputs are related via the equation:
where error term captures either unmodeled eﬀects or random noise. Let us further
assume that the error terms are distributed IID according to a Gaussian distribution
with mean zero and some variance sigma^2.
The probability of the data is given by
This quantity is typically viewed a function of y (and perhaps X), for a xed value of θ. When
we wish to explicitly view this as a function of θ, we will instead call it the likelihood function:
The principal of maximum likelihood says that we should choose θ so as to make the data as
high probability as possible. I.e., we should choose θ to maximize L(θ).
The derivations is simpler if we instead maximize the log likelihood ℓ(θ):
Hence, maximizing ℓ(θ) gives the same answer as minimizing Least-squares regression
corresponds to ﬁnding the
maximum likelihood
estimate of θ.
DIFFERENTIATION METHODS - AUTODIFF
In mathematics and computer algebra, automatic differentiation (AD), also called
algorithmic differentiation or computational differentiation, is a set of techniques
to numerically evaluate the derivative of a function specified by a computer
program.
Bakpropagation refers to the whole process of training an artificial neural network
using multiple backpropagation steps, each of which computes gradients and uses
them to perform a Gradient Descent step. In contrast, auto diff is simply a
technique used to compute gradients efficiently and it happens to be used by
backpropagation.
Tensorflow uses automatic differentiation and more specifically reverse-mode auto
differentiation.
NUMERICAL DIFFERENTIATION
The simplest solution is to compute an approximation of the derivatives, numerically.
Recall the following derivate equations:
NUMERICAL DIFFERENTIATION
AUTOGRAD
A lightweight automatic differentiation system written by Dougal Maclaurin, David
Duvenaud, Matt Johnson, and Jamie Townsend.
More recently, JAX - Autograd and XLA, brought together for high-performance
machine learning research. Available at https://github.com/google/jax
Other similar Autodiff functionality exist in Tensorflow and Pytorch.
AUTOGRAD EXAMPLE – LOGISTIC REGRESSION
COMPUTATIONAL GRAPHS
A computational graph is a directed graph where the nodes correspond to
operations or variables. Variables can feed their value into operations, and
operations can feed their output into other operations. This way, every node in the
graph defines a function of the variables.
BREAKING OPERATIONS INTO PRIMITIVE
OPERATIONS
COMPUTATIONAL GRAPHS AND DERIVATIVES
Consider the following computational graphs:
We can evaluate the expression by setting the input variables to certain values and
computing nodes up through the graph. For example, let’s set a=2 and b=1:
If one wants to understand derivatives in a computational graph, the key is to understand
derivatives on the edges. If a directly affects c, then we want to know how it affects c. If a
changes a little bit, we want to know the degree/factor by how much c changes.
We call this the partial derivative of c with respect to a.
To evaluate the partial derivatives in this graph, we need the sum rule and the product rule:
Below, the graph has the derivative on each
edge labelled.
What if we want to understand how nodes
that aren’t directly connected affect each
other. Let’s consider how e is affected by a.
If we change a at a speed of 1, c also
changes at a speed of 1. In turn, c changing
at a speed of 1causes e to change at a
speed of 2. So e changes at a rate of 1∗2
with respect to a.
The general rule is to sum over all possible
paths from one node to the other, multiplying
the derivatives on each edge of the path
together. For example, to get the derivative
of e with respect to b we get:
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
SIGMOID FUNCTION
SIGMOID FUNCTION
SIGMOID FUNCTION
SIGMOID FUNCTION
SIGMOID FUNCTION
SIGMOID FUNCTION
SIGMOID FUNCTION
SIGMOID FUNCTION
COMPUTATIONAL GRAPH - VECTORIZED
JACOBIANS AND HESSIANS
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
COMPUTATIONAL GRAPH
See Example on
Jupyter
SOFTMAX REGRESSION
SOFTMAX FUNCTION
LOG-LOSS
LOG-LOSS
FINDING WEIGHTS USING GRADIENT DESCENT
NN MATH – THE “TRADITIONAL WAY”
NNS – COMPUTATIONAL GRAPH APPROACH
NNS – COMPUTATIONAL GRAPH APPROACH
PROCESSING INPUT
PROCESSING INPUT
DERIVATIVES AT EACH NODE
DERIVATIVES AT EACH NODE
STATE AFTER BACKWARD PASS
DERIVATIVE COMPUTATION COMPLETE
GRADIENTS FOR PARAMETER UPDATES
PARAMETER UPDATE
MATRICES ALL THE WAY …
Using images of the MNIST data which has 10 classes (i.e. digits
from 0 to 9). The implemented network has 2 hidden layers:
the first one with 200 hidden units (neurons) and the second
one (also known as classifier layer) with 10 (number of classes)
neurons.
BUILDING YOUR OWN TENSORFLOW!
Follow and work throughout this public Jupyter Notebook.
At this point you should have all the necessary background to understand every step.
Suggested for everyone who wishes to have a good hands-on practical to further understand
the architecture of Deep NN libraries like Tensorflow.
http://www.deepideas.net/deep-learning-from-scratch-i-computational-graphs/
RECURRENT NEURAL NETWORKS
Motivation
Sequence
Sequential Memory
Motivation
Motivation
A recurrent neural network looks very much like a feedforward neural network,
except it also has connections pointing backward.
Let’s look at the simplest possible RNN, composed of just one neuron receiving inputs,
producing an output, and sending that output back to itself:
You can easily create a layer of recurrent neurons. At each time step t, every neuron
receives both the input vector x(t) and the output vector from the previous time step
y(t–1), as shown in figure. Note that both the inputs and outputs are vectors now.
Each recurrent neuron has two sets of weights: one for the inputs x(t) and the other for
the outputs of the previous time step, y(t–1). Let’s call these weight vectors Wx and
Wy. The output of a single recurrent neuron can be computed pretty much as you
might expect, as shown in equation below (b is the bias term and ϕ(·) is the activation
function, e.g., ReLU).
Just like for feedforward neural networks, we can compute a whole layer’s output in
one shot for a whole mini-batch using a vectorized form of the previous equation:
MEMORY CELLS
Since the output of a recurrent neuron at time step t is a function of all the inputs from
previous time steps, you could say it has a form of memory.
A part of a neural network that preserves some state across time steps is called a
memory cell (or simply a cell).
In general a cell’s state at time step t, denoted h(t) (the “h” stands for “hidden”), is a
function of some inputs at that time step and its state at the previous time step: h(t) =
f(h(t–1), x(t) ). Its output at time step t, denoted y(t) , is also a function of the previous
state and the current inputs.
DIFFERENT STRUCTURES
Seq to Seq (top lef),

Seq to Vector (top right),
Vector to Sequence (bottom left),
Delayed Sequence to Sequence (bottom right)
BASIC RNN USING TENSORFLOW (RAW)
We will create an RNN composed of a layer of five recurrent neurons (figure below),
using the tanh activation function. We will assume that the RNN runs over only two
time steps, taking input vectors of size 3 at each time step.
BASIC RNN USING TENSORFLOW (RAW)
MOTIVATION FOR BETTER MODELS
Vanishing Gradient
Imagine that we have a model taking a word from a sentence for each input.
For each step, we feed the word at the specific step and the hidden state from the previous
step
MOTIVATION FOR BETTER MODELS
Vanishing Gradient
Short-term memory is caused by the infamous vanishing gradient
problem, which is also prevalent in other neural network
architectures.
Short-Term memory and the vanishing gradient is due to the
nature of back-propagation.
The gradient is the value used to adjust the networks internal weights, allowing the network to learn. The bigger
the gradient, the bigger the adjustments and vice versa. When doing back propagation, each node in a layer
calculates it’s gradient with respect to the effects of the gradients, in the layer before it. So if the adjustments to
the layers before it is small, then adjustments to the current layer will be even smaller.
That causes gradients to exponentially shrink as it back propagates down. The earlier layers fail to do any learning
as the internal weights are barely being adjusted due to extremely small gradients. And that’s the vanishing
gradient problem.
LSTM’S AND GRU’S
To mitigate short-term memory, two specialized recurrent neural networks were
created. One called Long Short-Term Memory or LSTM’s for short. The other is Gated
Recurrent Units or GRU’s.
LSTM’s and GRU’s essentially function just like RNN’s, but they’re capable of learning
long-term dependencies using mechanisms called “gates.”
These gates are different tensor operations that can learn what information to add or
remove to the hidden state.
LSTM The core concept of LSTM’s are the
cell state, and it’s various gates.
The cell state act as a transport

highway that transfers relative
information all the way down the
sequence chain. You can think of it
as the “memory” of the network.
The gates are different neural

networks that decide which
information is allowed on the cell
state. The gates can learn what
information is relevant to keep or
forget during training.
LSTM
Gates contains sigmoid activations. A sigmoid
activation is similar to the tanh activation. Instead of
squishing values between -1 and 1, it squishes values
between 0 and 1.
That is helpful to update or forget data because any

number getting multiplied by 0 is 0, causing values to
disappears or be “forgotten.” Any number multiplied
by 1 is the same value therefore that value stay’s the
same or is “kept.” The network can learn which data is
not important therefore can be forgotten or which
data is important to keep.
EFFECT OF GATES To update the cell state, we have the input
gate.
First, we pass the previous hidden state and

current input into a sigmoid function. That
decides which values will be updated by
transforming the values to be between 0 and
1 where 0 means not important, and 1 means
important. You also pass the hidden state and
current input into the tanh function to squish
values between -1 and 1 to help regulate the
network.
Then you multiply the tanh output with the

sigmoid output. The sigmoid output will
decide which information is important to keep
from the tanh output.
LSTM
THANK YOU …
On the internet you will be able to find a lot of LSTM and GRU papers applied to
financial time series.

Demystifying Deep Learning2

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Demystifying Deep Learning2

Uploaded by

Copyright:

Available Formats

NNS AND DEEP LEARNING Dr V Vella

We call this the partial derivative of c with respect to a.

Seq to Seq (top lef),

The cell state act as a transport

The gates are different neural

That is helpful to update or forget data because any

First, we pass the previous hidden state and

Then you multiply the tanh output with the

You might also like