Richi's Neural Nets Summary

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 114

Neural Nets summary

tune a learning algorithm such that instead of attempting to minimize the risk (4)
(impossible because distribution PU,Y is unknown) it tries to minimize the empirical risk
(5) which is just the mean loss averaged over the training examples

The ultimate goal for supervised learning algorithms is to estimate a


model fˆwhich has a low risk (4), that is, which on average gives low-
loss (“good”) outputs on “testing” data drawn from the distribution
PU,Y .
• The only source of information that the learning algorithm has is the
training sample S = (ui, yi)i=1,...,N .
Thus, it appears that the best one can do is to design a learning
algorithm which minimizes the empirical risk (“training error”,
a function approximator), that is, upon input
S the learning algorithm should return the solution of the minimization
problem (6):

H = hypothesis space (set of all NNs with specific connection structure, e.g. weights)
Flexibility:
a learning algorithm A is more flexible than another learning algorithm B if
A can fit its computed models fˆ more closely to the training data than B can
do

1.3.1 THE nonidentical twin curves

X axis = flexibility range (further right -> more flexible)


y-axis = (empirical) risk
For each flexibility adjustment of a learning algorithm (x-axis), these two curves
give the risk (testing error) and the empirical risk (training error) of the model
1.4.1 Using regularization for tuning modeling flexibility
model regularization: one fixes a single model structure and size with a fixed
number of trainable parameters, that is, one fixes H. Structure and size of
the considered model should be rich and large enough to be able to overfit
(!) the available training data. Thus, one can be sure that the “right” model
is contained in the search space H.

The models in H are typically characterized by a set of trainable parameters.


For a fixed neural network structure, it would be the set of all synaptic weights.
we denote this collection of trainable parameters by θ
a high-flexibility model type would inevitably lead to overfitting when
an “optimal” model would be learnt using the basic learning equation
To dampen the exaggerated flexibility of this baseline learning algorithm, one adds a
regularization term (also known as penalty term, or simply regularizer) to the loss function.

A regularization term is a cost function reg : RM -> R≥0 which penalizes model
parameters θ that code models with a high degree of geometrical “wiggliness”.

A good penalty term should, of course, (1) assign high penalty values to parameter vectors
θ which represent “wiggly” models; but (2) furthermore it should be easy to compute and
blend well with the algorithm used for empirical risk minimization.
The degree of “oscillativity” can be measured, for instance, by the integral over the
(square of the) second derivative of the polynomial p,
L2-norm regularizer
the squared sum of all model parameters

This regularizer favors models with small absolute parameters, which often amounts to
“geometrically soft” models.
It is called the L2-norm regularizer because it measures the (squared) L2-norm
of the parameter vector θ.

Computing a solution to the minimization task (10) means to find a set of parameters
which simultaneously minimizes the original risk and the penalty term. The factor
α2 in (10) controls how strongly one wishes the regularization to “soften” the solution.
Increasing α (or x-axisinthefigureabove)means downregulating the model flexibility. For α2 =
0 one returns to the original un-regularized empirical risk (which would likely mean
overfitting). [a search]
For α2 -> infinity, the regularization term entirely dominates the model optimization and
one gets a model which does not care anymore about the training data but instead only is
tuned to have minimal regularization penalty.
• In case of L2-norm: 0 parameters -> wiggle free
Regularizers are convenient because you can just start with a big (unregularized) flexible model
instead of checking different sizes
1.4.1 Tuning model flexibility through adding noise
Another way to tune model flexibility is to add noise.
1. Adding noise to the training data

for each ui, add l variants ui + νi1, . . . , ui + νil of this pattern to the training data,
where the νij are i.i.d. (independently, identically distributed) random vectors (for
instance, uniform or Gaussian noise).
This increases the number of training patterns from N to (l +1)N
The more such noisy variants are added and the stronger the noise
the more one steers to the left (underfitting) side
How to estimate the risk of a model
1. Akaike information criterion
2. Cross-Validation
K-fold cross validation:

R = flexibility
• Start with low-end r and increase stepwise. With each
iteration/regularization strength r, assess generalization quality with cross
validation until the validation risk (R val r j) starts to rise!
leave-one-out cross-validation
- Used when data set it small (danger of poor generalizability if wrong r
was used)
- Split set into singleton subsets (containing a single training example)

Other uses
cross-validation can similarly be used for tuning hyperparameters of a
learning procedure. The term “hyperparameter” is generally used for all
kinds of “knobs to play around with” when tuning a complex learning
algorithm.

2. Feedforward Networks

1. Perceptron

Figure 7
Perceptron learning rule:
ONLY MINIMIZES TRAINING ERROR! DANGER OF
OVERFITTING
• Can only classify what is linearly separable: cannot learn XOR function
because it’s not separable

2.2 Multi-layer Perceptrons

Differences to Perceptron

1. MLPs have many summation units coupled together in parallel and serial
connectivity
2. Learning rule replaced by error backpropagation: allows to iteratively adapt
the synaptic weights to minimize a given loss function
3. MLP receives input and send output to many different neurons.
-> each such connection is still characterized by a weight (just like perceptron)

An MLP with n input units and m output units thus instantiates a network
function N: RK -> RM

combine all trainable parameters of a machine learning model together in one


parameter vector, which is standardly denoted by θ. For a neural network, θ is
thus the vector of all the trainable synaptic connection weights.

• N θ: RK -> RM
• Used to emphasize dependence on weight

MLP is a function approximator, used for:


1. Pattern recognition
2. Time series prediction
3. Denoising, restoration and pattern completion
4. Data compression
5. Process control

MLP Structure

1. several layers of neurons labeled 0,…,k


-> 0 = input layer
-> k = output layer
-> 1- (k-1) = hidden layer
2. number of units in layer k = Lk
3. two successive layers have connections between all neurons (all to all)
4. link between unit j in layer k-1 (last hidden layer) to unit i in k has a weight
5. when used for computation: i-th unit in layer k has activation
N(u) of argument u element RL0 is computed as follows:
1. Activations x0J of the input layer are set to the component values of the input
vector u
2. 1. linearly combining the activations of the lower layer with combination

weights

2.2. add the bias -> interpreted as weight of synaptic link from bias
unit in layer k-1 that always has an activation of 1 (see below)

2.3. wrap obtained sum with activation function : nonlinear, s-shaped function
3. The output layer y (dimension LK) is computed from the activations of the pre-
output layer LK-1 depending on the task setting.
(without activation function)

Activation Function

[Graph tanh logistic sigmoid rectifier]

Two reasons for applying them:


1. abstractions of biological neural systems. In biological systems, electrical
activation is bounded. When applying tanh or logistic sigmoid, we bound the
activation of MLP neurons between [-1,1] and [0,1] respectively.

2. They introduce nonlinearity. Else, N( ) would be many linear transformations.


Therefore, it could not learn nonlinear functions
• Replaced by rectifier function in deep learning for cheaper cost
Summary
After an input vector u is written into the input units, a “wave of activation”
sweeps forward through the layers of the network. The activation vector xκ
in each layer κ is directly triggered by the activations xκ−1 according to (15).
The data transformation from xκ−1 to xκ is a relatively “mild” one: just an
affine linear map Wκ xκ−1 + bκ followed by a wrapping with the gentle
sigmoid σ. But when several such mild transformations are applied in
sequence, very complex “foldings” of the input vector u can be affected.

Feedforward Neural Networks (FNN’s)


MLP is a class of that
- All have input layer
- And output layer
- In between a directed neural connection network
• No feedback layers
• The connection graph is acyclic (if cyclical then it is a recurrent neural
network RNN which represent dynamical systems)
• Can approximate arbitrary functions:

• Most tasks can be solved with 1 hidden layer, but more reduce the overall
size of the MLP (quantified by total number of weights)

Training MLP with the Backpropagation Algorithm


Iterative Model Optimization: The Backpropagation Algorithm
- Used for gradient descent minimization
Gradient Descent
Performance surface: graph of a risk function (function may be empirical
risk, risk, other cost). Typically discussed with model families, where

candidates can be identified with a candidate set of parameter vectors .


The performance surface becomes the graph of the function

: generic risk function

If : performance surface is hypersurface in RD+1


• Idea of model optimization is to iteratively move toward a minimal-risk
solution by “sliding downhill” in direction of steepest descent

General Recipe

Weakness
- Final model depends on the choice of the initial model
n+1
- If is too large: risk of R( ) may be > R( n) (red line below)

Figure 16
➔ The gradient descent algorithm has become instable
➔ The learning rate must be sufficiently small to avoid instability and
divergence (green arrow) (price is slow convergence)
➔ Becomes very slow when approaching ‘valley’ of graph (shaded dots)
What does it mean for a learning rate to be too large?
➔ The more layers, the more probable it is to get high ratio of

How to Escape Instability-Slowness Dilemma


Second order gradient descent
➔ Determines the direction vector for the next weight adaptation based on
gradient AND the curvature of the performance surface at the current
model.
➔ This curvature is a MxM sized Hessian matrix containing second order

partial derivatives .
➔ If Taylor approximation is precise: second-order gradient descent points
exactly towards target minimum (magenta arrow in picture)
➔ Computing the Hessian is expensive & subject to numerical problems &
assumptions may be violated
Backpropagation
Backwards pass:
1.

Delta rule
2.
Error calculation in 2nd run:
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/

p.46 in reader (slightly different)


Computing delta for output units with quadratic error
Difficulties
- Step size u sufficiently small to avoid instabilities but big enough to speed
up convergence
- Gradient descent may be slow on nonlinear performance landscapes when
hitting a plateau (gradient small in all directions)
- Gradient descent on performance landscape only gives local minimum of
risk function. BUT local minima are often overfitted models, we do not
want to reach them and use a stopping criterion to stop earlier.
3. Dynamical Systems
- Feedforward networks are functions -> static argument-value relationship (no
time involved)
- a brain is an RNN -> has at least one connection feedback cycle (possibly with
itself)
A DS is any real or artificial or formal system that evolves over time
Two fundamental decisions before modelling
1. Selection: what subsystem is modeled
2. Perspective: what aspects of that subsystem are modeled

These decisions mandate the use of very different modeling tools.


Three modeling attitudes

Analytical modeling

• Try to capture the real-world underlying mechanisms

• The physicist's and neuro-medicinal view

• Examples: compartment models, Hodgkin–Huxley model,


Chomsky generative grammar

• Formalisms: ODEs, PDEs, automata models, dynamical Graphical Models

Blackbox modeling

• Try to describe the observable phenomena

• The approach of machine learning, signal processing

• Examples: BCI applications, brain data analyses, artificial intelligence systems


• Formalisms: hidden Markov models, stochastic DEs, artificial neural networks

Epistemic modeling (rare)

• Try to emulate information gain processes

• Sometimes adopted by roboticists, agent modeling

• Examples: concept formation processes, belief state modeling

• Formalisms: predictive state representations, observable operator


models, temporal logics

Deterministic finite-state automata (DFA)


• Deterministic

o Can be very large

• states are "fully observable"

• inferring a DFA from observed trajectories is easy


Moore and Mealy Machines
➔ POMDP with binary (0,1) transition matrix is a Moore machine
Non-deterministic Finite-State Automata (NFA)

- can yield infinite different state sequences on given input sequences


- no probabilities (non-deterministic)
- non-deterministic systems capture what is possible/impossible
- can have no/one-element input set (arrows only)

Finite-dimensional Markov chains

- used to describe trajectories starting at time n=0


- MC specifies a stochastic process with values Q
Markov Property: next state (propability of observing qi,n+1) depends only on
current state, not history
➔ MC’s are memoryless systems
➔ Crucial to define a finite-valued stochastic process Markov chain

Controlled Markov Chains


Like above but with different inputs (a1, a2) that each has a transition kernel
Update mechanism: switch transition kernel according to current input symbol
Hidden Markov models (HMMs)
Extension of Markov models by observable output (O = {0,1})
For every state (q,p,r) and observable (0,1), has either
(1) an emission probability , or
(2) an emission function (example)

- Very applicable because usually you can only observe measurements, not
states (experimental data)
- Explain data by generative mechanism (based on empirical measurement
data, infer state transition system)
➔ Can approximate every discrete-valued stochastic process arbitrarily
well (better approximation ➔ higher state space)
➔ HMM is a special case of a dynamic graphical model: HMM is DGM with
single input set, single state set, single output set (HMMs trained by
expectation maximization algorithm, DGMs by generalization of HMM
learning algorithm)
Controlled Markov Models (POMDPs)/Partially Observable Markov
Decision Processes
- Crossover controlled MCs and HMMs
- Basic tool in theory of autonomous agents because it constitutes the
agent’s world model:
Q: external world states
A: agent’s action in world
O: sensory feedback from world
➔ Action-sensor-feedback timeseries data needed
1-dim deterministic Cellular Automata (CA)
- Finite sets of cells with local (neighbor) transition function (picture thingy
from autonomous systems)
- CA defines global transition function

- Can have many dimensions


- Simplest kind of model for spatial-temporal dynamics
- Local transition rule may be stochastic
Dynamic Bayesian Networks – dynamical graphical models

- Multivariate generalization of HMMs and POMDPs


- Natural model class for interacting stochastic quantities
- Analysis of multivariate empirical timeseries
- In machine learning: 1) inference (prediction), 2) learning

Finite State Models: generalities


- Set-up in discrete time
- All can be used for learning algorithms (model estimation)
- Simpler to describe and analyze. Infer from empirical observation data
than continuous models
- Can get very powerful and complex with growing number of states (can
come as close as wanted to continuous model)

3 – Continuous State Models


Iterated function systems / iterated maps
Difference autonomous non-autonomous system

-> discrete time!


Ordinary Differential Equations
➔ Looks like iterated maps but isn’t (they can not cross trajectories, while
iterated maps can ➔ exercise sheet 2, exercise 18)
➔ In trajectory plotted in a phase portrait generated by ode, each point of the
trajectory the vectors of the ODE vector field are tangential to the
trajectory

➔ The update operator function specifies a rate of change, not


next state
Delay Differential Equations (DDEs)

Neural Field Equations


Models dynamics in spatially extended neural systems
➔ Takes signal travel time, distance, and field geometry into account
➔ Combines DDEs, PDEs, integral equations
Part 6 – Qualitative Theory of DS: Attractors, Bifurcations, Chaos – ABC
theory of DS
Attractors: DS exhibits “stable dynamical pattern” because of it
Bifurcations (Abzweigung): Transitions from one pattern to another

R and are state variables


➔ the set of all possible values (R2 in this example) = state/phase space
a [a search 2] is a control parameter: set by user, each setting yields different
version of system equation
➔ in geometrical DS theory, time is disregarded. Visible here is the trace
line [search phase portrait]
[search limit cycle]
Stable and Unstable Trajectories
Unstable:
Start from (0,0), the trajectory will not move from (0,0). This is called a fixed
point of the dynamics. It is unstable because any small perturbation will drive
trajectory away
➔ origin is a repellor
Stable:
Start from unit circle: trajectory will forever revolve around unit circle
➔ this periodic orbit is stable
➔ it is a limit cycle
➔ it is a cyclical attractor

Structural Similarity and Stability


Structural Similarity:
Two DS are structurally similar if their phase portraits can continuously be transformed into
the other
➔ it is an equivalence relation used to classify dynamical systems
Structural Stability:
A DS is structurally stable if small changes to its vector field lead to structurally similar phase
portraits

These two fields are structurally stable


Point Attractor:

- a point 𝐱∈ℝ"such that (i) x is a fixed point of f, and (ii) all


points in a neighborhood of x ultimately move closer and
closer to x under repeated applications of f.
➔ stable because small perturbations will lead to DS to move the state back to (0,0)
Point Repellor:
- all trajectories neighboring (0,0) move away from (0,0)
➔ unstable because small perturbation from (0,0) will move the state even further away
➔ if function makes every trajectory come closer to the origin at every time step, it is a
point repellor

Periodic Attractor:

(i) f(xi) = xi+1 and f(xk) = x1

(ii) if y is any point sufficiently close


to any of the xi, upon iterated
n
application of f, the sequence f (y)
will converge to the cyclic
sequence x1, x2, ..., xk.

Bifurcation: when a control parameter A passes through a critical value (here A=0), the phase
portrait changes its nature. Also, only at the critical value, the system is structurally instable.
Left and right of it, the phase portrait is structurally stable
Phase Transitions
From thermodynamics, not DS (point where water becomes gas, liquid, solid)
➔ also has critical values
but: it is infinite-dimensional stochastic system, bifurcation is low-dimensional deterministic
➔ bifurcation can be used to model PTs in low-dim mean-field model

Part 8 – Attractors Mathematically Defined


Unifying the variety of state spaces is done by investigation dynamics on topological space
Def: X is a set, ! is a (collection of) ⊆ of X (called the open sets).
(X, !) is a topological space if:
1. Empty set is ϵ of !
2. ! is closed under arbitrary unions
3. ! is closed under finite intersections
[basin of attraction]
Definition: A ⊆ N ⊆ X. N is neighborhood of A if A ⊆ U ⊆ N for some U ϵ X
➔ When T-iterates are applied to b ϵ X, b will be confined to U, the neighborhood of A
Part 9 – Non-Autonomous Dynamical Systems
- ABC theory developed for working on autonomous (input-free, deterministic, stationary)
systems
BUT: real-life neural dynamics are input-driven, stochastic, non-stationary (aka non-
autonomous)
➔ Ongoing input may make system non-autonomous! Classical attractor notion is not
applicable!

Chapter 4 – RNNs in Deep Learning


1. Supervised training of RNNs in temporal tasks
- always use discrete time
Basic Format

➔ Specifies how the network activation state is updated from one timestep to the next
➔ In first iteration (n=1), x(n) is replaced by initial state x0

X(n) = vector of activations of the neurons inside the RNN (in RL, L = Number of Neurons)
W = LxL weights
Win = LxK (k input neurons)
U(n) = input signal
B = bias (in RL)

= Sigmoid function (logistic, rectifier function)


➔ Specified how output vector is computed at time n

Y(n) = output signal at time n (in RM)


Wout = output weight matrix from RNN neurons to output neurons (MxL size Matrix)
f = wrapper function applied to linear readout, can be sigmoid or identity function
in more detail:

Equation 32 with feedback of generated output signal y(n):


WFB = LxM sized matrix for feeding back the generated output signal into the RNN
➔ Needed when task is NOT input -> output transformation (output is determined by input
signal)
➔ Needed when the output is actively generated by the RNN, even in absence of input
(robot motion generation)

Basic temporal Tasks for RNNs


-> common to all: training data contains time series
1. Stationary dynamical systems
- do not change signal throughout time. Whether you observe n=0 to n=100 or n=1000 to n=1100
does not make a difference (e.g. monitoring a wind turbine)
- training data: single long sequence S of paired input-output signals
- learning goal: train RNN to create output signal from testing input signal, which is close to the
testing output

2. Non-stationary DS
- the signal changes over time
➔ microphone signal of someone saying ‘zero’: has z at beginning and o at end
➔ training data consists of many individual recordings (e.g. of many people saying zero), which
is double indexed:

, where i is the index of the training example

➔ Both (1., 2.) can be stochastic or deterministic but most real-life systems are stochastic
➔ Stationarity is no perfect dichotomy. It depends on how long a system is observed. (e.g.
(ignoring climate change) weather changes daily but follows the same pattern over the
years)

For these short- long term distinctions, you can follow two strategies to implement RNNs
1. One complex RNN, capable of changing ‘mode’ in shorter time intervals
2. Several simpler RNNs, one for each typical non-stationary sub pattern. Then, combine
them into a compound multi-RNN system, where individual RNNs are
activated/deactivated to give rise to the overall pattern

Dynamical Pattern Generation Tasks


- Works without input, trained just by desired output sequence
➔ Since there is no input, similarity of y^ is difficult to determine
➔ Especially important if pattern is chaotic or stochastic

Pattern Detection Tasks


- Training data has input-output pairs
➔ X(n) = a signal of measurements from the system in which one wants to identify a certain
pattern
➔ Y(n) = {0,1} ➔ 0 when pattern is not present, 1 when it is (can happen at end of signal:
when ‘p’ in ‘stop’ was analyzed, or before target pattern occurs)

Timeseries Prediction Tasks


- Financial timeseries forecasting: stochastic, nonstationary

Typical approach: train NN on predicting one timestep, feedback this observation and iterate

Di(n) = the n-th value in the i-th training timeseries


➔ Output is equal to the input timeseries shifted one timestep ahead
➔ Set up with 1 input and 1 output node
➔ First trained to predict all the training series as well as it can, then forecast future by
iterated one-step predictions as follow:
System Modelling Tasks
- Simulate how a physical system responds to external perturbations/human control input
- RNNs are good at modeling/simulating (e.g. an aircraft response to pilot input) if the
target system behavior is nonlinear and has memory effects [RNN good search]
- Training data of

for stationary
And

for non-stationary systems

In aircraft modelling example:


U(n) = three pilot command signals
Y(n) = measured accelerations of the aircraft

4.2 Backpropagation through time


Loss function for stationary tasks:
- Sufficient to measure the mismatch between outputs at individual time steps (same as
MLP)

➔ v

➔ E.g. quadratic loss:


Loss function for non-stationary tasks:
- Loss function differs throughout time!

MLPs use Backprop for 1., which depends on the network topology as feedforward without
connection cycles (which LSTM does have)
➔ (34) is therefore solved with backpropagation through time (BPTT)

BPTT

1. Replicate recurrent net for every time step


2. Network-internal connections (weights) are rewired forward to the next copy
Difference to MLP:

- Corresponding weights in different “time-layers” (u(n+1) compared to u(n)) must be


identical
Problems of BPTT:

1. Stack of temporal copies must have finite depth h (for ‘horizon’)


➔ RNN will be finite, cycle-free network mapping input-output from input u(n)…,u(n+h-
1) to output sequences y(n),…y(n+h-1)
➔ If nmax > h (training data is longer/aka more time than our time horizon), algorithm
averages over gradient increments, shifting the depth-h network through nmax – h – 2.
➔ Implies that any effects from input u(n) ➔ y(n) which are delayed by more than h
timesteps cannot be learned
BUT: not the case if there are stochastic dependencies between input signals at different times

➔ Significant problem where input-output effects with arbitrary long delay (human
language)
2. Vanishing/Exploding Gradient Problem
- Surfaces when error gradients are backpropagated through many layers (the deeper the
network the more severe the problem)
- RNNs are much deeper than MLPs due to many time-slice layers
➔ Vanishing gradient problem occurs in RNNs AND MLPs

Example of a pure memory task:

➔ The desired output is the input from h timesteps earlier


➔ For depth h we need h+1 time slice copies (minimal needed depth for this problem)
➔ For this simple embryonic RNN we do not need backprop trick

For quadratic loss, the gradient of the loss with respect to weight w is
Critical term: wh-1 because if h is large and w<1, it will go towards 0, if h is large and w>1 the
gradient explodes with depth h (because of the exponent)

➔ Since the unfolded RNN are identical copies, gradient components arising from pathways
that span large temporal horizons (large h) are repeatedly quenched or expanded at
every timestep
➔ If learning task includes the exploitation of long-term delayed input-to-output effects
(long-term memory), the vanishing/exploding gradient problem will make it practically
impossible to encode these long-term effects during learning
➔ Best answer to problem: LSTM

LSTM networks

Long-Short-Term-Memory: memory mechanisms where information is preserved over limited,


but possibly rather long times, based on neural activation dynamics in RNNs, without synaptic
weight adaptation

- Helps with vanishing gradient problem and to cope with multiple-timescale tasks

U(n) ∈ R

S(n) & r(n) ∈ {0,1}

Task of network: output u(n) at y(n+h), h=3


S = store: if s(n) = 1, the input value is stored

R = read: if r(n) = 1, the stored value is read

➔ S & R control the memory task


➔ Whenever R = 0, the output should be 0 too

Has different neurons (called (memory) cells now)

Memory cell is embedded in control circuitry ➔ memory block

Simplified equations:

➔ Only two trainable weights: win and wout


➔ If r(n) = 0, the quadratic loss is zero
➔ If r(n) = 1, the quadratic loss is [LSTM quadratic loss]

Currently, the gradient of the error with respect to the two trainable weights is
➔ At one point wout*win = 1 (therefore, gradient is 0)
➔ But: gradient is independent of h (no vanishing/exploding gradient with large h)
Why?

If s(n) = 0, the activation state c(n) stays the same. This leads the weight w that lead to the
vanishing gradient problem in [35] (if it is not equal to 1), is fixed to be 1 in [36] (LSTM)

An LSTM network is an RNN which consists of several memory blocks and possibly other
‘normal’ RNN units.

Memory blocks and normal units may receive external inputs and/or recurrent input from other
blocks/units within the network.

5 specialized neurons with different update equation

1) See above (c(n))


2) An input neuron with state u(n):
- Corresponds to u(n) in [36].
- It may receive external input and/or input from other blocks or normal units within the
network.
- It is a ‘normal’ unit with update function:

F = sigmoid function

Xu (n) = vector of signals composed of external input/outputs


Wb = vector of input weights

Bu = bias vector

3) An input gate neuron with state ginput(n):


- External store input s(n) in our example
- Update equation:

= always logistic sigmoid


xg(input)(n) = vector of signals composed of external input/outputs

4) An output gate neuron with state goutput(n):


- External read input r(n) in example
- Like input gate with equation:

= always logistic sigmoid


xg(input)(n) = vector of signals composed of external input/outputs

5) A forget gate neuron with state gforget(n):


- In example played by constant 1 factor in [36]
- Like input gate neuron with equation:

➔ Memory cells do not necessarily preserve its previous value with a factor of 1
➔ May leak some of it by the multiplication with the forget gate neuron value
➔ Output may be fed to other memory blocks or normal units in the LSTM network
Why do the three gate units all have logistic sigmoid as their squashing function?

➔ To make the numbers range between 0 and 1.


- 1 = the gate is fully open
- 0 = the gat is fully closed

➔ Trained with BPTT.

➔ All weights and biases in all memory blocks


of LSTM network are trainable

Hebbian Cell Assembly

All participating neurons are connected to all others by some cyclic pathways (this is not the case
in an LSTM block)

5. Hopfield Networks

Scientific crisis because neuron weight cannot explain LTM sufficiently.

Problem of catastrophic forgetting: MLP learns Task A, then Task B but forgets Task A in the
process

➔ The active field of continual learning within deep learning tries solving this problem
➔ This field is incompatible with the dogma of basing LTM on fixing stable synaptic
weights

Grandmother cell: is your memory of your grandma encoded in a specific grandmother


neuron?

➔ unclear (evidence for and against it)


alternative view: distributed encoding of memories: thinking of grandmother is the effect of a
complex neural activation pattern involving large portions of brain

How is the memory ‘written’?

➔ Backprop algorithm is not biologically feasible (and suffers from catastrophic forgetting)
➔ Human brain might need sleep for that
How do we recall memories?

2 problems:

1) Addressing
➔ How do you point to memories (in computer by pointers)
➔ Brain might use content addressing: To access grandmother, you start with some
fragment
➔ Called associative memory model
2) Decoding
➔ twin problem of encoding problem

These problems also concern AI:

1. Problem of Continual Learning


➔ Achievements of deep learning are still narrow for specific task
2. Neuromorphic Computing
➔ Designing computer chips inspired by neural networks
➔ Main goal: enable in-memory computing: in biological and artificial neural networks,
there is no difference between CPU (processor) and RAM (memory)
o This is called von-Neumann bottleneck (the read/write channel between CPU and
RAM)
➔ Bottleneck needs much time and energy and enforces serial execution of computational
operations
➔ In-memory computing enables parallel way of computing directly on hardware level of
memory traces, saving time and energy (not found yet)
Hopfield networks (HNs) started the family of energy based NN models (➔ Boltzmann
machine, instrumental in getting deep learning off the ground)

➔ Still foundational reference model for associative neural memories in cognitive


neuroscience

Nutshell HN

- Can be trained to store a finite number of patterns that are retrievable by and auto-
association process

Pattern: A pattern is a binary vector {-1,1}L (also works with {0,1} but it is more complex)

➔ In introductions encoded as a 2-dimensional pixel array (b/w image)


➔ Given a finite (and not too large) number of training patterns (called fundamental
memories after storing took place), they can be encoded in a HP network

Recall: HN is presented with a cue pattern (must have same dimensions as


fundamental memories).

➔ Cue is corrupted version (i.e. cue patterns agrees with one stored pattern in some, but not
all pixels, difference can be substantial) of one of the stored patterns

Pattern Restauration:
Pattern Completion:

Main Design Principles for the Associative Pattern Recall Functionality in Hopfield nets:

1) Every state x corresponds to a possible pattern and vice versa


➔ Bijection between possible states and possible patterns
➔ A HN to process L-dimensional patterns has L neurons
➔ Each neuron has binary activation set {-1,1}

2) Each state x of HN has well-defined energy


➔ Negative energy is possible
3) During storage learning process, every pattern from learning set becomes associated
with a locally minimal energy
➔ Gives “energy landscape” over the space {-1,1}L of all binary patterns
➔ Fundamental memories are placed on local minima
4) Recall starts by presenting (corrupted) input pattern u (set to be initial state of a state
trajectory, which evolves according to the recurrent dynamics of HN)
➔ Trajectory leads through a state sequence
➔ At every step, the energy is reduced
➔ Necessarily ends in local minimum (where fundamental memory can be retrieved)
Generally:

- Input pattern u
- Discrete-time state update dynamics of HN leads to sequence of patterns (originating
from u)
➔ At every update, the new state x has energy <= x(-1) until x(m)

➔ At x(m) no further energy reduction is possible

➔ Local minimum containing fundamental memory

➔ This is guaranteed to happen because the pattern/state space {-1,1}L is finite

Fundamental memories are point attractors

➔ Set of all cues u are attracted by fundamental memory (point attractors) is the basin of
attraction of the set of fundamental memories

➔ All patterns in basin of attraction are instances of the category/concept/class represented


by a fundamental memory

5.2 HN: formal model

- does not have input/output units

➔ if you want to store L-dimensional patterns, HN will have L neurons

- These neurons serve as input, internal and output units

- HN is fully specified LxL sized, real-valued, symmetric weight matrix W

➔ Has 0’s in the diagonal and can be mirrored alongside that diagonal

➔ This means that connection weight of wij = wji (i.e. the connections are undirected)

Example with L = 4:
State:

- L-dimensional binary vector ∈ {-1,1}

Key idea:

➔ Each state is assigned a real-valued quantity (energy)

Energy:

Defined by:

➔ Multiply all (unequal) neurons by their connective weight and Σ it up

➔ Energy means close connection between NN dynamics and statistical thermodynamics

➔ NN subfield is called: energy-based models

Rule for update dynamics:

1. Randomly select one neuron xi

2. Compute activation value at time n+1 by


Sign = signum function:

- Z=

- If z>0 ➔ sign(z) = 1

- If z<0 ➔ sign(z) = -1

- If z=0 ➔ set xi(n+1) = xi(n) (rare case)

[z search]

3. Update only the activation of the neuron xi, not xi-1/xi+1 and so on

➔ Will always reduce (or keep same) energy state

5.3 Training an HN

Learning problem:

Given an L-dimensional training patterns, find weight Matrix W such that an energy landscape is
created that has all training patterns located at different local minima.

➔ Not always possible

2 methods to do so:

1) There is an analytical formula that directly computes W from training patterns

2) Iterative-incremental, appears unnecessarily time-consuming but is biologically plausible


(could be used by brains)

Analytical Solution of Learning Problem

If weight Matrix W exists which solves the learning problem, it can be written as
L = L-dimensional identity matrix

= training pattern * presented pattern (?)

➔ 1/L normalizes the energy levels expressed in [41] so that they do not grow with network
size

➔ N*I sets all self-connections wii to 0 (guarantees the 0 diagonal in W)

Iterative Solution of the Learning Problem

1) Initialization: create random initial weight matrix W(0) (symmetric with 0 diagonal)

2) Loop

o At update step k, present 1 training pattern to network (before step k W(k-1))

o Update all weights wij(k-1) where i != j by

[learning rate]

➔ If two RNNs are trained with same data but [44] has differing learning rate, the resulting
energy landscape will be the same, scaled by a scaling factor

➔ The order in which training patterns are presented does not matter! It will lead to same
landscape given by [43] (up to a scaling factor)

3) Stop

o When reaching stop criterion condition

➔ When largest weight (absolute value) hits a predefined ceiling

➔ When test set of corrupted input patterns is recalled correctly

➔ (best) when the pairwise ratios of weights in the sequence W(k) appear to converge up to
a predefined, small residual change

[44] can be written as matrix form

➔ Obtained matrix by both solutions is the same

➔ Local formulation in [44] underlines biological plausibility! (backprop is non-local, if


you want to change weight wij, you will need information from other neurons which is
biologically impossible)

Why it works – Idea of Hebbian learning

EXERCISE: why does [45] always lower the energy of the training pattern (check from energy
definition [41])?

➔ We incrementally change W by making it yield lower energies for the training patterns

➔ At the same time, it does not lower energies for patterns outside the training set more
than for training patterns
Hebbian learning: learning rule [44] could be effective in biological brains.

➔ “Cells that fire together, wire together”

2 problems that arise:

- Mutual excitatory connection need inhibitory ones too to compensate

- Neurons can only send spikes that last milliseconds, impressions may last seconds

➔ First one solved in HB: equal rights to positive and negative neural activations and
weights

➔ Second one does not arise

5.4 Limitations

It is not always possible to store N training patterns in an L-neuron HN. This can go wrong:

- Fundamental memories differ from training patterns in a few bits

- Storage capacity is limited: N/L is called load of training a HN

o If N/L > 0.138, [43] breaks down entirely: none of the patterns will be stored

o HN works well if 0.03 < N/L < 0.05 (training patterns are in local minima that
correspond to perfect copies)

- Spurious states: when at location of stored training patterns, other local minima are
created (false memories)

- If x is a fundamental memory, -x is also a fundamental memory

➔ HN cannot distinguish between the patterns and sign-inverted versions (always


happens)
Limitations render HN useless for practical applications

Bit Errors in Fundamental Memories

The storing formula [43] leads to a slight misplacement of the local minimum associated with

training pattern .

➔ If HN is cued with exact pattern , the state update dynamics will move down the

energy valley until it arrives at local minimum ^ (training pattern hat, local minimum
of training pattern).

➔ Some bits in the original pattern will become flipped and are unstable

How precisely can a HN recall the training patterns?

➔ Depends probabilistically on load N/L

➔ Maximum is P=0.5

Φ = error function associated with Gaussian distribution:


Worst case: Flipping of some bits trigger bit flip avalanche, which causes the training pattern to
be unstable and is not located to a local minimum of the energy landscape

➔ If P = 0.5 all bits become randomly flipped

➔ If N/L ~ 0.138 every pattern becomes maximally unstable

➔ Avalanches start to occur for N/L > 0.05

➔ This is an example of a phase transition


Spurious States

3 different kinds:

1) Inverted States (sign inversion, occur necessarily)

2) Mixtures of an odd number of fundamental patterns

3) Spurious states (spin glass states) are uncorrelated to stored patterns

Imperfections Related to Load

- For all loads: stable spin glass states exist

- For N/L > 0.138: spin glass states are the only stable ones

- For 0 < N/L < 0.138: stable states close to desired fundamental patterns exist

- For 0 < N/L < 0.05: pattern-related stable states have lower energy than spin glass states
(patterns are the actual local minimum)

- For 0.05 < N/L < 0.138: spin glass states dominate (some of them have lower energy than
pattern-related states)

- For 0 < N/L < 0.03: additional mixture states exist (but energy is not quite as low as
pattern-related states)

- ONLY for 0.03 < N/L < 0.05 a HN works best

Miscellaneous notes

Local stochastic vs deterministic global update

In [42] we randomly consider one bit for flipping or not flipping (local stochastic)

➔ Possible to update all bits of a pattern with [42] (deterministic global)

➔ These two are not equivalent!


➔ Local minima would be in different places

➔ This difference is common to in all kinds of dynamical system with states made of
discrete bits

Importance of Symmetric Synaptic Connections

➔ If asymmetry would be allowed (e.g. wij ≠ wji), energy functions could no longer be
defined & state updates would not necessarily reduce energy

➔ Other attractors would emerge (for instance cycles), which breaks use case of patterns as
point attractors

Heteroassociative Networks

- HN = auto associative networks: update dynamics associates fundamental memory with

itself

- In Heteroassociative networks, sequences of patterns are learned (like when I think of


something, it makes me think of something else)

➔ Needs sequence of training patterns (can be circular in easiest case)

. Learning formula [45] would be replaced with

➔ If convergence is achieved, the weight matrix would be non-symmetrical

HN pro’s

Check p.94 of reader

HN con’s

- Small memory capacity


- Nice results only for uncorrelated fundamental patterns

➔ Difficulty storing two similar patterns separately

- Not technically useful (lol)

6 Moving toward Boltzmann Machines

- Gives computational model of memory, concept representation, learning and reasoning in


hierarchical cognitive systems

- In ML used as universal learning machine for almost any probability distribution

➔ Universal statistical inference device (can compute conditional probabilities of any sort)

- Connects NN to statistical physics

Drawback: training is computationally expensive

2 key ideas

1. The Boltzmann Distribution

Statistical physics: derive macroscopic (measurable, spoken words) observables and laws as
statistical distributions to explain how microscopic (neurons) particles interact

➔ Describe brains with statistical physics ➔ Boltzmann Machine

Example: vessel filled with water

- Macroscopic: temperature, pressure,..

- Microscopic: microstates: specification of all 3D position coordinates and velocity of all


individual H2O molecules in vessel

➔ Boltzmann: What is the probability distribution of these microstates?

➔ From that, infer values and laws of macroscopic observables


- Distribution is not uniform

➔ Some microstates are more likely to occur than others

➔ The energy of microstates is fluctuating all the time

Microstates = s

Energy of microstate = E(s)

➔ The probability of a microstate depends only on E(s) and the temperature T of vessel/heat
bath system (the vessel is surrounded by uniform temperature water)

➔ Position & velocity vector microstates are continuous-valued vectors, therefore the
distribution assigns a probability density value to every microstate

Z = partition function (needs supercomputer)

➔ Z cancels out in many applications

Boltzmann distribution for discrete states:


P(s) = probability mass function

➔ Very dependent on Temperature

If large temperature uniformly approaches 1

➔ After normalization Z ➔ uniform distribution

➔ All microstates are equally probable

If temperature ~ 0 Kelvin (lowest temp)

➔ Lowest energy state is most likely (localizes distribution at lowest energy state)

➔ The property that slow cooling leads to global minimum is part of simulated annealing

2 facts about Boltzmann

1) If a constant is added to the energy function, the resulting distribution is the same P’ == P

2) Every (globally nonzero, i.e. for all s ∈ S for any temperature T, s ≠ 0) probability
distribution can be written as a Boltzmann distribution

➔ This makes BMs universal approximators of (discrete) probability distributions

6.2 Sampling Algorithms

Pmf = probability mass function (finite, discrete state space)

Pdf = probability density function (for continuous state spaces)


A computational procedure which generates ‘random’ examples from a given distribution

Theory behind sampling is called Markov Chain Monte Carlo (MCMC) sampling

Probability distributions are basis of scientific research when it comes to dealing with
uncertainty in data

➔ Pdfs and pmfs are usually high-dimensional and can not be visualized

➔ What can be computed is the pdf/pmfs values at a given point

➔ For any s ∈ high dimensional S, you can compute the pdf value p(s) or pmf value P(s)

➔ But not always possible (fuck these lecture notes sometimes...) because of numerical
underflow (64-bit machines can’t handle any number smaller than 2-53 ➔ is treated as 0)

➔ Solution: use log2P(s)

What is sampling good for?

1) Get estimates of probabilities for events that a researcher is interested in (and estimates of
expectations, variances, partition function, all integrals over functions over the sample
space)

➔ No access to these beside sampling

2) Sampling can be used to generated examples from a statistical model

➔ This is best way to use BM

➔ In cognitive terms: confabulation (Plauderei?): the random walk of an artificial brain


state sampler creates a ‘stream of thought’ which in the long run visits all the places and
themes that the brain knows about

6.3 Metropolis Algorithm

➔ sampler used in BM
What it needs:

- Set/space S of possible states of some modeled system

- For each state s ∈ S, a computable pdf or pmf value

➔ Enough to have pdf or pmf up to some unknown normalization factor

➔ Metropolis only needs ratios p(s)/p(s’) or P(s)/P(s’) (ratios remain the same if pdf or pmf
is scaled by constant factor)

➔ Makes it unnecessary to compute partition functions

Metropolis sampler for given pmf

Given:

- Finite state space S

- Non-negative function

➔ Proto-pmf because it needs normalization to become pmf

Task:

- Generate (endless) sequence s1,s2,… such that, in the long run, this sequence of sampling
points would remodel the pmf landscape

What does it do?

- Generate the sample point sequence s1,s2,... by means of a Markov process mechanism

➔ Every newly generated point sn depends only on sn-1 (M process generalizes discrete,
finite-state M chains to continuous state spaces)

- S1 is chosen arbitrarily

Assume sn is computed
How to compute sn+1

1) Randomly propose a candidate s* for sn+1:

- Needs mechanism to sample from a proposal distribution

Proposal distribution: conditional distribution over S which gives the probability to choose s*
given sn

➔ Should be chosen so that one can sample from I cheaply (e.g. multidimensional Gaussian
centered on sn)

2) Accept or reject the candidate s*:

- If s* accepted: become next output of the sampler: sn+1 = s*

- If rejected: discard s*, repeat previous value in next output: sn+1 = sn

- Requires random decision-making subroutine (acceptance functions)

➔ Asymptomatically remodel the landscape of the proto-pmf


2 most common acceptance function

➔ F = formula [49]

➔ The acceptance probability is then compared by drawing number a from uniform


distribution ∈ [0,1]. If a ≤ Paccept ➔ accept s* (weighted random decision)

Conditions for Metropolis:

Necessary condition: ‘random walk’ process of yielding sampling points is ergodic


➔ Every point in S can be reached by some sample point sequence

Sufficient condition: Pprop must be symmetric:

➔ Two proposal distributions in step 1 have this property (heart of Metropolis algorithm)

➔ If selects candidates far from sn: danger of low-probability zone of


landscape ➔ s* will rarely be selected ➔ slow coverage, long sample times

➔ If selects candidates close to sn: ➔ slow random walk (coverage,


sample time)

➔ Finding right one needs insight into the probability distribution one wants to sample from

- If one wants a sampler producing independently, identically distributed (i.i.d.) sequence


of sample points (without memory and reliance on the last point) ➔ subsample the
sequence

➔ Instead of using s1,s2,… retain only each h-th point recording only sh, sh+1

Gibbs sampling: possible if states s ∈ S are vectors

- Proposal distribution changes only one component of s at a time


➔ if E(s*) < E(sn), the ratio r > 1 because P(s*) > P(s)

6.4 Simulated Annealing

- General purpose optimization algorithm that finds global minima

Optimization task: search space S and a cost function. Goal is to solve the cost minimization

problem:

➔ This is the form of the neural network training objective in supervised learning!

➔ States s are parameter vectors of a NN

➔ Cost function R = empirical risk

To find global minimum one needs to search whole state space S, try low-cost candidates more
often than high-cost ones and make sure to not get trapped in local minimum

2 main families:

1) Evolutionary optimization:
- Cost function: inverse of biological fitness

- Compute sequence of generations G1, G2,… (consisting of individual solution


candidates, in NN = network parameter vectors).

- Next generation derived from previous generation favoring high fitness parents and
random variation in genes

2) Simulated Annealing:

- Cost function: energy of microstates

- Run Metropolis sampling from the Boltzmann distribution starting with high temperature
and slowly cooling down (nudges search for global min)

Outline of process for discrete search space (pmf):

1. Identify points of search space S with microstates s

2. Identify cost function (which global minimum should be found) with energy function
E(s)

3. Start with medium/high temperature T0 and consider Boltzmann distribution P(s,T0)


(close to uniform) and start sampling with Metropolis algorithm

➔ Sequence of samples will cover search space almost uniformly

4. Lower temperature gradually, yielding Boltzmann distributions P(s,T1), P(s,T2),..


concentrating ever more on low cost/energy value microstates

5. Continue until T~0

➔ Samples now close to global min

➔ If cooling too fast/start at low temp ➔ get stuck in local min (suboptimal)
2 cases:

1) When Metropolis algorithm accepts with certainty:

- Occurs when which is equal to

➔ When the energy of the proposed state is lower than the energy of the previous state,
accept with certainty

2)

Rewriting in log of this acceptance probability in terms of energy:

➔ In terms of energy, a new proposed microstate is accepted with certainty if its


corresponding energy jump goes down. If it goes up, it is accepted with probability [55].

➔ The greater the energy increase, the accept gets exponentially more unlikely

➔ But: at higher temperature, we take higher jumps uphill (allows to get out of local min)
Widely used cooling process:

Exponential cooling: put Tn+1 = k * Tn

➔ k < 1 but close to 1

➔ Update T after every sample point

BUT: sometimes you need quick, sometimes slow cooling!

Slow cooling: associated with phase transitions (freezing water)

Example:
6.4.1 Optimizing Compute Hardware Layout: circuit partitioning subtask

➔ see reader for 2 examples

7 The Boltzmann Machines

Neural Network for a representation of complex distributions

In cognitive science terms: A model of a contents-addressable, associative, generative long-term


memory

Fundamental assumption: memory is a generative model of a probability distribution.

Generative: BM comes complete with a sampling algorithm, which produces sample items from
the memorized distribution

➔ BM run in mode of “dreaming” (technical term: BM can confabulate pattern samples)

7.1 – Architecture

BM is

- Recurrent NN

- All neurons are binary {0,1}

- Neurons can be visible or hidden

➔ Often combined in notation:

➔ Activation vector for L visible and M hidden neurons

➔ First L members of network state will be reserved for visible units

- Neurons have undirected link between any two visible and/or hidden units
➔ Each link has weight wij = wji

➔ Special case: wij = 0 means no link between I and J

➔ Self-connections are not allowed (wii must be 0)

In summary:

BM is fully characterized by

1) Symmetric weight matrix W of size (L+M)*(L+M) with 0 diagonal

2) By specification of how many neurons are visible (number L)

Input: u ∈ {0,1}Lu

Output: y ∈ {0,1}Ly (in one-hot encoding)

If utest from pattern class j is activated, j-th output pattern yj = 1 & all other y = 0

L = Lu + Ly

Difference to MLP

BM can also be run backwards in generative confabulation mode:

If the BM is trained to classify digits 1,2,…,9,0, one can generate digits by using the ten
classification neurons (Ly = 10, number 0-9), as input neuronks

➔ It then generates patterns of the digit 4 with input (from output neurons) being (0, 0, 0, 1,
0, 0, 0, 0, 0, 0)

➔ Generation is driven by Metropolis sampling (random sequence/sample of various


patterns ‘4’ from learned distribution of this pattern class)

Therefore, Input layer can be


1) An input “retina” (in classification mode)

2) Output projection screen (in confabulation mode)

Similar with output

➔ Drop classification as input/output

➔ Instead: visible neurons

7.2 The stochastics dynamics of a BM

Temporal evolution of the BM state is in the format of random sequences s1, s2, … of binary
network states which is generated by an application of Metropolis sampling

➔ Metropolis sampling needs energy function

Energy state E(s) = (s1,…,sL+M) defined by:

➔ Gives low values if pairs of neurons si, sj which both have an activation of 1 are
connected by large positive weights

➔ any neuron with 0 activation does not contribute to the energy of the state

if neuron si jumps from activation 0 to 1 & all other neurons sj retain activation, [56] changes by
adding the amount

➔ gives rise to Boltzmann distribution over set S = {0,1}L+M, which has pmf
➔ temperature T usually fixed (mostly at T=1), which simplifies [58]

➔ fixed setting of weights defines an energy landscape & probability distribution over
microstates (Ew = Energy distribution induced by weight W, PW = Prob. Distribution
induced by W)

➔ sampling from PW done with Boltzmann acceptance function (Gibbs version of


Metropolis sampler)

➔ update only selected component si ∈ s to get new sample point

Exercise:

7.3 The learning task

- BM is trained to learn probability distribution P0(v) over the visible units

- Training data: sample S = (vi)i=1,…,N of patterns sampled from that target distribution

Objective:
Find weights W such that, if the trained network is run with the Metropolis sampler, the
distribution of the patterns that can be read off the visible units is a good approximation of the
target distribution

➔ Can also be done with input, output pairs

➔ BM then learns distribution PW(v) over all visibles, approximating the joint distribution
of input and output vectors

Example:

1. Handwritten 4 as input ➔ 4th output unit (out of 10) is activated much more often

➔ Relative frequency of activations indicates that BM thinks the input is of class 4

2. Ten classification units (output) clamped into classification vector (0,1,0,0,0,0,0,0,0,0).


Sampling process should lead to a sequence of random variations of handwritten “2”
patterns on retina (aka input neurons)

Loss function BM

Should measure how close a BM distribution PW(v) is to target distribution Ptarget(v)

➔ L = 0 if identical

➔ L > 0 if not identical


Standard choice of measure that compares two discrete probability distributions:

Kullback-Leibler distance:

➔ Not true distance because asymmetric

➔ Always nonnegative

➔ 0 iff P = Ptarget

Learning task: solve problem of minimizing empirical risk associated with loss function:

7.4 The learning algorithm

- Weight matrix must be computed that solves [61]

- KL is differentiable with respect to weights, gradient descent optimization can be used

Pij = average (over training samples) probability that units i and j are both active (si=sj=1) when
the visible units are clamped to the training data point v

qij = probability that these two units are simultaneously active in a “free-running” sampling mode
with no external clamping

yields update rule for weights:


Single weight update involves the following operations:

1. Wake phase (BM sees visible input): pij is estimated as follows: for each training
sample point vk clamp the visible units to vk. While clamped, run BM Metropolis
sampling [59] until a representative sample of network states under clamping condition
has been collected. This is used to estimate pkij of co-activity of i and j in this clamping
condition. Do this for k = 1,…,N.

➔ set pij to average of all pkij

2. estimation of qij: Similar, but without clamping the visible units (called sleep phase)

3. weight update: apply [63]

problem: for single weight update, one must run as many complete sampling runs as there are
training patterns

simplifications to fix problem:

1. Instead of [63] use

- Sign = signum function

➔ Sign of (pij - qij) needs less accurate sampling

2. For each sampling run (estimating p & q) use two-phase procedure

1) Start from higher temperature than agreed T and carry out simulated annealing like
cooling from higher temp to agreed T (prevents that the randomly generated starting
state remains stuck in some narrow untypical local energy minimum)

2) Sample at target temperature T


➔ Does not solve computational load problem

➔ ➔ BMs never used in practical applications

7.5 The restricted BM (RBMs)/ deep belief networks (DBNs)

- DBNs are layered neural networks ➔ each layer corresponds to one BM

➔ Hidden units of one BM make visible units of next-higher BM

- Reduction of layer connectivity: no within-layer connections, only connections between


adjacent layers (➔ restricted BM)

- Learning is done in divide and conquer fashion, layer by layer

- Sampling estimation of p & q condensed to two state updates per BM layer

➔ Done using approximate algorithm contrastive divergence

8 Reservoir Computing

- Tasks solved are the same as those solved by RNNs and LSTMs: supervised timeseries-
in, timeseries-out learning tasks of all sorts

Repetition section 4:
Difference to training RNNs

In RNNs: train/optimize all parameters W, Win, Wout, b

In RC: only readout weights Wout is trained, the rest is randomly set and fixed.

Advantages:

1. Lower computational cost than BPTT

2. Algorithm numerically robust and no local minima problem

3. Either same or better quality solutions compared to BPTT-trained RNNs

4. RC is biologically plausible

5. RC can work with equations that are not differentiable!


Demo

- Easy example with 1 input and output neuron

Input: pattern C1 or C2

Output: 1 if C1, 0 if C2

➔ Since in- and output are one-dimensional: K = M = 1

LxL sized internal weight matrix W, the Lx1 input weight matrix Win and L-dimensional bias
vector b from [64] are filled with random values from uniform distribution around 0

➔ Positive and negative values occur roughly equally often

Formal Statement of Learning Objective

➔ Compute 1xL output matrix Wout

➔ For activation function f in [65], identity is used

➔ Network output signal:

Learning objective is to solve:

Solving [67] is done in two steps:

1) State Harvesting

Randomly created network is driven by the teacher input for a duration of nmax steps

➔ We record activations xi(n) of each of the reservoir neurons

➔ This gives L timeseries of length nmax


2) Compute Readouts

- Optimal readout weights Woptout are computed

➔ Based on the L harvested state sequences xi = (xi(1),…,xi(nmax)) (nmax-dimensional


vectors)

- If Wout = (w1out,…, wLout) ➔ network output signal is nmax dimensional vector

^YtrainOut = ΣLWout wiout xi

[67] can be rewritten as:


i.e.: The L vectors xi must be linearly combined such that the combination Σ best approximates
the teacher vector YTrain in the least mean square error sense

- After this step, training is finished. The found weights Woptout are inserted into the network
architecture ➔ network is ready for testing and use

Testing

Network is driven with fresh input data, for which a correct output signal is known

➔ Network generated Ŷ can be compared to correct output signal

Computational Cost

Assuming that the number K of input channels is less than the reservoir size L

➔ the random initialization of all the fixed weights costs O(L2)

State harvesting needs nmax network updates

➔ each update costs O(L2)

Linear regression

➔ costs O(L3)

➔ total cost: O(nmax * L2 + L3)


Normally one should have more training data time points than reservoir neurons (if other way
around, one gets zero-error solution from linear regression ➔ overfitting)

➔ cost is dominated by harvesting phase and becomes O(nmax * L2)

➔ cost of training is essentially the same as running the network once with the training input
data (cannot be cheaper)

8.2 RC in practice

- high-dimensional nonlinear dynamical system that requires difficult design decisions

Discard washouts

- the arbitrarily chosen initial state x(0) leaves traces over the next few network states

➔ “initial state memory” effect

- Rate of convergence between first initial state and other initial states (from other
neurons? Unclear in the Reader) is exponential on average

➔ The initial state differences are washed out

➔ Start linear regression (step 2, harvesting) after initial washout period

➔ Length of washout best computed by following plot:


Top dashed: initial state memory

Bottom: log of loss function

Scaling of Initial Weights

IMPORTANT

Geometrical and dynamical properties of the harvested state sequence depend on the average
absolute sizes of weights in Win, W, and the bias b

Wrefin and bref can be sampled from uniform distribution of some range

Wref has different method:

1) Sample weights from same uniform distribution

2) Calculate absolute value of the largest eigenvalue of preliminary matrix (can be


computed for any square matrix)

➔ This is called spectral radius of the matrix (important role in theory and practice)
3) Scale preliminary matrix with 1/maxAbsEigenvalue ➔ Wref gets spectral radius of 1

The reference matrices (Win, W, b) are scaled by a factor (different cases are presented next)

S = order 0.1

M = order 1

L = order 10

Case A

- Network states normally work well

➔ Amplitude span much of range [-1,1]


➔ Influenced by driving input but the reservoir-internal interactions make the different from
each other

➔ Not always best behavior

Case B

- Activation of reservoir neurons dominated by input term in [64]

➔ Almost all neurons exhibit activation traces (look like scaled versions of input)

➔ Not desirable behavior except when desired output y(n) at time n depends only on the
input u(n) (does not need memory effects)

➔ BUT: feedforward network would be better for this case than RNN

Case C

- Activations of neurons typically pushed toward [-1,1] limits of tanh sigmoid

- Almost binary “switching” dynamics inside the reservoir results

➔ Appropriate in extremely nonlinear, quasi Boolean input-output learning tasks

Case D

- Reservoir dynamics degrade toward constant values in each neuron

➔ Not beneficial

Case E

- Danger that network dynamics recurrently self-excite (no longer modulated, chaotic
dynamics emerge)

➔ Useless because “echo state property” is violated


➔ Getting scalings right is crucial for good RC performance

➔ No general rule

➔ Depends on: (1) task and (2) dimensions K and L of input and reservoir

Figures like cases above help to see what reservoir is doing

Output feedback for signal generation tasks

- Some tasks require generating output signal

For instance: no input neurons, single output unit ➔ should yield sinewave osciallation

➔ Requires that generated output signal is fed back into the reservoir

Using linear output unit, update equations are:

W, Wfb, b fixed at design time

Readouts Wout are trained


For state harvesting, the teacher signal is delayed by one timestep in the output node

➔ In update cycle n, the value written into the output unit is y(n-1)

➔ Accounts for relative time difference of y signal in [69] versus [70]

➔ Wfb assume role as input weights in the state harvesting phase

After output weights are computed with linear regression and wash out phase, the network will
settle in the desired oscillation

- Input signals can modulate generated output (e.g. input could change frequency of
generated sine wave)

Echo State Networks (ESNs, reservoir computing by Jaeger) good at signal generation tasks,
especially chaotic signals

Leaky integrator neurons

For good results, RC dynamics and geometrical properties should match properties of task (see
case figure above)

Examples of matching properties:


- Degree of nonlinearity

- Length of memory timespans

- Complexity of task

Most important property:

Speed of system from which the training data comes (gigahertz computer clock vs ocean tides)

➔ No best way to define speed

➔ Since it’s so important to match speed, one needs way to design slow or fast reservoirs

Standard approach:

Neuron model where each neuron’s dynamics are governed by the leaky integration update
equations

➔ Leaky integrator neurons can be slowed down or sped up as one wishes

Explanation starting from continuous-time neuron model:

- Reservoir runs in continuous time

- L-dimensional activation vector at time t is x(t) (n ∈ Z = discrete time points, t ∈ R


continuous)

- Activation xi(t) is a continuous-time signal

➔ For leaky integrator neuron, it is governed by following ODE

Wij = synaptic connection weights of incoming connection to neuron i


C = time constant

➔ If integrated with different C, trajectory plot of phase portrait will look the same

o The trajectories follow tangentially the vectors of the vector field given by the
RHS of the ODE

➔ Scaling c scales length of these vectors, not direction

o growing c slows down “forward speed” of the point xi(t) along the trajectory

This dynamical law is called the leaky integration model of a neuron because

integrated the input to the neuron over time, while -xi lets the
activation x(t) diminish at rate proportional to current activation

BUT for practical ML application one needs discrete-time neuron model with chosen step size ∇

➔ continuous and discrete should be similar at equal n∇

➔ requires ODE solver

Euler Method for ODE:

After point xi(n ∇) is computed, the next point xi((n+1) ∇) is computed by following the direction
of the vector given by the RHS of the ODE for a timespan of ∇:
➔ control speed by setting a ∈ [0,1]

o the larger a, the faster the neuron

o if a = 1 (extreme case) update equation becomes [64] again (first part of equation
disappears)

o if a = 0 (slowest case) nothing happens.

➔ Network state remains frozen in initial state

➔ a is called the leaking rate of the neuron

How to integrate information across many time scales?

- Neither ESNs nor or LSTM with BPTT are capable of it

One approach in ESNs: design reservoirs with leaky integrator neurons. Different submodules
have different values for the inverse time constant a

➔ Fast modules/ fast neurons/ fast processing layers are closer to input than slow layers
Regularization and Reservoir Size

- RC is susceptible to overfitting

➔ Flexibility of an RC training scheme must be adapted to the available training data

➔ Task complexity [must be adapted (incomplete sentence here, I am filling in what I think
he means)] by experimentation in a cross-validation scheme

➔ Needs way to tune the degree of modeling flexibility

Recommended approach to do so with ESNs is:

- Use large reservoir (choose size that allows overfitting)

- Use regularized version of linear regression for the readout weight calculation (called
ridge regression)

Ridge Regression/Tykhonov Regularization:

- Should always be used in ML when a linear regression has to be carried out


“Rehearsal of Linear Regression”

Where a ≥ 0 weights the strength of the added regularization


➔ Shows how the regularizing term pulls down the resulting weight sizes when a
gets larger

Analytical solution:

➔ The less regularization is applied, the lower the training error

➔ But the test error is high both with very high and very low amounts of regularization (has
minimum for some intermediate degree of regularization)
8.3 Online reservoir training

- [76] is mostly used in practical RC applications

- Sometimes necessary to use iterative stochastic gradient descent to solve [75]

➔ Mandatory in online adaptive learning


Online adaptive learning: properties of the input-to-output signal transformation change with
time and the learning system must continually track the target system

➔ E.g.: valve control in engine: control mechanism changes with temperature, runtime,
load,…

➔ RNN must be continuously re-trained as new measurement data come in

➔ Biology cant invert a matrix like in [76]

Update equation:

➔ Adjusting wout by doing a small step down the error gradient (adding correction vector
which most strongly reduces the squared error)

➔ Perceptron learning rule is version of this algorithm with learning rate = 1

8.4 The echo state property

- Not every reservoir design works

➔ When reservoir weight matrix W is scaled too large, reservoir strongly self excites

➔ Self-generated dynamics overrule any input

Echo state property: When the network washes out differences of initial states under the
influence of an input signal u(n), the combination of the network and this input has the echo
state property (ESP)
Scaling of W expressed by the spectral radius of W

Spectral radius (SR):

- Important effect on learning accuracy

- Start with Wref with spectral radius 1

➔ Re-run the training with different spectral radia

- For every input signal and Wref, there is a unique critical spectral radius

➔ For SR < SRcrit, ESP is obtained

➔ For SR > SRcrit ESP is lost

➔ Marks bifurcation in the reservoir dynamics

➔ Can’t be found analytically

Why is ESP important?

1) RC training only works for input-reservoir combinations that exhibit ESP

2) PINK NOIIIIISE YOOO! Human Brain works best when it increases its self-excitation
level unit just below the point where it bifurcates into chaos

You might also like