Professional Documents
Culture Documents
Richi's Neural Nets Summary
Richi's Neural Nets Summary
Richi's Neural Nets Summary
tune a learning algorithm such that instead of attempting to minimize the risk (4)
(impossible because distribution PU,Y is unknown) it tries to minimize the empirical risk
(5) which is just the mean loss averaged over the training examples
H = hypothesis space (set of all NNs with specific connection structure, e.g. weights)
Flexibility:
a learning algorithm A is more flexible than another learning algorithm B if
A can fit its computed models fˆ more closely to the training data than B can
do
A regularization term is a cost function reg : RM -> R≥0 which penalizes model
parameters θ that code models with a high degree of geometrical “wiggliness”.
A good penalty term should, of course, (1) assign high penalty values to parameter vectors
θ which represent “wiggly” models; but (2) furthermore it should be easy to compute and
blend well with the algorithm used for empirical risk minimization.
The degree of “oscillativity” can be measured, for instance, by the integral over the
(square of the) second derivative of the polynomial p,
L2-norm regularizer
the squared sum of all model parameters
This regularizer favors models with small absolute parameters, which often amounts to
“geometrically soft” models.
It is called the L2-norm regularizer because it measures the (squared) L2-norm
of the parameter vector θ.
Computing a solution to the minimization task (10) means to find a set of parameters
which simultaneously minimizes the original risk and the penalty term. The factor
α2 in (10) controls how strongly one wishes the regularization to “soften” the solution.
Increasing α (or x-axisinthefigureabove)means downregulating the model flexibility. For α2 =
0 one returns to the original un-regularized empirical risk (which would likely mean
overfitting). [a search]
For α2 -> infinity, the regularization term entirely dominates the model optimization and
one gets a model which does not care anymore about the training data but instead only is
tuned to have minimal regularization penalty.
• In case of L2-norm: 0 parameters -> wiggle free
Regularizers are convenient because you can just start with a big (unregularized) flexible model
instead of checking different sizes
1.4.1 Tuning model flexibility through adding noise
Another way to tune model flexibility is to add noise.
1. Adding noise to the training data
for each ui, add l variants ui + νi1, . . . , ui + νil of this pattern to the training data,
where the νij are i.i.d. (independently, identically distributed) random vectors (for
instance, uniform or Gaussian noise).
This increases the number of training patterns from N to (l +1)N
The more such noisy variants are added and the stronger the noise
the more one steers to the left (underfitting) side
How to estimate the risk of a model
1. Akaike information criterion
2. Cross-Validation
K-fold cross validation:
R = flexibility
• Start with low-end r and increase stepwise. With each
iteration/regularization strength r, assess generalization quality with cross
validation until the validation risk (R val r j) starts to rise!
leave-one-out cross-validation
- Used when data set it small (danger of poor generalizability if wrong r
was used)
- Split set into singleton subsets (containing a single training example)
Other uses
cross-validation can similarly be used for tuning hyperparameters of a
learning procedure. The term “hyperparameter” is generally used for all
kinds of “knobs to play around with” when tuning a complex learning
algorithm.
2. Feedforward Networks
1. Perceptron
Figure 7
Perceptron learning rule:
ONLY MINIMIZES TRAINING ERROR! DANGER OF
OVERFITTING
• Can only classify what is linearly separable: cannot learn XOR function
because it’s not separable
Differences to Perceptron
1. MLPs have many summation units coupled together in parallel and serial
connectivity
2. Learning rule replaced by error backpropagation: allows to iteratively adapt
the synaptic weights to minimize a given loss function
3. MLP receives input and send output to many different neurons.
-> each such connection is still characterized by a weight (just like perceptron)
An MLP with n input units and m output units thus instantiates a network
function N: RK -> RM
• N θ: RK -> RM
• Used to emphasize dependence on weight
MLP Structure
weights
2.2. add the bias -> interpreted as weight of synaptic link from bias
unit in layer k-1 that always has an activation of 1 (see below)
2.3. wrap obtained sum with activation function : nonlinear, s-shaped function
3. The output layer y (dimension LK) is computed from the activations of the pre-
output layer LK-1 depending on the task setting.
(without activation function)
Activation Function
• Most tasks can be solved with 1 hidden layer, but more reduce the overall
size of the MLP (quantified by total number of weights)
General Recipe
Weakness
- Final model depends on the choice of the initial model
n+1
- If is too large: risk of R( ) may be > R( n) (red line below)
Figure 16
➔ The gradient descent algorithm has become instable
➔ The learning rate must be sufficiently small to avoid instability and
divergence (green arrow) (price is slow convergence)
➔ Becomes very slow when approaching ‘valley’ of graph (shaded dots)
What does it mean for a learning rate to be too large?
➔ The more layers, the more probable it is to get high ratio of
partial derivatives .
➔ If Taylor approximation is precise: second-order gradient descent points
exactly towards target minimum (magenta arrow in picture)
➔ Computing the Hessian is expensive & subject to numerical problems &
assumptions may be violated
Backpropagation
Backwards pass:
1.
Delta rule
2.
Error calculation in 2nd run:
https://mattmazur.com/2015/03/17/a-step-by-step-backpropagation-example/
Analytical modeling
Blackbox modeling
- Very applicable because usually you can only observe measurements, not
states (experimental data)
- Explain data by generative mechanism (based on empirical measurement
data, infer state transition system)
➔ Can approximate every discrete-valued stochastic process arbitrarily
well (better approximation ➔ higher state space)
➔ HMM is a special case of a dynamic graphical model: HMM is DGM with
single input set, single state set, single output set (HMMs trained by
expectation maximization algorithm, DGMs by generalization of HMM
learning algorithm)
Controlled Markov Models (POMDPs)/Partially Observable Markov
Decision Processes
- Crossover controlled MCs and HMMs
- Basic tool in theory of autonomous agents because it constitutes the
agent’s world model:
Q: external world states
A: agent’s action in world
O: sensory feedback from world
➔ Action-sensor-feedback timeseries data needed
1-dim deterministic Cellular Automata (CA)
- Finite sets of cells with local (neighbor) transition function (picture thingy
from autonomous systems)
- CA defines global transition function
Periodic Attractor:
Bifurcation: when a control parameter A passes through a critical value (here A=0), the phase
portrait changes its nature. Also, only at the critical value, the system is structurally instable.
Left and right of it, the phase portrait is structurally stable
Phase Transitions
From thermodynamics, not DS (point where water becomes gas, liquid, solid)
➔ also has critical values
but: it is infinite-dimensional stochastic system, bifurcation is low-dimensional deterministic
➔ bifurcation can be used to model PTs in low-dim mean-field model
➔ Specifies how the network activation state is updated from one timestep to the next
➔ In first iteration (n=1), x(n) is replaced by initial state x0
X(n) = vector of activations of the neurons inside the RNN (in RL, L = Number of Neurons)
W = LxL weights
Win = LxK (k input neurons)
U(n) = input signal
B = bias (in RL)
2. Non-stationary DS
- the signal changes over time
➔ microphone signal of someone saying ‘zero’: has z at beginning and o at end
➔ training data consists of many individual recordings (e.g. of many people saying zero), which
is double indexed:
➔ Both (1., 2.) can be stochastic or deterministic but most real-life systems are stochastic
➔ Stationarity is no perfect dichotomy. It depends on how long a system is observed. (e.g.
(ignoring climate change) weather changes daily but follows the same pattern over the
years)
For these short- long term distinctions, you can follow two strategies to implement RNNs
1. One complex RNN, capable of changing ‘mode’ in shorter time intervals
2. Several simpler RNNs, one for each typical non-stationary sub pattern. Then, combine
them into a compound multi-RNN system, where individual RNNs are
activated/deactivated to give rise to the overall pattern
Typical approach: train NN on predicting one timestep, feedback this observation and iterate
for stationary
And
➔ v
MLPs use Backprop for 1., which depends on the network topology as feedforward without
connection cycles (which LSTM does have)
➔ (34) is therefore solved with backpropagation through time (BPTT)
BPTT
➔ Significant problem where input-output effects with arbitrary long delay (human
language)
2. Vanishing/Exploding Gradient Problem
- Surfaces when error gradients are backpropagated through many layers (the deeper the
network the more severe the problem)
- RNNs are much deeper than MLPs due to many time-slice layers
➔ Vanishing gradient problem occurs in RNNs AND MLPs
For quadratic loss, the gradient of the loss with respect to weight w is
Critical term: wh-1 because if h is large and w<1, it will go towards 0, if h is large and w>1 the
gradient explodes with depth h (because of the exponent)
➔ Since the unfolded RNN are identical copies, gradient components arising from pathways
that span large temporal horizons (large h) are repeatedly quenched or expanded at
every timestep
➔ If learning task includes the exploitation of long-term delayed input-to-output effects
(long-term memory), the vanishing/exploding gradient problem will make it practically
impossible to encode these long-term effects during learning
➔ Best answer to problem: LSTM
LSTM networks
- Helps with vanishing gradient problem and to cope with multiple-timescale tasks
U(n) ∈ R
Simplified equations:
Currently, the gradient of the error with respect to the two trainable weights is
➔ At one point wout*win = 1 (therefore, gradient is 0)
➔ But: gradient is independent of h (no vanishing/exploding gradient with large h)
Why?
If s(n) = 0, the activation state c(n) stays the same. This leads the weight w that lead to the
vanishing gradient problem in [35] (if it is not equal to 1), is fixed to be 1 in [36] (LSTM)
An LSTM network is an RNN which consists of several memory blocks and possibly other
‘normal’ RNN units.
Memory blocks and normal units may receive external inputs and/or recurrent input from other
blocks/units within the network.
F = sigmoid function
Bu = bias vector
➔ Memory cells do not necessarily preserve its previous value with a factor of 1
➔ May leak some of it by the multiplication with the forget gate neuron value
➔ Output may be fed to other memory blocks or normal units in the LSTM network
Why do the three gate units all have logistic sigmoid as their squashing function?
All participating neurons are connected to all others by some cyclic pathways (this is not the case
in an LSTM block)
5. Hopfield Networks
Problem of catastrophic forgetting: MLP learns Task A, then Task B but forgets Task A in the
process
➔ The active field of continual learning within deep learning tries solving this problem
➔ This field is incompatible with the dogma of basing LTM on fixing stable synaptic
weights
➔ Backprop algorithm is not biologically feasible (and suffers from catastrophic forgetting)
➔ Human brain might need sleep for that
How do we recall memories?
2 problems:
1) Addressing
➔ How do you point to memories (in computer by pointers)
➔ Brain might use content addressing: To access grandmother, you start with some
fragment
➔ Called associative memory model
2) Decoding
➔ twin problem of encoding problem
Nutshell HN
- Can be trained to store a finite number of patterns that are retrievable by and auto-
association process
Pattern: A pattern is a binary vector {-1,1}L (also works with {0,1} but it is more complex)
➔ Cue is corrupted version (i.e. cue patterns agrees with one stored pattern in some, but not
all pixels, difference can be substantial) of one of the stored patterns
Pattern Restauration:
Pattern Completion:
Main Design Principles for the Associative Pattern Recall Functionality in Hopfield nets:
- Input pattern u
- Discrete-time state update dynamics of HN leads to sequence of patterns (originating
from u)
➔ At every update, the new state x has energy <= x(-1) until x(m)
➔ Set of all cues u are attracted by fundamental memory (point attractors) is the basin of
attraction of the set of fundamental memories
➔ Has 0’s in the diagonal and can be mirrored alongside that diagonal
➔ This means that connection weight of wij = wji (i.e. the connections are undirected)
Example with L = 4:
State:
Key idea:
Energy:
Defined by:
- Z=
- If z>0 ➔ sign(z) = 1
- If z<0 ➔ sign(z) = -1
[z search]
3. Update only the activation of the neuron xi, not xi-1/xi+1 and so on
5.3 Training an HN
Learning problem:
Given an L-dimensional training patterns, find weight Matrix W such that an energy landscape is
created that has all training patterns located at different local minima.
2 methods to do so:
If weight Matrix W exists which solves the learning problem, it can be written as
L = L-dimensional identity matrix
➔ 1/L normalizes the energy levels expressed in [41] so that they do not grow with network
size
1) Initialization: create random initial weight matrix W(0) (symmetric with 0 diagonal)
2) Loop
[learning rate]
➔ If two RNNs are trained with same data but [44] has differing learning rate, the resulting
energy landscape will be the same, scaled by a scaling factor
➔ The order in which training patterns are presented does not matter! It will lead to same
landscape given by [43] (up to a scaling factor)
3) Stop
➔ (best) when the pairwise ratios of weights in the sequence W(k) appear to converge up to
a predefined, small residual change
EXERCISE: why does [45] always lower the energy of the training pattern (check from energy
definition [41])?
➔ We incrementally change W by making it yield lower energies for the training patterns
➔ At the same time, it does not lower energies for patterns outside the training set more
than for training patterns
Hebbian learning: learning rule [44] could be effective in biological brains.
- Neurons can only send spikes that last milliseconds, impressions may last seconds
➔ First one solved in HB: equal rights to positive and negative neural activations and
weights
5.4 Limitations
It is not always possible to store N training patterns in an L-neuron HN. This can go wrong:
o If N/L > 0.138, [43] breaks down entirely: none of the patterns will be stored
o HN works well if 0.03 < N/L < 0.05 (training patterns are in local minima that
correspond to perfect copies)
- Spurious states: when at location of stored training patterns, other local minima are
created (false memories)
The storing formula [43] leads to a slight misplacement of the local minimum associated with
training pattern .
➔ If HN is cued with exact pattern , the state update dynamics will move down the
energy valley until it arrives at local minimum ^ (training pattern hat, local minimum
of training pattern).
➔ Some bits in the original pattern will become flipped and are unstable
➔ Maximum is P=0.5
3 different kinds:
- For N/L > 0.138: spin glass states are the only stable ones
- For 0 < N/L < 0.138: stable states close to desired fundamental patterns exist
- For 0 < N/L < 0.05: pattern-related stable states have lower energy than spin glass states
(patterns are the actual local minimum)
- For 0.05 < N/L < 0.138: spin glass states dominate (some of them have lower energy than
pattern-related states)
- For 0 < N/L < 0.03: additional mixture states exist (but energy is not quite as low as
pattern-related states)
Miscellaneous notes
In [42] we randomly consider one bit for flipping or not flipping (local stochastic)
➔ This difference is common to in all kinds of dynamical system with states made of
discrete bits
➔ If asymmetry would be allowed (e.g. wij ≠ wji), energy functions could no longer be
defined & state updates would not necessarily reduce energy
➔ Other attractors would emerge (for instance cycles), which breaks use case of patterns as
point attractors
Heteroassociative Networks
itself
HN pro’s
HN con’s
➔ Universal statistical inference device (can compute conditional probabilities of any sort)
2 key ideas
Statistical physics: derive macroscopic (measurable, spoken words) observables and laws as
statistical distributions to explain how microscopic (neurons) particles interact
Microstates = s
➔ The probability of a microstate depends only on E(s) and the temperature T of vessel/heat
bath system (the vessel is surrounded by uniform temperature water)
➔ Position & velocity vector microstates are continuous-valued vectors, therefore the
distribution assigns a probability density value to every microstate
➔ Lowest energy state is most likely (localizes distribution at lowest energy state)
➔ The property that slow cooling leads to global minimum is part of simulated annealing
1) If a constant is added to the energy function, the resulting distribution is the same P’ == P
2) Every (globally nonzero, i.e. for all s ∈ S for any temperature T, s ≠ 0) probability
distribution can be written as a Boltzmann distribution
Theory behind sampling is called Markov Chain Monte Carlo (MCMC) sampling
Probability distributions are basis of scientific research when it comes to dealing with
uncertainty in data
➔ Pdfs and pmfs are usually high-dimensional and can not be visualized
➔ For any s ∈ high dimensional S, you can compute the pdf value p(s) or pmf value P(s)
➔ But not always possible (fuck these lecture notes sometimes...) because of numerical
underflow (64-bit machines can’t handle any number smaller than 2-53 ➔ is treated as 0)
1) Get estimates of probabilities for events that a researcher is interested in (and estimates of
expectations, variances, partition function, all integrals over functions over the sample
space)
➔ sampler used in BM
What it needs:
➔ Metropolis only needs ratios p(s)/p(s’) or P(s)/P(s’) (ratios remain the same if pdf or pmf
is scaled by constant factor)
Given:
- Non-negative function
Task:
- Generate (endless) sequence s1,s2,… such that, in the long run, this sequence of sampling
points would remodel the pmf landscape
- Generate the sample point sequence s1,s2,... by means of a Markov process mechanism
➔ Every newly generated point sn depends only on sn-1 (M process generalizes discrete,
finite-state M chains to continuous state spaces)
- S1 is chosen arbitrarily
Assume sn is computed
How to compute sn+1
Proposal distribution: conditional distribution over S which gives the probability to choose s*
given sn
➔ Should be chosen so that one can sample from I cheaply (e.g. multidimensional Gaussian
centered on sn)
➔ F = formula [49]
➔ Two proposal distributions in step 1 have this property (heart of Metropolis algorithm)
➔ Finding right one needs insight into the probability distribution one wants to sample from
➔ Instead of using s1,s2,… retain only each h-th point recording only sh, sh+1
Optimization task: search space S and a cost function. Goal is to solve the cost minimization
problem:
➔ This is the form of the neural network training objective in supervised learning!
To find global minimum one needs to search whole state space S, try low-cost candidates more
often than high-cost ones and make sure to not get trapped in local minimum
2 main families:
1) Evolutionary optimization:
- Cost function: inverse of biological fitness
- Next generation derived from previous generation favoring high fitness parents and
random variation in genes
2) Simulated Annealing:
- Run Metropolis sampling from the Boltzmann distribution starting with high temperature
and slowly cooling down (nudges search for global min)
2. Identify cost function (which global minimum should be found) with energy function
E(s)
➔ If cooling too fast/start at low temp ➔ get stuck in local min (suboptimal)
2 cases:
➔ When the energy of the proposed state is lower than the energy of the previous state,
accept with certainty
2)
➔ The greater the energy increase, the accept gets exponentially more unlikely
➔ But: at higher temperature, we take higher jumps uphill (allows to get out of local min)
Widely used cooling process:
Example:
6.4.1 Optimizing Compute Hardware Layout: circuit partitioning subtask
Generative: BM comes complete with a sampling algorithm, which produces sample items from
the memorized distribution
7.1 – Architecture
BM is
- Recurrent NN
- Neurons have undirected link between any two visible and/or hidden units
➔ Each link has weight wij = wji
In summary:
BM is fully characterized by
Input: u ∈ {0,1}Lu
If utest from pattern class j is activated, j-th output pattern yj = 1 & all other y = 0
L = Lu + Ly
Difference to MLP
If the BM is trained to classify digits 1,2,…,9,0, one can generate digits by using the ten
classification neurons (Ly = 10, number 0-9), as input neuronks
➔ It then generates patterns of the digit 4 with input (from output neurons) being (0, 0, 0, 1,
0, 0, 0, 0, 0, 0)
Temporal evolution of the BM state is in the format of random sequences s1, s2, … of binary
network states which is generated by an application of Metropolis sampling
➔ Gives low values if pairs of neurons si, sj which both have an activation of 1 are
connected by large positive weights
➔ any neuron with 0 activation does not contribute to the energy of the state
if neuron si jumps from activation 0 to 1 & all other neurons sj retain activation, [56] changes by
adding the amount
➔ gives rise to Boltzmann distribution over set S = {0,1}L+M, which has pmf
➔ temperature T usually fixed (mostly at T=1), which simplifies [58]
➔ fixed setting of weights defines an energy landscape & probability distribution over
microstates (Ew = Energy distribution induced by weight W, PW = Prob. Distribution
induced by W)
Exercise:
- Training data: sample S = (vi)i=1,…,N of patterns sampled from that target distribution
Objective:
Find weights W such that, if the trained network is run with the Metropolis sampler, the
distribution of the patterns that can be read off the visible units is a good approximation of the
target distribution
➔ BM then learns distribution PW(v) over all visibles, approximating the joint distribution
of input and output vectors
Example:
1. Handwritten 4 as input ➔ 4th output unit (out of 10) is activated much more often
Loss function BM
➔ L = 0 if identical
Kullback-Leibler distance:
➔ Always nonnegative
➔ 0 iff P = Ptarget
Learning task: solve problem of minimizing empirical risk associated with loss function:
Pij = average (over training samples) probability that units i and j are both active (si=sj=1) when
the visible units are clamped to the training data point v
qij = probability that these two units are simultaneously active in a “free-running” sampling mode
with no external clamping
1. Wake phase (BM sees visible input): pij is estimated as follows: for each training
sample point vk clamp the visible units to vk. While clamped, run BM Metropolis
sampling [59] until a representative sample of network states under clamping condition
has been collected. This is used to estimate pkij of co-activity of i and j in this clamping
condition. Do this for k = 1,…,N.
2. estimation of qij: Similar, but without clamping the visible units (called sleep phase)
problem: for single weight update, one must run as many complete sampling runs as there are
training patterns
1) Start from higher temperature than agreed T and carry out simulated annealing like
cooling from higher temp to agreed T (prevents that the randomly generated starting
state remains stuck in some narrow untypical local energy minimum)
8 Reservoir Computing
- Tasks solved are the same as those solved by RNNs and LSTMs: supervised timeseries-
in, timeseries-out learning tasks of all sorts
Repetition section 4:
Difference to training RNNs
In RC: only readout weights Wout is trained, the rest is randomly set and fixed.
Advantages:
4. RC is biologically plausible
Input: pattern C1 or C2
Output: 1 if C1, 0 if C2
LxL sized internal weight matrix W, the Lx1 input weight matrix Win and L-dimensional bias
vector b from [64] are filled with random values from uniform distribution around 0
1) State Harvesting
Randomly created network is driven by the teacher input for a duration of nmax steps
- After this step, training is finished. The found weights Woptout are inserted into the network
architecture ➔ network is ready for testing and use
Testing
Network is driven with fresh input data, for which a correct output signal is known
Computational Cost
Assuming that the number K of input channels is less than the reservoir size L
Linear regression
➔ costs O(L3)
➔ cost of training is essentially the same as running the network once with the training input
data (cannot be cheaper)
8.2 RC in practice
Discard washouts
- the arbitrarily chosen initial state x(0) leaves traces over the next few network states
- Rate of convergence between first initial state and other initial states (from other
neurons? Unclear in the Reader) is exponential on average
IMPORTANT
Geometrical and dynamical properties of the harvested state sequence depend on the average
absolute sizes of weights in Win, W, and the bias b
Wrefin and bref can be sampled from uniform distribution of some range
➔ This is called spectral radius of the matrix (important role in theory and practice)
3) Scale preliminary matrix with 1/maxAbsEigenvalue ➔ Wref gets spectral radius of 1
The reference matrices (Win, W, b) are scaled by a factor (different cases are presented next)
S = order 0.1
M = order 1
L = order 10
Case A
Case B
➔ Almost all neurons exhibit activation traces (look like scaled versions of input)
➔ Not desirable behavior except when desired output y(n) at time n depends only on the
input u(n) (does not need memory effects)
➔ BUT: feedforward network would be better for this case than RNN
Case C
Case D
➔ Not beneficial
Case E
- Danger that network dynamics recurrently self-excite (no longer modulated, chaotic
dynamics emerge)
➔ No general rule
➔ Depends on: (1) task and (2) dimensions K and L of input and reservoir
For instance: no input neurons, single output unit ➔ should yield sinewave osciallation
➔ Requires that generated output signal is fed back into the reservoir
➔ In update cycle n, the value written into the output unit is y(n-1)
After output weights are computed with linear regression and wash out phase, the network will
settle in the desired oscillation
- Input signals can modulate generated output (e.g. input could change frequency of
generated sine wave)
Echo State Networks (ESNs, reservoir computing by Jaeger) good at signal generation tasks,
especially chaotic signals
For good results, RC dynamics and geometrical properties should match properties of task (see
case figure above)
- Complexity of task
Speed of system from which the training data comes (gigahertz computer clock vs ocean tides)
➔ Since it’s so important to match speed, one needs way to design slow or fast reservoirs
Standard approach:
Neuron model where each neuron’s dynamics are governed by the leaky integration update
equations
➔ If integrated with different C, trajectory plot of phase portrait will look the same
o The trajectories follow tangentially the vectors of the vector field given by the
RHS of the ODE
o growing c slows down “forward speed” of the point xi(t) along the trajectory
This dynamical law is called the leaky integration model of a neuron because
integrated the input to the neuron over time, while -xi lets the
activation x(t) diminish at rate proportional to current activation
BUT for practical ML application one needs discrete-time neuron model with chosen step size ∇
After point xi(n ∇) is computed, the next point xi((n+1) ∇) is computed by following the direction
of the vector given by the RHS of the ODE for a timespan of ∇:
➔ control speed by setting a ∈ [0,1]
o if a = 1 (extreme case) update equation becomes [64] again (first part of equation
disappears)
One approach in ESNs: design reservoirs with leaky integrator neurons. Different submodules
have different values for the inverse time constant a
➔ Fast modules/ fast neurons/ fast processing layers are closer to input than slow layers
Regularization and Reservoir Size
- RC is susceptible to overfitting
➔ Task complexity [must be adapted (incomplete sentence here, I am filling in what I think
he means)] by experimentation in a cross-validation scheme
- Use regularized version of linear regression for the readout weight calculation (called
ridge regression)
Analytical solution:
➔ But the test error is high both with very high and very low amounts of regularization (has
minimum for some intermediate degree of regularization)
8.3 Online reservoir training
➔ E.g.: valve control in engine: control mechanism changes with temperature, runtime,
load,…
Update equation:
➔ Adjusting wout by doing a small step down the error gradient (adding correction vector
which most strongly reduces the squared error)
➔ When reservoir weight matrix W is scaled too large, reservoir strongly self excites
Echo state property: When the network washes out differences of initial states under the
influence of an input signal u(n), the combination of the network and this input has the echo
state property (ESP)
Scaling of W expressed by the spectral radius of W
- For every input signal and Wref, there is a unique critical spectral radius
2) PINK NOIIIIISE YOOO! Human Brain works best when it increases its self-excitation
level unit just below the point where it bifurcates into chaos