Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

Hopfield networks

Devised by the physicist John J. Hopfield in 1982.

Neural networks and physical systems with emergent collective computational capabilities, PNAS (1982) 79:2554

Key idea: A simple recurrent artificial neural network capable of storing and recovering an
input pattern in a memory, and correcting errors in corrupted inputs.

Ingredients of the basic Hopfield network:

- Network of ‘units’ connected by links. Each unit can have only one of two possible values (+1/-1)
at each time point in the evolution of the network.

- Links betwen any two units or nodes (i,j) are symmetric (undirected graph) à the matrix of
weights W is symmetric (if units represent neurons or populations of neurons, they are connected by mutual
interactions of the same sign and strength).

- Self-connections are not allowed à the diagonal of W is zero.

- The weights between two units depend on the activity of each unit (Hebbian learning)
Example: a 4 unit network
The activity of the network at each time is
given by a state vector V (representing the
state of activity of each unit in the network).
Assume you want to store a given state
vector in memory, such as

" +1 %
$ '
−1
V=$ '
$ −1 '
$ +1 '
# &

Storing rule: The weight matrix W is a


symmetric NxN matrix of entries with zero
diagonal. Each non diagonal element is
calculated as:

wij = Vi ⋅Vj ! waa wab wac wad $ ! 0 −1 −1 +1 $


# & # &
# wba wbb wbc wbd & # −1 0 +1 −1 &
W =# &=#
# wca wcb wcc wcd & −1 +1 0 −1 &
# & # +1 −1 −1 0 &
wda wdb wdc wdd % " %
W = V ⋅ VT − I "
Update rule (dynamics):
𝑉! = 𝑓 𝑥!

𝑥! = % 𝑤!" ' 𝑉"


"

+1 𝑖𝑓 𝑥 > 0 (a step activation function)


𝑓 𝑥 =(
−1 𝑖𝑓 𝑥 < 0

Imagine your state vector V represents a pattern (some data) that you have stored in memory.
Now initialize the network with a slightly different state vector V’ (corrupted data)

" +1 % # +1 &
$ ' % (
−1 −1
V=$ ' V! = % (
$ −1 ' % −1 (
$ +1 ' % −1 (
# & $ '
With this new state vector V’, and the weight matrix W ‘learnt’ from the stored pattern,

# & " 0 −1 −1 +1 %
+1 $ '
% (
−1 0 +1 −1 '
V! = %
−1 ( W =$
% −1 ( $ −1 +1 0 −1 '
% ( $ +1 −1 −1 0 '&
$ −1 ' #

Va!! = f ( wabVb + wacVc + wadVd ) = f (+1+1−1) = +1


The new state variables are:
Vb!! = f ( wbaVa + wbcVc + wbdVd ) = f (−1−1+1) = −1

V!! = f ( W ⋅ V!)
Vc!! = f ( wcaVa + wcbVb + wcdVd ) = f (−1−1+1) = −1

Vd!! = f ( wdaVa + wdbVb + wdcVc ) = f (+1+1+1) = +1

We recover the original (uncorrupted) state vector, V”=V !!


How does it work? Compare to the iteration of the original stored vector:

𝑓(+1 + 1 + 1)
𝑓(−1 − 1 − 1)
𝑓(W ' V)= W ' V=(𝑁 − 1)V
𝑓(−1 − 1 − 1)
so flippling one of the signs in the input
𝑓(+1 + 1 + 1) will not change the net sign of the
activation function
Dynamical intuition:

With the choice of weights wij = Vi ⋅Vj to store a state vector V,


the stored V is a stable attractor (the same state is recovered over and over again using
the update rule). The corrupted state, which is a small perturbation of this stable
attractor, will converge to this attractor after one or few iterations of the update rule.

Proof: V t = f ( W ⋅ V t−1 )
but W ⋅ V = ( V ⋅ VT − I) ⋅ V = V ⋅ VT ⋅ V − V = V ⋅ ( VT ⋅ V − I) = V ⋅ ( N −1)
where N is the number of units of the network.
Note that f ( V ⋅ (N −1)) = V, and thus V t = V t−1

2.- This clever choice of the weights is analogous to the Hebb’s rule for learning based on
long-term synaptic plasticity: Neurons that fire together, wire together (repeated stimulation
of neuron B by neuron A increases the synaptic strength between B and A).

In this Hopfield’s model, two units that have the same activity have positive
connections, while those of opposite activity have negative connections.
Physical intuition: the ‘energy’ function

Define an ‘energy’ function that depends on the state vector and on the weight matrix of
connections as:

1 1 1
E =− ∑ ∑
2 i j
wijViVj = − VT ⋅ W ⋅ V = − VT ⋅ x
2 2

For a perturbation of the stored state, this function will be monotonically decreasing as
time (update rule) is iterated.

Proof:
If we change (flip sign) the state of one unit i at time t, and define DVi =(Vit-1-Vit)/2, the change in ‘energy’ is:

ΔE = −ΔVi ∑ wijVj = −ΔVi ⋅ xi (where xi is the input of the activation function for unit i)
j≠i

Now, if V i changes from positive (+1) to negative(-1),


') x > 0
i
( ⇒ ΔE < 0
*) ΔV i = +1
if V i changes from negative (-1) to positive(+1),
') x < 0
i
( ⇒ ΔE < 0
)* ΔV i = −1
The energy function will not decrease anymore (DE=0) when V is in the stored state (stable
attractor), which will be a local minimum of the energy function or a state of least energy.

If we are already on the attractor, Vit =Vit-1 for any unit i therefore ΔE = −ΔVi xi = 0 since ΔVi = 0

Morever, from the definition of the energy function, when V is an attractor stored in the weight matrix W:

1 1
E = − VT ⋅ W ⋅ V = − VT ⋅ ( V ⋅ VT − I) ⋅ V =
2 2

1 T
=−
2
( V ⋅ V ⋅ VT ⋅ V − VT ⋅ V) =

1 N(N −1)
= − VT ⋅ V ( VT ⋅ V − I) = −
2 2
which shows that the energy is negative and has the lowest possible value
(note that the matrix W has N(N-1)/2 non-zero elements: lower/upper triangle)

So the initial stored state is a stable


configuration of the network activity (a
minimum in an energy landscape)
Hopfield networks can act as ‘content addressable memories’ (a pattern kept in
memory can be recovered at a later time) à long term memory

Multiple patterns can be stored in a Hopfield network with a fixed W.


Assume you want to store P different patterns (state vectors) in a single network:

Vν , ν = 1,..., P
Define a set of weight matrices Wν as:
W jiν = Vjν Viν and sum up to each component to calculate a final weight matrix:
W ji = ∑Vjν Viν
ν

The network will converge to the attractor that is ‘closest’ to the initial state
energy landscape changes in such a way that it acquires many additional pre-trained transformer (ChatGPT), and others. Despite being one
local minima that have nothing to do with the memory patterns that we of the most popular deep learning models, transformers are typi-
are trying to store (Fig. 1a). These additional local minima are closely cally designed through trial and error, and the theoretical principles
related to spin glass states, commonly studied in statistical physics of behind their computational strategies remain mysterious. In contrast,
Are Hopfield networks really useful or just a proof of concept?
disordered systems. Their presence is undesirable for a proper function Hopfield networks have a well-established record of theoretical meth-

Comment of the Hopfield network as an associative memory system. This linear ods, but have yet to demonstrate truly impressive empirical results in

Comment scaling relationship between the dimension of the feature space and the large-scale machine learning systems. The correspondence between
memory storage capacity presents a problem from the perspective of dense associative memories and transformer’s attention is interesting

40 years after Hopfield: modern Hopfield networks


machine learning applications, in which the dimension of the feature for two reasons. First, it enables investigation of pre-trained trans-
space is given by the specific data science task of interest. At the same former models from the perspective of energy, basins of attraction,
time, the amount of patterns that the network needs to operate with https://doi.org/10.1038/s42254-023-00595-y
and other theoretical concepts commonly used in statistical physics.
is large, possibly much larger than the dimension of the feature space.https://doi.org/10.1038/s42254-023-00595-y
Second, it opens up the possibility of searching for new transformer-

Anew
A newfrontier
frontierfor
forHopfield
Hopfieldnetworks
networks Modern Hopfield networks
In 2016, we realized that it is possible to overcome the linear scaling
like architectures that are fundamentally designed around the idea
of associative memory.
A study reported in a recent preprint8 pursued this second idea
problem between the number of features and memory storage capacity and proposed replacing a sequence of conventional transformer layers

Dmitry Krotov
Dmitry Krotov
nature reviews physics Check for updates
Check4for updates
Over
a the pastHopfield
Traditional few years
networkthere hasassociative
b Dense been amemory by introducing a rapidly growing activation function . The model,
new transformer-like architectures for language and4image processing
Over theKpast few years there has been a by introducing
dubbed ‘denseaassociative
rapidly growing memory’, activation function . The
is characterized by an model,
energy
resurgenceµ of interest in Hopfield
mem K networks of mem
dubbed
tasks that are grounded in the idea of associative memory.
function ‘dense associative
that includes memory’,
higher is characterized
than quadratic by an between
interactions energy
resurgence
E = – ∑ (ξof
· x)interest in Hopfield
2
networks
E = – ∑ F(ξ µ · x)
of
associative
µ = 1 memory. Dmitry Krotov µ=1 discusses function
the featuresthat (Fig.
The broader
includes higher
1b). The
landscape keythan quadratic
difference interactions
between between
dense associative
associative memory. Dmitry Krotov discusses the features
memory (Fig. 1b).
and the traditional The key difference
can we gainHopfield
between
network dense associative
is theoretical
the presence of the
recent theoretical advances and their broader memory
What insights
and the traditional
activation Hopfield
from these
network
recent
isthis
the presence
advances?
ofquad-
the
recent theoretical advances and their broader rapidly
First,growing
the longstanding problem F(·).
function of the When
coupling function
between is memory
impact in the context of energy-based neural rapidly
ratic, growing
dense activation
associative function
memory F(·). When
reduces to thethis function is
traditional quad-
Hopfield
impact in the context of energy-based neural capacity and feature space dimension in Hopfield networks has been
Energy (E)

Energy (E)

ratic, dense associative memory reduces to the traditional moreHopfield


architectures. network.
resolved. However,
In thewhen the activation
practically relevantfunction
regime,grows the memory rapidly,
capacityas of
architectures. network.
thenew However,
stateHopfield when
vector approaches the
networks isone
activation function
of the memory
restricted
grows
by the number
more
patterns, rapidly,
the
of hidden
as
network
neurons,
the
canstate
achievevector approachesmemory
a super-linear one of thestorage
memory patterns,
capacity.
4,9
themakes
This networkit an
rather than the feature space dimension . Second, these networks
Hopfield networks of associative memory, famously formalized by attractive can achieve a super-linear
tool for machine memory
learning storage capacity.
applications. This makes it an
can function in both discrete and continuous spaces. The continuous
Hopfield networks of associative memory, 1,2 famously formalized by attractive tool for machine learning applications. 5
John Hopfield more than forty years ago 1,2
, had an enormous impact This idea
versions of has
dense been further extended
associative memoriesin 2017
4,6,9 by showingappropri-
are particularly that a
John Hopfield more ξ
than
ξ
forty years ago , had an
ξ
enormous
ξ
impact This idea has been further extended in 20175 by showing that a
on multiple disciplines including
1 2
Feature x statistical physics, neuroscience,
1 2
Feature xcareful
ate forchoice
useof asthepart activation
of otherfunction
end-to-end can deep
even lead to an exponential
learning architectures.
on multiple disciplines including statistical physics, neuroscience, careful choice of the activation function can even lead 5 to an exponential
andFig.
machine learning. The
1 | The comparative energycore idea, that thetraditional
associative memory memory
Third,storage capacity.
it is possible Importantly,
to include commonly the study
used also demonstrated
inductive biases such
and machine learning. The corelandscapes
idea, thatofthe the associative and modern
memory memory storage capacity. Importantly, the study5 also demonstrated
capability can be described by the energy descent
Hopfield networks. a, When the number of stored memories significantly dynamics (Fig. 1), thatasdense associative
convolutions, memory,
attention, like
and the
poolingtraditional
in the Hopfield
general network,
Hopfield-like
capability can be described by the energy descent dynamics (Fig. 1), that dense associative memory, like the traditional Hopfield network,
found numerous
exceeds the numberapplications
of feature in the study
neurons, of complex
the traditional systems
Hopfield with acquires
network rug- hasframework.
large basinsLastly, of attraction of size O(N ). This
large-memory-capacity meansnetworks
Hopfield that the maynew be
found numerous applications in the study of complex systems with rug- has large basins of attraction of size O(Nf ).f This means that the new
gedspin
landscapes.
glass local This idea also inspired thewith development of restricted
(one such model continues
as parttoofbenefit from strong associative properties despite
HOPFIELD NETWORKS
minima that are uncorrelated the memory vectors viewed a broader class of energy-based models frequently
ged landscapes. This idea also inspired the development of restricted model continues to benefit from strong associative properties despite
Boltzmann
minimummachines,
is shown in panel which were
a). b, In theinstrumental
dense associative inmemory
the early days
model thisof thediscussed
dense packing in the of memories
AI community inside10
the
(seefeature
also this space. Following
interview) and Refs.
can4act
Boltzmann
spin
machines,
glass transition
which
happens at
were
a much
instrumental
larger number of
in the early
memories.
days of the
Thus, even in andas
dense packing of memories inside the feature space. Following Refs. 4
deep learning. In cognitive and neurosciences, Hopfield networks 5,aitsource
became ofclear that thefor
inspiration Hopfield network is not
new energy-based just one specific
architectures rooted

IS ALL YOU NEED


deepsituations
learning. In cognitive and neurosciences, Hopfield networks
the number and 5, it became clear that the Hopfield network is not just one specific
established awhen the number
conceptual of memories
formalization isofsignificantly
the notion larger than
of memory. model, but rathermemory
in associative a familyideas.
of many models that can be classified into
established
of featureaneurons,
conceptual formalization
each memory has a large ofbasin
the notion of memory.
of attraction around it, and model, but rather a family of many models that can be classified into
universalityclasses
universality classes based
based onon the
the asymptotic
asymptotic behaviour
behaviour ofof the
the activation
activation
there are no spin glass local minima. The memory patterns (vectors ξµ) are indexed
TraditionalHopfieldHopfield networks DmitryF(·).
function Krotov
Most of these new Hopfield networks have super-linear
Traditional
by µ (going from 1 to thenetworks
number of memory patterns Kmem) and each pattern is function F(·). Most of these new Hopfield networks have super-linear
Mathematically,
Mathematically,
an Nf-dimensional
the
the stateof
state
vector.
of continuous
theHopfield
Forthe
Hopfield network
network
variables isisdescribed
the feature
described
vector xby
byan
needs
anto memoryMIT-IBM
memory Watson
storage
storage capacity
capacity Deep Learning: Classics and Trends
AI Lab, IBM Research,
andexhibit
and exhibitstrong Cambridge,
strong associative
associative
MA, USA.
properties.
properties.
N x, e-mail: krotov@ibm.com
Nf-dimensional vector of features x, which can be either binary or con- Additionally, dense associative memories are not restricted toto
f-dimensional
additionally passvector
throughof features
a bounded which
activation can be
function, either
such binary
as or
sigmoid con-
or layer Additionally, dense associative memories are not restricted binary
binary
tinuous. Thetemporal
temporal
normalization,
tinuous. The to ensureevolution ofofthis
that the energy
evolution this isstate
Estatebounded vector
vector isisgoverned
from governedby
below. byanan variables
variablesand andcancanwork workwith withcontinuous
continuousvariablesvariables too.
too. For
For this
this reason,
reason,
energyfunction,
energy function,which whichhas haslocal
localminima
minimalocated locatedatataaset setofofKKmemmem-
mem- they theyPublished
canbe
can
online: xx
besmoothly
smoothly xx xxxx into
integrated
integrated Johannes
intodeep
deeplearning Brandstetter
learning architectures
architectures and
and
mem
µ
ory vectors ξµ representing the patterns that the network stores in its
ory vectors ξ representing the patterns that the network stores in its trained trained with the back-propagation algorithm
with the back-propagation algorithm in an end-to-end fashion. in an end-to-end fashion.
weights(each
(eachpattern
patternisisan anNNf-dimensional
References Amsterdam Machine Learning Lab
weights f-dimensional vector).When
vector). Whenpresented
presentedwith with 1. Hopfield, J. J. Neural networks and physical systems with emergent collective
an
with
aninitial a single
initialprompt
promptthat dense associative
thatresembles
resemblesone memory.
oneofofthethememory Transformers
memoryvectors, vectors,the operate
theenergy on Hopfield
energy Hopfield networks
networks
computational meet
meet
abilities. PNAS Institute
transformers
transformers
79, 2554–2558 (1982). for Machine Learning Linz
a set dynamics
descent of tokens,finds where theeach
most token
similarrepresents
memoryavector word inbased
vector a sentence theorIn
onthe aIn 2020,
2. Hopfield, J. J.noticed
Neurons6 with
6 graded response have collective computational properties
F(·)
descent dynamics finds the most similar memory based on 2020,like itit was
was noticed that
that ififone
one picks
picks the
the
those of two-state neurons. PNAS 81, 3088–3092 (1984).
activation
activation function
function F(·)
patch of an image, depending on the data domain. Since words in

You might also like