DL 2021 Chapter02

Deep Learning
Chapter 2
Connectionism
Fernando Perez-Cruz
based on Thomas Hofmann’s Slides
23rd September 2021
1/41
Section 1
McCulloch-Pitts (1943)
2/41
McCulloch & Pitts
Because of the “all-or-none” character of nervous activ-

ity, neural events and the relations among them can be
treated by means of prepositional logic. – McCulloch
& Pitts, 1943
3/41
McCulloch-Pitts Neuron
(
if ni=1 σi xi ≥ θ
P
1
def → f (x; σ , θ) = (M-P neuron)
0 otherwise
x ∈ {0, 1}n , σ ∈ {−1, +1}n , θ∈Z
4/41
M-P Neuron: AND and NAND
AND NAND
σ1 = σ2 = σ1 = σ2 =
θ= θ=
x1 x2 σ1 x1 + σ2 x2 f σ1 x1 + σ2 x2 f
0 0 0 0 0 1
0 1 1 0 -1 1
1 0 1 0 -1 1
1 1 2 1 -2 0
5/41
M-P Neuron: Sums and DNFs – Example
Assume σ = (1, 1, 1, −1) and θ = 2.
DNF (reduced) of logical proposition implemented by M-P neuron:
6/41
M-P Neuron: Sums and DNFs – General
 
_ ^ ^
f (x; σ, θ) =  xi x̄i 
I∈I i∈I i6∈I
7/41
Section 2
Hebb (1949)
8/41
Hebbian Learning
I One of the very first learning rules...

I Hebb: Neurons that fire together, wire together
I Caricature of a naı̈ve Hebb rule:
t
def → ∆θij ∝ xti xtj
neurons xi ∈ {0, 1} interconnected with weights θij ∈ R

I Co-variance learning rule
t
def → ∆θij = (xti − x̄i )(xtj − x̄j )
neural mechanism to extract correlations
9/41
Minsky’s Hebbian Learning Machine: SNARC
I Marvin Minsky: first learning machine with 40 neurons

I Maze solving: correlation of activities with success of mission
I (sort of) Hebbian reinforcement learning
It had three hundred tubes and a lot of motors. It

needed some automatic electric clutches, which we ma-
chined ourselves. The memory of the machine was
stored in the positions of its control knobs – forty of
them – and when the machine was learning it used the
clutches to adjust its own knobs. – Marvin Minsky
10/41
Minsky’s Neuron
11/41
Section 3
Perceptron (1958+)
12/41
Perceptron
Perceptron unit: Inputs x ∈ Rn ; Weights θ ∈ Rn
( P
+1 if i xi θi ≥ 0
def → (x, θ ) 7→ sign(x · θ ) = (sign unit)
−1 else
If each x is assigned a binary label yi = ±1. Perceptron update

(on mistake)
(
0 if y sign(x · θ ) ≥ 0
def → ∆θθ = (perceptron rule)
yx else
I The update direction is always x, with the orientation given

by the target sign.
13/41
Taking Sharp Turns
Imagine you can make unit length steps ∆t . How far can you
travel in t steps? A distance of t by walking on a straight line.
But what if you are required to make each step at an angle ≥ π/2
relative to the displacement?
√
This greatly reduces the distance you can travel to t.
14/41
Perceptron Iterate Trajectory
Perceptron iterates zig-zag a lot:1
θ4
θ2 θ0
θ5 θ3
θ1
1
Python notebook: https://colab.research.google.com/drive/
1fCQa7UGZn5pj5OPoA8IcKODzbohvbgwi?usp=sharing, c/o Antonio Orvieto 15/41
Perceptron: Norm Growth
Lemma. Let (xt , y t ) be a sequence of perceptron mistakes,

inducing updates ∆θθ t , θ s = st=1 ∆θθ t . Then
P
s
X
s 2
kθθ k ≤ kxt k2
t=1
Proof: by induction kθθ 0 k = 0 and then
kθθ t+1 k2 = kθθ t + y t xt k2
16/41

P
s
X
kθθ s k2 ≤ kxt k2
t=1
kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2

| {z }
16/41
Lemma. Let (xt , y t ) be a sequence

Ps of perceptron mistakes,
t s
inducing updates ∆θθ , θ = t=1 ∆θθ t . Then
s
X
t=1
kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2

| {z }
≤0
16/41
Lemma. Let (xt , y t ) be a sequence

Ps of perceptron mistakes,
t s
inducing updates ∆θθ , θ = t=1 ∆θθ t . Then
s
X
t=1
kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2 ≤ kθθ t k2 + kxt k2

| {z }
≤0
16/41

P
s
X
s 2
kθθ k ≤ kxt k2
t=1
kθθ t+1 k2 = kθθ t + y t xt k2
√
Corollary. If kxt k ≤ 1 (∀t), then k∆θθ t k ≤ 1 and kθθ s k ≤ s.
16/41
Perceptron: Linear (γ-)Separability
Definition. Training set S linearly separable with margin γ > 0 or

γ-separable for short, if
def → ∃θθ ∗ , kθθ ∗ k = 1 : yx · θ ∗ ≥ γ > 0 (∀(x, y) ∈ S) .
17/41
Perceptron: Convergence Theorem
Novikoff’s Theorem. The perceptron converges in at most bγ −2 c

update steps on any γ-separable sample.
Proof. Number mistakes in order and denote by ∆θθ t the update
on the t-th mistake. Let θ ∗ be unit length and γ-separating. Then:
s s
1
X 2
X 3
⇒ θs · θ∗ = ∆θθ t · θ ∗ = (y t xt ) · θ ∗ ≥ γs
t=1 t=1
4
⇒ θ s · θ ∗ ≤ kθθ s kkθθ ∗ k =
6 √ 6
kθθ s k ≤ s
Where (1): θ 0 = 0. (2): ∆θθ t = y t xt . (3): y i xi · θ ∗ ≥ γ. (4):

Cauchy-Schwartz inequality. (5): kθθ ∗ k = 1. (6): Corollary above.
The claim follows as any s ≥ bγ −2 c + 1 leads to a contradiction.
18/41
Perceptron: Rosenblatt’s View
Given an elementary perceptron, a stimulus world W ,

and any classification C(W ) for which a solution exists;
let all stimuli in W occur in any sequence, provided that
each stimulus must reoccur in finite time; then begin-
ning from an arbitrary initial state, an error correction
procedure will always yield a solution to C(W ) in finite
time. – Rosenblatt, 1962
19/41
Perceptron: XOR Depression

0 1 0 1
def → ,1 , ,1 , , −1 , , −1 ,
0 1 1 0
(XOR problem)
20/41
Linear Dichotomies
Assume s points in Rn in general position are given.
Definition
min C(S, n) = |{y ∈ {−1, 1}s | ∃θθ : yi xi · θ ≥ 0 (∀i)}|

def → C(s, n) = |S|=s
Counts how many dichotomies are possible with linear separators.
21/41
Cover’s Theorem
Theorem (Cover 1965).
n−1
X
s−1
⇒ C(s, n) = 2
i
i=0
Important recurrence relation
C(s + 1, n) = C(s, n) + C(s, n − 1)
22/41
Cover’s Theorem
With the recurrence established, the claim can be shown by

induction using Pascal’s identity.
n−1
X n−2
X s − 1
s−1
+
i i
i=0 i=0
n−1
X
s−1 s−1 s−1
= + +
i i−1 0
i=1
n−1
X n−1
X s
s
= +1=
i i
i=1 i=0
23/41
The (“Messy”) Intermediate Regime



1 if s ≤ n
1 − O(e−n )

C(s, n)  if n < s < 2n
=
2s 

1
2 if s = 2n

O(e−n )

otherwise
In plain English:
I All dichotomies are linear, as long as s ≤ n.
I For n < s < 2n a vanishingly small (as n → ∞) number of
dichotomies is not linearly realizable.
I Above s > 2n almost all dichotomies are not linearly
realizable.
24/41
Section 4
Willshaw Memory (1969)
25/41
Making (Associative) Memories
I What are simple learning mechanisms to make memories?

I Associative memory: linked to stimuli and each other
I Auto-associative memory: pattern completion, robustness
I Formally: key-value pairs (xt , yt ) ∈ S
I Learn to map xt → yt ∀t
26/41
Sparse Binary Patterns
I r-sparse Boolean vectors (keep things simple...)
n
( )
X
def → Bnr = x ∈ {0, 1}n : xi = r
i=1
I Binary memory matrix
θ ∈ {0, 1}n×n , n2 bits
27/41
Information & Capacity
I (Random) Pattern information:

n
lg ≈ r lg n
r
I Memory capacity:
n2 bits
I Upper bound on s:
n2
s≤
r lg n
28/41
Willshaw Memory: Learning
I Hebbian rule (Willshaw memory) for s pairs
s
( )
X
def → Θji = min 1, yjt xti .
t=1
Meaning: Θji = 1 ⇐⇒ ∃t : xti = 1 ∧ yjt = 1.
I Equivalently:
s
( )
t>
X
t t t
def → Θ = y x , Θ = min 1, Θ
t=1
1 : matrix of all ones; min applied elementwise.
29/41
Willshaw Memory: Retrieval
Retrieving y for given x proceed as follows

(
0 if zj < r
def → x 7→ z 7→ y, z = Θx, yj =
1 otherwise
This rule is motivates by the fact that
Θt xt = (xt · xt )y t = ry t
30/41
Willshaw Memory: Example
31/41
Willshaw Memory: Monotonicity
Possible errors in retrieval can only be attributed to additional 1s

in the ouput. Formally:
⇒ xt 7→ y ≥ yt , ∀(xt , yt ) ∈ S .
Θτ } ≥ min{1, Θt } = Θt
P
1. Note that for any t: Θ = min{1, τ
2. Hence Θx ≥ Θt x for any x ∈ Bnr as x ≥ 0.
Specifically: zt = Θxt ≥ Θt xt
3. Note that
Θt xt = (xt · xt )yt = ryt
4. So zt ≥ ryt which implies the claim.
32/41
Willshaw Memory: Errors?
You may get more then you asked for :)
33/41
Pattern-per-Pattern Storage
I How can one store sparse memories incrementally? How

would one know that the memory capacity is reached?
I Incremental storage
t t−1 t t
θji = max{θji , yj xi }
I Guided by analysis
P(to tfollow): Monitor the 2number of
non-zero entries i,j θji . Stay well below n /2. Otherwise
the memory is full.
34/41
Analysis: Step 1
How many bits in Θ are turned on (on average) after storing s
random patterns?
r 2 s
⇒ q := P{Θji = 0} = 1 − .
n
1. For x, y ∈ Bnr : yx> has exactly r2 non-zero entries.

2. Probability that a matrix entry is not among them: 1 − r2 /n2
3. Θji = 0, iff an entry was affected in none of s memories
Taylor approximation
r 2 r 2
⇒ ln q = s ln 1 − ≈ −s ,
n n
35/41
Analysis: Step 2
Storage filled with s random pairs (x, y) ∈ Bnr × Bnr .

Probability that for random x ∈ Bnr , x 7→ yj = 1?
⇒ p := P{yj = 1} = (1 − q)r
1. We can assume x = (1, . . . , 1, 0, . . . , 0).

| {z }
r times
2. P{Θji = 1} = (1 − q) for any i (and j)
3. P { ri=1 (Θji = 1)} = (1 − q)r because of independence
V
36/41
Analysis: Step 3
Probability that on input xt at least one additional bit is turned on.
⇒ P{error} := P xt 7→ y > yt / n(1 − q)r

1. As there are (n − r) bits, each of which can be corrupted
P{error} = p + (1 − p)p + (1 − p)2 p + · · · + (1 − p)n−r+1 p
2. For p 1, r n this can be upper-bounded close to equality
P{error} / (n − r)p / np = n(1 − q)r ,
37/41
Analysis: Step 4
What is the critical choice of sparseness r∗ ?
1
⇒ r∗ (n, q) = α∗ lg n, α∗ =
− lg(1 − q)
Obtained (left as an exercise) by setting

!
P{error} ≈ n(1 − q)r = 1
38/41
Capacity
1. Information contained in random pattern (x or y)?
lg2 n

n ∗
⇒ I(pattern) = lg ≈ r lg n ≈
r − lg(1 − q)
2. Total information in s patterns
⇒ sI(pattern) = I(patterns) ≈ (ln 2) n2 (lg q lg(1 − q))
3. Maximal Capacity
⇒ max I(patterns) = (ln 2) n2 ≈ 0.693 n2

q
That is to say: making best use of Θ, the memory capacity is

suboptimal by a factor of ln 2 ≈ 0.693.
39/41
Summary
We should encode patterns with sparseness r = lg n. If we do so,

we will be able to store (in the limit) almost 69% of the maximum
possible number of patterns with interference probability
approaching 0 as n → ∞. At this point, half of the weights θji will
be 0 and 1, respectively.
40/41
Was it worth it? :)
Quite possibly there is no system in the brain which

corresponds exactly to the hypothetical neural network;
but we do attach importance to the principle on which
it works and the quantitative relations which we have
shown must hold if such a system is to perform, as
it can, with high efficiency. – Willshaw, Buneman,
Longuet-Higgins, 1969.
The mathematical analysis strongly suggest nature to take

advantage of it!
41/41

DL 2021 Chapter02

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DL 2021 Chapter02

Uploaded by

Copyright:

Available Formats

Deep Learning

23rd September 2021

Because of the “all-or-none” character of nervous activ-

x ∈ {0, 1}n , σ ∈ {−1, +1}n , θ∈Z

Assume σ = (1, 1, 1, −1) and θ = 2.

DNF (reduced) of logical proposition implemented by M-P neuron:

I One of the very first learning rules...

neurons xi ∈ {0, 1} interconnected with weights θij ∈ R

neural mechanism to extract correlations

I Marvin Minsky: first learning machine with 40 neurons

It had three hundred tubes and a lot of motors. It

If each x is assigned a binary label yi = ±1. Perceptron update

I The update direction is always x, with the orientation given

Perceptron iterates zig-zag a lot:1

Lemma. Let (xt , y t ) be a sequence of perceptron mistakes,

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2

Lemma. Let (xt , y t ) be a sequence of perceptron mistakes,

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2

Lemma. Let (xt , y t ) be a sequence

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2

Lemma. Let (xt , y t ) be a sequence

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2 ≤ kθθ t k2 + kxt k2

Lemma. Let (xt , y t ) be a sequence of perceptron mistakes,

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2

Definition. Training set S linearly separable with margin γ > 0 or

def → ∃θθ ∗ , kθθ ∗ k = 1 : yx · θ ∗ ≥ γ > 0 (∀(x, y) ∈ S) .

Novikoff’s Theorem. The perceptron converges in at most bγ −2 c

Where (1): θ 0 = 0. (2): ∆θθ t = y t xt . (3): y i xi · θ ∗ ≥ γ. (4):

Given an elementary perceptron, a stimulus world W ,

Assume s points in Rn in general position are given.

min C(S, n) = |{y ∈ {−1, 1}s | ∃θθ : yi xi · θ ≥ 0 (∀i)}|

Counts how many dichotomies are possible with linear separators.

Theorem (Cover 1965).

Important recurrence relation

C(s + 1, n) = C(s, n) + C(s, n − 1)

With the recurrence established, the claim can be shown by

Willshaw Memory (1969)

I What are simple learning mechanisms to make memories?

I r-sparse Boolean vectors (keep things simple...)

I Binary memory matrix

θ ∈ {0, 1}n×n , n2 bits

I (Random) Pattern information:

I Hebbian rule (Willshaw memory) for s pairs

Meaning: Θji = 1 ⇐⇒ ∃t : xti = 1 ∧ yjt = 1.

1 : matrix of all ones; min applied elementwise.

Retrieving y for given x proceed as follows

This rule is motivates by the fact that

Possible errors in retrieval can only be attributed to additional 1s

Θt xt = (xt · xt )yt = ryt

4. So zt ≥ ryt which implies the claim.

You may get more then you asked for :)

I How can one store sparse memories incrementally? How

1. For x, y ∈ Bnr : yx> has exactly r2 non-zero entries.

Storage filled with s random pairs (x, y) ∈ Bnr × Bnr .

1. We can assume x = (1, . . . , 1, 0, . . . , 0).

Probability that on input xt at least one additional bit is turned on.

⇒ P{error} := P xt 7→ y > yt / n(1 − q)r

2. For p 1, r n this can be upper-bounded close to equality