Download as pdf or txt
Download as pdf or txt
You are on page 1of 45

Deep Learning

Chapter 2

Connectionism

Fernando Perez-Cruz
based on Thomas Hofmann’s Slides

23rd September 2021

1/41
Section 1

McCulloch-Pitts (1943)

2/41
McCulloch & Pitts

Because of the “all-or-none” character of nervous activ-


ity, neural events and the relations among them can be
treated by means of prepositional logic. – McCulloch
& Pitts, 1943

3/41
McCulloch-Pitts Neuron

(
if ni=1 σi xi ≥ θ
P
1
def → f (x; σ , θ) = (M-P neuron)
0 otherwise

x ∈ {0, 1}n , σ ∈ {−1, +1}n , θ∈Z

4/41
M-P Neuron: AND and NAND

AND NAND
σ1 = σ2 = σ1 = σ2 =

θ= θ=

x1 x2 σ1 x1 + σ2 x2 f σ1 x1 + σ2 x2 f
0 0 0 0 0 1
0 1 1 0 -1 1
1 0 1 0 -1 1
1 1 2 1 -2 0

5/41
M-P Neuron: Sums and DNFs – Example

Assume σ = (1, 1, 1, −1) and θ = 2.

DNF (reduced) of logical proposition implemented by M-P neuron:

6/41
M-P Neuron: Sums and DNFs – General

 
_ ^ ^
f (x; σ, θ) =  xi x̄i 
I∈I i∈I i6∈I

7/41
Section 2

Hebb (1949)

8/41
Hebbian Learning

I One of the very first learning rules...


I Hebb: Neurons that fire together, wire together
I Caricature of a naı̈ve Hebb rule:
t
def → ∆θij ∝ xti xtj

neurons xi ∈ {0, 1} interconnected with weights θij ∈ R


I Co-variance learning rule

t
def → ∆θij = (xti − x̄i )(xtj − x̄j )

neural mechanism to extract correlations

9/41
Minsky’s Hebbian Learning Machine: SNARC

I Marvin Minsky: first learning machine with 40 neurons


I Maze solving: correlation of activities with success of mission
I (sort of) Hebbian reinforcement learning

It had three hundred tubes and a lot of motors. It


needed some automatic electric clutches, which we ma-
chined ourselves. The memory of the machine was
stored in the positions of its control knobs – forty of
them – and when the machine was learning it used the
clutches to adjust its own knobs. – Marvin Minsky

10/41
Minsky’s Neuron

11/41
Section 3

Perceptron (1958+)

12/41
Perceptron
Perceptron unit: Inputs x ∈ Rn ; Weights θ ∈ Rn

( P
+1 if i xi θi ≥ 0
def → (x, θ ) 7→ sign(x · θ ) = (sign unit)
−1 else

If each x is assigned a binary label yi = ±1. Perceptron update


(on mistake)
(
0 if y sign(x · θ ) ≥ 0
def → ∆θθ = (perceptron rule)
yx else

I The update direction is always x, with the orientation given


by the target sign.
13/41
Taking Sharp Turns
Imagine you can make unit length steps ∆t . How far can you
travel in t steps? A distance of t by walking on a straight line.
But what if you are required to make each step at an angle ≥ π/2
relative to the displacement?

This greatly reduces the distance you can travel to t.

14/41
Perceptron Iterate Trajectory

Perceptron iterates zig-zag a lot:1

θ4
θ2 θ0

θ5 θ3
θ1

1
Python notebook: https://colab.research.google.com/drive/
1fCQa7UGZn5pj5OPoA8IcKODzbohvbgwi?usp=sharing, c/o Antonio Orvieto 15/41
Perceptron: Norm Growth

Lemma. Let (xt , y t ) be a sequence of perceptron mistakes,


inducing updates ∆θθ t , θ s = st=1 ∆θθ t . Then
P

s
X
s 2
kθθ k ≤ kxt k2
t=1

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2

16/41
Perceptron: Norm Growth

Lemma. Let (xt , y t ) be a sequence of perceptron mistakes,


inducing updates ∆θθ t , θ s = st=1 ∆θθ t . Then
P

s
X
kθθ s k2 ≤ kxt k2
t=1

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2


| {z }

16/41
Perceptron: Norm Growth

Lemma. Let (xt , y t ) be a sequence


Ps of perceptron mistakes,
t s
inducing updates ∆θθ , θ = t=1 ∆θθ t . Then
s
X
kθθ s k2 ≤ kxt k2
t=1

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2


| {z }
≤0

16/41
Perceptron: Norm Growth

Lemma. Let (xt , y t ) be a sequence


Ps of perceptron mistakes,
t s
inducing updates ∆θθ , θ = t=1 ∆θθ t . Then
s
X
kθθ s k2 ≤ kxt k2
t=1

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2 = kθθ t k2 + 2y t xt · θ t +kxt k2 ≤ kθθ t k2 + kxt k2


| {z }
≤0

16/41
Perceptron: Norm Growth

Lemma. Let (xt , y t ) be a sequence of perceptron mistakes,


inducing updates ∆θθ t , θ s = st=1 ∆θθ t . Then
P

s
X
s 2
kθθ k ≤ kxt k2
t=1

Proof: by induction kθθ 0 k = 0 and then

kθθ t+1 k2 = kθθ t + y t xt k2


Corollary. If kxt k ≤ 1 (∀t), then k∆θθ t k ≤ 1 and kθθ s k ≤ s.

16/41
Perceptron: Linear (γ-)Separability

Definition. Training set S linearly separable with margin γ > 0 or


γ-separable for short, if

def → ∃θθ ∗ , kθθ ∗ k = 1 : yx · θ ∗ ≥ γ > 0 (∀(x, y) ∈ S) .

17/41
Perceptron: Convergence Theorem

Novikoff’s Theorem. The perceptron converges in at most bγ −2 c


update steps on any γ-separable sample.
Proof. Number mistakes in order and denote by ∆θθ t the update
on the t-th mistake. Let θ ∗ be unit length and γ-separating. Then:

s s
1
X 2
X 3
⇒ θs · θ∗ = ∆θθ t · θ ∗ = (y t xt ) · θ ∗ ≥ γs
t=1 t=1

4
⇒ θ s · θ ∗ ≤ kθθ s kkθθ ∗ k =
6 √ 6
kθθ s k ≤ s

Where (1): θ 0 = 0. (2): ∆θθ t = y t xt . (3): y i xi · θ ∗ ≥ γ. (4):


Cauchy-Schwartz inequality. (5): kθθ ∗ k = 1. (6): Corollary above.
The claim follows as any s ≥ bγ −2 c + 1 leads to a contradiction.

18/41
Perceptron: Rosenblatt’s View

Given an elementary perceptron, a stimulus world W ,


and any classification C(W ) for which a solution exists;
let all stimuli in W occur in any sequence, provided that
each stimulus must reoccur in finite time; then begin-
ning from an arbitrary initial state, an error correction
procedure will always yield a solution to C(W ) in finite
time. – Rosenblatt, 1962

19/41
Perceptron: XOR Depression

           
0 1 0 1
def → ,1 , ,1 , , −1 , , −1 ,
0 1 1 0

(XOR problem)

20/41
Linear Dichotomies

Assume s points in Rn in general position are given.

Definition

min C(S, n) = |{y ∈ {−1, 1}s | ∃θθ : yi xi · θ ≥ 0 (∀i)}|


def → C(s, n) = |S|=s

Counts how many dichotomies are possible with linear separators.

21/41
Cover’s Theorem

Theorem (Cover 1965).

n−1
X 
s−1
⇒ C(s, n) = 2
i
i=0

Important recurrence relation

C(s + 1, n) = C(s, n) + C(s, n − 1)

22/41
Cover’s Theorem

With the recurrence established, the claim can be shown by


induction using Pascal’s identity.
n−1
X  n−2
X s − 1
s−1
+
i i
i=0 i=0

n−1
X      
s−1 s−1 s−1
= + +
i i−1 0
i=1

n−1
X  n−1
X s
s
= +1=
i i
i=1 i=0

23/41
The (“Messy”) Intermediate Regime




1 if s ≤ n
1 − O(e−n )

C(s, n)  if n < s < 2n
=
2s 

1
2 if s = 2n

O(e−n )

otherwise

In plain English:
I All dichotomies are linear, as long as s ≤ n.
I For n < s < 2n a vanishingly small (as n → ∞) number of
dichotomies is not linearly realizable.
I Above s > 2n almost all dichotomies are not linearly
realizable.

24/41
Section 4

Willshaw Memory (1969)

25/41
Making (Associative) Memories

I What are simple learning mechanisms to make memories?


I Associative memory: linked to stimuli and each other
I Auto-associative memory: pattern completion, robustness
I Formally: key-value pairs (xt , yt ) ∈ S
I Learn to map xt → yt ∀t

26/41
Sparse Binary Patterns

I r-sparse Boolean vectors (keep things simple...)

n
( )
X
def → Bnr = x ∈ {0, 1}n : xi = r
i=1

I Binary memory matrix

θ ∈ {0, 1}n×n , n2 bits

27/41
Information & Capacity

I (Random) Pattern information:


 
n
lg ≈ r lg n
r

I Memory capacity:
n2 bits

I Upper bound on s:

n2
s≤
r lg n

28/41
Willshaw Memory: Learning

I Hebbian rule (Willshaw memory) for s pairs

s
( )
X
def → Θji = min 1, yjt xti .
t=1

Meaning: Θji = 1 ⇐⇒ ∃t : xti = 1 ∧ yjt = 1.

I Equivalently:

s
( )
t>
X
t t t
def → Θ = y x , Θ = min 1, Θ
t=1

1 : matrix of all ones; min applied elementwise.

29/41
Willshaw Memory: Retrieval

Retrieving y for given x proceed as follows


(
0 if zj < r
def → x 7→ z 7→ y, z = Θx, yj =
1 otherwise

This rule is motivates by the fact that

Θt xt = (xt · xt )y t = ry t

30/41
Willshaw Memory: Example

31/41
Willshaw Memory: Monotonicity

Possible errors in retrieval can only be attributed to additional 1s


in the ouput. Formally:

⇒ xt 7→ y ≥ yt , ∀(xt , yt ) ∈ S .

Θτ } ≥ min{1, Θt } = Θt
P
1. Note that for any t: Θ = min{1, τ
2. Hence Θx ≥ Θt x for any x ∈ Bnr as x ≥ 0.
Specifically: zt = Θxt ≥ Θt xt
3. Note that

Θt xt = (xt · xt )yt = ryt

4. So zt ≥ ryt which implies the claim.

32/41
Willshaw Memory: Errors?

You may get more then you asked for :)

33/41
Pattern-per-Pattern Storage

I How can one store sparse memories incrementally? How


would one know that the memory capacity is reached?

I Incremental storage
t t−1 t t
θji = max{θji , yj xi }

I Guided by analysis
P(to tfollow): Monitor the 2number of
non-zero entries i,j θji . Stay well below n /2. Otherwise
the memory is full.

34/41
Analysis: Step 1
How many bits in Θ are turned on (on average) after storing s
random patterns?
  r 2 s
⇒ q := P{Θji = 0} = 1 − .
n

1. For x, y ∈ Bnr : yx> has exactly r2 non-zero entries.


2. Probability that a matrix entry is not among them: 1 − r2 /n2
3. Θji = 0, iff an entry was affected in none of s memories

Taylor approximation
  r 2   r 2
⇒ ln q = s ln 1 − ≈ −s ,
n n

35/41
Analysis: Step 2

Storage filled with s random pairs (x, y) ∈ Bnr × Bnr .


Probability that for random x ∈ Bnr , x 7→ yj = 1?

⇒ p := P{yj = 1} = (1 − q)r

1. We can assume x = (1, . . . , 1, 0, . . . , 0).


| {z }
r times
2. P{Θji = 1} = (1 − q) for any i (and j)
3. P { ri=1 (Θji = 1)} = (1 − q)r because of independence
V

36/41
Analysis: Step 3

Probability that on input xt at least one additional bit is turned on.

⇒ P{error} := P xt 7→ y > yt / n(1 − q)r




1. As there are (n − r) bits, each of which can be corrupted

P{error} = p + (1 − p)p + (1 − p)2 p + · · · + (1 − p)n−r+1 p

2. For p  1, r  n this can be upper-bounded close to equality

P{error} / (n − r)p / np = n(1 − q)r ,

37/41
Analysis: Step 4

What is the critical choice of sparseness r∗ ?

1
⇒ r∗ (n, q) = α∗ lg n, α∗ =
− lg(1 − q)

Obtained (left as an exercise) by setting


!
P{error} ≈ n(1 − q)r = 1

38/41
Capacity

1. Information contained in random pattern (x or y)?

lg2 n
 
n ∗
⇒ I(pattern) = lg ≈ r lg n ≈
r − lg(1 − q)

2. Total information in s patterns

⇒ sI(pattern) = I(patterns) ≈ (ln 2) n2 (lg q lg(1 − q))

3. Maximal Capacity

⇒ max I(patterns) = (ln 2) n2 ≈ 0.693 n2


q

That is to say: making best use of Θ, the memory capacity is


suboptimal by a factor of ln 2 ≈ 0.693.

39/41
Summary

We should encode patterns with sparseness r = lg n. If we do so,


we will be able to store (in the limit) almost 69% of the maximum
possible number of patterns with interference probability
approaching 0 as n → ∞. At this point, half of the weights θji will
be 0 and 1, respectively.

40/41
Was it worth it? :)

Quite possibly there is no system in the brain which


corresponds exactly to the hypothetical neural network;
but we do attach importance to the principle on which
it works and the quantitative relations which we have
shown must hold if such a system is to perform, as
it can, with high efficiency. – Willshaw, Buneman,
Longuet-Higgins, 1969.

The mathematical analysis strongly suggest nature to take


advantage of it!

41/41

You might also like