Download as pdf or txt
Download as pdf or txt
You are on page 1of 32

Introduction to Deep Learning

6. Multilayer Perceptron

STAT 157, Spring 2019, UC Berkeley

Mu Li and Alex Smola


courses.d2l.ai/berkeley-stat-157
© 2018, Amazon Web Services, Inc. or its Affiliates. All rights reserved.
Outline

• Single Layer Perceptron


• Decision Boundary
• XOR is hard
• Multilayer Perceptron
• Layers
• Nonlinearities
• Computational Cost

courses.d2l.ai/berkeley-stat-157
Perceptron
Mark I Perceptron, 1960
(wikipedia.org)

courses.d2l.ai/berkeley-stat-157
Perceptron

• Given input x , weight w and bias b , perceptron outputs:

{0 otherwise
1 if x > 0
o = σ (⟨w, x⟩ + b) σ(x) =

courses.d2l.ai/berkeley-stat-157
Perceptron

• Given input x , weight w and bias b , perceptron outputs:

{0 otherwise
1 if x > 0
o = σ (⟨w, x⟩ + b) σ(x) =

• Binary classification (0 or 1)
• Vs. scalar real value for regression
• Vs. probabilities for logistic regression

courses.d2l.ai/berkeley-stat-157
Perceptron

Ham
Spam

courses.d2l.ai/berkeley-stat-157
Training the Perceptron
initialize w = 0 and b = 0
repeat
if yi [hw, xi i + b]  0 then
w w + yi xi and b b + yi
end if
until all classified correctly
Equals to SGD (batch size is 1) with the following loss
ℓ(y, x, w) = max(0, − y⟨w, x⟩)
courses.d2l.ai/berkeley-stat-157
courses.d2l.ai/berkeley-stat-157
From wikipedia
courses.d2l.ai/berkeley-stat-157
From wikipedia
courses.d2l.ai/berkeley-stat-157
From wikipedia
courses.d2l.ai/berkeley-stat-157
From wikipedia
Convergence Theorem

• Radius r enclosing the data


• Margin ρ separating the classes
y(x⊤w + b) ≥ ρ
for ∥w∥2 + b 2 ≤ 1
• Guaranteed that perceptron will
converge after
r2 + 1
steps
ρ 2

courses.d2l.ai/berkeley-stat-157
Consequences

• Only need to store errors.


This gives a compression
bound for perceptron.
• Fails with noisy data

do NOT train your


avatar with perceptrons
Black & White

courses.d2l.ai/berkeley-stat-157
XOR Problem (Minsky & Papert, 1969)

The perceptron cannot learn an XOR function


(neurons can only generate linear separators)

courses.d2l.ai/berkeley-stat-157
Multilayer
Perceptron

courses.d2l.ai/berkeley-stat-157
Learning XOR

1 2

3 4

courses.d2l.ai/berkeley-stat-157
Learning XOR
1 2 3 4
+ - + -
+ + - -
product + - - +
1 2

3 4

courses.d2l.ai/berkeley-stat-157
Learning XOR
1 2 3 4
+ - + -
+ + - -
product + - - +
1 2

3 4

courses.d2l.ai/berkeley-stat-157
Single Hidden Layer

courses.d2l.ai/berkeley-stat-157
Single Hidden Layer

Hyperparameter - size m of hidden layer

courses.d2l.ai/berkeley-stat-157
Single Hidden Layer

• Input x ∈ ℝn
• Hidden W1 ∈ ℝm×n, b1 ∈ ℝm
• Output w2 ∈ ℝm, b2 ∈ ℝ

h = σ(W1x + b1)
o = wT2 h + b2

σ is an element-wise
activation function

courses.d2l.ai/berkeley-stat-157
Single Hidden Layer

Why do we need an a
nonlinear activation?

h = σ(W1x + b1)
o = wT2 h + b2

σ is an element-wise
activation function

courses.d2l.ai/berkeley-stat-157
Single Hidden Layer

Why do we need an a
nonlinear activation?

h = W1x + b1
o = wT2 h + b2
hence o = w⊤2 W1x + b′

Linear …
courses.d2l.ai/berkeley-stat-157
Sigmoid Activation

Map input into (0, 1), a soft version of σ(x) = {


1 if x > 0
0 otherwise
1
sigmoid(x) =
1 + exp(−x)

courses.d2l.ai/berkeley-stat-157
Tanh Activation

Map inputs into (-1, 1)


1 − exp(−2x)
tanh(x) =
1 + exp(−2x)

courses.d2l.ai/berkeley-stat-157
ReLU Activation

ReLU: rectified linear unit


ReLU(x) = max(x,0)

courses.d2l.ai/berkeley-stat-157
Multiclass Classification

courses.d2l.ai/berkeley-stat-157
Multiclass Classification
y1, y2, …, yk = softmax(o1, o2, …, ok)

courses.d2l.ai/berkeley-stat-157
Multiclass Classification

• Input x ∈ ℝn
• Hidden W1 ∈ ℝm×n and b1 ∈ ℝm
• Output W2 ∈ ℝm×d and b2 ∈ ℝd

h = σ(W1x + b1)
o = wT2 h + b2
y = softmax(o)

courses.d2l.ai/berkeley-stat-157
Multiclass Classification

• Input x ∈ ℝn
• Hidden W1 ∈ ℝm×n and b1 ∈ ℝm
• Output W2 ∈ ℝm×d and b2 ∈ ℝd

h = σ(W1x + b1)
o = wT2 h + b2
y = softmax(o)

courses.d2l.ai/berkeley-stat-157
Multiple Hidden Layers

h1 = σ(W1x + b1)
h2 = σ(W2h1 + b2)
h3 = σ(W3h2 + b3)
o = W4h3 + b4

Hyper-parameters
• # of hidden layers
• Hidden size for each layer

courses.d2l.ai/berkeley-stat-157
Summary

• Perceptron
• Simple updates
• Limited function complexity
• Multilayer Perceptron
• Multiple layers add more complexity
• Nonlinearity is needed
• Simple composition (but architecture search needed)

courses.d2l.ai/berkeley-stat-157

You might also like