Professional Documents
Culture Documents
6 Perceptron
6 Perceptron
6. Multilayer Perceptron
courses.d2l.ai/berkeley-stat-157
Perceptron
Mark I Perceptron, 1960
(wikipedia.org)
courses.d2l.ai/berkeley-stat-157
Perceptron
{0 otherwise
1 if x > 0
o = σ (⟨w, x⟩ + b) σ(x) =
courses.d2l.ai/berkeley-stat-157
Perceptron
{0 otherwise
1 if x > 0
o = σ (⟨w, x⟩ + b) σ(x) =
• Binary classification (0 or 1)
• Vs. scalar real value for regression
• Vs. probabilities for logistic regression
courses.d2l.ai/berkeley-stat-157
Perceptron
Ham
Spam
courses.d2l.ai/berkeley-stat-157
Training the Perceptron
initialize w = 0 and b = 0
repeat
if yi [hw, xi i + b] 0 then
w w + yi xi and b b + yi
end if
until all classified correctly
Equals to SGD (batch size is 1) with the following loss
ℓ(y, x, w) = max(0, − y⟨w, x⟩)
courses.d2l.ai/berkeley-stat-157
courses.d2l.ai/berkeley-stat-157
From wikipedia
courses.d2l.ai/berkeley-stat-157
From wikipedia
courses.d2l.ai/berkeley-stat-157
From wikipedia
courses.d2l.ai/berkeley-stat-157
From wikipedia
Convergence Theorem
courses.d2l.ai/berkeley-stat-157
Consequences
courses.d2l.ai/berkeley-stat-157
XOR Problem (Minsky & Papert, 1969)
courses.d2l.ai/berkeley-stat-157
Multilayer
Perceptron
courses.d2l.ai/berkeley-stat-157
Learning XOR
1 2
3 4
courses.d2l.ai/berkeley-stat-157
Learning XOR
1 2 3 4
+ - + -
+ + - -
product + - - +
1 2
3 4
courses.d2l.ai/berkeley-stat-157
Learning XOR
1 2 3 4
+ - + -
+ + - -
product + - - +
1 2
3 4
courses.d2l.ai/berkeley-stat-157
Single Hidden Layer
courses.d2l.ai/berkeley-stat-157
Single Hidden Layer
courses.d2l.ai/berkeley-stat-157
Single Hidden Layer
• Input x ∈ ℝn
• Hidden W1 ∈ ℝm×n, b1 ∈ ℝm
• Output w2 ∈ ℝm, b2 ∈ ℝ
h = σ(W1x + b1)
o = wT2 h + b2
σ is an element-wise
activation function
courses.d2l.ai/berkeley-stat-157
Single Hidden Layer
Why do we need an a
nonlinear activation?
h = σ(W1x + b1)
o = wT2 h + b2
σ is an element-wise
activation function
courses.d2l.ai/berkeley-stat-157
Single Hidden Layer
Why do we need an a
nonlinear activation?
h = W1x + b1
o = wT2 h + b2
hence o = w⊤2 W1x + b′
Linear …
courses.d2l.ai/berkeley-stat-157
Sigmoid Activation
courses.d2l.ai/berkeley-stat-157
Tanh Activation
courses.d2l.ai/berkeley-stat-157
ReLU Activation
courses.d2l.ai/berkeley-stat-157
Multiclass Classification
courses.d2l.ai/berkeley-stat-157
Multiclass Classification
y1, y2, …, yk = softmax(o1, o2, …, ok)
courses.d2l.ai/berkeley-stat-157
Multiclass Classification
• Input x ∈ ℝn
• Hidden W1 ∈ ℝm×n and b1 ∈ ℝm
• Output W2 ∈ ℝm×d and b2 ∈ ℝd
h = σ(W1x + b1)
o = wT2 h + b2
y = softmax(o)
courses.d2l.ai/berkeley-stat-157
Multiclass Classification
• Input x ∈ ℝn
• Hidden W1 ∈ ℝm×n and b1 ∈ ℝm
• Output W2 ∈ ℝm×d and b2 ∈ ℝd
h = σ(W1x + b1)
o = wT2 h + b2
y = softmax(o)
courses.d2l.ai/berkeley-stat-157
Multiple Hidden Layers
h1 = σ(W1x + b1)
h2 = σ(W2h1 + b2)
h3 = σ(W3h2 + b3)
o = W4h3 + b4
Hyper-parameters
• # of hidden layers
• Hidden size for each layer
courses.d2l.ai/berkeley-stat-157
Summary
• Perceptron
• Simple updates
• Limited function complexity
• Multilayer Perceptron
• Multiple layers add more complexity
• Nonlinearity is needed
• Simple composition (but architecture search needed)
courses.d2l.ai/berkeley-stat-157