Professional Documents
Culture Documents
IE643 Lecture7 2020sep4 Moodle
IE643 Lecture7 2020sep4 Moodle
IE 643
Lecture 7
September 4, 2020.
1 Recap
MLP for XOR
Description of MLP
Perceptron - Caveat
x1 x2 y = x1 ⊕ x2 ŷ = sign(w1 x1 + w2 x2 − θ)
0 0 -1 sign(−θ)
0 1 1 sign(w2 − θ)
1 0 1 sign(w1 − θ)
1 1 -1 sign(w1 + w2 − θ)
Perceptron - Caveat
Not suitable when linear separability assumption fails
Example: Classical XOR problem
sign(−θ) = −1 =⇒ θ > 0
sign(w2 − θ) = 1 =⇒ w2 − θ ≥ 0
sign(w1 − θ) = 1 =⇒ w1 − θ ≥ 0
sign(w1 + w2 − θ) = −1 =⇒ −w1 − w2 + θ > 0
x1 x2 a11 a21 ŷ y
0 0 max{b1 , 0} max{b2 , 0} sign(ta11 + ua21 + b3 ) -1
0 1 max{q + b1 , 0} max{s + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 0 max{p + b1 , 0} max{r + b2 , 0} sign(ta11 + ua21 + b3 ) +1
1 1 max{p + q + b1 , 0} max{r + s + b2 , 0} sign(ta11 + ua21 + b3 ) -1
Notable features:
Multiple layers stacked together.
Zero-th layer usually called input layer.
Final layer usually called output layer.
Intermediate layers are called hidden layers.
Notable features:
Each neuron in the hidden and output layer is like a perceptron.
However, unlike perceptron, different activation functions are used.
Recall:
nk` denotes k-th neuron at `-th layer.
ak` denotes activation of neuron nk` .
We have at layer L1 :
1 1 1 1 x
1 a1 z1 w11 x1 + w12 2
a = 1 =φ =φ = φ(W 1 x)
a2 z21 1 x + w1 x
w21 1 22 2
We have at layer L2 :
h 2i h 2 i h 2 1 2 a1
i h 1 i
a z w a1 + w12 a
a2 = a12 = φ z12 = φ w11 2
2 a1 + w 2 a1 = φ W 2 a11 = φ(W 2 a1 )
2 2 21 1 22 2 2
At Layer L3 ,
h i
3 a12
a3 = [a13 ] = φ ([z13 ]) = φ ([w11
3 a2 + w 3 a2
1 12 2 ]) = φ W a2
= φ(W 3 a2 )
2
Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X
min ei
i=1
Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X S
X
i
min e = Err(y i , ŷ i )
i=1 i=1
Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1
Optimization perspective
Given training data D = {(x i , y i )}Si=1 ,
S
X S
X S
X
min ei = Err(y i , ŷ i ) = Err(y i , MLP(x i ))
i=1 i=1 i=1
1 1 1
ŷ = σ(w11 x1 + w12 x2 ) = 1 x + w1 x ]
1 + exp −[w11 1 12 2
Let
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Assume: Err(y , ŷ ) = (y − ŷ )2 .
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
Assume: Err(y , ŷ ) = (y − ŷ )2 .
Popularly called the squared error.
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 30 / 67
MLP - Data Perspective
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
1 1
x1 x2 y ŷ = σ(w11 x1 + w12 x2 )
1 1
-3 -3 1 σ(−3w11 − 3w12 )
1 1
-2 -2 1 σ(−2w11 − 2w12 )
1 1
4 4 0 σ(4w11 + 4w12 )
1 1
2 -5 0 σ(2w11 − 5w12 )
4
!2
X 1
E = yi − 1 xi + w1 xi
i=1
1 + exp − w11 1 12 2
Optimization Concepts
min f (x)
x∈C
min f (x)
x∈C
− Neighborhood of z ∈ C
min f (x)
x∈C
General Assumption: C ⊆ Rd .
`2 Norm of vector x
√
kxk2 = x > x = (|x1 |2 + |x2 |2 + . . . + |xd |2 )1/2 .
`1 Norm of vector x
kxk1 = |x1 | + |x2 | + . . . + |xd |.
Note: |x1 | denotes the absolute value of x1 .
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 45 / 67
MLP - Data Perspective Optimization Concepts
min f (x)
x∈C
C ⊆ Rd .
f : C −→ R.
Directional derivative
f (x + αd) − f (x)
lim
α↓0 α
Directional derivative
Interior of a set C
Let C ⊆ Rd . Then int(C) is defined by:
Directional derivative
Directional derivative
f (x + αd) − f (x)
lim
α↓0 α
Note: If all partial derivatives of f exist at x, then f 0 (x; d) = h∇f (x), di,
h i>
where ∇f (x) = ∂f∂x(x)1
. . . ∂f (x)
∂x d
.
Descent Direction
Descent Direction
Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
Descent Direction
Descent Direction
Proposition
Let f : Rd −→ R be a continuously differentiable function over Rd . Let
0 6= d ∈ Rd be a descent direction of f at x. Then there exists > 0 such
that ∀α ∈ (0, ] we have
Proof idea:
where f : Rd −→ R
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I Check for some stopping criterion and break from loop.
Proposition
Let f : C −→ R be a function over the set C ⊆ Rd . Let x ? ∈ int(C) be a
local optimum point of f . Let all partial derivatives of f exist at x ? . Then
∇f (x ? ) = 0.
Proof idea:
where f : Rd −→ R.
Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I Find a descent direction d k of f at x k and αk > 0 such that
f (x k + αk d k ) < f (x k ).
I x k+1 = x k + αk d k .
I If k∇f (x k+1 )k2 = 0, set x ∗ = x k+1 , break from loop.
Output x ∗ .
Output x ∗ .
where f : Rd −→ R.
Gradient Descent Algorithm to solve (GEN-OPT)
Start with x 0 ∈ Rd .
For k = 0, 1, 2, . . .
I d k = −∇f (x k ).
I αk = argminα>0 f (x k + αd k ).
I x k+1 = x k + αk d k .
I If k∇f (x k+1 )k2 = 0, set x ∗ = x k+1 , break from loop.
Output x ∗ .
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 63 / 67
MLP - Data Perspective Gradient Descent
where E : R2 −→ R.
Gradient Descent Algorithm to solve MLP Loss Minimization Problem
Start with w 0 ∈ Rd .
For k = 0, 1, 2, . . .
I d k = −∇E (w k ).
I αk = argminα>0 E (w k + αd k ).
I w k+1 = w k + αk d k .
I If k∇E (w k+1 )k2 = 0, set w ∗ = w k+1 , break from loop.
Output w ∗ .
P. Balamurugan Deep Learning - Theory and Practice September 4, 2020. 64 / 67
MLP - Data Perspective Gradient Descent
4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y − 1 i 1 xi
1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
4 4
!2
X
i
X
i 1
min E (w ) = e (w ) = y − 1 i 1 xi
1 ,w 1 )
w =(w11 12 i=1 i=1
1 + exp − w11 x1 + w12 2
Acknowledgments:
CalcPlot3D website for plotting.