ECS171: Machine Learning: Lecture 10: Neural Networks

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 40

ECS171: Machine Learning

Lecture 10: Neural Networks

Cho-Jui Hsieh
UC Davis

Feb 12, 2018


Neural Networks
Another way to introduce nonlinearity

How to generate this nonlinear hypothesis?


Another way to introduce nonlinearity

How to generate this nonlinear hypothesis?

Combining multiple perceptrons to construct nonlinear hypothesis!


Combining perceptrons
Combining perceptrons

Example: h = (h1 or h2 ):
h(x) = sign(1.5 + h1 (x) + h2 (x))
h1 (x) = sign(w1T x), h2 (x) = sign(w2T x)
Creating more layers

feedforward network
Activation Function

Perceptron: activation function is “hard threshold”


h(x) = θ(w T x) θ(x) = sign(x) (1)
θ: activation function
Activation Function

Perceptron: activation function is “hard threshold”


h(x) = θ(w T x) θ(x) = sign(x) (1)
θ: activation function
Non-differentiable, hard to optimize
Activation Function

Perceptron: activation function is “hard threshold”


h(x) = θ(w T x) θ(x) = sign(x) (1)
θ: activation function
Non-differentiable, hard to optimize
Replace θ by some other better functions
Activation Function
Formal Definitions


1 ≤ l ≤ L
 : layers
(l)
wij 0 ≤ i ≤ d (l−1) : inputs

1 ≤ j ≤ d (l) : outputs

Formal Definitions


1 ≤ l ≤ L
 : layers
(l)
wij 0 ≤ i ≤ d (l−1) : inputs

1 ≤ j ≤ d (l) : outputs

j-th neuron in the l-the layer:


(l−1)
dX
(l) (l) (l) (l−1)
xj = θ(sj ) = θ( wij xi )
i=0
Formal Definitions


1 ≤ l ≤ L
 : layers
(l)
wij 0 ≤ i ≤ d (l−1) : inputs

1 ≤ j ≤ d (l) : outputs

j-th neuron in the l-the layer:


(l−1)
dX
(l) (l) (l) (l−1)
xj = θ(sj ) = θ( wij xi )
i=0

Output:
(L)
h(x) = x1
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Forward propagation
Stochastic Gradient Descent

All the weights W = {W1 , · · · , WL } determine h(x)


Stochastic Gradient Descent

All the weights W = {W1 , · · · , WL } determine h(x)


Error on example (xn , yn ) is

e(h(xn ), yn ) = e(W )
Stochastic Gradient Descent

All the weights W = {W1 , · · · , WL } determine h(x)


Error on example (xn , yn ) is

e(h(xn ), yn ) = e(W )

To implement SGD, we need the gradient

∂e(W )
∇e(W ) : { (l)
} for all i, j, l
∂wij
∂e(W )
Computing Gradient (l)
∂wij

Use chain rule:


(l)
∂e(W ) ∂e(W ) ∂sj
(l)
= (l)
× (l)
∂wij ∂sj ∂wij

(l) Pd (l−1) (l)


sj = i=1 xi wij
(l)
∂sj (l−1)
We have (l) = xi
∂w ij
∂e(W )
Computing Gradient (l)
∂wij

(l) ∂e(W )
Define δj := (l)
∂sj
Compute by layer-by-layer:

(l−1) ∂e(W )
δi = (l−1)
∂si
d (l) (l−1)
X ∂e(W ) ∂sj ∂xi
= × ×
j=1 ∂sj
(l) (l−1)
∂xi ∂sil−1
d
(l) (l) (l−1)
X
= δj × wij × θ0 (si ),
j=1

where θ0 (s) = 1 − θ2 (s) for tanh


(l−1) (l−1) 2 Pd (l) (l)
δi = (1 − (xi ) ) j=1 wij δj
Final layer

(Assume square loss)


(L)
e(W ) = (x1 − yn )2
(L) (L)
x1 = θ(s1 )
So,

(L) ∂e(W )
δ1 = (L)
∂s1
(L)
∂e(W ) ∂x1
= (L)
× (L)
∂x1 s1
(L) (L)
= 2(x1 − yn ) × θ0 (s1 )
Backward propagation
Backward propagation
Backward propagation
Backward propagation
Backward propagation
Backward propagation
Backward propagation
Backpropagation

SGD for neural networks


(l)
Initialize all weights wij at random
For iter = 0, 1, 2, · · ·
(l)
Forward: Compute all xj from input to output
(l)
Backward: Compute all δj from output to input
(l) (l−1) (l)
Update all the weights wijl ← wij − ηxi δj
Backpropagation

Just an automatic way to apply chain rule to compute gradient


Auto-differentiation (AD) — as long as we define derivative for each
basic function, we can use AD to compute any of their compositions
Implemented in most deep learning packages
(e.g., pytorch, tensorflow)
Conclusions

Next class: LFD 4.1

Questions?

You might also like