Professional Documents
Culture Documents
Lecture NN Intro
Lecture NN Intro
Brian Thompson
slides by Philipp Koehn
27 September 2018
good bad
bad good
• Popular choices
1
tanh(x) sigmoid(x) = 1+e−x
relu(x) = max(0,x)
example
3.7
2.9 4.5
3.7
-5.2
2.9
.5
-1
-4.6 -2.0
1 1
1.0
3.7
2.9 4.5
3.7
-5.2
0.0
2.9
.5
-1
-4.6 -2.0
1 1
• Try out two input values
1
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = −2.2
= 0.90
1+e
1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = = 0.17
1 + e1.6
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17
2.9
.5
-1
-4.6 -2.0
1 1
• Try out two input values
1
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = −2.2
= 0.90
1+e
1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = = 0.17
1 + e1.6
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17
2.9
.5
-1
-4.6 -2.0
1 1
1
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = = 0.76
1 + e−1.17
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17 .76
2.9
.5
-1
-4.6 -2.0
1 1
1
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = = 0.76
1 + e−1.17
Dendrite
Axon terminal
Soma
Axon
Nucleus
• Neurons receive electric signals at the dendrites and send them to the axon
• The axon of the neuron is connected to the dendrites of many other neurons
Neurotransmitter
Synaptic
vesicle
Neurotransmitter
transporter Axon
terminal
Voltage
gated Ca++
channel
Receptor
Postsynaptic
density Synaptic cleft
Dendrite
• Similarities
– Neurons, connections between neurons
– Learning = change of connections,
not change of neurons
– Massive parallel processing
back-propagation training
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17 .76
2.9
.5
-1
-4.6 -2.0
1 1
• Gradient descent
– error is a function of the weights
– we want to reduce the error
– gradient descent: move towards the error minimum
– compute gradient → get direction to the error minimum
– adjust weights towards direction of lower error
• Back-propagation
– first adjust last set of weights
– propagate error back to each previous layer
– adjust their weights
error(λ)
= 1
n t
d i e
a
gr λ
optimal λ current λ
Gradient for w2
G r a
ined
Comb
1
• Sigmoid sigmoid(x) =
1 + e−x
d sigmoid(x) d 1
• Derivative =
dx dx 1 + e−x
0 × (1 − e−x) − (−e−x)
=
(1 + e−x)2
1 e−x
=
1 + e−x 1 + e−x
1 1
= 1−
1 + e−x 1 + e−x
= sigmoid(x)(1 − sigmoid(x))
P
• Linear combination of weights s = k wk hk
dE dE dy ds
=
dwk dy ds dwk
P
• Linear combination of weights s = k wk hk
dE dE dy ds
=
dwk dy ds dwk
dE d 1
= (t − y)2 = −(t − y)
dy dy 2
P
• Linear combination of weights s = k wk hk
dE dE dy ds
=
dwk dy ds dwk
dy d sigmoid(s)
= = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
ds ds
P
• Linear combination of weights s = k wk hk
dE dE dy ds
=
dwk dy ds dwk
ds d X
= wk hk = hk
dwk dwk
k
dE dE dy ds
=
dwk dy ds dwk
= −(t − y) y(1 − y) hk
– error
– derivative of sigmoid: y 0
∆wk = µ (t − y) y 0 hk
X1
E= (tj − yj )2
j
2
• But we can compute how much each node contributed to downstream error
δj = (tj − yj ) yj0
A D
1.0
3.7 .90
2.9 4.5
B 3.7 E G
-5.2
0.0 .17 .76
2.9
.5
-1
C -4.6 F -2.0
1 1
A D
1.0
3.7 .90 4.8
2.9 91
4—
.5
B 3.7 E G
-5.126 -5.2
——
0.0 .17 .76
2.9
.5
-1 0
-2.—
—
C -4.6 F 56 6
1 1 -1.
A D
1.0
3.7 .90 4.8
2.9 91
4—
.5
B 3.7 E G
-5.126 -5.2
——
0.0 .17 .76
2.9
.5
-1 0
-2.—
—
C -4.6 F 56 6
1 1 -1.
• Hidden node D
P
0 0
– δD = j wj←i δj yD = wGD δG yD = 4.5 × .0434 × .0898 = .0175
– ∆wDA = µ δD hA = 10 × .0175 × 1.0 = .175
– ∆wDB = µ δD hB = 10 × .0175 × 0.0 = 0
– ∆wDC = µ δD hC = 10 × .0175 × 1 = .175
• Hidden node E
P
0 0
– δE = j wj←i δj yE = wGE δG yE = −5.2 × .0434 × 0.2055 = −.0464
– ∆wEA = µ δE hA = 10 × −.0464 × 1.0 = −.464
– etc.
error(λ)
error(λ)
Bad initialization
error(λ)
local optimum
global optimum λ
Local optimum
∆wj←k (n − 1)
• ... and add these to any new updates (with decay factor ρ)
• Adagrad update
dE
based on error E with respect to the weight w at time t = gt = dw
µ
∆wt = qP gt
t 2
τ =1 gτ
computational aspects
• Forward computation: ~s = W ~h
• Theano
• Tensorflow (Google)
• PyTorch (Facebook)
• MXNet (Amazon)
• DyNet