Professional Documents
Culture Documents
2016 Slides13 Lecture Neural Networks
2016 Slides13 Lecture Neural Networks
2016 Slides13 Lecture Neural Networks
Philipp Koehn
12 November 2015
● Bayes rule
1
p(C∣A) = p(A∣C) p(C)
Z
● Independence assumption
● Weights
p(A∣C) = ∏ p(ai∣C)λi
i
good bad
bad good
● Popular choices
1
tanh(x) sigmoid(x) = 1+e−x
6 6
example
3.7
2.9 4.5
3.7
-5.2
2.9
.5
-1
-4.6 -2.0
1 1
1.0
3.7
2.9 4.5
3.7
-5.2
0.0
2.9
.5
-1
-4.6 -2.0
1 1
● Try out two input values
1
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = = 0.90
1+e −2.2
1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = = 0.17
1 + e1.6
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17
2.9
.5
-1
-4.6 -2.0
1 1
● Try out two input values
1
sigmoid(1.0 × 3.7 + 0.0 × 3.7 + 1 × −1.5) = sigmoid(2.2) = = 0.90
1+e −2.2
1
sigmoid(1.0 × 2.9 + 0.0 × 2.9 + 1 × −4.5) = sigmoid(−1.6) = = 0.17
1 + e1.6
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17
2.9
.5
-1
-4.6 -2.0
1 1
1
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = = 0.76
1 + e−1.17
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17 .76
2.9
.5
-1
-4.6 -2.0
1 1
1
sigmoid(.90 × 4.5 + .17 × −5.2 + 1 × −2.0) = sigmoid(1.17) = = 0.76
1 + e−1.17
Dendrite
Axon terminal
Soma
Axon
Nucleus
● Neurons receive electric signals at the dendrites and send them to the axon
● The axon of the neuron is connected to the dendrites of many other neurons
Neurotransmitter
Synaptic
vesicle
Neurotransmitter
transporter Axon
terminal
Voltage
gated Ca++
channel
Receptor
Postsynaptic
density Synaptic cleft
Dendrite
● Similarities
– Neurons, connections between neurons
– Learning = change of connections, not change of neurons
– Massive parallel processing
back-propagation training
1.0
3.7 .90
2.9 4.5
3.7
-5.2
0.0 .17 .76
2.9
.5
-1
-4.6 -2.0
1 1
● Gradient descent
– error is a function of the weights
– we want to reduce the error
– gradient descent: move towards the error minimum
– compute gradient → get direction to the error minimum
– adjust weights towards direction of lower error
● Back-propagation
– first adjust last set of weights
– propagate error back to each previous layer
– adjust their weights
1
● Sigmoid sigmoid(x) =
1 + e−x
d sigmoid(x) d 1
● Derivative =
dx dx 1 + e−x
0 × (1 − e−x) − (−e−x)
=
(1 + e−x)2
1 e−x
= ( )
1 + e−x 1 + e−x
1 1
= (1 − )
1 + e−x 1 + e−x
= sigmoid(x)(1 − sigmoid(x))
dE dE dy ds
=
dwk dy ds dwk
dE dE dy ds
=
dwk dy ds dwk
dE d 1
= (t − y)2 = −(t − y)
dy dy 2
dE dE dy ds
=
dwk dy ds dwk
dy d sigmoid(s)
= = sigmoid(s)(1 − sigmoid(s)) = y(1 − y)
ds ds
dE dE dy ds
=
dwk dy ds dwk
ds d
= ∑ wk hk = hk
dwk dwk k
dE dE dy ds
=
dwk dy ds dwk
= −(t − y) y(1 − y) hk
– error
– derivative of sigmoid: y ′
∆wk = µ (t − y) y ′ hk
1
E = ∑ (tj − yj )2
j 2
● But we can compute how much each node contributed to downstream error
δj = (tj − yj ) yj′
δi = ( ∑ wj←iδj ) yi′
j
A D
1.0
3.7 .90
2.9 4.5
B 3.7 E G
-5.2
0.0 .17 .76
2.9
.5
-1
C -4.6 F -2.0
1 1
A D
1.0
3.7 .90 4.8
2.9 91
4—
.5
B 3.7 E G
-5.126 -5.2
——
0.0 .17 .76
2.9
.5
-1 0
-2.—
—
C -4.6 F 56 6
1 1 -1.
A D
1.0
3.7 .90 4.8
2.9 91
4—
.5
B 3.7 E G
-5.126 -5.2
——
0.0 .17 .76
2.9
.5
-1 0
-2.—
—
C -4.6 F 56 6
1 1 -1.
● Hidden node D
– δD = ( ∑j wj←iδj ) yD′ = wGD δG yD′ = 4.5 × .0434 × .0898 = .0175
– ∆wDA = µ δD hA = 10 × .0175 × 1.0 = .175
– ∆wDB = µ δD hB = 10 × .0175 × 0.0 = 0
– ∆wDC = µ δD hC = 10 × .0175 × 1 = .175
● Hidden node E
– δE = ( ∑j wj←iδj ) yE′ = wGE δG yE′ = −5.2 × .0434 × 0.2055 = −.0464
– ∆wEA = µ δE hA = 10 × −.0464 × 1.0 = −.464
– etc.
∆wj←k (n − 1)
● ... and add these to any new updates (with decay factor ρ)
computational aspects
● Forward computation: s⃗ = W h
⃗
⃗T
● Weight updates: ∆W = µδ⃗h
● Computations such as W h
⃗ require 200 × 200 = 40, 000 multiplications
● Homepage: http://deeplearning.net/software/theano/
● Used to implement
– neural network language models
– neural machine translation (Bahdanau et al., 2015)