Download as pdf or txt
Download as pdf or txt
You are on page 1of 23

Artificial Neural Networks

x1

x2

x3
What are they?
Inspired by the Human Brain.
The human brain has about 86 Billion neurons
and requires 20% of your body’s energy to
function.
These neurons are connected to between 100
Trillion to 1 Quadrillion synapses!
What are they?
Human Machine

Input Output
What are they?
1. Originally developed by Warren McCullough and Walter Pitts (1944)
2. First trainable network, perceptron, proposed by Frank Rosenblatt
(1957)
3. Started off as an unsupervised learning tool.
1. Had problems with computing time and could not compute XOR
2. Was abandoned in favor of other algorithms
4. Werbos's (1975) backpropagation algorithm
1. Incorporated supervision and solved XOR
2. But were still too slow vs. other algorithms e.g., Support Vector
Machines
5. Backpropagation was accelerated by GPUs in 2010 and shown to be more
efficient and cost effective
GPUS
GPUS handle parallel operations much better (thousands of
threads per core) but are not as quick as CPUs. However, the
matrix multiplication steps in ANNs can be run in parallel resulting
in considerable time + cost savings. The best CPUs handle about
50GB/s while the best GPUs handle 750GB/s memory bandwidth.

CPU i9 GeForce GTX


Xseries 1080
Cores 18 (36 threads) 2560

Clock Speed 4.4 1.6G


(GHz)
Memory Shared 8GB

Price ($) 1799 549


What are they?
Human Machine

Input Output
Neuron
bias

x0
w0 = − 10

w1 = 20 y = g(w0 x0 + w1x1 + w2 x2)


x1

g − activation function
x2 w2 = 20
Neuron
bias

x0
w0 = − 10

Inputs w1 = 20 y = g(w0 x0 + w1x1 + w2 x2)


x1

g − activation function
x2 w2 = 20

2
z = x0 w0 + x1w1 + x2 w2 =

wi xi y = g(z)
i=0
Activation function
bias

x0
w0 = − 10

w1 = 20 y = g (w0 x 0 + w1 x1 + w2 x 2 )
x1

g − activation function
x2 w2 = 20

1
g(z) =
1 + e −z
2
z = x 0 w0 + x1w1 + x 2 w2 =
∑ i i
wx y = g (z )
i=0 z = w0 x0 + w1x1 + w2 x2
Activation Functions
■ Sigmoid ■ Hyperbolic tangent

2
1 ta n h(z) =
1 + e −2z
−1
g(z) =
1 + e −z

■ ReLU ■ Leaky ReLU

{z for z ≥ 0
r(z) = ma x(0,z) αz for z < 0
lr(z) =
Neural Network
b0 b1

x1

Input
x2 Output

x3

Input layer Hidden layers Output layer


XOR Gate
X Y za
w0
X z 0 h0 w4
0 0 0
w1
0 1 1 z2 h2 ẑ
w2
w5
1 0 1 Y z1 h1
w3

1 1 0
XOR Gate
x y za
w0 = 0.4
x z 0 h0 w4 = 0.7
0 0 0 w1 = 0.6

0 1 1 w2 = 0.3 z2 h2 ẑ

y z1 h1 w5 = 0.1 output = h2 = z ̂
1 0 1 w3 = 0.8

1 1 0
Feedforward propagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1
w2 = 0.3
z2 h 2 ẑ
1 0 1
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
0 w3 = 0.8
1 1

1
z 0 = wo x + w2 y h 0 = g(z 0 ) =
1 + e −z 0
x =1 = 0.4 × 1 + 0.3 × 1 1 1
= 0.7 = h 2 = g(z2 ) =
1 + e −0.7 1 + e −z2
z2 = w4 h0 + w5h1
h 0 = 0.67 1 h2 = z ̂ = 0.65
= 0.7 × 0.67 + 0.1 × 0.8 =
1 + e −0.63
= 0.63 h 2 = 0.65
1
h1 = g(z1) =
z1 = w1x + w3 y 1 + e −z1
y=1 = 0.6 × 1 + 0.8 × 1 1
=
= 1.4 1 + e −1.4
h1 = 0.8
Feedforward propagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1
z2 h 2 ẑ
w2 = 0.3
1 0 1
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
0 w3 = 0.8
1 1

1
z 0 = wo x + w2 y h 0 = g (z 0 ) =
1 + e −z 0
x =1 = 0.4 × 1 + 0.3 × 1 1 1
= h 2 = g (z2 ) =
= 0.7 1 + e −z2
1 + e −0.7 z2 = w4 h 0 + w5 h1
h 0 = 0.67 =
1 ; h 2 = z ̂ = 0.65
= 0.7 × 0.67 + 0.1 × 0.8 1 + e −0.63
= 0.63 h 2 = 0.65
1
h1 = g (z1) =
z1 = w1 x + w3 y 1 + e −z1
y =1 = 0.6 × 1 + 0.8 × 1 1
=
= 1.4 1 + e −1.4
h1 = 0.8

1 1
Mean Squared Error, E = (z ̂ − za )2 = (0.65 − 0)2 = 0.21
2 2
Backpropagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1 1
z2 h 2 ẑ E= (z ̂ − za )2
w2 = 0.3
1 0 1 2
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
w3 = 0.8

∂z ( 1 + e −z )
1 1 0 ∂ ∂ 1
g′(z) = g(z) = = g(z)[1 − g(z)]
∂z
1 1 1
Mean Squared Error, E = (z ̂ − za )2 = (h 2 − za )2 = (0.65 − 0)2 = 0.21
2 2 2

∂E ∂E ∂h2 ∂z2
= ⋅ ⋅ = 0.65 × 0.228 × 0.67 = 0.09
∂w4 ∂h2 ∂z2 ∂w4

∂h 2 [ 2 2 ]
∂E ∂ 1
= (h − za )2 = h 2 − za = 0.65
∂h 2

∂z2 ( 1 + e −z2 )
∂h 2 ∂ ∂ 1
= g(z2 ) = = g(z2 )(1 − g(z2 )) = h 2(1 − h 2 ) = 0.65 × 0.35 = 0.228
∂z2 ∂z2

∂z2 ∂
= [w h + w5h1] = h0 = 0.67
∂w4 ∂w4 4 0
Backpropagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1 1
z2 h 2 ẑ E= (z ̂ − za )2
w2 = 0.3
1 0 1 2
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
w3 = 0.8

∂z ( 1 + e −z )
1 1 0 ∂ ∂ 1
g′(z) = g(z) = = g(z)[1 − g(z)]
∂z
1 1 1
Mean Squared Error, E = (z ̂ − za )2 = (h 2 − za )2 = (0.65 − 0)2 = 0.21
2 2 2

∂E ∂E ∂h2 ∂z2
= ⋅ ⋅ = 0.65 × 0.228 × 0.8 = 0.118
∂w5 ∂h2 ∂z2 ∂w5

∂h 2 [ 2 2 ]
∂E ∂ 1
= (h − za )2 = h 2 − za = 0.65
∂h 2

∂z2 ( 1 + e −z2 )
∂h 2 ∂ ∂ 1
= g(z2 ) = = g(z2 )(1 − g(z2 )) = h 2(1 − h 2 ) = 0.65 × 0.35 = 0.228
∂z2 ∂z2

∂z2 ∂
= [w h + w5h1] = h1 = 0.8
∂w5 ∂w5 4 0
Backpropagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1 1
z2 h 2 ẑ E= (z ̂ − za )2
w2 = 0.3
1 0 1 2
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
w3 = 0.8

∂z ( 1 + e −z )
1 1 0 ∂ ∂ 1
g′(z) = g(z) = = g(z)[1 − g(z)]
∂z
1 1 1
Mean Squared Error, E = (z ̂ − za )2 = (h 2 − za )2 = (0.65 − 0)2 = 0.21
2 2 2

∂E ∂E ∂h0 ∂z0
= ⋅ ⋅ = 0.103 × 0.221 × 1 = 0.227
∂w0 ∂h0 ∂z0 ∂w0

∂E ∂E ∂h 2 ∂z2
= ⋅ ⋅ = h 2 × h 2(1 − h 2 ) × w4 = 0.65 × 0.65(1 − 0.65) × 0.7 = 0.103
∂h 0 ∂h 2 ∂z2 ∂h 0

∂z0 ( 1 + e −z 0 )
∂h0 ∂ ∂ 1
= g(z 0 ) = = g(z 0 )(1 − g(z 0 )) = h0(1 − h0 ) = 0.67 × 0.33 = 0.221
∂z0 ∂z0

∂z 0 ∂
= [w x + w2 y] = x = 1
∂w0 ∂w0 0
Backpropagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1 1
z2 h 2 ẑ E= (z ̂ − za )2
w2 = 0.3
1 0 1 2
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
w3 = 0.8

∂z ( 1 + e −z )
1 1 0 ∂ ∂ 1
g′(z) = g(z) = = g(z)[1 − g(z)]
∂z
1 1 1
Mean Squared Error, E = (z ̂ − za )2 = (h 2 − za )2 = (0.65 − 0)2 = 0.21
2 2 2

∂E ∂E ∂h0 ∂z0
= ⋅ ⋅ = 0.103 × 0.221 × 1 = 0.227
∂w2 ∂h0 ∂z0 ∂w2

∂E ∂E ∂h 2 ∂z2
= ⋅ ⋅ = h 2 × h 2(1 − h 2 ) × w4 = 0.65 × 0.65(1 − 0.65) × 0.7 = 0.103
∂h 0 ∂h 2 ∂z2 ∂h 0

∂z0 ( 1 + e −z 0 )
∂h0 ∂ ∂ 1
= g(z 0 ) = = g(z 0 )(1 − g(z 0 )) = h0(1 − h0 ) = 0.67 × 0.33 = 0.221
∂z0 ∂z0

∂z 0 ∂
= [w x + w2 y] = y = 1
∂w2 ∂w2 0
Backpropagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1 1
z2 h 2 ẑ E= (z ̂ − za )2
w2 = 0.3
1 0 1 2
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
w3 = 0.8

∂z ( 1 + e −z )
1 1 0 ∂ ∂ 1
g′(z) = g(z) = = g(z)[1 − g(z)]
∂z
1 1 1
Mean Squared Error, E = (z ̂ − za )2 = (h 2 − za )2 = (0.65 − 0)2 = 0.21
2 2 2

∂E ∂E ∂h1 ∂z1
= ⋅ ⋅ = 0.014 × 0.16 × 1 = 0.0022
∂w1 ∂h1 ∂z1 ∂w1

∂E ∂E ∂h 2 ∂z2
= ⋅ ⋅ = h 2 × h 2(1 − h 2 ) × w5 = 0.65 × 0.65(1 − 0.65) × 0.1 = 0.014
∂h1 ∂h 2 ∂z2 ∂h1

∂z1 ( 1 + e −z1 )
∂h1 ∂ ∂ 1
= g(z1) = = g(z1)(1 − g(z1)) = h1(1 − h1) = 0.8 × 0.2 = 0.16
∂z1 ∂z1

∂z 0 ∂
= [w x + w3 y] = x = 1
∂w1 ∂w1 1
Backpropagation
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1 1
z2 h 2 ẑ E= (z ̂ − za )2
w2 = 0.3
1 0 1 2
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
w3 = 0.8

∂z ( 1 + e −z )
1 1 0 ∂ ∂ 1
g′(z) = g(z) = = g(z)[1 − g(z)]
∂z
1 1 1
Mean Squared Error, E = (z ̂ − za )2 = (h 2 − za )2 = (0.65 − 0)2 = 0.21
2 2 2

∂E ∂E ∂h1 ∂z1
= ⋅ ⋅ = 0.014 × 0.16 × 1 = 0.0022
∂w3 ∂h1 ∂z1 ∂w3

∂E ∂E ∂h 2 ∂z2
= ⋅ ⋅ = h 2 × h 2(1 − h 2 ) × w5 = 0.65 × 0.65(1 − 0.65) × 0.1 = 0.014
∂h1 ∂h 2 ∂z2 ∂h1

∂z1 ( 1 + e −z1 )
∂h1 ∂ ∂ 1
= g(z1) = = g(z1)(1 − g(z1)) = h1(1 − h1) = 0.8 × 0.2 = 0.16
∂z1 ∂z1

∂z 0 ∂
= [w x + w3 y] = y = 1
∂w1 ∂w3 1
Gradient descent weight updates
x y za 1
w0 = 0.4 activation function, g(z) =
0 0 x z0 h0 1 + e −z
0 w4 = 0.7
w1 = 0.6
0 1 1 1
z2 h 2 ẑ E= (z ̂ − za )2
w2 = 0.3
1 0 1 2
z1 h1 w5 = 0.1 out put = h 2 = z ̂
y
w3 = 0.8

∂z ( 1 + e −z )
1 1 0 ∂ ∂ 1
g′(z) = g(z) = = g(z)[1 − g(z)]
∂z
1 1 1
Mean Squared Error, E = (z ̂ − za )2 = (h 2 − za )2 = (0.65 − 0)2 = 0.21
2 2 2
1
z 0 = wo x + w2 y h 0 = g (z 0 ) =
1 + e −z 0
x =1 = 0.4 × 1 + 0.3 × 1 1 1
= h 2 = g (z 2 ) =
= 0.7
1 + e −0.7 1 + e −z 2
z2 = w4 h 0 + w5 h1
h 0 = 0.67 1 h 2 = z ̂ = 0.65
=
feed for ward = 0.7 × 0.67 + 0.1 × 0.8
= 0.63 h 2 = 0.65
1 + e −0.63
1
h1 = g (z1) =
z1 = w1 x + w3 y 1 + e −z1
y =1 = 0.6 × 1 + 0.8 × 1 1
=
= 1.4 1 + e −1.4
h1 = 0.8

∂E ∂E ∂h 2 ∂z 2 ∂E ∂E ∂h 0 ∂z 0 ∂E ∂E ∂h1 ∂z1
∂ w4
= ⋅ ⋅
∂h 2 ∂z 2 ∂ w4
= 0.65 × 0.228 × 0.67 = 0.09
∂ w0
= ⋅ ⋅
∂h 0 ∂z 0 ∂ w0
= 0.103 × 0.221 × 1 = 0.227 = ⋅ ⋅ = 0.014 × 0.16 × 1 = 0.0022
∂w1 ∂h1 ∂z1 ∂w1
back prop
∂E ∂E ∂h 2 ∂z 2 ∂E ∂E ∂h 0 ∂z 0 ∂E ∂E ∂h1 ∂z1
= ⋅ ⋅ = 0.65 × 0.228 × 0.8 = 0.118 = ⋅ ⋅ = 0.103 × 0.221 × 1 = 0.227 = ⋅ ⋅ = 0.014 × 0.16 × 1 = 0.0022
∂ w5 ∂h 2 ∂z 2 ∂ w5 ∂ w2 ∂h 0 ∂z 0 ∂ w2 ∂w3 ∂h1 ∂z1 ∂w3

Learning rate, α = 0.01


gra d descent ∂E ∂E ∂E ∂E ∂E ∂E
w4 = w4 − α ; w5 = w5 − α ; w0 = w0 − α ; w1 = w1 − α ; w2 = w2 − α ; w3 = w3 − α ;
weight upd ate ∂w4 ∂w5 ∂w0 ∂w1 ∂w2 ∂w3

You might also like