Download as pdf or txt
Download as pdf or txt
You are on page 1of 61

Module 3

Artificial Neural Networks


Neural Network
■ Networks of processing units (neurons) with connections (synapses) between them
Artificial Neural Network
■ Machine Learning model inspired by the networks of biological neurons found in our
brains
■ Powerful, and scalable- large and highly complex Machine Learning tasks
– classifying billions of images (e.g., Google Images),
– powering speech recognition services (e.g., Apple’s Siri),
– recommending the best videos to watch to hundreds of millions of users every
day (e.g., YouTube)
Logical Computations with Neurons

neuron C is activated → only when both neurons A and B are


activated
Biological Neurons
■ dendrites: nerve fibres carrying electrical signals to the cell
■ cell body: containing the nucleus and most of the cell’s complex components,
computes a non-linear function of its inputs
■ axon: single long fiber that carries the electrical signal(action potentials) from the cell
body to other neurons
– axon splits off into many branches called telodendria, and at the tip of these
branches are minuscule structures called synaptic terminals
■ synapse: the point of contact between the axon of one cell and the dendrite of
another, regulating a chemical connection(neurotransmitters) whose strength affects
the input to the cell.
Logical Computations with Neurons

neuron C is activated → if either neuron A or neuron B is


activated (or both)
Perceptron
Logical Computations with Neurons

neuron C is activated → only when both neurons A and B are


activated
Logical Computations with Neurons

neuron C is activated → if neuron A is active and


neuron B is off
Perceptron training rule
■ Iterative rule
■ To learn an acceptable weight vector is to begin with random
weights
■ Iteratively apply the perceptron to each training example, modifying
the perceptron weights whenever it misclassifies an example
wi := wi + Δwi
– where Δwi =  (t - o) xi
– t is the target output for current training instance
– o is the perceptron output for x
–  is small positive constant, called the learning rate
■ The process will converge if
– training data is linearly separable, and
-  is sufficiently small
■ But if the training data is not linearly separable, it may not converge

18
Perceptron
■ Perceptrons are artificial neurons
– a threshold logic unit (TLU), or linear threshold unit (LTU)
■ A Perceptron is simply composed of a single layer of TLUs
■ Main Parameters
– input values, weights and Bias, weighted sum, and step function
■ Inputs –x1,x2,x3
■ Each input connection is associated with a weight w
Perceptron
Gradient descent
■ If the training data is not linearly separable, it may not converge
■ To overcome problem use alternate rule –delta rule
– More general
– Basis for networks of units
– Works in non-linearly separable cases
– gradient descent to search the hypothesis space of possible weight vectors
to find the weights that best fit the training examples.
■ Let output of linear unit o(x) = w0 + w1x1 +  + wnxn
– Simple example of linear unit (will generalize)
– Omit the thresholding initially

20
Perceptron
■ A single TLU can be used for simple linear binary classification
– computes a linear combination of the inputs, and if the result exceeds a
threshold, it outputs the positive class otherwise negative class
– Bias feature x0 = 1 is added
– Training a TLU -finding the right values for w
■ A Perceptron is simply composed of a single layer of TLUs, with each TLU
connected to all the inputs.
■ Fully connected layer, or a dense layer:
– All the neurons in a layer are connected to every neuron in the previous layer
(i.e., its input neurons)
■ The inputs of the Perceptron are fed to special passthrough neurons called input
neurons: they output whatever input they are fed -form the input layer.
■ Bias feature is generally added x0 = 1 : represented using a special type of
neuron called a bias neuron, which outputs 1 all the time.
Perceptron

Architecture of a Perceptron with two input neurons, one bias neuron, and three output neurons
Perceptron
■ Computing the outputs of a fully connected layer
hW,b (X) = Φ (XW + b)
■ X - the matrix of input features. It has one row per instance and one
column per feature
■ W - all the connection weights except from the bias neuron. It has one row
per input neuron and one column per artificial neuron in the layer
■ B-bias vector contains all the connection weights between the bias neuron
and the artificial neurons.
■ Φ - activation function: when the artificial neurons are TLUs, it is a step
function
Perceptron
■ Perceptron learning rule reinforces connections that help reduce the error.
■ The Perceptron is fed one training instance at a time, and for each
instance it makes its predictions.
■ For every output neuron that produced a wrong prediction, it reinforces the
connection weights from the inputs that would have contributed to the
correct prediction
Perceptron training rule
■ Iterative rule
■ To learn an acceptable weight vector is to begin with random
weights
■ Iteratively apply the perceptron to each training example, modifying
the perceptron weights whenever it misclassifies an example
wi := wi + Δwi
– where Δwi =  (t - o) xi
– t is the target output for current training instance
– o is the perceptron output for x
–  is small positive constant, called the learning rate
■ The process will converge if
– training data is linearly separable, and
-  is sufficiently small
■ But if the training data is not linearly separable, it may not converge

18
Perceptron training rule
■ Consider the following cases
■ The training example is correctly classified already by the perceptron.
– In this case, (t-o) is zero, making Δwi zero, so that no weights are updated.
■ The perceptron outputs a -1, when the target output is +1.
■ To make the perceptron output a +1 instead of -1
– The weights must be altered to increase the value of wi xi
– If xi >0,then increasing wi will bring the perceptron closer to correctly
classifying
– The training rule will increase wi, because (t-o),  and xi are all positive
■ For example
– t = 1, o = -1, xi = 0.8,  = 0.1
– then Δwi = 0.16 and wi xi gets larger
■ if t = - 1 and o = 1, then weights associated with positive xi will be decreased
19
Gradient descent
■ If the training data is not linearly separable, it may not converge
■ To overcome problem use alternate rule –delta rule
– More general
– Basis for networks of units
– Works in non-linearly separable cases
– gradient descent to search the hypothesis space of possible weight vectors
to find the weights that best fit the training examples.
■ Let output of linear unit o(x) = w0 + w1x1 +  + wnxn
– Simple example of linear unit (will generalize)
– Omit the thresholding initially

20
Gradient descent
■ Training error of a hypothesis (weight vector)

1 2
𝐸𝑤 ≡ ෍ 𝑡𝑑 − 𝑜𝑑
2
𝑑∈𝐷

■ D is the set of training examples {d = x, td}


■ td is the target output for training example d, and od is the output of the linear
unit for training example d
■ 𝐸 𝑤 is simply half the squared difference between the target output td and the
linear unit output od, summed over all training examples
■ E is a function of 𝑤, because the linear unit output o depends on this weight
vector

21
Error minimization
■ Look at error E as a function of weights {wi}
■ Slide down gradient of E in weight space
■ Reach values of {wi} that correspond to minimum error
– Look for global minimum

22
Visualizing the Hypothesis Space
■ For a linear unit with two weights, the hypothesis
space H is the w0, w1 plane.
■ The vertical axis indicates the error of the
corresponding weight vector hypothesis, relative
to a fixed set of training examples.
■ The arrow shows the negated gradient at one
particular point, indicating the direction in the
wo, w l plane producing steepest descent along
the error surface.
■ Gradient descent search determines a weight
vector that minimizes E by starting with an
arbitrary initial weight vector, then repeatedly Parabola with a
modifying it in small steps. single minima
■ At each step, the weight vector is altered in the
direction that produces the steepest descent along
the error surface depicted and continues until the
global minimum error is reached

23
Incremental Version
■ Batch gradient descent for a single Sigmoid unit
ED
ED =
1
 d d(t - o )2
= -  (td - od )od (1 - od ) xi , d
2 d D wi d D

• Stochastic approximation

Ed =
1
(td - od )2 Ed
= -(td - od )od (1 - od ) xi , d
2 wi

37
Errors propagate backwards
d 3 = o3 (1 - o3 )(t3 - o3 )
1
1 2 3 4

w2,5
w1,5 w,,4,9

w,2,7 w,3,7
w1,7
w,4,7
5 6 7 8 9

4
d 7 = o7 (1 - o7 ) wi , 7d i
i =1

w1,7 updated based on δ1 and x1,7

dk ← ok (1 - ok) (tk - ok), dh ← oh (1 - oh) Skoutputs wk,hdk, , wj,i ← wj,i + wj,i, wj,i =  dj xj,i
■ Same process repeats if we have more layers
40
Hidden layer representations
output Input Hidden Output
values
10000000 → .89 .04 .08 → 10000000
01000000 → .01 .11 .88 → 01000000
00100000 → .01 .97 .27 → 00100000
00010000 → .99 .97 .71 → 00010000
input 00001000 → .03 .05 .02 → 00001000
00000100 → .22 .99 .99 → 00000100
00000010 → .80 .01 .98 → 00000010
00000001 → .60 .94 .01 → 00000001
Incremental (Stochastic) Gradient Descent
Difficulties in applying gradient descent are
(1)converging to a local minimum can sometimes be quite slow
(2) if there are multiple local minima in the error surface, then there is no guarantee that the
procedure will find the global minimum
Incremental mode Gradient Descent:  1
Ed w  (t d - od )
2

■ Repeat 2
– For each training example d in D
1. Compute the gradient ∇𝐸𝑑 𝑤
  
w  w - Ed w
■ Incremental can approximate batch if  is small enough

27
Incremental Gradient Descent Algorithm
Incremental-Gradient-Descent (training examples, )
Each training example is a pair x, t: x is the vector of input values,
and t is the target output value.  is the learning rate (e.g., .05).
■ Initialize each wi to some small random value
■ Repeat until the termination condition is met
1. Initialize each wi to zero
2. For each x, t
■ Input x to the unit and compute output o
■ For each linear unit weight wi
wi ← wi +  (t - o) xi

28
Standard gradient descent and stochastic
gradient descent - Differences

Standard gradient descent Stochastic gradient descent


The error is summed over all examples before Weights are updated upon examining each training
updating weights example
Summing over multiple examples requires more Less computation
computation per weight update step
Used with a larger step size per weight update than In cases where there are multiple local minima with
stochastic gradient descent respect to 𝐸 𝑤 , can sometimes avoid falling into
these local minima because it uses the various
∇𝐸𝑑 𝑤 rather than ∇𝐸 𝑤

29
Perceptron vs. Delta rule training
■ Perceptron training rule guaranteed to succeed if
– Training examples are linearly separable
– Sufficiently small learning rate
■ Delta training rule uses gradient descent
– Guaranteed to converge to hypothesis with minimum squared error
■ Given sufficiently small learning rate
■ Even when training data contains noise
■ Even when training data not linearly separable
■ Can generalize linear units to units with threshold
– Just threshold the results

30
Perceptron vs. Delta rule training
■ Delta/perceptron training rules appear same but
– Perceptron rule trains discontinuous units
■ Guaranteed to converge under limited conditions
■ May not converge in general
– Gradient rules trains over continuous response (unthresholded
outputs)
■ Gradient rule always converges
– Even with noisy training data
– Even with non-separable training data
– Gradient descent generalizes to other continuous responses
– Can train perceptron with LMS rule
■ get prediction by thresholding outputs

31
Classification MLPs
■ For a binary classification problem,
– Need a single output neuron using the logistic activation function: the output will be a
between 0 and 1, which you can interpret as the estimated probability of the positive
class. The estimated probability of the negative class is equal to one minus that
number.
■ Multiple binary classification tasks
– For example, in an email classification system that predicts whether each incoming
email is ham or spam, and also predicts whether it is an urgent or nonurgent email
– need two output neurons-using the logistic activation function: the first would output
the probability that the email is spam, and the second would output the probability
that it is urgent.
– one output neuron for each positive class.
■ For example if each instance can belong only to a single class, out of three or more possible
classes
– Need to have one output neuron per class, and use the softmax activation function for
the whole output layer

54
Classification MLPs
Extension of ANNs
■ Many possible variations
■ Alternative error functions
– Penalize large weights
■ Add weighted sum of squares of weights to error term
■ Structure of network
– Start with small network, and grow
– Start with large network and diminish
■ Use other learning algorithms to learn weights

57
Extensions of ANNs
■ Recurrent networks
– Example of time series
■ Would like to have representation of behavior at t+1
from arbitrary past intervals (no set number)
■ Idea of simple recurrent network
– hidden units that have feedback to inputs
■ Dynamically growing and shrinking networks

58
Example
1. Calculate the weighted sum
2. Apply the activation function to v
■ For each input pattern
■ Weighted sum for P1
v =w1x1+w2x2+w3x3
= 2.1 + -4.0 +1.0=2
Applying activation function for v=2
2>0, y= 𝛗 2 =1
Similarly for P2
v= -3, -3<0, y= 𝛗 −3 =0
P3
v=3, 3>0, y= 𝛗 3 =1
P4
v=-1, -1<0, y= 𝛗 −1 =0
Incremental Version
■ Batch gradient descent for a single Sigmoid unit
ED
ED =
1
 d d(t - o )2
= -  (td - od )od (1 - od ) xi , d
2 d D wi d D

• Stochastic approximation

Ed =
1
(td - od )2 Ed
= -(td - od )od (1 - od ) xi , d
2 wi

37
Backpropagation procedure
■ Create FFnet
– nin inputs
– noutoutput units 1 2
𝐸𝑤 ≡ ෍ ෍ 𝑡𝑘𝑑 − 𝑜𝑘𝑑
■ Define error by considering all output units 2
𝑑∈𝐷 𝑘∈𝑜𝑢𝑡𝑝𝑢𝑡𝑠
– nhidden hidden units
■ Train the net by propagating errors backwards from output units
– First output units
– Then hidden units
■ xji is input from unit i to unit j
wji is the corresponding weight
■ Various termination conditions
– error
– # iterations,…

38
Backpropagation (stochastic case)
Backpropagation(training examples, , nin, nout, nhidden )
■ Create a feed-forward network with nin inputs, nhidden hidden units, and nout output units
■ Initialize all weights to small random numbers
■ Until termination condition is met, do
For each training example
Propagate the input forward through the network:
1. Input the training example to the network and compute the network outputs δ
Propagate the errors backward through the network:
2. For each output unit k, calculate its error
dk ← ok (1 - ok) (tk - ok)
3. For each hidden unit h
dh ← oh (1 - oh) Skoutputs wk,hdk
4. Update each network weight wj,i
wj,i ← wj,i + wj,i
where wj,i =  dj xj,i
39
Errors propagate backwards
d 3 = o3 (1 - o3 )(t3 - o3 )
1
1 2 3 4

w2,5
w1,5 w,,4,9

w,2,7 w,3,7
w1,7
w,4,7
5 6 7 8 9

4
d 7 = o7 (1 - o7 ) wi , 7d i
i =1

w1,7 updated based on δ1 and x1,7

dk ← ok (1 - ok) (tk - ok), dh ← oh (1 - oh) Skoutputs wk,hdk, , wj,i ← wj,i + wj,i, wj,i =  dj xj,i
■ Same process repeats if we have more layers
40
Properties of Backpropagation
■ Easily generalized to arbitrary directed (acyclic) graphs
– Backpropagate errors through the different layers
■ Training is slow but applying network after training is fast

41
Convergence of Backpropagation
■ Convergence
– Training can take thousands of iterations → slow
■ Gradient descent over entire network weight vector
■ Speed up using small initial values of weights:
– Linear response initially
– Generally will find local minimum
■ Typically can find good approximation to global minimum
– Solutions to local minimum trap problem
■ Stochastic gradient descent
■ Train multiple networks using the same data, but initializing each network with different random
weights
– best performance selected
■ Can modify to find better approximation to global minimum
– include weight momentum a
wji(n ) =  dj xji + a wji (n-1 )
■ Momentum avoids local max/min and plateaus

42
Example of learning a simple function
■ Learn to recognize 8 simple inputs
– how to interpret hidden units
– System learns binary representation
■ Trained with
– initial w_i between –0.1, +0.1,
– eta=0.3
■ 5000 iterations (most change in first 50%)
■ Target output values:
– .1 for 0
– .9 for 1

43
Hidden layer representations
output Input Hidden Output
values
10000000 → → 10000000
01000000 → → 01000000
00100000 → → 00100000
00010000 → ? ? ? → 00010000
input 00001000 → → 00001000
00000100 → → 00000100
00000010 → → 00000010
00000001 → → 00000001
Hidden layer representations
output Input Hidden Output
values
10000000 → .89 .04 .08 → 10000000
01000000 → .01 .11 .88 → 01000000
00100000 → .01 .97 .27 → 00100000
00010000 → .99 .97 .71 → 00010000
input 00001000 → .03 .05 .02 → 00001000
00000100 → .22 .99 .99 → 00000100
00000010 → .80 .01 .98 → 00000010
00000001 → .60 .94 .01 → 00000001
Some issues with ANNs
■ Interpretation of hidden units
■ Hidden units “discover” new patterns/regularities
■ Often difficult to interpret
■ Overfitting
■ Expressiveness
– Generalization to different classes of functions

46
Dealing with overfitting
■ Complex decision surface
■ Divide sample into
– Training set
– Validation set
■ Solutions
– Return to weight set occurring near minimum over validation set
– Prevent weights from becoming too large
■ Reduce weights by (small) proportionate amount at each iteration

47
48
Effect of hidden units

49
Expressiveness

■ Every Boolean function can be represented by network with a single hidden layer
– Create 1 hidden unit for each possible input
– but might require exponential (in number of inputs) hidden units

50
Expressiveness
■ Every bounded continuous function can be approximated with arbitrarily small
error, by network with one hidden layer
– Hidden layer of sigmoid functions
– Output layer of linear functions
■ Any function can be approximated to arbitrary accuracy by a network with two
hidden layers
– Sigmoid units in both hidden layers
– Output layer of linear functions

51
Regression MLPs
■ To predict a single value (e.g., the price of a house, given many of its features), then
just need a single output neuron: its output is the predicted value
■ For multivariate regression (i.e., to predict multiple values at once), need one output
neuron per output dimension.
■ For example, to locate the center of an object in an image, you need to predict 2D
coordinates, so you need two output neurons.
■ When building an MLP for regression, no need to use any activation function for the
output neurons, so they are free to output any range of values.
■ If you want to guarantee that the output will always be positive, -use the ReLU
activation function in the output layer.
– Rectified Linear Unit function: ReLU(z) = max(0, z), 0 for z < 0
– Softplus activation function is a smooth variant of ReLU: softplus(z) = log(1 +
exp(z)). It is close to 0 when z is negative, and close to z when z is positive

52
Regression MLPs

53
Classification MLPs
■ For a binary classification problem,
– Need a single output neuron using the logistic activation function: the output will be a
between 0 and 1, which you can interpret as the estimated probability of the positive
class. The estimated probability of the negative class is equal to one minus that
number.
■ Multiple binary classification tasks
– For example, in an email classification system that predicts whether each incoming
email is ham or spam, and also predicts whether it is an urgent or nonurgent email
– need two output neurons-using the logistic activation function: the first would output
the probability that the email is spam, and the second would output the probability
that it is urgent.
– one output neuron for each positive class.
■ For example if each instance can belong only to a single class, out of three or more possible
classes
– Need to have one output neuron per class, and use the softmax activation function for
the whole output layer

54
Classification MLPs

55
Classification MLPs
Extension of ANNs
■ Many possible variations
■ Alternative error functions
– Penalize large weights
■ Add weighted sum of squares of weights to error term
■ Structure of network
– Start with small network, and grow
– Start with large network and diminish
■ Use other learning algorithms to learn weights

57
Extensions of ANNs
■ Recurrent networks
– Example of time series
■ Would like to have representation of behavior at t+1
from arbitrary past intervals (no set number)
■ Idea of simple recurrent network
– hidden units that have feedback to inputs
■ Dynamically growing and shrinking networks

58
Example
■ Consider a single neuron unit with three inputs as shown below
The Weights corresponding to the inputs
w1=2
w2=-4
w3=1
The activation of the unit is given by the function:
1 𝑖𝑓 𝑣 ≥ 0
𝛗 𝑣 =ቊ
0 𝑜𝑡ℎ𝑒𝑟𝑤𝑖𝑠𝑒

Calculate what will be the output value y of the unit for each of the following input patterns:
Pattern P1 P2 P3 P4
x1 1 0 1 1
x2 0 1 0 1
x3 0 1 1 1
Example
1. Calculate the weighted sum
2. Apply the activation function to v
■ For each input pattern
■ Weighted sum for P1
v =w1x1+w2x2+w3x3
= 2.1 + -4.0 +1.0=2
Applying activation function for v=2
2>0, y= 𝛗 2 =1
Similarly for P2
v= -3, -3<0, y= 𝛗 −3 =0
P3
v=3, 3>0, y= 𝛗 3 =1
P4
v=-1, -1<0, y= 𝛗 −1 =0
Exercise
■ Assume that you are at the output stage of the network. The objective is for the
unit to learn a single input pattern, namely
 i  1
i= 1= 
 i2   4 
■ The desired output is o = 1. Initially assume w1 = w2 = 0

Use a learning rate  = 1.0

Show all the calculations for two iterations. Show the weight values at the end of
the first and second iterations. In what direction is the weight vector moving from
iteration to iteration?

You might also like