Naive Bayes

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 60

UNIT-II

UNIT-II
• Estimating Probabilities from Data, Bayes Rule, MLE, MAP
• Naive Bayes: Conditional Independence, Naive Bayes: Why
and How, Bag of Words
• Logistic Regression : Maximizing Conditional likelihood ,
Gradient Descent
• Kernels: Kernalization Algorithm, Kernalizing Perceptoron.
• Discriminants: The Perceptron, Linear Separability, Linear
Regression
• Multilayer Perceptron (MLP): Going Forwards, Backwards,
MLP in practices, Deriving back Propagation.
BAYESIAN LEARNING
• Bayesian reasoning provides a probabilistic approach to inference. It
is based on the assumption that the quantities of interest are
governed by probability distributions and that optimal decisions can
be made by reasoning about these probabilities together with
observed data.
• It is important to machine learning because it provides a
quantitative approach to weighing the evidence supporting
alternative hypotheses.
• Bayesian reasoning provides the basis for learning algorithms that
directly manipulate probabilities, as well as a framework for
analyzing the operation of other algorithms that do not explicitly
manipulate probabilities.
• Bayesian learning methods are relevant to our study of machine
learning for two different reasons.
• Bayesian learning algorithms calculate explicit probabilities for
hypotheses, such as the naive Bayes classifier.
• Naive Bayes classifier is competitive with other learning algorithms
(Decision Tree and Neural Network Algorithms).
• Bayesian methods are important for study of machine learning is
that they provide a useful perspective for understanding many
learning algorithms that do not explicitly manipulate probabilities.
Ex.: Analyze algorithms FIND-S and CANDIDATEELIMINATION
algorithms to determine conditions under which they output the most
probable hypothesis given the training data.
Use a Bayesian analysis to justify a key design choice in neural network
learning algorithms: choosing to minimize the sum of squared errors
when searching the space of possible neural networks.
• Derive an alternative error function, cross entropy, i.e. more
appropriate than sum of squared errors when learning target
functions that predict probabilities.
Features of Bayesian learning methods include:
• Each observed training example can incrementally decrease or
increase the estimated probability that a hypothesis is correct. This
provides a more flexible approach to learning than algorithms that
completely eliminate a hypothesis if it is found to be inconsistent with
any single example.
• Prior knowledge can be combined with observed data to determine
the final probability of a hypothesis. In Bayesian learning, prior
knowledge is provided by asserting (1) a prior probability for each
candidate hypothesis, and (2) a probability distribution over observed
data for each possible hypothesis.
• Bayesian methods can accommodate hypotheses that make
probabilistic predictions (e.g., hypotheses such as "this pneumonia
patient has a 93% chance of complete recovery").
• New instances can be classified by combining the predictions of multiple
hypotheses, weighted by their probabilities.
• Even in cases where Bayesian methods prove computationally intractable, they can
provide a standard of optimal decision making against which other practical
methods can be measured.
• One practical difficulty in applying Bayesian methods is that they typically require
initial knowledge of many probabilities.
• A second practical difficulty is the significant computational cost required to
determine the Bayes optimal hypothesis in the general case.
• BAYES THEOREM In machine learning, interested in determining the best hypothesis
from some space H, given the observed training data D.
• One way to specify what we mean by the best hypothesis is to say that we demand
the most probable hypothesis, given the data D plus any initial knowledge about the
prior probabilities of the various hypotheses in H.
• Bayes theorem provides a direct method for calculating such probabilities
• P(h) to denote the initial probability that hypothesis h holds, before we have
observed the training data.
• P(h) is often called the prior probability of h and may reflect any background
knowledge we have about the chance that h is a correct hypothesis.
• If we have no such prior knowledge, then we might simply assign the same
prior probability to each candidate hypothesis.
• Similarly, we will write P(D) to denote the prior probability that training data
D will be observed.
• P(D/ h) to denote the probability of observing data D given some world in
which hypothesis h holds.
• P(x / y) to denote the probability of x given y. In machine learning problems
interested in P (h / D) that h holds given the observed training data D.
• P (h / D) is called the posterior probability of h, because it reflects our
confidence that h holds after we have seen the training data D. Notice the
posterior probability P(h/D) reflects the influence of the training data D, in
contrast to the prior probability P(h) , which is independent of D.
• Bayes theorem is the cornerstone of Bayesian learning methods
because it provides a way to calculate the posterior probability
P(h/D), from the prior probability P(h), together with P(D) and
P(D(h)).

• P(h / D) increases with P(h) and with P(D/h) according to Bayes


theorem. It is also reasonable to see that P(h/ D) decreases as
P(D) increases, because the more probable it is that D will be
observed independent of h, the less evidence D provides in
support of h.
• In many learning scenarios, the
learner considers some set of
candidate hypotheses H and is
interested in finding the most
probable hypothesis h ɛ H given the
observed data D (or at least one of
the maximally probable if there are Final step above we dropped the term P(D)
several). because it is a constant independent of h. l
• Any such maximally probable assume that every hypothesis in H is
hypothesis is called a maximum a equally probable a priori (P(hi) = P(h;) for
posteriori (MAP) hypothesis. all hi and h; in H). In this case we can
• We can determine the MAP further simplify Equation (6.2) and need
hypotheses by using Bayes theorem only consider the term P(D/h) to find the
to calculate the posterior probability most probable hypothesis. P(D/h) is often
of each candidate hypothesis. called the likelihood of the data D given h,
• More precisely, we will say that MAP and any hypothesis that maximizes P(D/h)
is a MAP hypothesis provided is called a maximum likelihood (ML)
hypothesis, hML
• An Example To illustrate Bayes rule, consider a medical diagnosis problem in
which there are two alternative hypotheses: (1) that the patient has a
particular form of cancer. and (2) that the patient does not. The available
data is from a particular laboratory
• BAYES THEOREM AND CONCEPT LEARNING What is the relationship
between Bayes theorem and the problem of concept learning? Since
Bayes theorem provides a principled way to calculate the posterior
probability of each hypothesis given the training data, we can use it as
the basis for a straightforward learning algorithm that calculates the
probability for each possible hypothesis, then outputs the most
probable.
ESTIMATING PROBABILITIES
• Estimated probabilities by the fraction of times the event is observed to occur over the total
number of opportunities.
• Ex: Estimated P(Wind = strong] Play Tennis = no) by the fraction % where n = 5 is the total number
of training examples for which PlayTennis = no, and nc = 3 is the number of these for which Wind =
strong.
• While this observed fraction provides a good estimate of the probability in many cases, it provides
poor estimates when n, is very small.
• P(Wind = strongl PlayTennis = no) = .08 and that we have a sample containing only 5 examples for
which PlayTennis = no.
• Then the most probable value for n, is 0. This raises two difficulties.
• First, produces a biased underestimate of the probability.

• Second probability estimate is zero, this probability term will dominate the Bayes classifier if the
future query contains Wind = strong. The reason is that the quantity calculated in Equation (6.20)
requires multiplying all the other probability terms by this zero value
MAXIMUM LIKELIHOOD AND LEAST-SQUARED ERROR
HYPOTHESES
• Consider the problem of learning a
continuous-valued target function-a problem
faced by many learning approaches such as
neural network learning, linear regression,
and polynomial curve fitting.
• A straightforward Bayesian analysis will show
that under certain assumptions any learning
algorithm that minimizes the squared error
between the output hypothesis predictions
and the training data will output a maximum
likelihood hypothesis.
• The significance of this result is that it
provides a Bayesian justification for many
neural network and other curve fitting
methods that attempt to minimize the sum of
squared errors over the training data.
A simple example of such a problem is learning a linear function, though our analysis applies to learning
arbitrary real-valued functions. Figure 6.2 illustrates a linear target function f depicted by the solid line,
and a set of noisy training examples of this target function. The dashed line corresponds to the
hypothesis hML with least-squared training error, hence the maximum likelihood hypothesis. Notice that
the maximum likelihood hypothesis is not necessarily identical to the correct hypothesis, f, because it is
inferred from only a limited sample of noisy training data.

MAXIMUM LIKELIHOOD HYPOTHESES FOR


PREDICTING PROBABILITIES
In the problem setting of the previous section we
determined that the maximum likelihood
hypothesis is the one that minimizes the sum of
squared errors over the training examples. In this
section we derive an analogous criterion for a
second setting that is common in neural network
learning: learning to predict probabilities.
• Consider the setting in which we wish to To learn a neural network whose output is the
learn a nondeterministic (probabilistic) probability that f (x) = 1.
function f : X -> {0, 1} which has two discrete To learn the target function, f' : X ->[0, 1], such
output values. that f '(x) = P( f (x) = 1).
• Ex: Instance space X might represent In the above medical patient example, if x is
one of those indistinguishable patients of
medical patients in terms of their symptoms,
which 92% survive, then f'(x) = 0.92 whereas
and the target function f (x) might be 1 if the
the probabilistic function f (x) will be equal to 1
patient survives the disease and 0 if not.
in 92% of cases and equal to 0 in the remaining
• X might represent loan applicants in terms of 8%.
their past credit history, and f (x) might be 1 First obtain an expression for P(D/h). Let us
if the applicant successfully repays their next assume the training data D is of the form D =
loan and 0 if not. In both of these cases {(xl, dl) . . . (x,, dm)}, where di is the observed 0
expect f to be probabilistic. or 1 value for f (xi).
• Ex: Among a collection of patients exhibiting Simplifying assumption that the instances (xl . .
the same set of observable symptoms, find . x,) were fixed.
that 92% survive, and 8% do not. Treat both xi and di as random variables, and
assuming that each training example is drawn
independently, write P(D/h) as
Assume the outcomes of the coin flips are
mutually independent-an assumption that fits
• Assume, furthermore, that the our current setting.
probability of encountering any it easier to work with the log of the likelihood,
particular instance xi is independent yield
of the hypothesis h.
• For example, the probability that our
training set contains a particular
patient xi is independent of our
hypothesis about survival rates
(though of course the survival d, of Describes the quantity that must be
the patient does depend strongly on maximized in order to obtain the maximum
h).
likelihood hypothesis in our current problem
• When x is independent of h we can
rewrite the above expression setting.
(applying the product rule from Table This result is analogous to our earlier result
6.1) as showing that minimizing the sum of squared
errors produces the maximum likelihood
hypothesis in the earlier problem setting.
Gradient Search to Maximize Likelihood in a Neural Net
Key idea behind the delta rule is to use gradient Here the axes wo and wl represent possible
descent to search the hypothesis space of possible values for the two weights of a simple linear
weight vectors to find the weights that best fit the unit. The wo, wl plane therefore represents
training examples. This rule is important because the entire hypothesis space.
gradient descent provides the basis for the The vertical axis indicates the error E relative
BACKPROPAGATION algorithm, which can learn to some fixed set of training examples.
networks with many interconnected units. It is also The error surface shown in the figure thus
important because gradient descent can serve as summarizes the desirability of every weight
the basis for learning algorithms that must search vector in the hypothesis space.
through hypothesis spaces containing many Given the way in which we chose to define E,
different types of continuously parameterized for linear units this error surface must always
hypotheses. be parabolic with a single global minimum.
VISUALIZING THE HYPOTHESIS SPACE The specific parabola will depend, of course,
To understand the gradient descent algorithm, on the particular set of training examples.
it is helpful to visualize the entire hypothesis
space of possible weight vectors and their
associated E values, as illustrated in Figure 4.4.
Gradient descent search
determines a weight vector that
minimizes E by starting with an
arbitrary initial weight vector,
then repeatedly modifying it in
small steps.
At each step, the weight vector is
altered in the direction that
produces the steepest descent
along the error surface depicted
in Figure 4.4.
This process continues until the
global minimum error is
reached.
Kernels: Kernalization Algorithm, Kernalizing Perceptoron
• Kernel function is the function of distance that is used to
determine the weight of each training example.
• kernel function is the function K such that wi = K(d(xi, x,)).
Discriminants: The Perceptron, Linear Separability, Linear Regression
The Perceptron: The Perceptron is nothing more than a
collection of McCulloch and Pitts neurons together with
a set of inputs and some weights to fasten the inputs to
the neurons.
The network is shown in Figure 3.2.
On the left of the figure, shaded in light grey, are the
input nodes.
These are not neurons, they are just a nice schematic
way of showing how values are fed into the network.
They are almost always drawn as circles, just like
neurons.
The neurons are shown on the right, and you can see
both the additive part (shown as a circle) and the
thresholder.
In practice nobody bothers to draw the thresholder
separately.
Neurons in the Perceptron are completely When we looked at the McCulloch and Pitts
independent of each other: neuron, the weights were labelled as wi, with
It doesn’t matter to any neuron what the others are the i index running over the number of inputs.
doing, it works out whether or not to fire by wij , where the j index runs over the number of
multiplying together its own weights and the input, neurons.
adding them together, and comparing the result to So w32 is the weight that connects input node 3
its own threshold, regardless of what the other to neuron 2.
neurons are doing. In implementation of the neural network, use a
Even the weights that go into each neuron are two-dimensional array to hold these weights.
separate for each one, so the only thing they share Whether or not a neuron should fire is easy: we
is the inputs, since every neuron sees all of the set the values of the input nodes to match the
inputs to the network. elements of an input vector and then use
In Fig. 3.2 the number of inputs is the same as the Equations (3.1) and (3.2) for each neuron.
number of neurons in general there will be m inputs Do this for all of the neurons, and the result is a
and n neurons. pattern of firing and non-firing neurons, which
The number of inputs is determined for us by the looks like a vector of 0s and 1s.
data, and so is the number of outputs, since we are If there are 5 neurons, as in Fig. 3.2, then a
doing supervised learning, so we want the typical output pattern could be (0, 1, 0, 0, 1),
Perceptron to learn to reproduce a particular target, which means that the second and fifth neurons
a pattern of firing and non-firing neurons for the fired and the others did not.
given input.
Need to decide how much to change
There are m weights that are connected to that neuron, one
the weight by.
for each of the input nodes.
This is done by multiplying the value
If we label the neuron that is wrong as k, then the weights
above by a parameter called the
that we are interested in are wik, where i runs from 1 to m.
learning rate, labelled as η.
So we know which weights to change, but we still need to
The value of the learning rate decides
work out how to change the values of those weights.
how fast the network learns. It’s quite
The first thing we need to know is whether each weight is
important, so it gets a little subsection
too big or too small.
of its own (next), but first let’s write
This seems obvious at first: some of the weights will be too
down the final rule for updating a
big if the neuron fired when it shouldn’t have, and too small
weight wij :
if it didn’t fire when it should.
Compute yk −tk (the difference between the output yk, the
3.3
target for that neuron, tk. This is a possible error function).
The first time the network might get
If it is negative then the neuron should have fired and
some of the answers correct and some
didn’t, make the weights bigger, and vice versa if it is
wrong; the next time it will hopefully
positive, which we can do by subtracting the error value.
improve, and eventually its
If the neuron to fire need to make the value of the weight
performance will stop improving.
negative as well.
Predefine the maximum number of
Multiply those two things together to see how we should
iterations, T. If the network got all of the
change the weight: ∆wik = −(yk − tk) × xi and the new value
inputs correct, then it is a good time to
of the weight is the old value plus this value.
stop.
The Bias Input
When we discussed the McCulloch and Pitts
neuron, gave each neuron a firing threshold θ
that determined what value it needed before it
should fire.
Threshold should be adjustable, change the
value that the neuron fires at. Suppose that all
of the inputs to a neuron are zero.
It doesn’t matter what the weights are (since
zero times anything equals zero), the only way
The Learning Rate η Equation (3.3) above tells us that we can control whether the neuron fires
how to change the weights, with the parameter η or not is through the threshold.
controlling how much to change the weights by.
The cost of having a small learning rate is that the The Perceptron Learning Algorithm
weights need to see the inputs more often before The algorithm is separated into two parts: a
they change significantly, network takes longer to training phase, and a recall phase. The recall
learn. phase is used after training, and it is the one
It will be more stable and resistant to noise (errors) that should be fast to use, since it will be used
and inaccuracies in the data. far more often than the training phase. Training
Use a moderate learning rate, typically 0.1 < η < 0.4, phase uses the recall equation, it has to work
depending upon how much error expect in the out the activations of the neurons before the
inputs. Learning rate is a crucial parameter. error can be calculated and the weights trained.
Loops over the neurons, and within
that loops over the inputs, so its
complexity is O(mn).
The training part does this same
thing, but does it for T iterations, so
costs O(Tmn).
An Example of Perceptron
Learning: Logic Functions the
example we are going to use is
something very simple that you
already know about, the logical OR.
There are two input nodes (plus the
bias input) and there will be one
output. The inputs and the target
are given in the table on the left of
Figure 3.4; the right of the figure
shows a plot of the function with
the circles as the true outputs, and
a cross as the false one. The
corresponding neural network is
shown in Figure 3.5.
This value is above 0, so the neuron fires and
the output is 1, which is incorrect according to
the target.
The update rule tells us that we need to apply
Equation (3.3) to each of the weights
separately (we’ll pick a value of η = 0.25 for
the example):
w0 : −0.05 − 0.25 × (1 − 0) × −1 = 0.2, (3.7)
w1 : −0.02 − 0.25 × (1 − 0) × 0 = −0.02, (3.8)
w2 : 0.02 − 0.25 × (1 − 0) × 0 = 0.02. (3.9)
Now we feed in the next input (0, 1) and
compute the output (check that you agree
that the neuron does not fire, but that it
Figure 3.5, there are three weights. should) and then apply the learning rule
The algorithm tells us to initialise the weights to again:
small random numbers, so we’ll pick w0 = −0.05, w1 w0 : 0.2 − 0.25 × (0 − 1) × −1 = −0.05, (3.10)
= −0.02, w2 = 0.02. w1 : −0.02 − 0.25 × (0 − 1) × 0 = −0.02, (3.11)
Now we feed in the first input, where both inputs w2 : 0.02 − 0.25 × (0 − 1) × 1 = 0.27. (3.12)
are 0: (0, 0). Remember that the input to the bias For the (1, 0) input the answer is already
weight is always −1, so the value that reaches the correct (you should check that you agree with
neuron is −0.05 × −1 −0.02 × 0 + 0.02 × 0 = 0.05. this), so we don’t have to update the weights
at all, and the same is true for the (1, 1) input.
LINEAR SEPARABILITY
One output neuron example of the OR data it
tries to separate out the cases where the neuron
should fire from those where it shouldn’t.
Looking at the graph on the right side of Figure
3.4, you should be able to draw a straight line
that separates out the crosses from the circles
without difficulty (it is done in Figure 3.6).
In fact, that is exactly what the Perceptron does:
it tries to find a straight line (in 2D, a plane in 3D, wT denotes the transpose of w and is used to make
and a hyper plane in higher dimensions) where both of the vectors into column vectors).
the neuron fires on one side of the line, and The a · b notation describes the inner or scalar
doesn’t on the other. product between two vectors.
This line is called the decision boundary or It is computed by multiplying each element of the
discriminant function, and an example of one is first vector by the matching element of the second
given in Figure 3.7. and adding them all together
One input vector x. The neuron fires if x·wT ≥ 0 a · b = ||a|| ||b|| cos θ, where θ is the angle
(where w is the row of W that connects between a and b
the inputs to one particular neuron; they are the find an input vector x1 that has x1 · wT = 0. Now
same for the OR example since there is suppose that we find another input vector x2 that
only one neuron satisfies x2 · wT = 0.
Putting these two equations together we get: The weights for each neuron separately describe a
straight line, so by putting together several neurons
we get several straight lines that each try to
separate different parts of the space.
Figure 3.8 shows an example of decision boundaries
the inner product to be 0, either ||a|| or computed by a Perceptron with four neurons; by
||b|| or cos θ needs to be zero. There is no putting them together we can get good separation
reason to believe that ||a|| or ||b|| should be of the classes.
0, so cos θ = 0. This means that θ = π/2 (or
−π/2), which means that the two vectors are at
right angles to each other. Now x1 − x2 is a
straight line between two points that lie on the
decision boundary, and the weight vector wT
must be perpendicular to that, as in Figure 3.7.
The Perceptron simply tries to find a straight
line that divides the examples where each
neuron fires from those where it does not. This
is great if that straight line exists, but is a bit of
a problem otherwise. The cases where there is
a straight line are called linearly separable
cases.
LINEAR REGRESSION
Fit a line to data, from classification problems, The βi define a straight line (plane in 3D, hyper
where we find a line that separates out the classes, plane in higher dimensions) that goes through
so that they can be distinguished. (or at least near) the data points. Figure 3.13
However, it is common to turn classification shows this in two and three dimensions.
problems into regression problems. This can be
done in two ways, first by introducing an indicator
variable, which says which class each data point
belongs to.
The problem is use the data to predict the indicator
variable, which is a regression problem. The second
approach is to do repeated regression, once for
each class, with the indicator value being 1.
For regression make a prediction about an unknown
value y (such as the indicator variable for classes or
a future value of some data) by computing some
function of known values xi .
Think about straight lines, output y is going to be a
sum of the xi values, each multiplied by a constant
parameter:
The question is how we define the line (plane or This can be written in matrix form as:
hyper plane in higher dimensions) that best fits the (t − Xβ)T (t − Xβ), (3.22)
data. The most common solution is to try to where t is a column vector containing the targets
minimise the distance between each data point and and X is the matrix of input values (even
the line that we fit. including the bias inputs), just as for the
We can measure the distance between a point and Perceptron.
a line by defining another line that goes through the Computing the smallest value of this means
point and hits the line. differentiating it with respect to the (column)
Try to minimise an error function that measures the parameter vector β and setting the derivative to
sum of all these distances. 0, which means that XT (t − Xβ) = 0 (to see this,
If we ignore the square roots, and just minimise the expand out the brackets, remembering that ABT
sum-of-squares of the errors, then we get the most = BT A and note that the term βT Xtt = tT Xβ both
common minimisation, which is known as least- are scalar terms), which has the solution β = (XT
squares optimisation. X)−1XTt (assuming that the matrix XTX can be
What we are doing is choosing the parameters in inverted). Given input vector z, the prediction is
order to minimise the squared difference between zβ. The inverse of a matrix X is the matrix that
the prediction and the actual data value, summed satisfies XX−1 = I, where I is the identity matrix,
over all of the data points. the matrix that has 1s on the leading diagonal
and 0s everywhere else. The inverse of a matrix
only exists if the matrix is square and its
3.21 determinant is non-zero.
The Multi-layer Perceptron
• Linear models are easy to understand and use, they come with
the inherent cost that is implied by the word ‘linear’;
• They can only identify straight lines, planes, or hyper planes.
• Majority of interesting problems are not linearly separable.
• Problems can be made linearly separable if we can work out
how to transform the features suitably the learning in the
neural network happens in the weights.
• To perform more computation it seems sensible to add more
weights.
• There are two things that we can do: add some backwards
connections, so that the output neurons connect to the inputs
again, or add more neurons.
• The first approach leads into recurrent networks.
• These have been studied, but are not that commonly used.
• Instead consider the second approach, add neurons between the input
nodes and the outputs, and this will make more complex neural
networks.
• Why adding extra layers of
nodes makes a neural network
more powerful.
• A prepared network can solve
the two-dimensional XOR problem,
something that we
have seen is not possible
for a linear model like FIGURE 4.1 The Multi-layer Perceptron network,
consisting of multiple layers of connected neurons.
the Perceptron.
• A suitable network is shown in Figure 4.2.
• To check that it gives the correct answers, all that is required is to put in
each input and work through the network, treating it as two different
Perceptrons, first computing the activations of the neurons in the middle
layer (labelled as C and D in Figure 4.2) and then using those activations
as the inputs to the single neuron at the output.
• As an example, I’ll work out what happens when you put in (1, 0) as an
input; the job of checking the rest is up to you.

FIGURE 4.2 A Multi-layer Perceptron


network showing a set of weights
that solve the XOR problem.
• Input (1, 0) corresponds to node A being 1 and B being 0.
• The input to neuron C is therefore
−1 × 0.5 + 1 × 1 + 0 × 1 = −0.5 + 1 = 0.5
• This is above the threshold of 0, and so neuron C fires, giving output 1. For
neuron D the input is
−1 × 1 + 1 × 1 + 0 × 1 = −1 + 1 = 0
and so it does not fire, giving output 0.
• Therefore the input to neuron E is
−1×0.5+1×1+0×−1 = 0.5 so neuron E fires.
• Checking the result of the inputs should persuade you that neuron E fires
when inputs A and B are different to each other, but does not fire when they
are the same, which is exactly the XOR function (it doesn’t matter that the fire
and not fire have been reversed).
• This network can solve a problem that the Perceptron cannot, it seems worth
looking into further
• The MLP is one of the most common neural networks in use.
• It is often treated as a ‘black box’, in that people use it
without understanding how it works, which often results in
fairly poor results.
GOING FORWARDS
• Training the MLP consists of two parts: working out what the
outputs are for the given inputs and the current weights,
and then updating the weights according to the error, which
is a function of the difference between the outputs and the
targets.
• These are generally known as going forwards and backwards
through the network.
• How to go forwards for the MLP is the XOR example
• It is same as the Perceptron, except that we have to do it twice, once for
each set of neurons, and need to do it layer by layer, because otherwise
the input values to the second layer don’t exist.
• Made an MLP with two layers of nodes, there is no reason why we can’t
make one with 3, or 4, or 20 layers of nodes.
• Work forwards through the network computing the activations of one
layer of neurons and using those as the inputs for the next layer.
• Start at the left by filling in the values for the inputs.
• Use these inputs and the first level of weights to calculate the
activations of the hidden layer, and then use those activations and the
next set of weights to calculate the activations of the output layer.
• Now got the outputs of the network, and compare them to the targets
and compute the error.
BIASES
• Include a bias input to each neuron.
• Do this in the same way as for the Perceptron by having an extra input that
is permanently set to -1, and adjusting the weights to each neuron as part
of the training.
• Each neuron in the network (whether it is a hidden layer or the output)
has 1 extra input, with fixed value.
GOINGBACKWARDS:BACK-PROPAGATIONOF ERROR
• Computing the errors at the output is no more difficult than it was for the
Perceptron, but working out what to do with those errors is more difficult.
• The method that going to look at is called back-propagation of error, which
makes it clear that the errors are sent backwards through the network.
• It is a form of gradient descent.
• The best way to describe back-propagation properly is mathematically,
but this can be intimidating and difficult to get a handle on at first.
• There are actually just three things that need to know, all of which are
from differential calculus:
– The derivative of ½ X2,
– If you differentiate a function of x with respect to some other
variable t, then the answer is 0,
– The chain rule, tells how to differentiate composite functions.
• Talked about the Perceptron, changed the weights so that the neurons
fired when the targets said they should, and didn’t fire when the
targets said they shouldn’t.
• What we did was to choose an error function for each neuron k: Ek = yk
− tk, and tried to make it as small as possible.
• Since there was only one set of weights in the network, this was
sufficient to train the network.
• Minimise the error, so that neurons fire only when they should—but,
with the addition of extra layers of weights, this is harder to arrange.
• The problem is that when we try to adapt the weights of the Multi-layer
Perceptron, we have to work out which weights caused the error.
• This could be the weights connecting the inputs to the hidden layer, or
the weights connecting the hidden layer to the output layer.
• The error function used for the Perceptron was
• where N is the number of output nodes. However, suppose that make
two errors.
• In the first, the target is bigger than the output, while in the second the
output is bigger than the target.
• If these two errors are the same size, then if add them up we could get
0, which means that the error value suggests that no error was made.
• To get around this we need to make all errors have the same sign.
• Best way is the sum-of-squares error function, which calculates the
difference between y and t for each node, squares them, and adds them
all together:

• if we differentiate a function, then it tells the gradient of that function,


which is the direction along which it increases and decreases the most.
• So if we differentiate an error function, we get the gradient of the error.
• Since the purpose of learning is to minimise the error, following the
error function downhill.
• Imagine a ball rolling around on a surface that looks like the line in
Figure 4.3. Gravity will make the ball roll downhill (follow the downhill
gradient) until it ends up in the bottom of one of the hollows.
• These are places where the error is small, so that is exactly what we
want.
• This is why the algorithm is called gradient descent.
• There are only three things in the network that change: the inputs, the
activation function that decides whether or not the node fires, and the
weights.
• The first and second are out of our control when the algorithm is running, so
only the weights matter, and therefore they are what we differentiate with
respect to.
• The problem is that we need that jump between firing and not firing to make
it act like a neuron.
• We can solve the problem if we can find an activation function that looks like
a threshold function, but is differentiable so that we can compute the
gradient.
• If you squint at a graph of the threshold function (for example, Figure 4.4)
then it looks kind of S-shaped.
• There is a mathematical form of S-shaped functions, called sigmoid functions
(see Figure 4.5).
• The most commonly used form of this function (where is some positive
parameter) is:
• In some texts you will see the activation function given a different form,
as:

• which is the hyperbolic tangent function.


• This is a different but similar function; it is still a sigmoid function, but it
saturates (reaches its constant values) at ±1 instead of 0 and 1, which is
sometimes useful. It also has a relatively simple derivative:

• We can convert between the two easily, because if the saturation points
are (±1), then we can convert to (0, 1) by using 0.5 × (x + 1).
• Got a new form of error computation and a new activation function that
decides whether or not a neuron should fire
• Compute the gradient of these errors and use them to decide how much
to update each weight in the network.
• We will do that first for the nodes connected to the output layer, and
after we have updated those, we will work backwards through the
network until we get back to the inputs again.
• There are just two problems:
• for the output neurons, we don’t know the inputs.
• for the hidden neurons, we don’t know the targets; for extra hidden
layers, we know neither the inputs nor the targets, but even this won’t
matter for the algorithm we derive.
• The chain rule of differentiation tells us that if we want to know how the
error changes as we vary the weights, we can think about how the error
changes as we vary the inputs to the weights, and multiply this by how
those input values change as we vary the weights.
• This is useful because it lets us calculate all of the derivatives that we
want to:
• we can write the activations of the output nodes in terms of the
activations of the hidden nodes and the output weights, and then we can
send the error calculations back through the network to the hidden layer
to decide what the target outputs were for those neurons.
• Do exactly the same computations if the network has extra hidden layers
between the inputs and the outputs.
• It gets harder to keep track of which functions we should be
differentiating, but there are no new tricks needed.
• Compute the gradients of the errors with respect to the weights, so that
we change the weights so that we go downhill, which makes the errors
get smaller.
• We do this by differentiating the error function with respect to the
weights, but we can’t do this directly, so we have to apply the chain rule
and differentiate with respect to things that we know.
• This leads to two different update functions, one for each of the sets of
weights, and we just apply these backwards through the network,
starting at the outputs and ending up back at the inputs.

The Multi-layer Perceptron Algorithm


• Look at some practical issues, such as how much training data is needed,
how much training time is needed, and how to choose the correct size of
network.
• We will assume that there are L input nodes, plus the bias, M hidden
nodes, also plus a bias, and N output nodes, so that there are (L+1)×M
weights between the input and the hidden layer and (M+1)×N between
the hidden layer and the output.
• The sums will start from 0 if they include the bias nodes and 1 otherwise,
and run up to L,M, or N, so that x0 = −1 is the bias input, and a0 = −1 is
the bias hidden node.
• The algorithm that is described could have any number of hidden layers,
in which case there might be several values for M, and extra sets of
weights between the hidden layers.
• We will also use i, j, k to index the nodes in each layer in the sums, and
the corresponding Greek letters for fixed indices.

• Quick summary of how the algorithm works, and then the full MLP
training algorithm using back-propagation of error is described.
• 1. an input vector is put into the input nodes
• 2. the inputs are fed forward through the network (Figure 4.6)
• the inputs and the first-layer weights (here labelled as v) are used to
decide whether the hidden nodes fire or not. The activation function g(·)
is the sigmoid function given in Equation (4.2) above
• the outputs of these neurons and the second-layer weights (labelled as
w) are used to decide if the output neurons fire or not
• 3. The error is computed as the sum-of-squares difference between the
network outputs and the targets
• 4. this error is fed backwards through the network in order to
• first update the second-layer weights
• and then afterwards, the first-layer weights

FIGURE 4.6 The forward direction in a Multi-layer Perceptron.


• Sometimes, even 5000 iterations are not enough for the XOR function, and
more have to be added.

You might also like