Weights and Biases

Weights and biases
Weights in an ANN are the most important factor in converting an input to impact the output. This is
similar to slope in linear regression, where a weight is multiplied to the input to add up to form the
output. Weights are numerical parameters which determine how strongly each of the neurons
affects the other.
For a typical neuron, if the inputs are x1, x2, and x3, then the synaptic weights to be applied to them
are denoted as w1, w2, and w3.
Output is
where i is 1 to the number of inputs.
Simply, this is a matrix multiplication to arrive at the weighted sum.
Bias is like the intercept added in a linear equation. It is an additional parameter which is used to
adjust the output along with the weighted sum of the inputs to the neuron.
The processing done by a neuron is thus denoted as :
A function is applied on this output and is called an activation function. The input of the next layer
is the output of the neurons in the previous layer, as shown in the following image:
Loss Functions
A loss function is used to optimize the parameter values in a neural network model. Loss functions
map a set of parameter values for the network onto a scalar value that indicates how well those
parameter accomplish the task the network is intended to do.
There are several common loss functions provided by theanets. These losses often measure
the squared or absolute error between a network’s output and some target or desired output. Other
loss functions are designed specifically for classification models; the cross-entropy is a common loss
designed to minimize the distance between the network’s distribution over class labels and the
distribution that the dataset defines.
Models in theanets have at least one loss to optimize during training. There are default losses for
each of the built-in model types, but you can often override these defaults just by providing a non-
default value for the loss keyword argument when creating your model. For example, to create a
regression model with a mean absolute error loss:
net = theanets.Regressor([10, 20, 3], loss='mae')
This will create the regression model with the specified loss.
Predefined Losses
These loss functions are available for neural network models.
Loss(target[, weight, weighted, output_name]) A loss function base class.
CrossEntropy(target[, weight, weighted, ...]) Cross-entropy (XE) loss function for classifiers.
GaussianLogLikelihood([mean_name, ...]) Gaussian Log Likelihood (GLL) loss function.
Hinge(target[, weight, weighted, output_name]) Hinge loss function for classifiers.
The KL divergence loss is computed over probability

KullbackLeiblerDivergence(target[, weight, ...])
distributions.
MaximumMeanDiscrepancy([kernel]) Maximum Mean Discrepancy (MMD) loss function.
MeanAbsoluteError(target[, weight, ...]) Mean-absolute-error (MAE) loss function.
MeanSquaredError(target[, weight, weighted, ...]) Mean-squared-error (MSE) loss function.
Multiple Losses
A theanets model can actually have more than one loss that it attempts to optimize simultaneously,
and these losses can change between successive calls to train(). In fact, a model has
a losses attribute that’s just a list of theanets.Loss instances; these losses are weighted by
a weight attribute, then summed and combined with any applicable regularizers during each call
to train().
Let’s say that you want to optimize a model using both the mean absolute and the mean squared
error. You could first create a regular regression model:
net = theanets.Regressor([10, 20, 3])
and then add a new loss to the model:
net.add_loss('mse')
Then, when you call:
net.train(...)
the model will attempt to minimize the sum of the two losses.
You can specify the relative weight of the two losses by manipulating the weight attribute of each
loss instance. For instance, if you want the MAE loss to be twice as strong as the MSE loss:
net.losses[1].weight = 2
net.train(...)
Finally, if you want to reset the loss to the standard MSE:
net.set_loss('mse', weight=1)
(Here we’ve also shown how to specify the weight of the loss when adding or setting it to the model.)
Using Weighted Targets
By default, the network models available in theanets treat all inputs as equal when computing the
loss for the model. For example, a regression model treats an error of 0.1 in component 2 of the
output just the same as an error of 0.1 in component 3, and each example of a minibatch is treated
with equal importance when training a classifier.
However, there are times when all inputs to a neural network model are not to be treated equally. This
is especially evident in recurrent models: sometimes, the inputs to a recurrent network might not
contain the same number of time steps, but because the inputs are presented to the model using a
rectangular minibatch array, all inputs must somehow be made to have the same size. One way to
address this would be to cut off all inputs at the length of the shortest input, but then the network is
not exposed to all input/output pairs during training.
Weighted targets can be used for any model in theanets. For example, an autoencoder could use
an array of weights containing zeros and ones to solve a matrix completion task, where the input
array contains some “unknown” values. In such a case, the network is required to reproduce the
known values exactly (so these could be presented to the model with weight 1), while filling in the
unknowns with statistically reasonable values (which could be presented to the model during training
with weight 0).
As another example, suppose a classifier model is being trained in a binary classification task where
one of the classes—say, class A—is only present 0.1% of the time. In such a case, the network can
achieve 99.9% accuracy by always predicting class B, so during training it might be important to
ensure that errors in predicting A are “amplified” when computing the loss. You could provide a large
weight for training examples in class A to encourage the model not to miss these examples.
All of these cases are possible to model in theanets; just include weighted=True when you create
your model:
net = theanets.recurrent.Autoencoder([3, (10, 'rnn'), 3], weighted=True)
When training a weighted model, the training and validation datasets require an additional
component: an array of floating-point values with the same shape as the expected output of the
model. For example, a non-recurrent Classifier model would require a weight vector with each
minibatch, of the same shape as the labels array, so that the training and validation datasets would
each have three pieces: sample, label, and weight. Each value in the weight array is used as the
weight for the corresponding error when computing the loss.
Custom Losses
It’s pretty straightforward to create models in theanets that use different losses from the
predefined theanets.Classifier and theanets.Autoencoder and theanets.Regressor models. (The
classifier uses categorical cross-entropy (XE) as its default loss, and the other two both use mean
squared error, MSE.)
To define a model with a new loss, just create a new theanets.Loss subclass and specify its name
when you create your model. For example, to create a regression model that uses a step function
averaged over all of the model inputs:
class Step(theanets.Loss):
def __call__(self, outputs):
return (outputs[self.output_name] > 0).mean()
net = theanets.Regressor([5, 6, 7], loss='step')
Your loss function implementation must return a Theano expression that reflects the loss for your
model. If you wish to make your loss work with weighted outputs, you will also need to include a case
for having weights:
class Step(theanets.Loss):
def __call__(self, outputs):
step = outputs[self.output_name] > 0
if self._weights:
return (self._weights * step).sum() / self._weights.sum()
else:
return step.mean()
Multi-layered networks
Multi-layered networks consist of layers of several networks, where nodes appear in at least one
of these layers. The networks are both connected by intra-layer links (links in one layer) as well
as inter-layer links (links between layers). This can be seen in social networks, where multiple
types of social ties exist at the same time (private or professional). In real communication
networks, such as a peer-to-peer network, one can draw a logical network (the connectedness of
peers with each other) as well as a physical network (the way peers are connected through
cables, hubs and data centres).
The research on single layer networks is mostly a simplification of the real-world: a social
network is multi-layered since we have different networks based on the type of relation with other
individuals. Although different layers of a network are mostly partly separated, it is interesting to
estimate diffusion and failures propagation between nodes, based on the properties of intra-layer
and inter-layer links.
Backpropagation
The goals of backpropagation are straightforward: adjust each weight in the network in proportion to
how much it contributes to overall error. If we iteratively reduce each weight’s error, eventually we’ll
have a series of weights the produce good predictions.
Chain rule refresher
As seen above, foward propagation can be viewed as a long series of nested equations. If you think
of feed forward this way, then backpropagation is merely an application the Chain rule to find
the Derivatives of cost with respect to any variable in the nested equation. Given a forward
propagation function:
f(x)=A(B(C(x)))f(x)=A(B(C(x)))
A, B, and C are activation functions at different layers. Using the chain rule we easily calculate the
derivative of f(x)f(x) with respect to xx:
f′(x)=f′(A)⋅A′(B)⋅B′(C)⋅C′(x)f′(x)=f′(A)⋅A′(B)⋅B′(C)⋅C′(x)
How about the derivative with respect to B? To find the derivative with respect to B you can
pretend B(C(x))B(C(x)) is a constant, replace it with a placeholder variable B, and proceed to find the
derivative normally with respect to B.
f′(B)=f′(A)⋅A′(B)f′(B)=f′(A)⋅A′(B)
This simple technique extends to any variable within a function and allows us to precisely pinpoint the
exact impact each variable has on the total output.
Applying the chain rule
Let’s use the chain rule to calculate the derivative of cost with respect to any weight in the network.
The chain rule will help us identify how much each weight contributes to our overall error and the
direction to update each weight to reduce our error. Here are the equations we need to make a
prediction and calculate total error, or cost:
Given a network consisting of a single neuron, total cost could be calculated as:
Cost=C(R(Z(XW)))Cost=C(R(Z(XW)))
Using the chain rule we can easily find the derivative of Cost with respect to weight W.
C′(W)=C′(R)⋅R′(Z)⋅Z′(W)=(y^−y)⋅R′(Z)⋅XC′(W)=C′(R)⋅R′(Z)⋅Z′(W)=(y^−y)⋅R′(Z)⋅X
Now that we have an equation to calculate the derivative of cost with respect to any weight, let’s go
back to our toy neural network example above
What is the derivative of cost with respect to WoWo?

C′(WO)=C′(y^)⋅y^′(ZO)⋅Z′O(WO)=(y^−y)⋅R′(ZO)⋅HC′(WO)=C′(y^)⋅y^′(ZO)⋅ZO′(WO)=(y^−y)⋅R′(ZO)⋅H
And how about with respect to WhWh? To find out we just keep going further back in our function
applying the chain rule recursively until we get to the function that has the Wh term.
C′(Wh)=C′(y^)⋅O′(Zo)⋅Z′o(H)⋅H′(Zh)⋅Z′h(Wh)=(y^−y)⋅R′(Zo)⋅Wo⋅R′(Zh)⋅XC′(Wh)=C′(y^)⋅O′(Zo)⋅Zo′(H)⋅H′(
Zh)⋅Zh′(Wh)=(y^−y)⋅R′(Zo)⋅Wo⋅R′(Zh)⋅X
And just for fun, what if our network had 10 hidden layers. What is the derivative of cost for the first
weight w1w1?
C′(w1)=dCdy^⋅dy^dZ11⋅dZ11dH10⋅dH10dZ10⋅dZ10dH9⋅dH9dZ9⋅dZ9dH8⋅dH8dZ8⋅dZ8dH7⋅dH7dZ7⋅dZ7dH6
⋅dH6dZ6⋅dZ6dH5⋅dH5dZ5⋅dZ5dH4⋅dH4dZ4⋅dZ4dH3⋅dH3dZ3⋅dZ3dH2⋅dH2dZ2⋅dZ2dH1⋅dH1dZ1⋅dZ1dW1C′(
w1)=dCdy^⋅dy^dZ11⋅dZ11dH10⋅dH10dZ10⋅dZ10dH9⋅dH9dZ9⋅dZ9dH8⋅dH8dZ8⋅dZ8dH7⋅dH7dZ7⋅dZ7
dH6⋅dH6dZ6⋅dZ6dH5⋅dH5dZ5⋅dZ5dH4⋅dH4dZ4⋅dZ4dH3⋅dH3dZ3⋅dZ3dH2⋅dH2dZ2⋅dZ2dH1⋅dH1dZ1⋅d
Z1dW1
See the pattern? The number of calculations required to compute cost derivatives increases as our
network grows deeper. Notice also the redundancy in our derivative calculations. Each layer’s cost
derivative appends two new terms to the terms that have already been calculated by the layers above
it. What if there was a way to save our work somehow and avoid these duplicate calculations?
Saving work with memoization
Memoization is a computer science term which simply means: don’t recompute the same thing over
and over. In memoization we store previously computed results to avoid recalculating the same
function. It’s handy for speeding up recursive functions of which backpropagation is one. Notice the
pattern in the derivative equations below.
Each of these layers is recomputing the same derivatives! Instead of writing out long
derivative equations for every weight, we can use memoization to save our work as
we backprop error through the network. To do this, we define 3 equations (below),
which together encapsulate all the calculations needed for backpropagation. The
math is the same, but the equations provide a nice shorthand we can use to track
which calculations we’ve already performed and save our work as we move
backwards through the network.
We first calculate the output layer error and pass the result to the hidden layer before
it. After calculating the hidden layer error, we pass its error value back to the
previous hidden layer before it. And so on and so forth. As we move back through
the network we apply the 3rd formula at every layer to calculate the derivative of cost
with respect that layer’s weights. This resulting derivative tells us in which direction
to adjust our weights to reduce overall cost.
Note
The term layer error refers to the derivative of cost with respect to a layer’s input. It
answers the question: how does the cost function output change when the input to
that layer changes?
Output layer error
To calculate output layer error we need to find the derivative of cost with respect to
the output layer input, ZoZo. It answers the question — how are the final layer’s
weights impacting overall error in the network? The derivative is then:
C′(Zo)=(y^−y)⋅R′(Zo)C′(Zo)=(y^−y)⋅R′(Zo)
To simplify notation, ml practitioners typically replace

the (y^−y)∗R′(Zo)(y^−y)∗R′(Zo) sequence with the term EoEo. So our formula for
output layer error equals:
Eo=(y^−y)⋅R′(Zo)Eo=(y^−y)⋅R′(Zo)
Hidden layer error
To calculate hidden layer error we need to find the derivative of cost with respect to
the hidden layer input, Zh.
C′(Zh)=(y^−y)⋅R′(Zo)⋅Wo⋅R′(Zh)C′(Zh)=(y^−y)⋅R′(Zo)⋅Wo⋅R′(Zh)
Next we can swap in the EoEo term above to avoid duplication and create a new
simplified equation for Hidden layer error:
Eh=Eo⋅Wo⋅R′(Zh)Eh=Eo⋅Wo⋅R′(Zh)
This formula is at the core of backpropagation. We calculate the current layer’s error,
and pass the weighted error back to the previous layer, continuing the process until
we arrive at our first hidden layer. Along the way we update the weights using the
derivative of cost with respect to each weight.
Derivative of cost with respect to any weight
Let’s return to our formula for the derivative of cost with respect to the output layer
weight WoWo.
C′(WO)=(y^−y)⋅R′(ZO)⋅HC′(WO)=(y^−y)⋅R′(ZO)⋅H
We know we can replace the first part with our equation for output layer error EoEo.
H represents the hidden layer activation.
C′(Wo)=Eo⋅HC′(Wo)=Eo⋅H
So to find the derivative of cost with respect to any weight in our network, we simply
multiply the corresponding layer’s error times its input (the previous layer’s output).
C′(w)=CurrentLayerError⋅CurrentLayerInputC′(w)=CurrentLayerError⋅CurrentLay
erInput
Note
Input refers to the activation from the previous layer, not the weighted input, Z.
Summary
Here are the final 3 equations that together form the foundation of backpropagation.
Here is the process visualized using our toy neural network example above.
Code example
def relu_prime(z):
if z > 0:
return 1
return 0
def cost(yHat, y):

return 0.5 * (yHat - y)**2
def cost_prime(yHat, y):

return yHat - y
def backprop(x, y, Wh, Wo, lr):

yHat = feed_forward(x, Wh, Wo)
# Layer Error
Eo = (yHat - y) * relu_prime(Zo)
Eh = Eo * Wo * relu_prime(Zh)
# Cost derivative for weights

dWo = Eo * H
dWh = Eh * x
# Update weights
Wh -= lr * dWh
Wo -= lr * dWo

Weights and Biases

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Weights and Biases

Uploaded by

Copyright:

Available Formats

Weights and biases

affects the other.

are denoted as w1, w2, and w3.

where i is 1 to the number of inputs.

Simply, this is a matrix multiplication to arrive at the weighted sum.

The processing done by a neuron is thus denoted as :

net = theanets.Regressor([10, 20, 3], loss='mae')

These loss functions are available for neural network models.

Loss(target[, weight, weighted, output_name]) A loss function base class.

GaussianLogLikelihood([mean_name, ...]) Gaussian Log Likelihood (GLL) loss function.

Hinge(target[, weight, weighted, output_name]) Hinge loss function for classifiers.

The KL divergence loss is computed over probability

MeanAbsoluteError(target[, weight, ...]) Mean-absolute-error (MAE) loss function.

MeanSquaredError(target[, weight, weighted, ...]) Mean-squared-error (MSE) loss function.

net = theanets.Regressor([10, 20, 3])

and then add a new loss to the model:

Then, when you call:

Finally, if you want to reset the loss to the standard MSE:

Using Weighted Targets

net = theanets.recurrent.Autoencoder([3, (10, 'rnn'), 3], weighted=True)

net = theanets.Regressor([5, 6, 7], loss='step')

Chain rule refresher

Applying the chain rule

What is the derivative of cost with respect to WoWo?

Saving work with memoization

Output layer error

To simplify notation, ml practitioners typically replace

Hidden layer error

Derivative of cost with respect to any weight

def cost(yHat, y):

def cost_prime(yHat, y):

def backprop(x, y, Wh, Wo, lr):

# Cost derivative for weights

You might also like