Download as pdf or txt
Download as pdf or txt
You are on page 1of 18

Artificial Neural

Network(1)
ANN

Biological Neurons
• The human brain is composed of a huge number of
neurons (on the order of 10 billion), each with
connections to tens of thousands of other neurons.
• Each neuron receives electrochemical inputs from
other neurons, and if the sum of these electrical
inputs exceeds a certain level, the neuron then
triggers an output transmission of another
electrochemical signal to its attached neurons.
• If the input does not exceed this level, the neuron
does not trigger any output.
• Each neuron is a very simple processing unit capable
of only very limited functionality, but in combination
with a large number of other neurons connected in
various patterns and layers, the brain is capable of
performing extremely complex tasks ranging from
telling apart a cat from a dog to grasping deep
philosophical concepts.

2
https://en.wikipedia.org/wiki/Biological_neuron_model
Artificial Neuron: Threshold Logic Unit (TLU)
• An artificial neuron is a very simple model of the
biological neuron.
• It has one or more binary (on/off) inputs and one
binary output.
• It is the simplest elements or building blocks in a
Neural Network
• The artificial neuron simply activates its output when
more than a certain number of its inputs are active.
• Training a neuron in this case (shown in this figure)
means finding the right values for w , w , and w
1 2 3

• A single neuron can be used for simple linear binary


classification. It computes a linear combination of the
inputs and if the result exceeds a threshold, it
outputs the positive class or else outputs the
negative class


• It is common to draw special pass-through neurons
called input neurons: they just output whatever input
they are fed. All the input neurons form the input layer
• The size of the input layer, is equal to the number of
features of the input data.
• Moreover, an extra bias feature is generally added (x0 =
1): it is typically represented using a special type of
neuron called a bias neuron, which just outputs 1 all
the time.
• The weights control the level of importance of each
input.
• biases determine how easily a neuron fires or activates.
a bias value allows you to shift the activation function
to the left or right

4
The Perceptron
• Perceptron is one of the simplest ANN
architectures
• A Perceptron is simply composed of a
single layer of neurons with each
neuron connected to all the inputs.
• When all the neurons in a layer are
connected to every neuron in the
previous layer (i.e., its input neurons), it • A Perceptron with two inputs and three outputs is
is called a fully connected layer or a represented in this figure
dense layer. • This Perceptron can classify instances
simultaneously into three different binary classes,
which makes it a multioutput classifier.

….
• It is possible to compute the outputs of a Perceptron for a single
instance, by using this Equation: ℎ𝑊,𝑏 (𝑋) = ∅(𝑊𝑋 + 𝑏)
• X represents the input instance with N features (N x 1).
• The weight matrix W contains all the connection weights except for the ones
from the bias neuron. It has one row per artificial neuron in the layer (M) and
one column per input neuron (N) ➔ M x N
• The bias vector b contains all the connection weights between the bias
neuron and the artificial neurons. It has one bias term per artificial neuron ➔
M x 1.
• The function ϕ is called the activation function.

6

Another Example


• It is possible to compute the outputs of a layer of artificial neurons
for several instances at once, by using the same equation: ℎ𝑊,𝑏 (𝑋)
= ∅(𝑊𝑋 + 𝑏)
• Here X represents the matrix of input features. It has one column per
instance, one row per feature.
• The weight matrix W contains all the connection weights except for the ones
from the bias neuron. It has one row per input neuron and one column per
artificial neuron in the layer.

8
Multi-Layer Perceptron (MLP)
• MLP: stacking multiple Perceptrons.
• Within a layer neurons are not connected, but they
are connected to neurons of the next and previous
layers.
• An MLP is composed of one (pass-through) input
layer, one or more layers called hidden layers, and
one final layer called the output layer.
• The layers close to the input layer are usually called
the lower layers, and the ones close to the outputs
are usually called the upper layers.
• Every layer except the input layer includes a bias
neuron and is fully connected to the next layer. When an ANN contains a deep stack of
hidden layers, it is called a deep neural
network (DNN)


• The number of hidden layers should be specified by the programmer.
It is a hyper-parameter.
• Each neuron in a layer receives input from the previous layer and, if
activated, emits output to one or more neurons in the next layer.
• We perform the following calculations in each neuron of the network.
• FW1: 𝑎1 = 𝑥
• FW2: 𝑧 𝑙 = 𝑤 𝑙 . 𝑎𝑙−1 + 𝑏 𝑙
• FW3: 𝑎𝑙 = 𝜎 𝑧 𝑙

10
Training (find weights and biases)
• What we'd like is an algorithm which lets us find weights and biases so that
the output from the network approximates y(x) for all training inputs .
• The main steps:
• Randomly initialize the weights for all the nodes.
• Forward Pass: by using the current weights, calculate the output of each node going from left to
right
• Cost Calculation: Compare the final output with the actual target in the training data, and
measure the error using a loss function.
• Backwards Pass and update weights and biases : adjust the weights accordingly using gradient
descent: slightly tweaks the connection weights to reduce the error.
• The learning (training) process of a neural network is an iterative process in
which the calculations are carried out forward and backward through each
layer in the network until the loss function is minimized.

11
Overview of a Neural Network’s Learning Process

If the prediction is correct, reward the connections that produced this result by increasing their weights
in proportion to the confidence of the prediction. If the prediction is wrong, penalize the connections
that contributed to this wrong result.
12

One example of a cost function is the quadratic cost function (or RMSE) C.
1
• C for a single sample (x) ➔𝐶𝑥 = 𝑦𝑥 − 𝑎𝐿 2
; ➔dividing by 2 is just for simplify
2
the calculation

• 𝑥, 𝑦, … , 𝑧 is the length of a vector ➔ 𝑥 2 + 𝑦 2 + ⋯ + 𝑧 2


• If the 𝐶𝑥 ≈ 0 ⇒ 𝑦𝑥 = 𝑎𝐿
• The cost function after using all n samples in training set is ➔
1
C w, b = σ𝑥 𝐶𝑥
𝑛

13


• We do not usually use all training samples in one iteration during the neural network
training. Instead, we specify the batch size which determines the number of training
samples to be propagated (forward and backward) during training.
• It handles one mini-batch at a time, and it goes through the full training set multiple times.
Each pass is called an epoch.
• An epoch is an iteration over the entire training dataset.
➔ Mini-Batch GD

14

• For example, let’s say we have a dataset of 1000 training samples and we choose a batch
size of 10 and epochs of 20. In this case, our dataset will be divided into 100 (1000/10)
batches each with 10 training samples.
• According to this setting, the algorithm takes the first 10 training samples from the dataset
and trains the model. Next, it takes the second 10 training samples and trains the model and
so on.
• Since there is a total of 100 batches, the model parameters will be updated 100 times in each
epoch of optimization.
• This means that one epoch involves 100 batches or 100 times parameter updates.
• Since the number of epochs is 20, the optimizer passes through the entire training dataset 20
times giving a total of 2000 (100x20) iterations!

15

Backward Propagation
• The process of updating network parameters is called optimization which is done using
an optimization algorithm (optimizer) that implements backpropagation.
• The objective is to find the global minima where the loss function has its minimum value.
• However, it is a real challenge for an optimization algorithm to find the global minimum of a
complex loss function by avoiding all the local minima.
• If the algorithm is stopped at a local minimum, we’ll not get the minimum value for the loss
function. Therefore, our model will not perform well.
• In the backward propagation, calculations are made from the output layer to the input
layer (right to left) through the network.
• In the backward propagation, the partial derivatives (gradients) of the loss function with
respect to the model parameters (W and b) in each layer are calculated.
• The derivative of the loss function is its slope which provides us with the direction that we
should need to consider for updating (changing) the values of the model parameters.

16
BP: Main Steps
• BP1: Computes how much each output connection (last layer) contributed to the
error.
• BP2: The algorithm then measures how much of these error contributions came from
each connection in the layer below and so on until the algorithm reaches the input
layer.
• BP3 and BP4: Measures the error gradient (for each sample) across all the connection
weights in the network by propagating the error gradient backward through the
network.
• Wu1: Calculate the average gradient of all samples in this batch
• Wu2: Finally, the algorithm performs a Gradient Descent step to tweak all the
connection weights in the network, using the error gradients it just computed.

17

2- Feed forward for each sample x


𝛿 𝑙=2 (𝐵𝑃2) 𝛿 𝑙=3 (𝐵𝑃2) 𝛿 𝑙=4 (𝐵𝑃2) 𝛿 𝐿=5 (𝐵𝑃1)
3- Backward for each sample x
𝜕𝐶 𝜕𝐶 𝜕𝐶 𝜕𝐶
= 𝛿 2 (𝐵𝑃3) = 𝛿 3 (𝐵𝑃3) = 𝛿 4 (𝐵𝑃3) = 𝛿 5 (𝐵𝑃3)
𝜕𝑏 2 𝜕𝑏 3 𝜕𝑏 4 𝜕𝑏 5 4- calculate the gradients for
each sample
𝜕𝐶 𝜕𝐶 𝜕𝐶 𝜕𝐶
(𝐵𝑃4) (𝐵𝑃4) (𝐵𝑃4) (𝐵𝑃4)
𝜕𝑤 2 𝜕𝑤 3 𝜕𝑤 4 𝜕𝑤 5
5- calculate the average gradient
of all samples 1 𝜕𝐶 1 𝜕𝐶
𝛻𝑏𝑙 = ෍ , 𝛻𝑤𝑙 = ෍
𝑚 𝑥 𝜕𝑏 𝑚 𝑥 𝜕𝑤
6- update weights and biases
𝑏 𝑙 = 𝑏 𝑙 − 𝜂𝛻𝑏𝑙 𝑤 𝑙 = 𝑤 𝑙 − 𝜂𝛻𝑤𝑙

18
The Equations for Backpropagation
𝜕𝐶
• BP1: 𝛿 𝐿 = 𝛻𝑎𝐿 𝐶 ∗ 𝜌′ 𝑧 𝐿 = ∗ 𝜎′ 𝑧𝐿
𝜕𝑎𝐿
• BP2: 𝛿 𝑙 = ((𝑤 𝑙+1 )𝑇 . 𝛿 𝑙+1 ) ∗ 𝜎 ′
𝑧𝑙 ➔ move the error at layer l+1
backward through the network
𝜕𝐶
𝜕𝐶 • :measure how fast the cost is changing as a function of each neuron in
• BP3: 𝑙 = 𝛿𝑙 𝜕𝑎𝐿
the last layer. The exact form of this term depends on the cost function
𝜕𝑏
𝜕𝐶 𝐿
𝜕𝐶 𝑇 used. Since we use quadratic cost function then 𝜕𝑎 𝐿 = 𝑎 −𝑦
• BP4: 𝑙 = 𝛿𝑙 𝑎 𝑙−1
• 𝛿𝑗𝐿 : if C does not depend on particular neuron j then 𝛿𝑗𝐿 will be small
𝜕𝑤
• 𝜎 ′ 𝑧 𝐿 : measure how fast the activation function is changing at each
𝜎 𝑥
neuron ⇒ 𝜎′ 𝑥 = 1−𝜎
𝑥

19

Weights And Bias Update Functions


1 𝜕𝐶 1 𝜕𝐶
• Wu1: 𝛻𝑏𝑙 = σ𝑥 , 𝛻𝑤𝑙 = σ𝑥
𝑚 𝜕𝑏 𝑚 𝜕𝑤
• Wu2: 𝑏 𝑙 = 𝑏 − 𝜂𝛻𝑏𝑙
𝑙 , 𝑤𝑙 = 𝑤 − 𝜂𝛻𝑤𝑙
𝑙

20
Training Algorithm
1- Load the training-set
2- Build the neuron network  number of neurons in in each layer(l1,l2,….., L)
3- Initialize the weights and biases for each layer randomly
4- Determine the number of epochs and the size of each batch
5- For each epoch:
6- shuffle the training set
7- split the training set into many batches
8- for each batch:
9- for each sample (x, y) in the current batch:
10- apply feed forward (calculate the 𝑧 𝑙 and 𝑎𝑙 at each layer using eq. FW1,2 and 3)
11- apply backward propagation:
12- calculate 𝛿 𝐿 using eq. BP1
13- calculate 𝛿 𝑙 for each l = L-1,…,2 using equation BP2
𝜕𝐶 𝜕𝐶
14- calculate the gradient , using eq. BP3 and BP4
𝜕𝑤 𝜕𝑏
15- calculate the average gradient of all samples in this batch ➔ 𝛻𝑏𝑙
1 𝜕𝐶 1 𝜕𝐶
= σ , 𝛻𝑤𝑙 = σ𝑥
𝑚 𝑥 𝜕𝑏 𝑚 𝜕𝑤
16- update weights and biases :𝑏𝑙 = 𝑏𝑙 − 𝜂𝛻𝑏𝑙 , 𝑤 𝑙 = 𝑤 𝑙 − 𝜂𝛻𝑤𝑙

21

Testing Algorithm (For Classification)


0- Execute the training algorithm
1- Load the testing set
2- for each sample (x, y):
3- apply feed forward (𝑎𝑙 at each layer using eq. FW1,2 and 3)
4- if the index of the maximum activation in the last layer = y, then increment the number of
correct classification
5- print the number of correct classification

22
Other Models of GD
• Batch Gradient Descent: In Batch Gradient Descent, all the training
data is taken into consideration to take a single step. We take the
average of the gradients of all the training examples and then use that
mean gradient to update our parameters. So each epoch consists of
one-step of gradient descent. In this case, we move somewhat
directly towards an optimum solution.
• Suppose our dataset has 5 million examples, then just to take one
step the model will have to calculate the gradients of all the 5 million
examples. This does not seem an efficient way. To tackle this problem
we have Stochastic Gradient Descent.

23


• Stochastic(random) Gradient Descent (SGD): SGD just picks a random
instance in the training set at every step and computes the gradients based
only on that single instance. We do the following steps in one epoch for
SGD:
1. Take an example
2. Feed it to Neural Network
3. Calculate it’s gradient
4. Use the gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for all the examples in training dataset
• Since we are considering just one example at a time the cost will fluctuate
over the training examples and it will not necessarily decrease. But in the
long run, you will see the cost decreasing with fluctuations.
• Also because the cost is so fluctuating, it will never reach the minima but it
will keep dancing around it.
24

• Mini-batch Gradient Descent: at each step, instead of computing the
gradients based on the full training set (as in Batch GD) or based on
just one instance (as in Stochastic GD), Mini-batch GD computes the
gradients on small random sets of instances called mini-batches. we
do the following steps in one epoch:
1. Pick a mini-batch
2. Feed it to Neural Network
3. Calculate the mean gradient of the mini-batch
4. Use the mean gradient we calculated in step 3 to update the weights
5. Repeat steps 1–4 for the mini-batches we created

25

Activation Functions ()

26

• The Rectified Linear Unit function: ReLU(z) = max(0, z)
• It is continuous but unfortunately not differentiable at z = 0 (the slope
changes abruptly, which can make Gradient Descent bounce around), and its
derivative is 0 for z < 0. However, in practice it works very well and has the
advantage of being fast to compute. Most importantly, the fact that it does
not have a maximum output value also helps reduce some issues during
Gradient Descent
• softplus: is an activation function f(z)=ln(1+exp(z)). It can be viewed
as a smooth version of ReLU.

27


• The hyperbolic tangent function(Tanh)
• Just like the logistic function it is S-shaped, continuous,
and differentiable, but its output value ranges from –1 to 1
(instead of 0 to 1 in the case of the logistic function), which
tends to make each layer’s output more or less centered
around 0 at the beginning of training. This often helps
speed up convergence.
• The softmax function: is a function that turns
a vector of K real values into a vector of K real values
that sum to 1 (probabilities). This activation function
is used in multiclass classification where K is the
number of classes.

28

• In order for GD to work properly, the loss function must be
differentiable like logistic function.
• This was essential because the step function contains only flat segments, so
there is no gradient to work with (Gradient Descent cannot move on a flat
surface), while the logistic function has a well-defined nonzero derivative
everywhere allowing Gradient Descent to make some progress at every step.
• In fact, the backpropagation algorithm works well with many other
activation functions, not just the logistic function. Two other popular
activation functions are: tanh, and ReLU

29

Regression MLPs
• Single value regression: (e.g., the price of a house given many of its
features), then you just need a single output neuron: its output is the
predicted value.
• Multivariate regression (i.e., to predict multiple values at once), you
need one output neuron per output dimension. For example, to
locate the center of an object on an image, you need to predict 2D
coordinates, so you need two output neurons.
• The loss function to use during training is typically the mean squared
error, but if you have a lot of outliers in the training set, you may
prefer to use the mean absolute error instead.

30

• The activation function for hidden layers: ReLU
• The activation function for the output layer:
• you do not want to use any activation function for the output neurons, so
they are free to output any range of values.
• ReLU or the softplus: if you want to guarantee that the output will always be
positive
• logistic function or the Tanh: if you want to guarantee that the predictions
will fall within a given range of values.
• Note: scale the labels to the appropriate range: 0 to 1 for the logistic function, or –1 to 1
for the hyperbolic tangent.

31

Classification MLP
• Binary classification problem:
• you just need a single output neuron using the logistic activation function: the
output will be a number between 0 and 1, which you can interpret as the estimated
probability of the positive class.
• Obviously, the estimated probability of the negative class is equal to one minus that
number.
• Multi-label binary classification tasks:
• For example, you could have an email classification system that predicts whether
each incoming email is ham or spam, and simultaneously predicts whether it is an
urgent or non-urgent email.
• In this case, you would need one output neuron per label using the logistic
activation.
• For the above example, two output neurons: the first would output the probability
that the email is spam and the second would output the probability that it is urgent.

32
….
• Multiclass Classification:
• If each instance can belong only to a single class,
out of 3 or more possible classes (e.g., classes 0
through 9 for digit image classification), then you
need to have one output neuron per class, and
you should use the softmax activation function for
the whole output layer.
• The softmax function will ensure that all the
estimated probabilities are between 0 and 1 and
that they add up to one (which is required if the
classes are exclusive). This is called multiclass
classification.
• For the three cases: the cross-entropy (also
called the log loss) is generally a good choice
as a loss function, since we are predicting
probability distributions.

33

Notes
• Even though there are a lot of hardware and software optimizations
available for the training of neural networks, they are still significantly
more computationally expensive than training a decision tree, for
instance.
• The number of model hyperparameters to tune in the training of
ANNs can be enormous.
• For instance, you must choose the configuration of network architecture, the
type of activation function, whether to fully connect neurons or leave layers
sparsely connected, and so on.
• ANNs are very complex and difficult to reason about.

34
Other Artificial Neural Nets
• There are hundreds of variations of neural network infrastructures,
and we can use them for both supervised and unsupervised learning.
• Some are shown here

35

You might also like