Professional Documents
Culture Documents
Intelligent Systems Notes: Federico Rossi A.A 2017/2018
Intelligent Systems Notes: Federico Rossi A.A 2017/2018
Intelligent Systems Notes: Federico Rossi A.A 2017/2018
Federico Rossi
A.A 2017/2018
1 Introduction
We are interested in reproducing the inner structure of the human brain. The central computation unit is the
"biological neuron" composed by:
• Dendrites (inputs)
• Body
• Axon (outputs)
A neuron receives inputs from other neurons, and approximately sums those inputs. If the value received goes
over a certain threshold it activates the neuron and produce a spike (electrical impulse) that travels to near neu-
rons.
Axons almost touch dendrites of next neurons, so the signal is transmitted by neurotransmitters, chemicals
particles released from the first neuron and which bind to receptors in the second. The link between two neurons
is called synapse.
The travelling signal has the given value but the receiving neuron receives the value modulated by the corre-
sponding synapse that can:
• Enhance
• Inhibit
The learning process consists in altering the strength of synaptic connections. So modifying the strength of
travelling signal I can modify the input value to the second neuron.
We have about 10 billion neurons with 10.000 synapses each. Computers runs much faster than brains (typi-
cally 100Mhz) but brain has important characteristics:
• Can learn from experience
• Performance degrades gracefully under partial damage
• It performs massively parallel computations extremely efficiently.
1
2. NEURAL NETWORKS
2 Neural networks
How can we mimic those characteristic of the brain? We have to model the low-level structure of the brain and
the synaptic learning. An artificial neuron has many inputs both from other neurons and external environment,
but only one output. Each input connection as associated an adjustable weight which can be modified according
to synaptic learning. Then basically a neuron computes some function called activation function (transfer) on
the weighted sum of inputs: Ã !
N
X
y j = f (a) = f w i j ∗ xi (1)
i =1
a <θ
½
0
yj = (2)
1 a ≥θ
Can be also written as below, where θ (called bias) is interpreted as another input equal to x 0 = 1 or x 0 = −1 with
associated weight w 0 j = −θ (or θ):
à !
N
y j = f (a − θ) = f
X
w i j ∗ xi (3)
i =0
Theta can be considered as one more degree of freedom upon which we can work.
Piece-wise-linear function
Sigmoid function With varying slope parameter (smooth threshold function -> smoothness can be modified).
Sigmoid function can be differentiated, so we can operate in a better way to train the network (max/min search):
1
f (a) = (4)
1 + e −a
Also called logistic function (is the differential version of the threshold function): Value a can be adjusted with a
constant to represent different slopes. Depending on specific application problems, we may need to express the
output out of [0,1].
If we need a differential version of the SIGNUM function we have to switch to hyperbolic tangent function:
3 Connecting neurons
We have many different types of neural networks depending on how neurons are connected together. For example
each unit connects to any other unit, or they are organised in layered strata connected between successive levels.
Example Two layers (typically in most papers we do not take into account the input layer in the count, because
no computation occurs in them).
Example More layers can be added to the network (more on this later on).
Input neurons do not perform any type of computation, but only receives input from the environment and
propagate it to the next layer. An hidden layer is not connected, neither on input or output, to the external envi-
ronment. Hidden layer is important in the learning processing, encapsulating the power of the neural network.
Completely connected layered networks impose that each neuron in a layer is connected to all the neurons in
the adjacent layer (this is not always the case).
Note There’s no connection between input layers and layers that follows its immediately adjacent layer.
The number of neurons in each layer is application-dependent. You have to decide the representation of inputs,
and there are many possible choices in doing that. Also you have to decide how to represent the output and as
consequence the number of output neurons. Choosing the number of nodes in a hidden layer almost always
involves trials and errors.
Partially connected networks We can also have partially connected networks employing mono directional
connections between neurons.
Providing inputs A network is fed up with a sample randomly chosen from the training set and synaptic
weight are modified to minimise the difference between expected outputs and real outputs. The training set
is applied more and more times to the network, providing a different random order on each training session.
The network will learn from the training examples by constructing input/output mapping for the problem under
consideration.
4 Delta rule
Is a form of supervised learning, where for every input pattern there is an output pattern (desired). In practice
(least mean squares):
• We prepare a training set composed of examples (pairs of input, desired output)
• Difference between outputs and expected outputs is the error
The rule Delta rule states how we have to change the weights by considering the error between expected re-
sponse t and actually obtained response y:
∆w i = η ∗ γ ∗ x i = η ∗ (t − y) ∗ x i (6)
Where η is the training rate. So, on each pass of the rule we got:
Example: linear fit You can use only one neuron with a linear transfer function with one input and a bias
y = w 1 ∗ x + w 0 . The output will be exactly the y-axis value for the given x-axis value. We can start from random
values for w 1 and w 0 , and then let the network learning through feeding points at our disposal and applying delta
rule.
Example: perceptron classification We can equip a neuron with two inputs and a bias with a step transfer
function. Binary classification means finding a boundary between two classes. The output will be binary 0, 1. If
the weighted sum is greater than zero I will classify the sample as belonging to first class, otherwise I will classify
it belonging to second. I need to have the weighted sum equals to zero to find the boundary equation. So we
simply apply the delta rule in order to obtain a zero weighted sum.
Training errors I want to prove that the error on the output decreases on each iteration:
1X
Ek = (t j − o j )2 (8)
2 j
X
E= Ek (9)
k
What does this error represent? Let’s take for example a linear neuron with two inputs x 1 , x 2 , w 1 , w 2 and the
output o with The error for a given pattern is
1 1
E= (t − o)2 = (t 2 − 2t o + o 2 ) (10)
2 2
E is a paraboloid in the z, t , o plane where the minimum error is the vertex of the paraboloid. The weights are
always modified to decrease the error and go in the vertex direction.
That means we are going in the steepest decreasing direction (being the gradient G the steepest increasing di-
rection). Practically we’ll always move downhill. With a positive slope we are decreasing the weight, while with a
negative slope we are increasing the weights.
dE dE do j
= ∗ (13)
dw i j do j dw i j
dE
= −δ j (14)
do j
do j
= xi (15)
dw i j
The learning rate η The first difficulty in the algorihtm is to decide how large the steps should be along the
line of steepest descend from the current point. Large steps may converge quickly but may also overstep the
solution or if the error surface is very eccentric go off in the wrong direction (bouncing from one side to the other
of a valley). On the other hand, very small steps may go in the correct direction but with too many iterations. The
step size can also be lowered during the time as the algorithm proceeds in its execution.
Local minima problem Error surfaces may have several local minima that are not global minima.
Batch vs On-line learning In batch learning we accumulate gradient contributions for all samples in the
training set before updating the weights In online (pattern) learning the weights are updated immediately after
each training sample. Although being not compliant with the theoretical algorithm online learning has some
advantages:
• It is often much faster with redundant training sets
• It can be used with no fixed training set (with new data coming in continuously )
• It is better at tracking non-stationary environments (where the best model gradually changes over the time)
• The noise in the gradient can help to escape from local minima as long as they’re not too deep.
Momentum The method most frequently used to escape from local minima is using a momentum term m to
be found by trials and errors.
∆w i j (n + 1) = η ∗ δ j ∗ x i + m ∗ ∆wij (n) (17)
Where the term in bold is the change made for the previous training sample.
5 Perceptron
A perceptron is a single neuron with a non-linear activation function (usually the sign function or the unit step
function).
0 a <θ
½
yj = (18)
1 a ≥θ
It is used for binary classification, fed with examples of classes C1 and C2, the perceptron can be trained in such
a way that it classifies correctly the training examples.
This is the core of the perceptron, since I will change the weights during training, changing as a consequence
the decision boundary.
Termination condition The condition to stop the algorithm is that weights don’t change for a certain num-
ber of consecutive iterations. If the problem is linearly separable, the algorithm will always converge in a finite
number of step.
Example of non linearly-separable problem We want to solve the XOR problem, dividing the two inputs
of the network in the two classes x 1 ⊕ x 2 = 0 and x 1 ⊕ x 2 = 1. This problem is clearly not linearly separable, but
if I had a two-step approach to the problem I could create an internal representation of the problem inside the
network that is instead linearly separable.
Function fitting with hidden-layers This consideration can be applied also to function fitting. If we con-
sider universal basis functions as sigmoidal s-shaped or radial functions, combined in multiple-neuron hidden-
layered networks, we are able to fit any kind of function.
Limitations Although being a powerful tool, without non-linear activation function, multi-layered networks
can only solve problems that includes only linear regions. Therefore, a multi-layered network with only linear
activation function will only be able to solve problems that a single layered network can solve.
6 Back-propagation algorithm
When we add more layers to a network we must re-consider our learning algorithm, so the delta rule must be
generalised:
dE dE do j dnet j
= ∗ ∗ (19)
dw i j do j dnet j dw i j
Where net j is the output :
X
net j = xi ∗ w i j (20)
i
In this case we define:
dE dnet j
∗ = −δ j (21)
do j do j
NOTE: in case of single layered network, this definition is equal to the already defined one since net j = o j The
first term is:
dE
= −(t j − o j ) (22)
do j
Given the activation function f (net j ) = o j the second term is:
dnet j
= f 0 (net j ) (23)
do j
Thus we obtain:
δ j = f 0 (net j ) ∗ (t j − o j ) (24)
At the end we obtain:
dE
= −δ j ∗ x i (25)
dw i j
dnet j
Since dw = x i from previous definition of net j .
ij
Hidden unit weights So, for a generic hidden unit j the overall error is computed considering also (in pro-
portion with the weight) the error back-propagated by the successive layer:
δ j = f 0 (net j ) ∗ δs ∗ w j s
X
(27)
s
We have to choose the most important variables, that have a strong correlation between input and output. Also
we have to check, for each variable, the possibility of presenting outliers.
Usually data are pre-processed and transformed into a numeric form suitable for the neural network. In par-
ticular, both inputs and targets may be normalised:
• Centring: by subtracting its average
• Scaling: divide by its standard deviation
There may be missing data or non-numeric data:
• Missing values of a given variable can be substituted by its mean value, or other statistic of that variable
across the other available training cases
• Non numeric data are either converted to numeric, or discarded.
How many training examples do we need? We must provide a representative sample of the data the network
will process in the finished application. Larger training sets reduce the risk of under sampling, if the set is too
small, noisy or skewed, the network can learn it perfectly, but fail in the final application. So it depends on several
factors:
• Network size
• Testing needs
• Input and target distribution
Point 1. matters the most, a big network usually needs more training data than a small one (on average 5-10
training patterns for each weight).
7.3 Over-training
We must pay attention to so called over-training/over-learning to prevent inappropriate memorisation of input
data. Thus we need to interleave testing and training. Testing during training shows when to end training to
prevent over-training, this is called validation.
If I repeat the same training samples in different order in different epochs the network will perfectly answer to
the training set but not generalise what it learns. So, the testing phase interleaved during training is different from
the one applied at the end of the training.
Validation set We add another set called validation set, used during training in order to check that the network
is not over-learning.
Graphical interpretation If we plot the error during the validation test and compare it to the errors during
the training we can recognise the over-training at time t, since validation errors start to grow after a constant
decreasing. At time t we must stop training process to avoid over-training. (t corresponds to the minimum error
in the validation set). [IMG]
8 Radial functions
Radial functions are a special class of functions, response decreases typically monotonically w.r.t central point.
The output strength depends on the distance from the central point.
Where c is the central point of the function. We must pay attention to the value we assign to this parameter.
Typical radial functions are:
(x−c)2
1. Gaussian: h(x) = e − 2σ
2. Inverse multiquadric: resembles a Gaussian function but approaches zero polynomially instead of expo-
nentially like Gaussian one.
3. Multiquadric
In Gaussians σ is called "spread" and it represents the selectivity of our activation function.
Structure In RBF networks, the weights connected to hidden layers represent the centres of radial functions in
the hidden units, so for a generic hidden unit j the output is:
The output of the output layer is instead a classical weighted sum, representing a linear combination of radial
functions.
Recall "We can always s approximate sum of arbitrary functions to a basis radial function whatever its complex-
ity."
Selectivity meaning The concept of selectivity (by means of σ for example) is linked with the notion of re-
ceptive fields. The new space generated by the hidden units in the mapping operation with the distance norm is
thus called feature space
maxi 6= j ||c i − c j ||
σ= p (30)
N
Pseudo-inverse method This method is based on the notion of pseudo-inverse matrix: "For any matrix, also
for square matrixes that are singular, there exists a pseudo-inverse matrix that has the same properties of an inverse-
matrix for non-singular matrixes".
M
w j ∗ φ j (||x k − c j ||) = Φ ∗ [w 1 ...w M ]T = Φ ∗ w = [d 1 ...d M ]T = d
X
y(x k ) = (32)
j =1
8.2.2 Algorithm 2
It’s a slightly variation of the previous one. Instead of choosing randomly the centres we choose them located in
regions of input space where significant data is present (discovered by a clustering algorithm).
Spreads are again chosen by normalisation, while weights to the output layer are computed using the Least
Mean Square supervised learning algorithm.Here, the output of the hidden units is provide as input to the LMS
algorithm.
8.2.3 Algorithm 3
Here we supervise also the selection of the centres, using gradient descent to train all the three parameters. We
obtain at the end three update rules.
Centres update rule We use a different training rate for each centre
dE
∆c j = −η c j ∗ (35)
dc j
Spreads update rule We can use only one spread thanks to normalisation of the spreads
dE
∆σ j = ∆σ = −η σ ∗ (36)
dσ
Weights update rule Weights are updated in the traditional delta rule way:
dE
∆w i j = −η i j ∗ (37)
dw i j
Neuron model The hidden layer of an RBF network is nonlinear, whereas the output layer is typically linear
while the hidden and output layers of a MLP used as a classifier are usually all nonlinear (although for some
nonlinear regression problems a linear layer is placed at the output).
Activation function The argument of the activation function of each hidden unit in an RBF network is the
Euclidean norm between the input and the centre of the unit. Instead the argument of the activation function of
each hidden unit in an MLP is the weighted sum of the input vector with the unit’s weights.
Approximation MLP networks construct global approximations to nonlinear input-output mapping.As a con-
sequence, they are capable of generalising regions of the input space where few or no training data are available.
RBF networks instead use exponentially decaying localised non-linearity (e.g., Gaussian functions) and con-
struct local approximations of a nonlinear input-output mapping. As a result these networks are capable of faster
learning and less sensitivity to the order of presentation of training data.
In many cases, however, we find that, in order to represent a mapping to some desired degree of smoothness,
the number of radial-basis functions required to span the input space adequately may be very large
Clustering With unsupervised learning we base our computations on clustering of input data. No a-priori
knowledge is available regarding an input’s membership to a particular class. Network classifies input vectors
into one of a given number of categories, according to the clusters detected in the training set.
The output of each neuron is the scalar product of its weight vector and input vector. We normalise all the
weight vectors to have same length, hence ling on a hyper-sphere of radius 1.
The winning neuron is the one whose weight vector is most similar to the input one. As always we start from
a situation in which all the weights are chosen randomly. For each training sample, we adjust the weights of the
winning neuron to be more similar to the training sample, in order to increase the probability of victory when a
similar input is presented to the network later on:
When learning is completed each neuron will have its position on the input space depending on its final weights.
The final position are called centroids of the data clusters. These centroids can be considered as the prototypes
for the clusters since they represent the key features of a cluster.
Limitations A simple WTA network can perform single-layer clustering analysis, hence the clusters must be
linearly separable by hyper-planes passing through the space origin. Also, we need to specify the number of
clusters in before.
Also, there’s no spatial relationship between the clusters identified by the neurons.
9.2.1 Structure
SOMs usually consist in one/two dimensional array of identical neurons. The input vector is broadcasted in
parallel to all these neurons and for each input vector the most responsive neuron is fired as in WTA network.
The weights of firing neurons and neighbouring ones are update accordingly.
The neighbourhood radius is reduced over the time and weights are changed proportionally to the closeness
to the neighbourhood centre.
10 Clustering algorithms
10.1 Introduction
Clustering it the most important unsupervised learning problem, thus its aim is to find a structure in a collection
of unlabelled data.
Cluster A cluster is a set of objects that are "similar" between each other and "dissimilar" from other objects
belonging to other clusters.
10.1.1 Requirements
A clustering algorithm must fulfils these requirements:
Scalability Performance must not degrade when increasing volume of input data.
Order Independence The order of input records must not influence the algorithm behaviour.
Interpretability and usability The output of the algorithm must be usable by other software and inter-
pretable by both human/machine.
10.1.2 Problems
Most of the currently developed algorithms problems are:
Requirement fulfilment Typically an algorithm fails in fulfilling all requirements expressed above.
Poor scalability Large dimension and volume of input data typically cause very high time complexity.
Distance dependent The output of the algorithm is strictly dependent on the definition of distance used to
cluster samples. Moreover there’s no obvious distance definition and it may also depends on application require-
ments.
Ambiguity Often, the output of the algorithm can be interpreted in different ways.
The objective function The aim of this algorithm is to minimise following quantity:
n
k X
(j)
− c j ||2
X
J= ||x i (39)
j =1 i =1
10.3.1 Algorithm
More formally the algorithm is the following:
1. Place K points into the space of object to be clustered. These points are the initial centroids.
2. Assign each object to the group that has the closest centroid.
3. When all object have been assigned, recalculate positions of the K centroids by means of barycenter oper-
ator.
4. Repeat 2. 3. until centroids do not move, or their movement is under a certain ² (according to distance
metric chosen for the algorithm).
Since this algorithm is a greedy one, it do not produce the optimal configuration, corresponding to the global
minimum of the objective function showed before.
10.3.2 Remarks
This algorithm has the following main characteristics:
• The way to initialise the means is not specified in any way (typically we choose them randomly but well
spaced)
• The final result highly depends on the initial centroids.
• It may happen that the set of samples closest to a certain point m i is empty, so m i cannot be updated.
• As already said, the result depends on the chosen distance metric
• Finally, since k is fixed, the result depends on it.
Where m ∈ R : m ≥ 1
3. Set m = m + 1 and merge clusters (r ) and (s) into a single one, and call it m. Set the level L(m) = d [(r ), (s)]
4. Delete row corresponding to merged clusters from D and add the row for the new cluster, with the dis-
tances updated as follow (recall that is single-linkage):
5. If all objects are in one cluster STOP, otherwise start again from 2.
Of course you may not want only one cluster at the end of the story. To obtain k cluster you have to cut the k − 1
longest links from the tree.
10.5.3 Problems
The two main weaknesses of agglomeratiev approaches like the one showed before are:
• Poor scaling: time complexity is at least O(n 2 ) (quadratic to the number n of objects)
• Operations done at iteration k destroy information produced by operations done at iteration k −1 - that is,
no undo is possible.
10.6 Performance
If I want to assess the performance of a clustering algorithm I can exploit the definitions of compactness and
separation.
1 X K X N
C= u m ||x j − c i ||2 (45)
N i =1 j =1 i j
The closer the points are to their cluster centres the smaller the compactness.
The more the centroids are distant between each other, the higher is the separation.
Xie-Beni index This index is built from the two previous definitions and the lower the index, the better the
clustering - that is an high level of cluster separation and an high level (small number) of compactness inside the
clusters.
C
iX B = (47)
S
11 Classification
Classification is the process of assigning objects to classes. A class contains similar objects. Every object can
be described by a set of attributes or characteristics called features. These features may be categorical - that is,
qualitative - or numerical - that is, quantitative. In the latter case we can arrange the features into a n-dimensional
vector:
x = [x 1 , . . . , x n ]T ∈ R n (48)
The space where the vector belongs is called feature space.
11.1 Classifier
A classifier is any function:
D : Rn → Ω (49)
Where Ω is the set of class labels. Hence, a classifier is built out of a labelled set of data, and can assign labels to
object, thus predicting the class to which a new unlabelled object belongs. A classifier can be:
• Discrete: if it outputs a class label
• Continuous: if it outputs a real value, that is a certain degree of membership of that object to classes in Ω.
We’ll need to apply a certain threshold to precisely classify that object.
Its 1-complement is called error rate. Accuracy can be used only when class a-priori probabilities are constant
and quite uniformly distributed. It also assumes equal penalties of bad classification, that is not always the case.
Real world problems In real world problems there are often a large number of cases that are either normal or
not interesting and a way smaller number of cases that are unusual but interesting. Thus, the class distribution is
very skewed.
Working scenario From now on, to asses other performance metrics, we’ll focus on a two-class problem or
binary classification. So we will have:
• p class of positives
• n class of negatives
• Y predicted class is positive
• N predicted class is negative
Said that we can have 4 outcomes:
• True Positive (TP): positive object classified as p (hit).
• True Negative (TN): negative object classified as n (correction rejection).
• False Positive (FP): negative object classified as positive (false alarm).
• False negative (FN): positive object classified as negative (miss)
FPs and FNs are caused by the overlay of p and n distributions.
p n
Y TP FP
N FN TN
Total P N
The rows represent the predicted classes, while the columns represent the true classes. From the matrix we
can compute some values:
TP +T N
accuracy = (51)
P +N
True positive rate, also called sensitivity or hit rate:
TP TP
TPR = = (52)
TP +FN P
False positive rate, also called false alarm rate, is also the 1’s complement of the specificity:
FP FP
FPR = = (53)
FP +T N N
A ROC graph shows the trade-off between benefits (TPR) and costs (FPR) of a classifier.
Some imporant points There are interesting points in the ROC space:
• (0, 0) It is "never alarming" strategy, so never issue a positive classification, thus not being able to have a
false positive (G.A.C.).
• (1, 1) Opposite w.r.t previous, unconditionally issue positive classifications.
• (0, 1) perfect classification
• y = x represents a random classifier: every classifier that goes below this line performs worse than a ran-
dom classification.
A discrete classifier (i.e. an instance of a confusion matrix) will produce a single point on the ROC. When
comparing two discrete classifiers, the one that is most "top-left" - that is higher TPR and ower FPR - has better
performance.
ROC for continuous classifier A continuous classifier will instead produce a ROC curver, based on the value
we associate to the classification threshold. In this case, comparing two classifiers, is not so straightforward as
before, since curves typically do not lie entirely above or below other curves.
Area under the curve In this case we better exploit the area under the curve or AUC. An AUC value of 0.5
indicates a random classifier, while an AUC value of 1 denotes perfrect classification.
From now on we’ll refer to an example that takes as input images of [32, 32, 3] dimension, to classify them as
digits, having thus n = 10 possible classes.
The last layer, instead, is a fully connected layer that receive only a single vector of class scores, with dimension
[1, 1, n] where n is the number of class we want to classify our input in.
Parameters Some of the previous layers contains parameters (like CONV/FC - weights and biases) while others
do not have any parameter, implementing fixed functions (like RELU/POOL). Hence the parameters interested in
training are those in CONV/FC layers.
During the forward pass we convolve - or slide - each filter across the entire dimensions of the input volume,
computing dot products between entries and filter parameters (the weights
Local connectivity When dealing with high-dimensional inputs such as images, it is very impractical using
fully connected layers from the very beginning. Instead it is more correct (also from a brain perspective) to
connect each neuron to a local region of the input volume. The spatial extent of this connectivity is an hyper-
parameter called receptive field of the neuron, that is equal to the filter size. Recall that the extent is always and
only along the width and the height of the image, while the depth is convolved as a whole.
Spatial arrangement We have defined the parameters to be trained but not the number of neuron in a CONV
layer. There are three hyper-parameters that define this quantity:
Depth The depth of the output volume corresponds to the number of filters we use for it: each filter will look for
different features of the input (like edges and colours). A set of neurons that look for the same region are
called depth column or fibre.
Stride The stride is how much we slide the filter along the input volume. Higher stride will reduce the output
volume.
Zero-Padding Sometimes is necessary/useful to pad with zeros the border of the input volume. This allows us to control
the spatial size of the output volumes (typically to preserve it).
Output volume calculation So, given W the input volume size, F the receptive field size, S the stride and P
the zero-padding amount used on the border, the output size O will be:
W − F + 2P
O= +1 (54)
S
NOTE: When selecting a stride, one must ensure that O is an integer number, otherwise we say that neurons do
not fit and the set of parameters is not valid.
Parameter sharing Parameters sharing scheme is used in CONV layers to control the number of parameters.
The basic idea is: If one feature is useful to compute at (x, y) it is likely to be useful also at (x 2 , y 2 ). So, if we call a
single 2D space as depth slice, we force neurons in each depth slice to use the same weights and biases.
NOTE: if all neurons in a single depth slice are using the same weight vector, the forward pass can be com-
puted, in each slice, as convolution of neuron’s weights with the input volume. This is why we commonly refer to
sets of weights as filter or kernel.
Operations The POOL operates independently on every depth slice of the input and re-sizes it spatially, using
a MAX operator.
A very common form of POOL is a layer with 2-by-2 filters applied with a stride of 2, applying the MAX oper-
ator on the 4 input numbers. This ables us to discard the 75% of the activations
General pooling There exists other type of pooling layers such as average pooling or L2-norm pooling
Dropout Since a fully connected layer occupies most of the parameters, it is very prone to over-fitting. To
reduce this risk, at each training stage, individual nodes are dropped from the network with probability p, to
obtain a reduced network. So, in that stage, only the reduced network is trained on the data and, at the end of the
stage, removed nodes are then inserted back into the network with their original weights.
During training typically p = 0.5 for hidden nodes, while for input nodes it must be sensibly lower, since a
node-drop on that layer would mean information loss.
Each fuzzy set is defined on a set of elements that defines the universe of discourse.
Formal definitions More formally a fuzzy set A is characterised by its membership function:
µ A : X → [0, 1] (55)
Membership functions Typical membership functions are called bell-shaped,triangular or trapezoidal func-
tions [IMG]
Height
hei g ht (A) = sup µ A (x) (57)
x∈X
Core Note that there exist fuzzy sets s.t. their core is empty (subnormal fuzzy set )
Support
suppor t (A) = {x ∈ X |µ A (x) > 0} (59)
Fuzzy singleton A fuzzy singleton is a set with only one value whose degree of membership is equal to 1 while
all the other values have a degree of membership = 0. [IMG]
13.2.3 Complement
For the complement operation we have several different definitions:
• Standard: A(x) = 1 − A(x)
p
• Round complement: A(x) = 1 − [A(x)]2
• Yager
• Sugeno
NOTE: Intersection between A and its complement is not the empty set. Union between A and its complement is
not the universe.
Composition (max-min composition) Given two relations R(X , Y ) and S(Y , Z ) we can define a relation
U (X , Z ) = R(X , Y ) ◦ S(Y , Z ) as:
U (X , Z )(x, z) = max[min{R(x, y), S(y, z)}] (61)
y∈Y
Sup-T composition But, since the minimum expressed there is a t-norm we can also write:
Modus ponens The most used inference rule in classical logic is modus ponens that follows this rule:
(P ∧ (P → Q)) → Q (63)
In other words, if P is known to be true and, given that P implies Q, Q must be true. However we need a fuzzy logic
extension of the modus ponens, that is:
(P 0 ∧ (P → Q)) → Q 0 (64)
Where P’ is slightly different from P, and so Q’ from Q. The terms of previous rule are called fuzzy propositions.
For any possible value of truth x,y of p,q defines the truth value I (x, y) of the proposition "if p then q".
This is an extension of the classical implication function, that is:
I (a, b) = ¬a ∨ b
However if I look at the generalized modus ponens B 0 = A 0 ◦ (A → B ), i can write it as a sup-min composition:
De-fuzzification The final fuzzy set is often defuzzified in order to produce a numeric value, typical defuzzi-
fiers are:
• Center of Area (COA) also called center of gravity or centroid method
• Mean of Maxima (MOM)
The fact that we need this tools comes from the idea that we want formalize vague and imprecise concepts while,
on the other hand we want to obtain precise results from vague and imprecise information.
Mamdani
The rule consequent is a fuzzy set. The main advantage that comes from using this fuzzy rule is the high
interpretability of the result, while having low accuracy and high computational cost
Takagi-Sugeno-Kang (TSK)
The consequent is a linear model of the input variabls. Sometimes second-order linear functions can be used.
The system’s output is a weighted average of the inputs variable firing strength.
The main advantage comes from greater accuracy w.r.t Mamdani systems. As a drawback we obtain low
interpretability.
Singleton fuzzy rule The rule consequent is a constant value. It is a good tradeoff between Mamdani
intepretability and TSK accuracy. Moreover, the process of defuzzyfication is less demanding.
Learning algorithm We can use the well-known gradient descent or use a hybrid learning algorithm which
combines LSE and gradient descent:
• In forward pass premise paramters are fixed and consequent ones are updated by means of LSE.
• In backward pass premise parameters are updated by means of gradient descent while consequent ones
are fixed.
Structure oriented approach We provide dataset along with fuzzy sets for each attribute. Then we
superpose hyperboxes according to these fuzzy sets over our data. Then we generate fuzzy rules to identify each
hyperbox that contains data.
Cluster oriented approach We perform clustering on our data - that is, unsupervised learning. Then we
project clusters into data, creating fuzzy rules for each cluster. In this way we manage to create both
membership functions and fuzzy rules.
Hyperbox oriented approach We define some hyperboxes (thus the algorithm is supervised in the
parameters we provide) to cover data. Then we project hyperboxes onto data, thus generating both fuzzy rules
and sets.
Acknowledgments Using structure oriented approach we obtain more intepretable fuzzy rules. The other
two are less usefule because:
• Each fuzzy rule uses individual fuzzy sets.
• Fuzzy sets obtained by projection are hard to interpret linguistically.
14 Genetic Algorithms
14.1 Introdution
Genetic algorithms, from now on referred as GAs, represents search and optimization methods aimed to mimic
the natural evolution. The progress in this type of evolution has three fundamental processes:
• Selection: picks individuals that will produce offspring (in italian: "prole").
• Recombination (crossover): combines two different individuals to produce offspring
• Mutation: random variation of existing genteic material.
To reproduce those aspects, GAs perform systematic random searches to improve the probability to find global
optimal solutions to the problem.
14.2 Characteristics
A GA considers a populatios of chromosomes or individuals that encode a potential solution to the optimization
problem.
Each chromosome is a set of genes, representing a specific feature of an individual and a point in the search
space.
The quality of a chromosome is measured by its fitness function - that is a scalar objective function. The higher
the fitness the higher the likelihood to select that individual to produce offspring.
14.7 Canonical GA
In this fashion of the algorithm, the number of chromosome is constant in each generation and the entire
popuation is replaced by the following genration (while other replacement scheme are allowed). Chromosomes
are binary strings of constant length, where each gene is 0, 1.
The algorithm is the following:
1. Set genearation t = 0
2. Initialize a random population P (t )
3. Evaluate it using fitness function
4. until DONE
5. (a) Select P 1 ∈ P (t ) population for recombination. Then by crossover probability choose some
individuals i ∈ P 1 to enter mating pool MP.
(b) Recombine i ∈ M P to form population P 2 and mutate with mutation probability individuals of P 2
into new population P 3
(c) Select for replacement from P 3 , P (t ) and form new generation P (t + 1).
(d) t := t + 1
Solution Solution is represented by the last generation but this does not guarantee that better individuals has
not been previously obtained. We may think of keeping best individuals from each generation, by means of
elitism: copy the fittest individuals (or a percentage) from a generation to the next one.
14.8 Selection
We can perform several types of selection:
• Proportional selection (or roulette wheel selection): selection probability is proportional to individual
fitness:
f (i )
p(i ) = P
f (j)
We then "spin the wheel" n times to select n individuals (the higher the probability the higher the roulette
slice assigned to that individual).
• Tournament selection: select randomly k individuals from population an the choose the fittest of them
(the tournament winner).
Selection pressure Selection pressure is the degree to which better individuals are favoured by the selection
operator. The higher the SP the more the exploitation.
A strong SP may cause premature convergence to sub-optimal solutions, while a weak SP may slow down the
algorithm.
z i = αx i + (1 − α)y i , α ∈ [0, 1]
• All-positions mutation: all the genese are perturbed in a similar way. For example we can apply an
additive normal mutation:
x i0 = x i + αi ∗ N (0, σi )
• Non uniform mutation: the meaningfulness of modifications decrease gradually through generations.
This means that in early steps modifications are very significant, leading to big improvement from
generation to generation, while in late ones we refine the work done so far.
Typically two parameters are randomly generated at each generation: the nature of change (increase or
decrease values) and the amplitude (a ∈ U ([0, 1])) of change.
Challenges So, when solving an optimization problem, following issues must be accounted:
• Derive a genetic representation of candidate solutions
• Create an initial population (random?) of solutions.
• Define the fitness function
• Choose selection and genetic operators
• Choose the parameters of GA, like population size, max number of genrations and probabilities of genetic
operators.