Intelligent Systems Notes: Federico Rossi A.A 2017/2018

Intelligent systems notes
Federico Rossi
A.A 2017/2018
Last updated on June 16, 2018

Contacts: Federico Rossi
1 Introduction
We are interested in reproducing the inner structure of the human brain. The central computation unit is the
"biological neuron" composed by:
• Dendrites (inputs)
• Body
• Axon (outputs)
A neuron receives inputs from other neurons, and approximately sums those inputs. If the value received goes
over a certain threshold it activates the neuron and produce a spike (electrical impulse) that travels to near neu-
rons.
Axons almost touch dendrites of next neurons, so the signal is transmitted by neurotransmitters, chemicals
particles released from the first neuron and which bind to receptors in the second. The link between two neurons
is called synapse.
The travelling signal has the given value but the receiving neuron receives the value modulated by the corre-
sponding synapse that can:
• Enhance
• Inhibit
The learning process consists in altering the strength of synaptic connections. So modifying the strength of
travelling signal I can modify the input value to the second neuron.
We have about 10 billion neurons with 10.000 synapses each. Computers runs much faster than brains (typi-
cally 100Mhz) but brain has important characteristics:
• Can learn from experience
• Performance degrades gracefully under partial damage
• It performs massively parallel computations extremely efficiently.
1
2. NEURAL NETWORKS
2 Neural networks
How can we mimic those characteristic of the brain? We have to model the low-level structure of the brain and
the synaptic learning. An artificial neuron has many inputs both from other neurons and external environment,
but only one output. Each input connection as associated an adjustable weight which can be modified according
to synaptic learning. Then basically a neuron computes some function called activation function (transfer) on
the weighted sum of inputs: Ã !
N
X
y j = f (a) = f w i j ∗ xi (1)
i =1
2.1 Non-linear activation functions

Binary threshold function Output is 1 if input is greater than a given threshold:
a <θ
½
0
yj = (2)
1 a ≥θ
Can be also written as below, where θ (called bias) is interpreted as another input equal to x 0 = 1 or x 0 = −1 with
associated weight w 0 j = −θ (or θ):
Ã !
N
y j = f (a − θ) = f
X
w i j ∗ xi (3)
i =0
Theta can be considered as one more degree of freedom upon which we can work.
Piece-wise-linear function
Sigmoid function With varying slope parameter (smooth threshold function -> smoothness can be modified).
Sigmoid function can be differentiated, so we can operate in a better way to train the network (max/min search):
1
f (a) = (4)
1 + e −a
Also called logistic function (is the differential version of the threshold function): Value a can be adjusted with a
constant to represent different slopes. Depending on specific application problems, we may need to express the
output out of [0,1].
If we need a differential version of the SIGNUM function we have to switch to hyperbolic tangent function:
f (a) = tanh a (5)
Federico Rossi, University Of Pisa 2017/2018 2

3. CONNECTING NEURONS
3 Connecting neurons
We have many different types of neural networks depending on how neurons are connected together. For example
each unit connects to any other unit, or they are organised in layered strata connected between successive levels.
3.1 Layered Feed-forward Network

Neurons are organised in layers: a signal propagate through the network from the input layer to the output layer.
Example Two layers (typically in most papers we do not take into account the input layer in the count, because
no computation occurs in them).
Example More layers can be added to the network (more on this later on).
Input neurons do not perform any type of computation, but only receives input from the environment and
propagate it to the next layer. An hidden layer is not connected, neither on input or output, to the external envi-
ronment. Hidden layer is important in the learning processing, encapsulating the power of the neural network.
Completely connected layered networks impose that each neuron in a layer is connected to all the neurons in
the adjacent layer (this is not always the case).
Note There’s no connection between input layers and layers that follows its immediately adjacent layer.
The number of neurons in each layer is application-dependent. You have to decide the representation of inputs,
and there are many possible choices in doing that. Also you have to decide how to represent the output and as
consequence the number of output neurons. Choosing the number of nodes in a hidden layer almost always
involves trials and errors.
3.2 Recurrent networks

They have at least one feedback in the network: a neuron’s output feeds one neuron’s input. [IMG]
3.3 Other typologies

Strongly connected digraph Neuron connected in a bidirectional way. There’s not a subset of neurons that
can be considered as input (there’s no layer). Network evolves isolated from the environment reaching eventually
a steady state. These networks are used as memory networks.
Partially connected networks We can also have partially connected networks employing mono directional
connections between neurons.
3.4 Input-Output mapping

Training A network can be trained to reproduce an existing but unknown relation between inputs and outputs.
Supervised learning (or learning with teacher) modifies the synaptic weights by applying a set of labelled training
samples. Each sample consists of an input signal and the corresponding desired response.
Providing inputs A network is fed up with a sample randomly chosen from the training set and synaptic
weight are modified to minimise the difference between expected outputs and real outputs. The training set
is applied more and more times to the network, providing a different random order on each training session.
The network will learn from the training examples by constructing input/output mapping for the problem under
consideration.

4. DELTA RULE
4 Delta rule
Is a form of supervised learning, where for every input pattern there is an output pattern (desired). In practice
(least mean squares):
• We prepare a training set composed of examples (pairs of input, desired output)
• Difference between outputs and expected outputs is the error
The rule Delta rule states how we have to change the weights by considering the error between expected re-
sponse t and actually obtained response y:
∆w i = η ∗ γ ∗ x i = η ∗ (t − y) ∗ x i (6)
Where η is the training rate. So, on each pass of the rule we got:
w i (n + 1) = w i (n) + ∆w i (n) (7)
Example: linear fit You can use only one neuron with a linear transfer function with one input and a bias
y = w 1 ∗ x + w 0 . The output will be exactly the y-axis value for the given x-axis value. We can start from random
values for w 1 and w 0 , and then let the network learning through feeding points at our disposal and applying delta
rule.
Example: perceptron classification We can equip a neuron with two inputs and a bias with a step transfer
function. Binary classification means finding a boundary between two classes. The output will be binary 0, 1. If
the weighted sum is greater than zero I will classify the sample as belonging to first class, otherwise I will classify
it belonging to second. I need to have the weighted sum equals to zero to find the boundary equation. So we
simply apply the delta rule in order to obtain a zero weighted sum.
Training errors I want to prove that the error on the output decreases on each iteration:
1X
Ek = (t j − o j )2 (8)
2 j
X
E= Ek (9)
k
What does this error represent? Let’s take for example a linear neuron with two inputs x 1 , x 2 , w 1 , w 2 and the
output o with The error for a given pattern is
1 1
E= (t − o)2 = (t 2 − 2t o + o 2 ) (10)
2 2
E is a paraboloid in the z, t , o plane where the minimum error is the vertex of the paraboloid. The weights are
always modified to decrease the error and go in the vertex direction.
Proof What we want to obtain is expressing the delta rule as:

dE
∆w i j = −G = − (11)
dw i j
That means we are going in the steepest decreasing direction (being the gradient G the steepest increasing di-
rection). Practically we’ll always move downhill. With a positive slope we are decreasing the weight, while with a
negative slope we are increasing the weights.

4. DELTA RULE
Let’s take a linear output neuron X

oj = xi ∗ w i j (12)
i
Applying the chain rule to Equation (11) we got:
dE dE do j
= ∗ (13)
dw i j do j dw i j
The first term is (from the definition of error E ):
dE
= −δ j (14)
do j
The second term is (from the definition of output o j ):
do j
= xi (15)
dw i j
Thus, packing all the things up we got:

dE
= −δ j ∗ x i (16)
dw i j
This is what we wanted at the beginning, except for a positive factor η that has to be multiplied to the expression
on right. In fact now we have that the delta rule follows the opposite direction of the error gradient G.

4. DELTA RULE
4.1 Gradient descent algorithm

The algorithm exploits the delta rule to train a network:
1. Initialise all weights to small random values
2. Repeat until DONE:
(a) For each weight set ∆w i j = 0

(b) For each < x, t >:
i. Produce output y j
ii. For each weight w i j compute ∆w i j + = (t j − y j ) ∗ y i (NOTE: y i = x j )
(c) For each weight w i j + = η ∗ ∆w i j
(d) Termination condition: error gradient G = 0 or |G| < ²
The learning rate η The first difficulty in the algorihtm is to decide how large the steps should be along the
line of steepest descend from the current point. Large steps may converge quickly but may also overstep the
solution or if the error surface is very eccentric go off in the wrong direction (bouncing from one side to the other
of a valley). On the other hand, very small steps may go in the correct direction but with too many iterations. The
step size can also be lowered during the time as the algorithm proceeds in its execution.
Local minima problem Error surfaces may have several local minima that are not global minima.
Batch vs On-line learning In batch learning we accumulate gradient contributions for all samples in the
training set before updating the weights In online (pattern) learning the weights are updated immediately after
each training sample. Although being not compliant with the theoretical algorithm online learning has some
advantages:
• It is often much faster with redundant training sets
• It can be used with no fixed training set (with new data coming in continuously )
• It is better at tracking non-stationary environments (where the best model gradually changes over the time)
• The noise in the gradient can help to escape from local minima as long as they’re not too deep.
Momentum The method most frequently used to escape from local minima is using a momentum term m to
be found by trials and errors.
∆w i j (n + 1) = η ∗ δ j ∗ x i + m ∗ ∆wij (n) (17)
Where the term in bold is the change made for the previous training sample.

5. PERCEPTRON
5 Perceptron
A perceptron is a single neuron with a non-linear activation function (usually the sign function or the unit step
function).
0 a <θ
½
yj = (18)
1 a ≥θ
It is used for binary classification, fed with examples of classes C1 and C2, the perceptron can be trained in such
a way that it classifies correctly the training examples.
5.1 Training a perceptron

The training is done simply by applying the delta rule. We try to find the hyper-plane that separates the example
of the two classes. From a geometrical point of view we make use of a line equation upon which the output of our
neuron is equal to zero. If the output is above zero we’ll classify the sample as C1, otherwise we’ll classify it as C2.
5.2 Geometrical properties

All the points on the decision boundary have the same inner product with the weight vector (hence having the
same projection onto the weight vector) so they lie on a line orthogonal to the weight vector.
Quick recall Scalar product x × y =

P
i x i ∗ y i or also x × y = |x| ∗ |y| ∗ cos(θ) where theta is the angle between
x and y. If the scalar product is equal to zero, it means that the two vector are orthogonal. It also represents
the projection of one vector onto the other. If two scalar product with the same vector is the same, hence the
projection is the same. This can happen only if the decision boundary is orthogonal to the weight vector.
This is the core of the perceptron, since I will change the weights during training, changing as a consequence
the decision boundary.
5.3 Learning algorithm

The problem is to adjust the weights to obtain si g n(X × W ) = t when provided the training pair < X , t >. The
algorithm is the following:
1. Initialise weights to small random values
2. Choose a learning rate η
3. Until DONE:
(a) For each training pattern < X , t >

(b) Compute y = si g n(X × W )
(c) IF y == t don’t change weights
(d) ELSE change weights using delta rule
Termination condition The condition to stop the algorithm is that weights don’t change for a certain num-
ber of consecutive iterations. If the problem is linearly separable, the algorithm will always converge in a finite
number of step.
5.4 Multi-neuron perceptron

A network with a single perceptron is limited to binary classification. If we use more than one neuron we are able
to separate a problem into 2n classes (always under the linear separability assumption)

5. PERCEPTRON
Example of non linearly-separable problem We want to solve the XOR problem, dividing the two inputs
of the network in the two classes x 1 ⊕ x 2 = 0 and x 1 ⊕ x 2 = 1. This problem is clearly not linearly separable, but
if I had a two-step approach to the problem I could create an internal representation of the problem inside the
network that is instead linearly separable.
5.5 Hidden layer multiperceptron

A multi-layered perceptron that includes 1 hidden layer with n neurons can model regions with a number of
sides l ≤ n. If we add more layers, we can combine multiple regions, being able to solve any kind of classification
problem. Practically single hidden layer perceptrons can solve the majority of the problems.
Function fitting with hidden-layers This consideration can be applied also to function fitting. If we con-
sider universal basis functions as sigmoidal s-shaped or radial functions, combined in multiple-neuron hidden-
layered networks, we are able to fit any kind of function.
Limitations Although being a powerful tool, without non-linear activation function, multi-layered networks
can only solve problems that includes only linear regions. Therefore, a multi-layered network with only linear
activation function will only be able to solve problems that a single layered network can solve.

6. BACK-PROPAGATION ALGORITHM
6 Back-propagation algorithm
When we add more layers to a network we must re-consider our learning algorithm, so the delta rule must be
generalised:
dE dE do j dnet j
= ∗ ∗ (19)
dw i j do j dnet j dw i j
Where net j is the output :
X
net j = xi ∗ w i j (20)
i
In this case we define:
dE dnet j
∗ = −δ j (21)
do j do j
NOTE: in case of single layered network, this definition is equal to the already defined one since net j = o j The
first term is:
dE
= −(t j − o j ) (22)
do j
Given the activation function f (net j ) = o j the second term is:
dnet j
= f 0 (net j ) (23)
do j
Thus we obtain:
δ j = f 0 (net j ) ∗ (t j − o j ) (24)
At the end we obtain:
dE
= −δ j ∗ x i (25)
dw i j
dnet j
Since dw = x i from previous definition of net j .
ij
Again we have obtained our delta rule:

∆w i j = η ∗ δ j ∗ x i (26)
Hidden unit weights So, for a generic hidden unit j the overall error is computed considering also (in pro-
portion with the weight) the error back-propagated by the successive layer:
δ j = f 0 (net j ) ∗ δs ∗ w j s
X
(27)
s
6.1 The learning algorithm

So the overall learning algorithm is:
1. Initialise weights to small random values
2. For each training sample < X , t >:
(a) Calculate the outputs of all neurons o j = f (net j )

(b) For each output neuron calculate δ j = (t j − o j ) ∗ f 0 (net j )
(c) For each hidden neuron calculate δ j = f 0 (net j ) ∗ s δs ∗ w j s
P
(d) Then, update all weights using delta rule

(e) Compute the error E , if E = 0 or |E | ≤ ² terminate.

7. BUILD A TRAINING SET
7 Build a training set

The crucial step in training is the creation of the training set. This includes gathering raw data, analysing them,
selecting variables, and pre-processing the data so that the network can learn efficiently.
We have to choose the most important variables, that have a strong correlation between input and output. Also
we have to check, for each variable, the possibility of presenting outliers.
Usually data are pre-processed and transformed into a numeric form suitable for the neural network. In par-
ticular, both inputs and targets may be normalised:
• Centring: by subtracting its average
• Scaling: divide by its standard deviation
There may be missing data or non-numeric data:
• Missing values of a given variable can be substituted by its mean value, or other statistic of that variable
across the other available training cases
• Non numeric data are either converted to numeric, or discarded.
How many training examples do we need? We must provide a representative sample of the data the network
will process in the finished application. Larger training sets reduce the risk of under sampling, if the set is too
small, noisy or skewed, the network can learn it perfectly, but fail in the final application. So it depends on several
factors:
• Network size
• Testing needs
• Input and target distribution
Point 1. matters the most, a big network usually needs more training data than a small one (on average 5-10
training patterns for each weight).
7.1 When to stop?

Typically the training process stops when:
• A given number of epochs elapses
• The error reaches an acceptable level
• The error stops improving
7.2 N-fold cross-validation

Is a training technique in which my data set is divided into N folds (typically 10) with approximately the same
size. Then, I randomly sub-sample a certain portion of the folds for training purposes and the remaining portion
for testing. At the end I will obtain a certain number of weight sets, and I will choose the one that satisfies the
most my requirements.
7.3 Over-training
We must pay attention to so called over-training/over-learning to prevent inappropriate memorisation of input
data. Thus we need to interleave testing and training. Testing during training shows when to end training to
prevent over-training, this is called validation.
If I repeat the same training samples in different order in different epochs the network will perfectly answer to
the training set but not generalise what it learns. So, the testing phase interleaved during training is different from
the one applied at the end of the training.

7. BUILD A TRAINING SET
Validation set We add another set called validation set, used during training in order to check that the network
is not over-learning.
Graphical interpretation If we plot the error during the validation test and compare it to the errors during
the training we can recognise the over-training at time t, since validation errors start to grow after a constant
decreasing. At time t we must stop training process to avoid over-training. (t corresponds to the minimum error
in the validation set). [IMG]

8. RADIAL FUNCTIONS
8 Radial functions
Radial functions are a special class of functions, response decreases typically monotonically w.r.t central point.
The output strength depends on the distance from the central point.
h(x) : Rn → R : f (||x − c||) (28)
Where c is the central point of the function. We must pay attention to the value we assign to this parameter.
Typical radial functions are:
(x−c)2
1. Gaussian: h(x) = e − 2σ
2. Inverse multiquadric: resembles a Gaussian function but approaches zero polynomially instead of expo-
nentially like Gaussian one.
3. Multiquadric
In Gaussians σ is called "spread" and it represents the selectivity of our activation function.
8.1 RBF networks

Cover’s theorem Given a set of training data that is not linearly separable, one can, with high probability,
transform it into a training set that is linearly separable by projecting it into a higher-dimensional space via some
non-linear transformations.
Structure In RBF networks, the weights connected to hidden layers represent the centres of radial functions in
the hidden units, so for a generic hidden unit j the output is:
net j = ||x − w j || (29)
Where ||.|| is a particular distance norm.
The output of the output layer is instead a classical weighted sum, representing a linear combination of radial
functions.
Recall "We can always s approximate sum of arbitrary functions to a basis radial function whatever its complex-
ity."
Selectivity meaning The concept of selectivity (by means of σ for example) is linked with the notion of re-
ceptive fields. The new space generated by the hidden units in the mapping operation with the distance norm is
thus called feature space
8.2 RBF learning algorithms

8.2.1 Algorithm 1
The first algorithm to perform learning in RBF networks is the following:
1. Randomly choose the N centres c i of RBFs
2. Choose the spread by normalising the maximum distance between the centres chosen before :
maxi 6= j ||c i − c j ||
σ= p (30)
N
3. Find the weights using the pseudo-inverse method

8. RADIAL FUNCTIONS
Pseudo-inverse method This method is based on the notion of pseudo-inverse matrix: "For any matrix, also
for square matrixes that are singular, there exists a pseudo-inverse matrix that has the same properties of an inverse-
matrix for non-singular matrixes".
Let’s define the activation function:

||x−t i ||2
φi (||x − t i ||) = e 2σ (31)
The output, given the training pair < x k , d k > is:
M
w j ∗ φ j (||x k − c j ||) = Φ ∗ [w 1 ...w M ]T = Φ ∗ w = [d 1 ...d M ]T = d
X
y(x k ) = (32)
j =1
Where Φ is the following N × M matrix:

φ1,1 φ1,m
 
···
 . .. .. 
 . (33)
.

 . . 
φn,1 ··· φn,m
Now, suppose to be able - we are - to compute the pseudo inverse matrix Φ+ that is a M × N one, we can compute
the new weights from the expeted output:
w = Φ+ × d (34)
8.2.2 Algorithm 2
It’s a slightly variation of the previous one. Instead of choosing randomly the centres we choose them located in
regions of input space where significant data is present (discovered by a clustering algorithm).
Spreads are again chosen by normalisation, while weights to the output layer are computed using the Least
Mean Square supervised learning algorithm.Here, the output of the hidden units is provide as input to the LMS
algorithm.
8.2.3 Algorithm 3
Here we supervise also the selection of the centres, using gradient descent to train all the three parameters. We
obtain at the end three update rules.
Centres update rule We use a different training rate for each centre
dE
∆c j = −η c j ∗ (35)
dc j
Spreads update rule We can use only one spread thanks to normalisation of the spreads
dE
∆σ j = ∆σ = −η σ ∗ (36)
dσ
Weights update rule Weights are updated in the traditional delta rule way:
dE
∆w i j = −η i j ∗ (37)
dw i j
8.3 Multi-layer perceptron (MLP) vs RBF networks

Architecture An RBF network (in its most basic form) has a single hidden layer while a MLP may have one or
more hidden layers.

8. RADIAL FUNCTIONS
Neuron model The hidden layer of an RBF network is nonlinear, whereas the output layer is typically linear
while the hidden and output layers of a MLP used as a classifier are usually all nonlinear (although for some
nonlinear regression problems a linear layer is placed at the output).
Activation function The argument of the activation function of each hidden unit in an RBF network is the
Euclidean norm between the input and the centre of the unit. Instead the argument of the activation function of
each hidden unit in an MLP is the weighted sum of the input vector with the unit’s weights.
Approximation MLP networks construct global approximations to nonlinear input-output mapping.As a con-
sequence, they are capable of generalising regions of the input space where few or no training data are available.
RBF networks instead use exponentially decaying localised non-linearity (e.g., Gaussian functions) and con-
struct local approximations of a nonlinear input-output mapping. As a result these networks are capable of faster
learning and less sensitivity to the order of presentation of training data.
In many cases, however, we find that, in order to represent a mapping to some desired degree of smoothness,
the number of radial-basis functions required to span the input space adequately may be very large

9. UNSUPERVISED AND COMPETITIVE LEARNING
9 Unsupervised and Competitive learning

So far, with supervised learning, we provided a set of training samples of proper network behaviour. Now we
introduce a new type of learning, called unsupervised learning
Clustering With unsupervised learning we base our computations on clustering of input data. No a-priori
knowledge is available regarding an input’s membership to a particular class. Network classifies input vectors
into one of a given number of categories, according to the clusters detected in the training set.
9.1 Winner takes all network

[IMG] A WTA network has N input neurons, P output neurons, one for each cluster we want to classify our data
in, and typically only one layer (the output one).
The output of each neuron is the scalar product of its weight vector and input vector. We normalise all the
weight vectors to have same length, hence ling on a hyper-sphere of radius 1.
The winning neuron is the one whose weight vector is most similar to the input one. As always we start from
a situation in which all the weights are chosen randomly. For each training sample, we adjust the weights of the
winning neuron to be more similar to the training sample, in order to increase the probability of victory when a
similar input is presented to the network later on:
w(n + 1) = w(n) + α ∗ [x − w(n)], α ∈]0, 1[ (38)
When learning is completed each neuron will have its position on the input space depending on its final weights.
The final position are called centroids of the data clusters. These centroids can be considered as the prototypes
for the clusters since they represent the key features of a cluster.
Limitations A simple WTA network can perform single-layer clustering analysis, hence the clusters must be
linearly separable by hyper-planes passing through the space origin. Also, we need to specify the number of
clusters in before.
Also, there’s no spatial relationship between the clusters identified by the neurons.
9.2 Self-organising maps (SOM)

We can modify winner-takes-all network by including lateral feedback connections between neurons. These con-
nections produce excitatory or inhibitory feedback depending on distance from excited neuron. The function of
this feedback is called Mexican Hat. [IMG]
9.2.1 Structure
SOMs usually consist in one/two dimensional array of identical neurons. The input vector is broadcasted in
parallel to all these neurons and for each input vector the most responsive neuron is fired as in WTA network.
The weights of firing neurons and neighbouring ones are update accordingly.
The neighbourhood radius is reduced over the time and weights are changed proportionally to the closeness
to the neighbourhood centre.
9.2.2 Differences from WTAs

While competitive WTAs only distribute neurons to recognise frequently presented inputs, SOMs learn to clas-
sify input vectors according to how they are grouped in the input space, hence learning both distribution and
topology of input vectors (since they introduce the concept of neighbourhood).

10. CLUSTERING ALGORITHMS
10 Clustering algorithms
10.1 Introduction
Clustering it the most important unsupervised learning problem, thus its aim is to find a structure in a collection
of unlabelled data.
Cluster A cluster is a set of objects that are "similar" between each other and "dissimilar" from other objects
belonging to other clusters.
10.1.1 Requirements
A clustering algorithm must fulfils these requirements:
Scalability Performance must not degrade when increasing volume of input data.
Flexibility It must deal with different type of attributes.
Reduced knowledge Application domain knowledge must be the smallest possible.
Noise tolerance The algorithm must be robust to noise and outliers
Order Independence The order of input records must not influence the algorithm behaviour.
Interpretability and usability The output of the algorithm must be usable by other software and inter-
pretable by both human/machine.
10.1.2 Problems
Most of the currently developed algorithms problems are:
Requirement fulfilment Typically an algorithm fails in fulfilling all requirements expressed above.
Poor scalability Large dimension and volume of input data typically cause very high time complexity.
Distance dependent The output of the algorithm is strictly dependent on the definition of distance used to
cluster samples. Moreover there’s no obvious distance definition and it may also depends on application require-
ments.
Ambiguity Often, the output of the algorithm can be interpreted in different ways.
10.2 Types of clustering algorithms

We can define three main clustering algorithm families:
• Exclusive clustering
• Fuzzy (overlapping) clustering
• Hierarchical clustering
10.2.1 Exclusive clustering

Data are clustered in an exclusive way, so that if a certain datum belongs to a particular cluster, it cannot belong
to any other one. For this family we consider the K-Means algorithm

10.2.2 Fuzzy (overlapping) clustering

Here we use fuzzy sets to measure the membership degree of a datum to several different clusters. For this family
we consider the Fuzzy C-Means.
10.2.3 Hierarchical clustering

This is based on the union of the two nearest cluster, in a recursive manner. At the beginning every datum is a
cluster and then on each step near cluster are fused into a new one.
10.3 K-Means clustering algorithm

This algorithm solves a problem of dividing a certain data into k (fixed) clusters. The main idea is to initially
define k random (or cunningly chosen) centroids and associate each data point to the nearest (according to some
distance metric) centroid. Once this early groupage is done, new centroids are computed by simply finding the
barycenter of each cluster found at previous step.
The objective function The aim of this algorithm is to minimise following quantity:
n
k X
(j)
− c j ||2
X
J= ||x i (39)
j =1 i =1
Where c j is the centroid of cluster j and n is the number of input points.
10.3.1 Algorithm
More formally the algorithm is the following:
1. Place K points into the space of object to be clustered. These points are the initial centroids.
2. Assign each object to the group that has the closest centroid.
3. When all object have been assigned, recalculate positions of the K centroids by means of barycenter oper-
ator.
4. Repeat 2. 3. until centroids do not move, or their movement is under a certain ² (according to distance
metric chosen for the algorithm).
Since this algorithm is a greedy one, it do not produce the optimal configuration, corresponding to the global
minimum of the objective function showed before.
10.3.2 Remarks
This algorithm has the following main characteristics:
• The way to initialise the means is not specified in any way (typically we choose them randomly but well
spaced)
• The final result highly depends on the initial centroids.
• It may happen that the set of samples closest to a certain point m i is empty, so m i cannot be updated.
• As already said, the result depends on the chosen distance metric
• Finally, since k is fixed, the result depends on it.
10.4 Fuzzy C-means algorithm

In this algorithm a datum d i can belong to a cluster j with a certain degree of membership u i j ∈ [0, 1]. The
objective function to minimize in this algorithm is:
N X
C
u imj ||x i − c j ||2
X
Jm = (40)
i =1 j =1
Where m ∈ R : m ≥ 1

10.4.1 The algorithm

The fuzzy partitioning algorithm is the following:
1. Initialise U (0) ∈ RN ∗C
2. Until STOP, for each step k:
(a) Compute c j ∈ C (k) as:

PN
u m ∗ xi
i =1 i j
cj = (41)
um
PN
i =1 i j
(b) Compute the new u i j ∈ U (k+1) as :

 2
−1
XC µ ||x i − c j || ¶ m−1
ui j =   (42)
k=1 ||x i − c k ||
(c) If ||U (k+1) −U (k) || ≤ ² then STOP
10.5 Hierarchical clustering

This algorithm works with a set of N elements and a distance matrix D ∈ RN ∗N . The basic idea is the following:
1. Start by assigning each item to a cluster, thus obtaining N initial clusters.
2. Find the closest (according to some metric - more on this later on) pair of clusters and merge them into
asingle one
3. Compute again the distance matrix D with the new cluster.
4. Repeat the operations until you get only one cluster of size N .
10.5.1 Types of distance metrics

We can define three different distance metrics between clusters:
• Single-linkage clustering (a.k.a connectedness or minimum method): we consider the distance between
two clusters equal to the shortest distance (or greatest similarity) between any two elements of theirs.
• Complete-linkage clustering (a.k.a diameter or maximum method): we consider the distance between two
clusters equal to the greatest distance (or smallest similarity) between any two elements of theirs.
• Average-linkage clustering: we consider the distance between one cluster and another cluster equal to the
average distance between any member of a cluster and a member of the other.
This approach is generally called agglomerative clustering (there exists a divisive one, that starts with all the ele-
ments in the same cluster).
10.5.2 Single-linkage algorithm

Suppose to have distances d (i , j ) ∈ D : RN ∗N . Clusters are numbered m = 0, 1, ..., N − 1 and L(k) is the level of the
k t h clustering step. The distance between two cluster is expressed as d [(a), (b)]. So the algorithm is the following:
1. Start with the disjointed clustering with L(0) = 0 and m = 0
2. Find the least dissimilar (or nearest) pair of clusters in current clustering (r ), (s), according to:
d [(r ), (s)] = min d [(i ), ( j )] (43)

(i )6=( j )
3. Set m = m + 1 and merge clusters (r ) and (s) into a single one, and call it m. Set the level L(m) = d [(r ), (s)]

4. Delete row corresponding to merged clusters from D and add the row for the new cluster, with the dis-
tances updated as follow (recall that is single-linkage):
d [(k), (r, s)] = min d [(k), (r )], d [(k), (s)] (44)
5. If all objects are in one cluster STOP, otherwise start again from 2.
Of course you may not want only one cluster at the end of the story. To obtain k cluster you have to cut the k − 1
longest links from the tree.
10.5.3 Problems
The two main weaknesses of agglomeratiev approaches like the one showed before are:
• Poor scaling: time complexity is at least O(n 2 ) (quadratic to the number n of objects)
• Operations done at iteration k destroy information produced by operations done at iteration k −1 - that is,
no undo is possible.
10.6 Performance
If I want to assess the performance of a clustering algorithm I can exploit the definitions of compactness and
separation.
Compactness The compactness is defined as:
1 X K X N
C= u m ||x j − c i ||2 (45)
N i =1 j =1 i j
The closer the points are to their cluster centres the smaller the compactness.
Separation The separation is defined as:
S = min ||c i − c j ||2 (46)

i 6= j
The more the centroids are distant between each other, the higher is the separation.
Xie-Beni index This index is built from the two previous definitions and the lower the index, the better the
clustering - that is an high level of cluster separation and an high level (small number) of compactness inside the
clusters.
C
iX B = (47)
S

11. CLASSIFICATION
11 Classification
Classification is the process of assigning objects to classes. A class contains similar objects. Every object can
be described by a set of attributes or characteristics called features. These features may be categorical - that is,
qualitative - or numerical - that is, quantitative. In the latter case we can arrange the features into a n-dimensional
vector:
x = [x 1 , . . . , x n ]T ∈ R n (48)
The space where the vector belongs is called feature space.
11.1 Classifier
A classifier is any function:
D : Rn → Ω (49)
Where Ω is the set of class labels. Hence, a classifier is built out of a labelled set of data, and can assign labels to
object, thus predicting the class to which a new unlabelled object belongs. A classifier can be:
• Discrete: if it outputs a class label
• Continuous: if it outputs a real value, that is a certain degree of membership of that object to classes in Ω.
We’ll need to apply a certain threshold to precisely classify that object.
11.1.1 Classifier performance

Comparison between two classifier is typically performed basing on classification accuracy:
correctly classified samples

accuracy = (50)
testing samples
Its 1-complement is called error rate. Accuracy can be used only when class a-priori probabilities are constant
and quite uniformly distributed. It also assumes equal penalties of bad classification, that is not always the case.
Real world problems In real world problems there are often a large number of cases that are either normal or
not interesting and a way smaller number of cases that are unusual but interesting. Thus, the class distribution is
very skewed.
Working scenario From now on, to asses other performance metrics, we’ll focus on a two-class problem or
binary classification. So we will have:
• p class of positives
• n class of negatives
• Y predicted class is positive
• N predicted class is negative
Said that we can have 4 outcomes:
• True Positive (TP): positive object classified as p (hit).
• True Negative (TN): negative object classified as n (correction rejection).
• False Positive (FP): negative object classified as positive (false alarm).
• False negative (FN): positive object classified as negative (miss)
FPs and FNs are caused by the overlay of p and n distributions.

11. CLASSIFICATION
11.1.2 Confusion matrix

Given the previous example we can build a matrix:
p n
Y TP FP
N FN TN
Total P N
Table 1: Confusion matrix
The rows represent the predicted classes, while the columns represent the true classes. From the matrix we
can compute some values:
TP +T N
accuracy = (51)
P +N
True positive rate, also called sensitivity or hit rate:
TP TP
TPR = = (52)
TP +FN P
False positive rate, also called false alarm rate, is also the 1’s complement of the specificity:
FP FP
FPR = = (53)
FP +T N N
Cost of errors We can define two classes of error cost:

• c(Y , n) = cost of a false positive error
• C (N , p) = cost of a false negative error
If we evaluate using accuracy we must assume that the two costs are equal. In real world problems, this is rarely
true and it mostly depends on problem characteristics.
11.2 Receiver operating characteristics: ROC

ROC is a more general comparison method, from signal theory. They’re useful to visualize and select classifiers
based on their performance. Also, they’re very useful in the presence of very skewed class distribution and un-
equal classification error costs.
11.2.1 ROC space

The ROC space is a two-dimensional space where ROC graphs are visualised. On the x-axis we put FPR, while on
the y-axis we put TPR.
A ROC graph shows the trade-off between benefits (TPR) and costs (FPR) of a classifier.
Some imporant points There are interesting points in the ROC space:
• (0, 0) It is "never alarming" strategy, so never issue a positive classification, thus not being able to have a
false positive (G.A.C.).
• (1, 1) Opposite w.r.t previous, unconditionally issue positive classifications.
• (0, 1) perfect classification
• y = x represents a random classifier: every classifier that goes below this line performs worse than a ran-
dom classification.
A discrete classifier (i.e. an instance of a confusion matrix) will produce a single point on the ROC. When
comparing two discrete classifiers, the one that is most "top-left" - that is higher TPR and ower FPR - has better
performance.

11. CLASSIFICATION
ROC for continuous classifier A continuous classifier will instead produce a ROC curver, based on the value
we associate to the classification threshold. In this case, comparing two classifiers, is not so straightforward as
before, since curves typically do not lie entirely above or below other curves.
Area under the curve In this case we better exploit the area under the curve or AUC. An AUC value of 0.5
indicates a random classifier, while an AUC value of 1 denotes perfrect classification.

12. CONVOLUTIONAL NEURAL NETWORKS
12 Convolutional Neural Networks

Convolutional neural networks are structured very similar to ordinary ones, employing neurons with adjustable
weights and biases. However in this new architecture we explicitly assume that network inputs are images, al-
lowing us to better exploit the input properties in our network and make the forward functions more efficient to
implement, reducing number of parameters in the network.
From now on we’ll refer to an example that takes as input images of [32, 32, 3] dimension, to classify them as
digits, having thus n = 10 possible classes.
12.1 Architecture overview

The layers of a convolutional neural network are neurons arranged in a 3D space made by [wi d t h, hei g ht , d ept h]
- that are, of course, the parameters that characterise an image.
The last layer, instead, is a fully connected layer that receive only a single vector of class scores, with dimension
[1, 1, n] where n is the number of class we want to classify our input in.
12.1.1 Type of layers

The building blocks of a CNN are 5:
• Input layer [32, 32, 3]: holds raw pixel values of the image.
• Convolution layer (CONV): computes output of neurons connected to local receptive regions of the in-
put (or previous block layer). The computation involves the dot product between neurons weights (filter
matrix) and small regions of the input (if we use 12 filters we may got [32, 32, 12] output).
• Rectified linear unit (RELU): applies an element-wise activation function (typically max(0, x)).
• Pooling layer (POOL): performs a down-sampling along width and height, reducing the volume to [16, 16, 12]
• Fully connected layer (FC): computes the class cores, outputting the vector [1, 1, 10] of class scores. Note
that this layer is not connected in any way to all the input neurons, but only to a subset of them.
Parameters Some of the previous layers contains parameters (like CONV/FC - weights and biases) while others
do not have any parameter, implementing fixed functions (like RELU/POOL). Hence the parameters interested in
training are those in CONV/FC layers.
12.1.2 Convolutional layer CONV

CONV layers’ parameters consist of a set of learn-able parameters called filters. A filter is a small subset - spatially
speaking - of the input layer (typically [5, 5, 3]).
During the forward pass we convolve - or slide - each filter across the entire dimensions of the input volume,
computing dot products between entries and filter parameters (the weights
Local connectivity When dealing with high-dimensional inputs such as images, it is very impractical using
fully connected layers from the very beginning. Instead it is more correct (also from a brain perspective) to
connect each neuron to a local region of the input volume. The spatial extent of this connectivity is an hyper-
parameter called receptive field of the neuron, that is equal to the filter size. Recall that the extent is always and
only along the width and the height of the image, while the depth is convolved as a whole.

12. CONVOLUTIONAL NEURAL NETWORKS
Spatial arrangement We have defined the parameters to be trained but not the number of neuron in a CONV
layer. There are three hyper-parameters that define this quantity:
Depth The depth of the output volume corresponds to the number of filters we use for it: each filter will look for
different features of the input (like edges and colours). A set of neurons that look for the same region are
called depth column or fibre.
Stride The stride is how much we slide the filter along the input volume. Higher stride will reduce the output
volume.
Zero-Padding Sometimes is necessary/useful to pad with zeros the border of the input volume. This allows us to control
the spatial size of the output volumes (typically to preserve it).
Output volume calculation So, given W the input volume size, F the receptive field size, S the stride and P
the zero-padding amount used on the border, the output size O will be:
W − F + 2P
O= +1 (54)
S
NOTE: When selecting a stride, one must ensure that O is an integer number, otherwise we say that neurons do
not fit and the set of parameters is not valid.
Parameter sharing Parameters sharing scheme is used in CONV layers to control the number of parameters.
The basic idea is: If one feature is useful to compute at (x, y) it is likely to be useful also at (x 2 , y 2 ). So, if we call a
single 2D space as depth slice, we force neurons in each depth slice to use the same weights and biases.
NOTE: if all neurons in a single depth slice are using the same weight vector, the forward pass can be com-
puted, in each slice, as convolution of neuron’s weights with the input volume. This is why we commonly refer to
sets of weights as filter or kernel.
12.1.3 Pooling layer

The aim of periodically inserting a POOL just after a CONV is to progressively reduce the spatial size of the rep-
resentation to reduce the amount of parameters ad computations in the network and also to control over-fitting
(that happens typically in fully connected networks).
Operations The POOL operates independently on every depth slice of the input and re-sizes it spatially, using
a MAX operator.
A very common form of POOL is a layer with 2-by-2 filters applied with a stride of 2, applying the MAX oper-
ator on the 4 input numbers. This ables us to discard the 75% of the activations
General pooling There exists other type of pooling layers such as average pooling or L2-norm pooling
12.1.4 Fully-connected layer

Neurons in a fully connected layer have full connections to all activations in the previous layer, as in regular NN.
Their activation can be easily computed by matrix multiplication followed by a bias offset.
Dropout Since a fully connected layer occupies most of the parameters, it is very prone to over-fitting. To
reduce this risk, at each training stage, individual nodes are dropped from the network with probability p, to
obtain a reduced network. So, in that stage, only the reduced network is trained on the data and, at the end of the
stage, removed nodes are then inserted back into the network with their original weights.
During training typically p = 0.5 for hidden nodes, while for input nodes it must be sensibly lower, since a
node-drop on that layer would mean information loss.

13. FUZZY LOGIC AND SYSTEMS
13 Fuzzy logic and systems

13.1 Fuzzy sets
Fuzzy sets Fuzzy sets are an extension of classical (called crisp) sets. Let’s do an example, suppose to have a set
of tomatoes.
• With classical sets the membership relationship between a tomato and the set is binary u ∈ {0, 1}
• With fuzzy sets instead, the relationship between a tomato and the set is defined as u ∈ [0, 1], defining a
degree of membership.
Each fuzzy set is defined on a set of elements that defines the universe of discourse.
Formal definitions More formally a fuzzy set A is characterised by its membership function:
µ A : X → [0, 1] (55)
So, to completely define a fuzzy set A we write:
A = {(x, µ A (x))|x ∈ X } (56)
Membership functions Typical membership functions are called bell-shaped,triangular or trapezoidal func-
tions [IMG]
13.1.1 Properties of fuzzy sets

We define some properties on fuzzy sets.
Height
hei g ht (A) = sup µ A (x) (57)
x∈X
Core Note that there exist fuzzy sets s.t. their core is empty (subnormal fuzzy set )
cor e(A) = {x ∈ X |µ A (x) = 1} (58)
Support
suppor t (A) = {x ∈ X |µ A (x) > 0} (59)
Fuzzy singleton A fuzzy singleton is a set with only one value whose degree of membership is equal to 1 while
all the other values have a degree of membership = 0. [IMG]
13.1.2 Fuzzy numbers

When the universe of discourse is Rn the fuzzy set is called fuzzy number (e.g. ’around 12’).
13.2 Operation on fuzzy sets

Classical operations on sets like intersection,union and complement can be extended to fuzzy sets as well. Func-
tions that can classify as fuzzy intersections or fuzzy unions are called triangular norms (t-norms) and triangular
co-norms (t-conorms, s-norms).

13.2.1 Fundamental t-norms

These are the fundamental t-norms from common operation on sets:
• Minimum: T (a, b) = mi n(a, b)
• Product: T (a, b) = a × b
• Bounded product: T (a, b) = max(0, a + b − 1)
13.2.2 Fundamental t-conorms

These are the fundamental t-conorms from common operation on sets:
• Minimum: T (a, b) = max(a, b)
• Probabilistic sum: T (a, b) = a + b − a ∗ b
• Bounded sum: T (a, b) = mi n(1, a + b)
13.2.3 Complement
For the complement operation we have several different definitions:
• Standard: A(x) = 1 − A(x)
p
• Round complement: A(x) = 1 − [A(x)]2
• Yager
• Sugeno
13.2.4 Standard fuzzy operations

• Intersection: A ∩ B = mi n{A(x), B (x)}
• Union: A ∪ B = max{A(x), B (x)}
• Complement: standard complement
NOTE: Intersection between A and its complement is not the empty set. Union between A and its complement is
not the universe.
13.2.5 Linguistic variables

A linguistic variable is a variable whose values are fuzzy sets.
13.3 Fuzzy relations

A fuzzy relation is a fuzzy set defined on the Cartesian product:
R(X , Y ) = {((x, y), µR (x, y))|(x, y) ∈ X × Y } (60)
Composition (max-min composition) Given two relations R(X , Y ) and S(Y , Z ) we can define a relation
U (X , Z ) = R(X , Y ) ◦ S(Y , Z ) as:
U (X , Z )(x, z) = max[min{R(x, y), S(y, z)}] (61)
y∈Y
Sup-T composition But, since the minimum expressed there is a t-norm we can also write:
U (X , Z )(x, z) = sup T (R(x, y), S(y, z)) (62)

y∈Y

13.4 Fuzzy logic

With fuzzy logic we refer to the extension of classical logic that uses fuzzy sets to compute with words.
he logic is embedded in a set of principles of reasoning and rules of inference.
Modus ponens The most used inference rule in classical logic is modus ponens that follows this rule:
(P ∧ (P → Q)) → Q (63)
In other words, if P is known to be true and, given that P implies Q, Q must be true. However we need a fuzzy logic
extension of the modus ponens, that is:
(P 0 ∧ (P → Q)) → Q 0 (64)
Where P’ is slightly different from P, and so Q’ from Q. The terms of previous rule are called fuzzy propositions.
Fuzzy proposition A fuzzy proposition can be:

• "X is A" where X is a linguistic variable and A is a fuzzy set. The degree of truth of a proposition for a given
value x ∈ X is represented by T (p) = A(x)
• "If X is A then Y is B" where X,Y are linguistic variables and A,B are fuzzy sets. This form is also called
conditional fuzzy proposition or fuzzy ryle.
So we end up having two binary fuzzy relations, A’ and A → B , which can be combined to produce an approxima-
tion B’. This is called approximate reasoning.
Fuzzy implication A fuzzy implication I is a function
I : [0, 1] × [0, 1] → [0, 1]
For any possible value of truth x,y of p,q defines the truth value I (x, y) of the proposition "if p then q".
This is an extension of the classical implication function, that is:
I (a, b) = ¬a ∨ b
Therefore I can apply some type of fuzzy-implication rules:

• Kleene-Dienes: I (a, b) = max(1 − a, b)
• Lukasiewicz: I (a, b) = mi n(1, 1 − a + b)
• Zadeh: I (a, b) = max(mi n(a, b), 1 − a)
• Mamdani: I (a, b) = mi n(a, b). This is by far the most used implication, thanks to its simplicity.
• Larsen: I (a, b) = a ∗ b
However if I look at the generalized modus ponens B 0 = A 0 ◦ (A → B ), i can write it as a sup-min composition:
B 0 (y) = sup mi n[A 0 (x), I (A(x), B (y))] (65)

x∈U
That in general is, using a t-norm:

B 0 (y) = sup T [A 0 (x), I (A(x), B (y))]
x∈U
But we can go further, infact, if we substitute the implication I with one of the above definitions we can obtain
our result.
13.5 Linguistic hedges

Linguistic hedges are special linguistic terms that modify other linguistic terms (e.g. very,more,less,fairly etc...).
Any linguistic hedge can be interpreted as an unary operation in the interval [0, 1]. For example, the edge very
p
if often interpreted as h(a) = a 2 , while the edge more or less is interpreted as h(a) = (a)

13.6 Fuzzy systems

A fuzzy system or fuzzy inference system consist of a certain number of rules, each of them is a conditional state-
ment where the antecedent (left term) and consequent (right term) contain linguistic variable and/or operators
(e.g. and,or etc...).
When considering fuzzy systems we have two approaches:

• FITA (first infer then aggregate) where each rule is evaluated individually and then conclusions are aggre-
gated (typically by means of union of the fuzzy sets produced by the rules)
• FATI (first aggregate then infer) where a single fuzzy relation is generated from the set of rules, and com-
bined with the fact (input).
De-fuzzification The final fuzzy set is often defuzzified in order to produce a numeric value, typical defuzzi-
fiers are:
• Center of Area (COA) also called center of gravity or centroid method
• Mean of Maxima (MOM)
The fact that we need this tools comes from the idea that we want formalize vague and imprecise concepts while,
on the other hand we want to obtain precise results from vague and imprecise information.
13.6.1 MISO (multi-input single-output)

It takes place in the following form:
if X 1 is A 1 ... and X m is A m then Y is B .
The antecedent is a fuzzy relation that is typically a min t-norm: T (A 1 (x 1 ), ..., A m (x m ))

13.7 Types of fuzzy rules

We can have different types of fuzzy rules formulations:
Mamdani
if X 1 is A and ... X m is A m then Y is B .
The rule consequent is a fuzzy set. The main advantage that comes from using this fuzzy rule is the high
interpretability of the result, while having low accuracy and high computational cost
Takagi-Sugeno-Kang (TSK)
if X 1 is A i 1 and ... X m is A i m then Y = a i 0 + a i 1 ∗ X 1 ... + a i m ∗ X m
The consequent is a linear model of the input variabls. Sometimes second-order linear functions can be used.
The system’s output is a weighted average of the inputs variable firing strength.
The main advantage comes from greater accuracy w.r.t Mamdani systems. As a drawback we obtain low
interpretability.
Singleton fuzzy rule The rule consequent is a constant value. It is a good tradeoff between Mamdani
intepretability and TSK accuracy. Moreover, the process of defuzzyfication is less demanding.
13.8 Adaptive fuzzy networks

So far, we’ve only considered fuzzy membership functions that are pre-determined in some way and fixed by us.
Suppose now to have a set of data we want to model without specifing the fuzzy modelling as before. We may
train our membership functions to adapt to data, obviously providing good grade of generalization.
An adaptive network is a multi-layered feed-forward network in which each node perform a node function on
incoming signals. We can have two type of nodes:
• Adaptive nodes: nodes that have tunable parameters
• Fixed nodes: that have no paramters
No weight is associated to link between nodes.
13.9 ANFIS - Adaptive-network-based fuzzy inference system

ANFIS is a family of adaptive networks that are equivalent to fuzzy inference systems. Supose to have this two
TSK rules:
• if X is A 1 and Y is B 1 then f 1 = p 1 ∗ x + q 1 ∗ y + r 1
• if X is A 2 and Y is B 2 then f 2 = p 2 ∗ x + q 2 ∗ y + r 2
We can implement them without specifying initially all the parameters in the consequent rule by using ANFIS:
• Layer 1: each node is is an adaptive one and perform fuzzyfication of the input. Parameters of
membership function are callled premise parameters.
• Layer 2: each node is a fixed one and represents the firing strength of a rule
• Layer 3: each node is a fixed one and represents a normalization of the previous layer output
• Layer 4: each node is an adaptive one and implements one of the two TSK rules with p, q, r paramters.
• Layer 5: last node is a fixed one and computes the weighted sum of the outputs of previous level.
Once it is trained, this network is functionally equivalent to the TSK fuzzy system described before.

Learning algorithm We can use the well-known gradient descent or use a hybrid learning algorithm which
combines LSE and gradient descent:
• In forward pass premise paramters are fixed and consequent ones are updated by means of LSE.
• In backward pass premise parameters are updated by means of gradient descent while consequent ones
are fixed.
13.10 How to create rules?

It’s not always true that we can dispose of an expert knowledge to derive our rules, so we have to use a learning
approach also on rule modelling using an appropriate dataset.
Structure oriented approach We provide dataset along with fuzzy sets for each attribute. Then we
superpose hyperboxes according to these fuzzy sets over our data. Then we generate fuzzy rules to identify each
hyperbox that contains data.
Cluster oriented approach We perform clustering on our data - that is, unsupervised learning. Then we
project clusters into data, creating fuzzy rules for each cluster. In this way we manage to create both
membership functions and fuzzy rules.
Hyperbox oriented approach We define some hyperboxes (thus the algorithm is supervised in the
parameters we provide) to cover data. Then we project hyperboxes onto data, thus generating both fuzzy rules
and sets.
Acknowledgments Using structure oriented approach we obtain more intepretable fuzzy rules. The other
two are less usefule because:
• Each fuzzy rule uses individual fuzzy sets.
• Fuzzy sets obtained by projection are hard to interpret linguistically.
13.11 Wang and Mendel algorithm

Assume we are given a set of <input,desired output> pairs the Wang and Mendel algorithm proceed as follows:
1. Divide the input and output spaces into fuzzy region: usually we use equispaced triangular or trapezoidal
membership functions
2. Then we determine the membership degree of each variable to each fuzzy set and we assign a variable to
the the region with the highest membership degree. Finally we obtain a fuzzy rule from each input,output
pair.
3. We now may have rules with same "if" part but different "then" part. We can assign to each rule a certain
degree and choose the one with the highest degree among the conflicting ones. For example: the degree
of a rule is the product of the degree of its components. We can also add some expertize weights if we
consider the rule more important than others.
4. A combined fuzzy rule is built from rules generated as before. If we have two inputs, we can construct a
matrix with all the possible fuzzy values assumed by the two inputs. An "and" rule will occupy only one
cell, while an "or" rule will occupy more cells. If a cell contains more than one rule it will be assigned to
the rule with highest degree.
5. Determine a mapping from input to output space based on the combined fuzzy rule produced at previous
step.

14. GENETIC ALGORITHMS
14 Genetic Algorithms
14.1 Introdution
Genetic algorithms, from now on referred as GAs, represents search and optimization methods aimed to mimic
the natural evolution. The progress in this type of evolution has three fundamental processes:
• Selection: picks individuals that will produce offspring (in italian: "prole").
• Recombination (crossover): combines two different individuals to produce offspring
• Mutation: random variation of existing genteic material.
To reproduce those aspects, GAs perform systematic random searches to improve the probability to find global
optimal solutions to the problem.
14.2 Characteristics
A GA considers a populatios of chromosomes or individuals that encode a potential solution to the optimization
problem.
Each chromosome is a set of genes, representing a specific feature of an individual and a point in the search
space.
The quality of a chromosome is measured by its fitness function - that is a scalar objective function. The higher
the fitness the higher the likelihood to select that individual to produce offspring.
14.3 Selection, crossover and mutation

The search progress modify the chromosome population generation after generation
A selection operator is applied to exploit best characteristics of good candidate solutions, to imporove them
throughout generations.
Genetic operators like crossover and mutation are applied to generate new solutions:
• With crossover we combine (or mate) two chromosomes (the parents) to produce new chromosome (the
offspring)
• With mutation we induce variability in the population restoring lost or unexplored genetic material,
preventing sub-optimal solutions coming out.
14.4 Exploration and exploitation

During the research, we apply exploration to explore the search space and exploitation to enforce the best
available solution.
The former corresponds to a global search while the latter corresponds to local search.
If we exploit too much a solution, we’re likely to fall into a sub-optimal or premature convergence, leading to a
unacceptable solution.
On the other hand, if we focus too much on exploration, the process becomes too slow.
GAs solves this trade-offs by exploring the space using mutation and crossover and exploting solutions by means
of selection.
14.5 The fitness function

Typically the fitness function can be:
• An objective function, for numerical or combinatorial optimization problems
• Cost or loss function for other type of problems
• Compared to example test cases to infer its quality.
The important thing to observe is that there must be enough fitness difference between generation to allow real
and effective evolutionary search process. Fitness functions that cannot discriminate too much between
successive generations may degenearte in blind searches.

14.6 Search progress

During the search progress we obtain generation evolution by means of chromosome combination and
modification.
On each iteration t we create a population that represents the generation at time t . Typically all generations
have the same size - that is same number of individuals and new generation contains better individuals w.r.t
fitness function, leading to the global optimum solution.
The overall process goes as follow:
1. Evaluation: GA evaluates fitness of each individual in current population
2. Selection for recombination: individuals of population P (t ) are selected for recombination according to
fitness, representing an intermediate population P 1 . Given a crossover probability, individuals from P 1
will enter a mating pool for crossover.
3. A new intermediate population P 2 is generated and mutated according to a mutation proabibility into a
new population P 3 .
4. Now the new generation P (t + 1) contains individuals from P 3 and may or may not overlap previous
populations. We can now apply several selection procedures using fitness function.
14.7 Canonical GA
In this fashion of the algorithm, the number of chromosome is constant in each generation and the entire
popuation is replaced by the following genration (while other replacement scheme are allowed). Chromosomes
are binary strings of constant length, where each gene is 0, 1.
The algorithm is the following:
1. Set genearation t = 0
2. Initialize a random population P (t )
3. Evaluate it using fitness function
4. until DONE
5. (a) Select P 1 ∈ P (t ) population for recombination. Then by crossover probability choose some
individuals i ∈ P 1 to enter mating pool MP.
(b) Recombine i ∈ M P to form population P 2 and mutate with mutation probability individuals of P 2
into new population P 3
(c) Select for replacement from P 3 , P (t ) and form new generation P (t + 1).
(d) t := t + 1
Termination We can terminate the algorithm by some criteria:

• Max number of generations
• Desired fitness value
• No population modification in a certain number of consecutive generations.
Solution Solution is represented by the last generation but this does not guarantee that better individuals has
not been previously obtained. We may think of keeping best individuals from each generation, by means of
elitism: copy the fittest individuals (or a percentage) from a generation to the next one.

14.8 Selection
We can perform several types of selection:
• Proportional selection (or roulette wheel selection): selection probability is proportional to individual
fitness:
f (i )
p(i ) = P
f (j)
We then "spin the wheel" n times to select n individuals (the higher the probability the higher the roulette
slice assigned to that individual).
• Tournament selection: select randomly k individuals from population an the choose the fittest of them
(the tournament winner).
Selection pressure Selection pressure is the degree to which better individuals are favoured by the selection
operator. The higher the SP the more the exploitation.
A strong SP may cause premature convergence to sub-optimal solutions, while a weak SP may slow down the
algorithm.
14.9 Binary encoding crossover

We now can apply different types of crossover:
• One point crossover: we define a split point in the chromosome and give to one offspring the first piece of
a parent and the second one of the other, alternatively.
• Two point crossover: the same as before but with two split point.
• Uniform crossover: for each gene of the first descendant we’ll chose the parent that will give that gene’s
value by some probability. For the second descendant we use other parent’s value or it is chosen
independently as before.
14.10 Real-valued encoded crossover

In this type of encoding each gene is a real number, so we have new choices for crossover:
• Discrete crossover: the same as uniform crossover in binary encoding.
• Average crossover: some genes are randomly chosen and descendant ones are represented by their
arithmetical average of parents’ ones.
• Convex crossover: descendant genes are expressed as a convex combination of their parents’ genes:
z i = αx i + (1 − α)y i , α ∈ [0, 1]
14.11 Binary encoding mutation

We now want to change a single gene of a chromosome, assuring that the full range of gene values is available
during searching process.
In strong mutation, the position selected for mutation is automatically changed, while in weak one, the position
change with a certain probability.
14.12 Real-valued encoding mutation

Again we have new choices for mutation opearator:
• One-position mutation: a single gen is randomly chosen and replaced with a real random number inside
the parameter domain.

• All-positions mutation: all the genese are perturbed in a similar way. For example we can apply an
additive normal mutation:
x i0 = x i + αi ∗ N (0, σi )
• Non uniform mutation: the meaningfulness of modifications decrease gradually through generations.
This means that in early steps modifications are very significant, leading to big improvement from
generation to generation, while in late ones we refine the work done so far.
Typically two parameters are randomly generated at each generation: the nature of change (increase or
decrease values) and the amplitude (a ∈ U ([0, 1])) of change.
14.13 Crossover and mutation probability

Typically crossover probability assumes values in the magnitued of 10− 1, while mutation probability assumes
values around [0.001, 0.01]. Also these two parameters affect the balance between exploitation and exploration.
Challenges So, when solving an optimization problem, following issues must be accounted:
• Derive a genetic representation of candidate solutions
• Create an initial population (random?) of solutions.
• Define the fitness function
• Choose selection and genetic operators
• Choose the parameters of GA, like population size, max number of genrations and probabilities of genetic
operators.

Intelligent Systems Notes: Federico Rossi A.A 2017/2018

Uploaded by

Copyright:

Available Formats

You might also like

Intelligent Systems Notes: Federico Rossi A.A 2017/2018

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Intelligent Systems Notes: Federico Rossi A.A 2017/2018

Uploaded by

Copyright:

Available Formats

Intelligent systems notes

Last updated on June 16, 2018

2.1 Non-linear activation functions

f (a) = tanh a (5)

Federico Rossi, University Of Pisa 2017/2018 2

3.1 Layered Feed-forward Network

3.2 Recurrent networks

3.3 Other typologies

3.4 Input-Output mapping

Federico Rossi, University Of Pisa 2017/2018 3

w i (n + 1) = w i (n) + ∆w i (n) (7)

Proof What we want to obtain is expressing the delta rule as:

Federico Rossi, University Of Pisa 2017/2018 4

Let’s take a linear output neuron X

The first term is (from the definition of error E ):

The second term is (from the definition of output o j ):

Thus, packing all the things up we got:

Federico Rossi, University Of Pisa 2017/2018 5

4.1 Gradient descent algorithm

(a) For each weight set ∆w i j = 0

Federico Rossi, University Of Pisa 2017/2018 6

5.1 Training a perceptron

5.2 Geometrical properties

Quick recall Scalar product x × y =

5.3 Learning algorithm

(a) For each training pattern < X , t >

5.4 Multi-neuron perceptron

Federico Rossi, University Of Pisa 2017/2018 7

5.5 Hidden layer multiperceptron

Federico Rossi, University Of Pisa 2017/2018 8

Again we have obtained our delta rule:

6.1 The learning algorithm

(a) Calculate the outputs of all neurons o j = f (net j )

(d) Then, update all weights using delta rule

Federico Rossi, University Of Pisa 2017/2018 9

7 Build a training set

7.1 When to stop?

7.2 N-fold cross-validation

Federico Rossi, University Of Pisa 2017/2018 10

Federico Rossi, University Of Pisa 2017/2018 11

h(x) : Rn → R : f (||x − c||) (28)

8.1 RBF networks

net j = ||x − w j || (29)

Where ||.|| is a particular distance norm.

8.2 RBF learning algorithms

3. Find the weights using the pseudo-inverse method

Federico Rossi, University Of Pisa 2017/2018 12

Let’s define the activation function:

Where Φ is the following N × M matrix:

8.3 Multi-layer perceptron (MLP) vs RBF networks

Federico Rossi, University Of Pisa 2017/2018 13

Federico Rossi, University Of Pisa 2017/2018 14

9 Unsupervised and Competitive learning

9.1 Winner takes all network

w(n + 1) = w(n) + α ∗ [x − w(n)], α ∈]0, 1[ (38)

9.2 Self-organising maps (SOM)

9.2.2 Differences from WTAs