Professional Documents
Culture Documents
Stat Learning 7 R
Stat Learning 7 R
Stat Learning 7 R
Ingrassia
s.ingrassia@unict.it
http://www.dei.unict.it/docenti/salvatore.ingrassia
http://www.datasciencegroup.unict.it
1 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
7. Feedforward
Neural Networks
2 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Outline
3 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Agenda
! The first Learning Machines
! The Idea of Neural Networks
! Kernel Methods and Decision Trees
! Deep Learning
4 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
5 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
6 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Threshold function
7 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The first iterative algorithm for learning linear classifications is the procedure
proposed by Frank Rosenblatt in 1958 for the Perceptron
It is an ”on-line” and ”mistake-driven” procedure which starts with an initial
weight vector w0 (usually w0 = 0) and adapts it each time a training point is
misclassified by the current weights
The procedure is guaranteed to converge provided that there exists a
hyperplane that correctly classifies the training data.
In this case, we say that the data are linearly separable.
8 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
f (x) = sign(w′ x+ b)
9 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
10 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
11 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
12 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
13 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
f (x) = sign(w′ x+ b)
1.0
0.5
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
14 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
1.0
0.5
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
Alternatively, sigmoidal functions can assume also values in (0, 1), for
example we can consider the logistic sigmoid
1
τ (z) = .
1 + e−z
Therefore, the composition of the new neuron is a continuous function that,
for any fixed x, has a gradient with respect to all coefficients of all neurons.
In the 1986, the method for evauating this gradient was found. Using the
evaluated gradient one can apply any gradient based technique for
construction a function that approximates the desired function.
16 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Kernel methods
17 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
18 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
21 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
22 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Historical Perspective
In summary, the history of deep learning dates back to the early 1940s with
the first mathematical model of an artificial neuron by McCulloch & Pitts.
Since then, numerous architectures have been proposed in the scientific
literature, from the single layer perceptron of Frank Rosenblatt (1958) to the
recent neural ordinary differential equations (2018), in order to tackle various
tasks like time-series prediction, image classification, pattern extraction, etc.
23 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Agenda
! Feedforward Neural Networks
! Single Hidden Layer Network
! Feedforward network for classification
! Fitting Neural Networks
! Some Issues in Training Neural Networks
! Application
24 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
25 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
26 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! Hidden Nodes. The Hidden nodes have no direct connection with the
outside world (hence the name hidden). They perform computations
and transfer information from the input nodes to the output nodes. A
collection of hidden nodes forms a Hidden Layer.
While a feedforward network will only have a single input layer
and a single output layer, it can have zero or multiple Hidden
Layers.
28 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
There are no cycles or loops in the network (this property of feed forward
networks is different from the so-called Recurrent Neural Networks in which
the connections between the nodes form a cycle).
30 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
31 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
32 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Feedforward neural networks are also called networks because they are
typically represented by composing together many different functions.
The model is associated with a directed acyclic graph describing three
functions f (1) , f (2) and f (3) connected in a chain to form
These chain structures are the most commonly used structures of neural
networks. In this case f (1) is called the first layer of the network, f (2) is
calledthe second layer of the network, and so on.
33 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Remark
As for the concept of Borel measurability, it suffices to say that any
continuous function on a closed and bounded subset of Rp is Borel measur-
able and therefore may be approximated by a neural network.
35 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
z1
x1
z2
x2
z3 f (x)
..
.
..
.
xp
zq
36 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
f (x) = g (T(x))
where the function g (T) allows a final transformation of the output T = T(x).
39 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
and therefore
zj = τ (aj1 x1 + · · · + ajp xp + aj0 ) = τ (a′j x+ aj0 ), j = 1, . . . , q
q q
% %
Tk (x) = ckj τ (a′j x+ aj0 ) + ck0 = ckj zj = c′k z k = 1, . . . , K (1)
j= 1 j= 1
fk (x) = g k (T(x))
with z = (z1 , . . . , zq )′ and T = (T1 , T2 , . . . , TK )′ 40 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
x0 = 1 ⑦
❩ w0
❩
❩
❩
⑦
❩
❩
❳❳❳❩❩✗✔
x1 ❳❳w❳1
❳❳③⑦ $
❩
✲ f (x;w)
✖✕
✘ ✘✿
✘
⑦
✘✘ ❃
✚
✚
x2 ✘✘✘ ✚
w2 ✚
✚
✚
✚
⑦ w3
✚
x3 ✚
41 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
x0 = 1 ⑦
❩ w0
❩
❩
❩
⑦
❩
❩
❳❳❳❩❩✗✔ ✗✔
x1 ❳❳w❳1
❳❳③⑦ $
❩ γ(x;w)
✲ ψ ✲ f (γ(x;w))
✖✕ ✖✕
✘ ✘✿
✘
⑦
✘✘ ❃
✚
✚
x2 ✘✘✘ ✚
w2 ✚
✚
✚
✚
⑦ w3
✚
x3 ✚
42 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
⑦
p (x;θ)
*
✒
* ■❅❅
✗✔ ❅✗✔
π1
* ❅ πk
*
✖✕ 00✏✏✏ ✖✕
f (x;ϕ1 ) ϕ1 • • • • • ϕk f (x;ϕk )
✐0
0 ✶
✏
✒
* ✏ ✏ 0 00 ■
❅
* ✏✏ 00 ❅
✏
⑦ 0⑦
* ✏ 00❅
*
✏✏✏
• • • • • • • • • • • • •0❅
x1 xp
k
%
p (x;θ) = πi ϕ(x;ψ i ).
i= 1
43 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
And then choose the class according to the maximum probability criterion,
i.e. according to the classifier G(x) = arg maxk fk (x).
44 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
45 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
fk (x) = g k (T)
with T = (T1 , . . . , TK ).
In K-class classification we consider the softmax function
exp(Tk )
g k (T) = $K
j= 1 exp(Tj )
46 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
47 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
48 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Each sample has two inputs and one output that represents the class label.
49 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! The first question to answer is whether hidden layers are required or not.
! A rule to follow in order to determine whether hidden layers are required or
not is as follows: in artificial neural networks, hidden layers are required if
and only if the data must be separated non-linearly.
! Looking at next figure, it seems that the classes must be non-linearly
separated. A single line will not work.
! As a result, we must use hidden layers in order to get the best decision
boundary.
50 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The first step is to draw the decision boundary that splits the two classes.
There is more than one possible decision boundary that splits the data
correctly as shown in the figure.
51 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
52 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! We are to build a single classifier with one output representing the class label, not
two classifiers.
! As a result, the outputs of the two hidden neurons are to be merged into a single
output.
! In other words, the two hyperplanes are to be connected by another neuron.
53 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
After knowing the number of hidden layers and their neurons, the network
architecture is now complete.
54 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
55 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The next step is to split the decision boundary into a set of lines, where each line will
be modeled as a perceptron in the NN.
Before drawing lines, the points at which the boundary change direction should be
marked as shown in the right-most part of the next figure (b).
56 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
! Because the first hidden layer will have hidden layer neurons equal to the
number of hyperplanes, the first hidden layer will have 4 neurons. In other
words, there are 4 classifiers each created by a single layer perceptron.
! The network will generate 4 outputs, one from each classifier.
! Next is to connect these classifiers together in order to make the network
generating just a single output. In other words, the lines are to be connected
together by other hidden layers to generate just a single curve.
58 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
The first hidden neuron will connect the first two lines and the last hidden
neuron will connect the last two lines.
59 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
60 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
61 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Then, if we feed the neural network an array of a handwritten digit zero, the
network should correctly trigger the top 4 hidden neurons in the hidden layer
while the other hidden neurons are silent, and then again trigger the first
output neuron while the rest are silent.
62 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
63 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
64 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
66 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Each component indicates what is the direction of steepest ascent for each
variable of the function. Put it differently, the gradient points to the direction
where the function increases the most.
68 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
In summary
! ∇f points in the direction of greatest increase of the function.
! -∇f points in the direction of greatest decrease of the function.
71 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Starting from some initial guess x(0) (usually selected randomly), a gradient
descent algorithm generates a sequence {x(r) }r of points based on (2) until
the condition
|f (x(r+ 1) ) − f (x(r) )| < ε
is satisfied for some ε > 0.
72 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
73 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
74 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Thus
D = {(x1 , y1 ), . . . , (xn , yn )}
75 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
With the softmax activation function and the cross-entropy error function, the
neural network model is exactly a linear logistic regression model in the hidden
units, and all the parameters are estimated by the maximum likelihood.
77 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
78 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
zi = (z1i , . . . , zqi )′
q
%
Tk (x) = ckj τ (a′j x+ aj0 ) + ck0 = c′k zi + ck0 , k = 1, . . . , K
j= 1
79 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
81 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
82 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
83 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Back-propagation
The two-pass procedure is what is known as back-propagation. It has also
been called the delta-rule (Widrow and Hoff, 1960).
84 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Training epoch
A training epoch refers to one sweep through the entire dataset.
The learning rate γr for batch learning is usually taken to be a constant, and
can also be optimized by a line search that minimizes the error function at
each update.
Warning
Back-propagation can be very slow, and for this reason it is usually not the
method of choice. Better approaches include conjugate gradients and variable
metric methods.
85 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
86 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Note that if the weights are near zero, then the operative part of the sigmoid
is roughly linear, and hence the neural network collapses into an
approximatey linear model.
1.0
0.5
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
87 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Usually starting values for weights are chosen to be random values near
zero. Hence the model starts out nearly linear, and becomes non-linear as
the weights increase.
Individual units localize to directions and introduce nonlinearities where
needed.
Use of exact zero weights leads to zero derivatives and perfect symmetry,
and the algorithm never moves.
Starting instead with large weights often leads to poor solutions.
88 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Overfitting - 1/2
Often neural networks have too many weights and will overfit the data at the
global minimum of R.
In early developments of neural networks an early stopping rule was adopted
to avoid overfitting: here we train the model only for a while, and stop well
before we approach the ”global” minimum. A validation set is useful for
determining when to stop , since we expect the validation error to start
increasing.
overfitted well-fitted
89 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Overfitting - 2/2
R(w) + λJ(w)
where
J(w) = ∥a∥2 + ∥c∥2
and λ ≥ 0 is a tuning parameter.
Larger values of λ will tend to shrink the weights toward zero: typically
cross-validation is used to estimate λ.
90 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Feature scaling in machine learning is one of the most critical steps during
the pre-processing of data before creating a machine learning model.
Scaling can make a difference between a weak machine learning model and
a better one.
The most common techniques of feature scaling are Normalization and
Standardization.
91 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Standardization
Standardization transforms the data to have zero mean and a unit variance.
92 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
93 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Normalization
! Normalization transforma features by scaling each feature to a given
range.
! An approach to normalization is the Min-Max Scaler:
x − xmin
xnew =
xmax − xmin
! Similar approaches scales the data in [ .
−1, 1]
94 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
95 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
In general, it is better to have too many hidden units than too few:
! With too few hidden units, the model might not have enough flexibility to
capture the nonlinearities in the data;
! with too many hidden units, the extra weights can be shrunk toward
zero if appropriate regularization is used.
! Typically the number of hidden units is somewhere in the range of 5 to
100, with the number increasing with the number of inputs and number
of training cases.
It is most common to put down a reasonably large number of units and train
them with regularization.
96 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
97 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
Multiple Minima
98 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
99 / 99