Stat Learning 7 R

Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S.
Ingrassia
Master in Data Science for Management

Statistical Learning
Prof. Salvatore Ingrassia
s.ingrassia@unict.it
http://www.dei.unict.it/docenti/salvatore.ingrassia
http://www.datasciencegroup.unict.it
1 / 99
Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S. Ingrassia
7. Feedforward
Neural Networks
”Experience without theory is blind,

but theory without experience is more
intellectual play”.
Immanuel Kant (1724-1804).
2 / 99
Outline
7.1. A (very) brief history of research in machine learning

7.1.1. The first learning machines
7.1.2. The Idea of neural networks
7.1.3. Kernel methods and decision trees
7.1.4. Deep learning
7.2. Multilayer Perceptrons

7.2.1. Feedforward neural networks
7.2.2. Single Hidden Layer Networks
7.2.3. Feedforward network for classification
7.2.4. Fitting Neural Networks
7.2.5. Some Issues in Training Neural Networks
7.2.6. Labo activity with R
3 / 99
7.1 A (very) brief history of

research in machine learning
Agenda
! The first Learning Machines
! The Idea of Neural Networks
! Kernel Methods and Decision Trees
! Deep Learning
4 / 99
Motivation: Biological Neuron

The inspiration for the creation of an artificial neuron comes from the
biological neuron.
In a simplistic view, neurons receive signals
and produce a response.
! Dendrites are the transmission channels
to bring inputs from another neuron or
another organ.
! Synapse governs the strength of the
interaction between the neurons,
consider it like weights we use in neural
networks.
! Soma is the processing unit of the
neuron.
At the higher level, neuron takes a signal input
through the dendrites, process it in the soma
and passes the output through the axon (the
brown color cable-like structure in the figure).
5 / 99
The McCullogh-Pitts Neuron - 1/2
The fundamental block of deep learning is artificial neuron i.e. it takes a

weighted aggregate of inputs, applies a function and gives an output.
The very first step towards the artificial neuron was taken by Warren
McCulloch and Walter Pitts.
In 1943, inspired by neurobiology, they created a model known as
McCulloch-Pitts Neuron.
The McCulloch-Pitts model was an extremely simple artificial neuron. The
inputs could be either a zero or a one. And the output was a zero or a one.
And each input could be either excitatory or inhibitory.
6 / 99
The McCullogh-Pitts Neuron - 2/2
Given x∈ {0, 1}n and a threshold T > 0

⎧
⎨ 1 if $n > T
i= 1
f (x) =
⎩ 0 otherwise
Threshold function
7 / 99
Rosenblatt’s Perceptron 1/6
The first iterative algorithm for learning linear classifications is the procedure
proposed by Frank Rosenblatt in 1958 for the Perceptron
It is an ”on-line” and ”mistake-driven” procedure which starts with an initial
weight vector w0 (usually w0 = 0) and adapts it each time a training point is
misclassified by the current weights
The procedure is guaranteed to converge provided that there exists a
hyperplane that correctly classifies the training data.
In this case, we say that the data are linearly separable.
8 / 99

To construct such a rule, the Perceptron uses adaptive properties of the
simplest neuron.
Each neuron is described by the McCullogh-Pitts model, according to which
the neuron has p inputs x= (x1 , x2 , . . . , xp ) ∈ X ⊆ Rp and one output
y ∈ {−1, +1}.
The output is connected with the inputs by the functional dependence
f (x) = sign(w′ x+ b)
where b is a threshold value.
9 / 99
The bias can be included in the model as w0 = b and set x0 = 1.
10 / 99

Geometrically speaking, the neurons divide the space X into two regions C1
and C2 where the output f (x) takes the values ± 1. The regions are separated
by the hyperplane w′ x+ b = 0.
11 / 99
Given a linearly separable training set D and a learning rate η ∈ R+

w0 ← 0; b0 = 0; k = 0
R ← max1≤j≤p ∥xj ∥
repeat
for i = 1 to n (# cycle through the units in the training set )
if yi (w′i xi + bk ) ≤ 0 then
wk+ 1 ← wk + ηyi xi
bk+ 1 ← bk + ηyi R2
k←k+1
end if
end for
until no mistakes made within the for loop
return k, (wk , bk ), where k is the number of mistakes.
12 / 99
Rosenblatt considered a model that is a composition os several neurons:

several levels of neurons, where the outputs of neurons of the previous level
are inputs for neurons of the next level.
The last level contains only one neuron.
13 / 99
The Idea of Neural Networks - 1/3

In 1986 several authors independently proposed a method for simultaneously
constructing the vector-coefficients for all neurons of the Perceptrons.
The idea is estremely simple. Instead of the McCullogh-Pitts model of the
neuron by means of a threshold function
f (x) = sign(w′ x+ b)
1.0
0.5
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
14 / 99
Idea of Neural Networks - 2/3

one considers a slightly modified model, where the discontinuities function is
replaced by the continuous so-called sigmoidal function τ (z)
f (x) = τ (w′ x+ b)
1.0
0.5
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
here τ (z) is a monotonic and differentiable function with the properties

τ (−∞) = −1 τ (∞) = 1,
for example
ez − e−z
τ (z) = tanh(z) = .
ez + e−z
15 / 99
Idea of Neural Networks - 3/3
Alternatively, sigmoidal functions can assume also values in (0, 1), for
example we can consider the logistic sigmoid
1
τ (z) = .
1 + e−z
Therefore, the composition of the new neuron is a continuous function that,
for any fixed x, has a gradient with respect to all coefficients of all neurons.
In the 1986, the method for evauating this gradient was found. Using the
evaluated gradient one can apply any gradient based technique for
construction a function that approximates the desired function.
Of course, gradient based techniques only guarantee finding local minima.
Nevertheless, it looked as the main idea of applied analysis of learning

processes was found and the problem was in its implementation.
16 / 99
Kernel methods
As neural networks started to gain some respect among researchers in the

1990s, a new approach to machine learning rose to fame and quickly sent
neural nets to oblivion: kernel methods.
Kernel methods are a group of classification algorithms, the best known of
which is the Support Vector Machines. (The modern formulation of SVMs
was developed by Vladimir Vapnik and published in 1995).
17 / 99
Decision trees, random forests, boosting
Decision Trees learned from data began to receive significant research

interest in the 2000s, and by 2010 they were often preferred to kernel
methods.
Random Forests algorithm introduced a robust, practical tale on decision-tree
learning that involves building a large number of specialized decision trees
and then ensembling their outputs.
A boosting machine, much like a random forest, is a machine learning
technique based on ensembling weak prediction models, generally decision
trees. Applied to decision trees, the use of the boosting technique results in
models that strictly outperform random forests most of the time.
18 / 99
Back to Neural Networks - 1/2

Around 2010, although neural networks were almost completely shunned by
the scientific community at large, a number of people still working on neural
networks started to make important breakthroughs: the groups of Geoffrey
Hinton (University of Toronto, Canada), Yoshua Bengio (University of
Montreal, Canada), Yann LeCun (New York University, USA), Swiss AI Lab
IDSIA (Switzerland).
In 2011, Dan Ciresan from IDSIA won academic image-classification
competitions with GPU-trained neural networks.
In 2012, a team led by Alex Krizhevsky (Geoffrey Hinton’s group) was able to
achieve a significant improvement (83.6% vs. earlier 74.3%) in the yearly
large-scale image-classification challenge ImageNet (consisting of classifying
high-resolution color images into 1,000 different categories after training 1.4
million images).
Since 2012, deep convolutionary neural networks have become the go-to
algorithm for all computer vision tasks; more generally, they work on all
perceptual tasks
21 / 99
Back to Neural Networks - 2/2

The primary reason deep learning took off so quickly is that it offered better
performances on many problems.
Deep learning also makes problem solving much easier, because it
completely automates what used to be the most crucial step in a
machine-learning workflow: feature engineering.
Previous machine-learning techniques only involved transformating the input
data into one or two successive representation spaces via simple
transformations such as high-dimensional non linear projections (SVMs) or
decision trees.
Refined representations required by complex problems generally can’t be
attained by such techniques.
As such, humans had to go to great lengths to make the initial input data
more amenable to processing by these methods: they had to manually
engineer good layers of representations for these data. This is called feature
engineering.
22 / 99
Historical Perspective
In summary, the history of deep learning dates back to the early 1940s with
the first mathematical model of an artificial neuron by McCulloch & Pitts.
Since then, numerous architectures have been proposed in the scientific
literature, from the single layer perceptron of Frank Rosenblatt (1958) to the
recent neural ordinary differential equations (2018), in order to tackle various
tasks like time-series prediction, image classification, pattern extraction, etc.
23 / 99
7.2 Multilayer Perceptrons
Agenda
! Feedforward Neural Networks
! Single Hidden Layer Network
! Feedforward network for classification
! Fitting Neural Networks
! Some Issues in Training Neural Networks
! Application
24 / 99
Feedforward Neural Networks - 1/7
Deep feedforward networks, also called feedforward neural networks or

Multilayer perceptrons (MLP) are the quintessential deep learning models.
The goal pf a feedforward neural network is to approximate some function f ∗ .
For example, for a classifier y = f ∗ (x) maps an input xto a category
y ∈ {1, 2, . . . , K}. A feedforward network defines a mapping y= f (x;w) and
learns the value w taking values in some Euclidean space W that results in
the best function approximation.
These models are called feedforward because information flows through the
function being evaluated in x, through the intermediate computations used to
define f , and finally to the output y.
When feedforward neural networks are extended to include feedback
connections, they are called recurrent neural networks.
25 / 99

The feedforward neural network was the first and simplest type of artificial
neural network devised. It contains multiple neurons (nodes) arranged in
layers.
Nodes from adjacent layers have connections or edges between them. All
these connections have weights associated with them.
26 / 99
A feedforward neural network can consist of three types of nodes:

! Input Nodes. The Input nodes provide information from the outside
world to the network and are together referred to as the Input Layer.
No computation is performed in any of the Input nodes - they just
pass on the information to the hidden nodes.
! Hidden Nodes. The Hidden nodes have no direct connection with the
outside world (hence the name hidden). They perform computations
and transfer information from the input nodes to the output nodes. A
collection of hidden nodes forms a Hidden Layer.
While a feedforward network will only have a single input layer
and a single output layer, it can have zero or multiple Hidden
Layers.
28 / 99
! Output Nodes. The Output nodes are collectively referred to as the

Output Layer and are responsible for computations and transferring
information from the network to the outside world.
In a feedforward network, the information moves in only one direction -

forward - from the input nodes, through the hidden nodes (if any) and to the
output nodes.
There are no cycles or loops in the network (this property of feed forward
networks is different from the so-called Recurrent Neural Networks in which
the connections between the nodes form a cycle).
30 / 99
There are two examples of feedforward networks:

1. Single Layer Perceptron. This is the simplest feedforward neural
network and does not contain any hidden layer.
The best example to illustrate the single layer perceptron is through

representation of ”Logistic Regression”.
31 / 99
2. Multilayer Perceptron. A Multilayer Perceptron has one or more hidden

layers. Multilayer Perceptrons are more useful than Single Layer
Perceptons for practical applications.
32 / 99
Feedforward neural networks are also called networks because they are
typically represented by composing together many different functions.
The model is associated with a directed acyclic graph describing three
functions f (1) , f (2) and f (3) connected in a chain to form
f (x) = f (3) (f (2) (f (1) (x))).
These chain structures are the most commonly used structures of neural
networks. In this case f (1) is called the first layer of the network, f (2) is
calledthe second layer of the network, and so on.
Depth of the network

The overall length of the chain gives the depth of the network. The name deep
learning arose from this terminology.
33 / 99
Single Hidden Layer Networks - 7/7
Why feedforward neural network works

The Universal Approximation Theorem (Barron (1993). Universal Approxima-
tion Bounds for Superpositions of a Sigmoidal Function) states that a feedfor-
ward network with a linear output layer and at least one hidden layer with
sigmoidal function (such as the logistic sigmoid or the hyperbolic tangent
functions) can approximate any Borel measurable function from one finite-
dimensional space to another with any desired nonzero amount of error, pro-
vided that the network is given enough hidden units.
Remark
As for the concept of Borel measurability, it suffices to say that any
continuous function on a closed and bounded subset of Rp is Borel measur-
able and therefore may be approximated by a neural network.
35 / 99
Single Hidden Layer Networks -1/4

Consider a single hidden layer network
z1
x1
z2
x2
z3 f (x)
..
.
..
.
xp
zq
36 / 99

Assume that there are q neurons in the hidden layer. Derived features
z1 , . . . , zq are derived from x1 , . . . , xp from
zj = τ (aj1 x1 + · · · + ajp xp + aj0 ) = τ (a′j x+ aj0 ), j = 1, . . . , q
q
%
f (x) = c1 z1 + c2 z2 + · · · , cq zq + c0 = cj zj + c0
j= 1
where aj = (aj1 , . . . , ajp )′ and c0 , c1 , . . . , cq are parameters (the weights of the

network, and τ (u ) is a sigmoidal function, for example τ (u ) = tanh(u ).
Thus
q
%
f (x) = cj τ (a′j x+ aj0 ) + c0 .
j= 1
In other words, a feedforward network implements the mapping

q
%
f (x) = cj τ (a′j x+ aj0 ) + c0
j= 1
and the graph is the graphical representation of this function.

38 / 99
More in general we can consider a further transformation in the output layer

and write
q
%
T(x) = cj τ (a′j x+ aj0 ) + c0
j= 1
f (x) = g (T(x))
where the function g (T) allows a final transformation of the output T = T(x).
In particular, g (T) is the identity function for regression problems.
39 / 99

More in general, we have an hiden layer with q nodes and an output layer
with K units
and therefore
zj = τ (aj1 x1 + · · · + ajp xp + aj0 ) = τ (a′j x+ aj0 ), j = 1, . . . , q
q q
% %
Tk (x) = ckj τ (a′j x+ aj0 ) + ck0 = ckj zj = c′k z k = 1, . . . , K (1)
j= 1 j= 1
fk (x) = g k (T(x))
with z = (z1 , . . . , zq )′ and T = (T1 , T2 , . . . , TK )′ 40 / 99
Graphical representation of statistical model - 1/3

Linear regression
x0 = 1 ⑦
❩ w0
❩
❩
❩
⑦
❩
❩
❳❳❳❩❩✗✔
x1 ❳❳w❳1
❳❳③⑦ $
❩
✲ f (x;w)
✖✕
✘ ✘✿
✘
⑦
✘✘ ❃
✚
✚
x2 ✘✘✘ ✚
w2 ✚
✚
✚
✚
⑦ w3
✚
x3 ✚
41 / 99

Logistic regression
x0 = 1 ⑦
❩ w0
❩
❩
❩
⑦
❩
❩
❳❳❳❩❩✗✔ ✗✔
x1 ❳❳w❳1
❳❳③⑦ $
❩ γ(x;w)
✲ ψ ✲ f (γ(x;w))
✖✕ ✖✕
✘ ✘✿
✘
⑦
✘✘ ❃
✚
✚
x2 ✘✘✘ ✚
w2 ✚
✚
✚
✚
⑦ w3
✚
x3 ✚
42 / 99

Mixture of distributions
⑦
p (x;θ)
*
✒
* ■❅❅
✗✔ ❅✗✔
π1
* ❅ πk
*
✖✕ 00✏✏✏ ✖✕
f (x;ϕ1 ) ϕ1 • • • • • ϕk f (x;ϕk )
✐0
0 ✶
✏
✒
* ✏ ✏ 0 00 ■
❅
* ✏✏ 00 ❅
✏
⑦ 0⑦
* ✏ 00❅
*
✏✏✏
• • • • • • • • • • • • •0❅
x1 xp
k
%
p (x;θ) = πi ϕ(x;ψ i ).
i= 1
43 / 99
Feedforward networks for classification

For a K-class classification, there are K units at the top, with the kth unit
modeling the probability of class k.
And then choose the class according to the maximum probability criterion,
i.e. according to the classifier G(x) = arg maxk fk (x).
44 / 99
The Softmax function - 1/3
Recall that logistic regression produces a number between 0 and 1. For

example, a logistic regression output of 0.8 from an email classifier suggests
an 80% chance of an email being spam and a 20% chance of it being not
spam. Clearly, the sum of the probabilities of an email being either spam or
not spam is 1.0.
The Softmax function extends this idea into K > 2 class problems. That is,
Softmax assigns probabilities to each class in a multi-class problem. Those
probabilities must add up to 1.
45 / 99
Consider the mapping implemented by feedforward network with a K nodes

in the output layer
q
%
Tk (x) = ckj τ (a′j x+ aj0 ) + ck0 k = 1, . . . , K
j= 1
fk (x) = g k (T)
with T = (T1 , . . . , TK ).
In K-class classification we consider the softmax function
exp(Tk )
g k (T) = $K
j= 1 exp(Tj )
46 / 99

Softmax is implemented through a neural network layer just before the output
layer.
The Softmax layer must have the same number of nodes as the output layer.
47 / 99
How feedforward NNs for classification work - 1/15
To understand how feedforward neural networks for classification work and

the role of the hidden layers, consider the following steps
1. Based on the data, draw an expected decision boundary to separate
the classes.
2. Express the decision boundary as a set of hyperplanes. Note that the
combination of such hyperplanes must yield to the decision boundary.
3. The number of selected hyperplanes represents the number of hidden
neurons in the first hidden layer.
4. To connect the hyperplanes created by the previous layer, a new hidden
layer is added. Note that a new hidden layer is added each time you
need to create connections among the hyperplanes in the previous
hidden layer.
5. The number of hidden neurons in each new hidden layer equals the
number of connections to be made.
48 / 99

Consider a classification problem with two classes, similar to the so-called
XOR problem.
Each sample has two inputs and one output that represents the class label.
49 / 99
! The first question to answer is whether hidden layers are required or not.
! A rule to follow in order to determine whether hidden layers are required or
not is as follows: in artificial neural networks, hidden layers are required if
and only if the data must be separated non-linearly.
! Looking at next figure, it seems that the classes must be non-linearly
separated. A single line will not work.
! As a result, we must use hidden layers in order to get the best decision
boundary.
50 / 99

Knowing that we need hidden layers to make us need to answer two
important questions.
1. What is the required number of hidden layers?
2. What is the number of the hidden neurons across each hidden layer?
The first step is to draw the decision boundary that splits the two classes.
There is more than one possible decision boundary that splits the data
correctly as shown in the figure.
51 / 99

Next step is to express the decision boundary by a set of lines.
The idea of representing the decision boundary using a set of hyperplanes

comes from the fact that any NN is built using the single layer perceptron as a
building block.
The single layer perceptron is a linear classifier which separates the classes
using a hyperplane.
52 / 99

Knowing that there are just two lines required to represent the decision boundary tells us
that the first hidden layer will have two hidden neurons.
Each hidden neuron could be regarded as a linear classifier that is represented as a
hyperplane. There will be two outputs, one from each classifier (i.e. hidden neuron).
! We are to build a single classifier with one output representing the class label, not
two classifiers.
! As a result, the outputs of the two hidden neurons are to be merged into a single
output.
! In other words, the two hyperplanes are to be connected by another neuron.
53 / 99
After knowing the number of hidden layers and their neurons, the network
architecture is now complete.
54 / 99

Another classification example is shown in the this figure.
! It is similar to the previous example in which there are two classes

where each sample has two inputs and one output.
! The difference is in the decision boundary.
! The boundary of this example is more complex than the previous
example.
55 / 99
How feedforward NN for classification work - 9/15

According to the guidelines, the first step is to draw the decision boundary. The
decision boundary to be used in our discussion is shown in left-most part of the next
figure (a).
The next step is to split the decision boundary into a set of lines, where each line will
be modeled as a perceptron in the NN.
Before drawing lines, the points at which the boundary change direction should be
marked as shown in the right-most part of the next figure (b).
56 / 99

How many hyperplanes are required? Each of top and bottom points will have two
hyperplanes associated to them for a total of 4 hyperplanes.
! Because the first hidden layer will have hidden layer neurons equal to the
number of hyperplanes, the first hidden layer will have 4 neurons. In other
words, there are 4 classifiers each created by a single layer perceptron.
! The network will generate 4 outputs, one from each classifier.
! Next is to connect these classifiers together in order to make the network
generating just a single output. In other words, the lines are to be connected
together by other hidden layers to generate just a single curve.
58 / 99
It is up to the model designer to choose the layout of the network. One

feasible network architecture is to build a second hidden layer with two
hidden neurons.
The first hidden neuron will connect the first two lines and the last hidden
neuron will connect the last two lines.
59 / 99
The result of the second hidden layer.
60 / 99
After network design is complete, the complete network architecture is shown

in the figure.
61 / 99

Assume the first 4 hidden neurons learned to recognize the patterns above in
the left side.
Then, if we feed the neural network an array of a handwritten digit zero, the
network should correctly trigger the top 4 hidden neurons in the hidden layer
while the other hidden neurons are silent, and then again trigger the first
output neuron while the rest are silent.
62 / 99
63 / 99
Preliminary note: Gradient descent - 1/7
Optimization is a big part of machine learning.

Gradient descent is an optimization algorithm used to find the values of
parameters (coefficients) of some function f that minimizes a cost function
(cost).
Gradient descent is best used when the parameters cannot be calculated
analytically (e.g. using linear algebra) and must be searched for by an
optimization algorithm.
64 / 99

Consider the function
f (x) = x2 with derivative f ′ (x) = 2x.
The derivative points to the direction of steepest ascent.
66 / 99

Take the function
f (x, y) = 2x2 + y2
as another example.
Here, f (x, y) is a two-variable function. Its gradient is a vector, containing the
partial derivatives of f (x, y)
∂f ∂f
= 4x = 2y
∂x ∂y
so the gradient is given by
& '′
∂f ∂f
∇f (x, y) = , = (4x, 2y)′ .
∂x ∂y
Each component indicates what is the direction of steepest ascent for each
variable of the function. Put it differently, the gradient points to the direction
where the function increases the most.
68 / 99
In general, if we have a function f (x), with x= (x1 , . . . , xp )′ ∈ Rp , the gradient

of f is given by & '′
∂f ∂f
∇f (x) = ,..., .
∂x1 ∂xp
For Gradient descent, however, we do not want to maximize f as fast as we

can, we want to minimize it. Thus the direction of the steepest descent is
−∇f .
In summary
! ∇f points in the direction of greatest increase of the function.
! -∇f points in the direction of greatest decrease of the function.
71 / 99

The nature of the gradient can be implemented in optimization iterative
algorithms: given the current value x(r) at step r, the value x(r+ 1) at step r + 1
is computed such that
1)
x(r+ = x(r) − γ ∇f (x)|x= x(r) r = 0, 1, 2, . . . (2)
where γ is a quantity called learning rate.
Starting from some initial guess x(0) (usually selected randomly), a gradient
descent algorithm generates a sequence {x(r) }r of points based on (2) until
the condition
|f (x(r+ 1) ) − f (x(r) )| < ε
is satisfied for some ε > 0.
72 / 99
The value of the learning rate should be chosen carefully.
73 / 99

Starting from some initial guess x(0) (usually selected randomly), a gradient
descent algorithm generates a sequence {x(r) }r of points that converges to
some local minimum x∗ (no guarantee convergence towards global
minimum).
Therefore, strategies based on multiple initial starts could be adopted.
74 / 99
Fitting Neural Networks - 1/9

The neural network model has unknown parameters, often called weights,
and we have to estimate these values by fitting the training data.
Consider a network with p inputs and q nodes in the hidden layer. Let us
denote the complete set of weights by w, which consists of
{aj0 , aj1 , . . . , ajp , j = 1, . . . , q} q(p + 1) weigths,

{ck0 , ck1 , . . . , ckq , k = 1, . . . , K} K(q + 1) weigths.
Thus
w = {aj0 , aj1 , . . . , ajp , j = 1, . . . , q} ∪ {ck0 , ck1 , . . . , ckq , k = 1, . . . , K}
and then the number of parameters in θ to be estimated is q(K + q + 1) + K

on the basis of a given training set
D = {(x1 , y1 ), . . . , (xn , yn )}
where xi = (xi1 , . . . , xip )′ ∈ Rp and yi = (yi1 , . . . , yiK )′ ∈ Y K .
75 / 99
! For regression, we use a sum-of-squares errors as measure of fit (error

function)
%K % n
R(w) = (yik − fk (xi ))2 .
k= 1 i= 1
! For classification, we use either squared error or cross-entropy a

sum-of-squares errors as measure of fit (error function)
K %
% n
R(w) = − yik log fk (xi )
k= 1 i= 1
and the corresponding classifier is G(x) = arg maxk fk (x).
With the softmax activation function and the cross-entropy error function, the
neural network model is exactly a linear logistic regression model in the hidden
units, and all the parameters are estimated by the maximum likelihood.
77 / 99
The generic approach to minimizing R(w) is by gradient descent, called

back-propagation.
Because of the compositional form of the model, the gradient can be easily
derived using the chain rule for differentiation.
We present a back-propagation in detail for the squared error loss. Similar
approach holds for the computational components for the cross-entropy error
function.
78 / 99

Let
( p
)
%
zji = τ (a′j xi + aj0 ) = τ ajh xih + aj0 , j = 1, . . . , q
h= 1
zi = (z1i , . . . , zqi )′
q
%
Tk (x) = ckj τ (a′j x+ aj0 ) + ck0 = c′k zi + ck0 , k = 1, . . . , K
j= 1
fk (xi ) = g k (T(xi )), k = 1, . . . , K

% n
K % n
%
R(w) = (yik − fk (xi ))2 = Ri
k= 1 i= 1 i= 1
K
%
Ri = (yik − fk (xi ))2
k= 1
where aj = (aj1 , . . . , ajp )′ and ck = (ck1 , . . . , ckq )′
79 / 99

Consider the derivatives. Applying the chain rule of calculus
∂Ri
= −2(yik − fk (xi ))g ′k (c′k zi + ck0 )zji
∂ckj
%K
∂Ri
= −2 (yik − fk (xi ))g ′k (c′k zi + ck0 )ckj τ ′ (a′j xi + aj0 )xih .
∂ajh k= 1
Given these derivatives, a gradient descent update at the (r + 1)st iteration

has the form
n
%
(r+ 1) (r) ∂Ri
ckj = ckj − γr (r)
i= 1 ∂ckj
n
(3)
(r+ 1) (r)
% ∂Ri
ajh = ajh − γr (r)
i= 1 ∂ajh
where γr is the learning rate.
81 / 99
Now let us set

∂Ri
= δki zji
∂ckj
(4)
∂Ri
= sji zih .
∂ajh
These quantities δmi and sji are ”errors” from the current model at the output
and hidden layer units.
From these definitions, these errors satisfy
K
%
sji = τ ′ (a′j xi + aj0 ) ckj δki
k= 1
known as the backpropagation equations.
82 / 99
Using this, the updates in (3) can be implemented in a two-pass algorithm:

! In the forward pass, the current weights are fixed and the predicted
values f̂k (xi ) are computed through the formula (1).
! in the backward pass, the errors δki are computed and then
$
back-propagation yields the errors sji = τ ′ (a′j xi + aj0 ) Kk= 1 ckj δki .
Both sets of errors are then used to compute the gradient for the updates (3)
via (4).
83 / 99
Back-propagation
The two-pass procedure is what is known as back-propagation. It has also
been called the delta-rule (Widrow and Hoff, 1960).
! The advantages of back.propagation are its simple and local nature.

! In the back.propagation algorithm, each hidden unit passes and
receives information only to and from units that share a connection.
Hence it can be implemented efficiently on a parallel computer.
! The updates in (3) are a kind of batch learning, with the parameter
updates being a sum over all of the training cases.
84 / 99
Training epoch
A training epoch refers to one sweep through the entire dataset.
The learning rate γr for batch learning is usually taken to be a constant, and
can also be optimized by a line search that minimizes the error function at
each update.
Warning
Back-propagation can be very slow, and for this reason it is usually not the
method of choice. Better approaches include conjugate gradients and variable
metric methods.
85 / 99
Some Issues in Training Neural Networkss
There is quite an art in training neural networks:

! The model is generally overparametrized
! The optimization problem is nonconvex.
86 / 99
Starting Values - 1/2
Note that if the weights are near zero, then the operative part of the sigmoid
is roughly linear, and hence the neural network collapses into an
approximatey linear model.
1.0
0.5
0.0
−0.5
−1.0
−3 −2 −1 0 1 2 3
87 / 99
Starting Values - 2/2
Usually starting values for weights are chosen to be random values near
zero. Hence the model starts out nearly linear, and becomes non-linear as
the weights increase.
Individual units localize to directions and introduce nonlinearities where
needed.
Use of exact zero weights leads to zero derivatives and perfect symmetry,
and the algorithm never moves.
Starting instead with large weights often leads to poor solutions.
88 / 99
Overfitting - 1/2
Often neural networks have too many weights and will overfit the data at the
global minimum of R.
In early developments of neural networks an early stopping rule was adopted
to avoid overfitting: here we train the model only for a while, and stop well
before we approach the ”global” minimum. A validation set is useful for
determining when to stop , since we expect the validation error to start
increasing.
overfitted well-fitted
89 / 99
Overfitting - 2/2
Regularization: Weight Decay

A more explicit method for regularization is weight decay: we add a penalty to
the error function, i.e. we consider
R(w) + λJ(w)
where
J(w) = ∥a∥2 + ∥c∥2
and λ ≥ 0 is a tuning parameter.
Larger values of λ will tend to shrink the weights toward zero: typically
cross-validation is used to estimate λ.
90 / 99
Scaling the Inputs - 1/5
Feature scaling in machine learning is one of the most critical steps during
the pre-processing of data before creating a machine learning model.
Scaling can make a difference between a weak machine learning model and
a better one.
The most common techniques of feature scaling are Normalization and
Standardization.
91 / 99

Normalization
Normalization is used when we want to bound our values between two
numbers, typically, in [
0, 1]or [ .
−1, 1]
Standardization
Standardization transforms the data to have zero mean and a unit variance.
Both normalization and standardization make the data unitless.
92 / 99

Another reason why feature scaling is applied is that few algorithms like
Neural network gradient descent converge much faster with feature scaling
than without it.
93 / 99
Normalization
! Normalization transforma features by scaling each feature to a given
range.
! An approach to normalization is the Min-Max Scaler:
x − xmin
xnew =
xmax − xmin
! Similar approaches scales the data in [ .
−1, 1]
94 / 99

Standardization
! Normalization transforma features by scaling each feature to a given
range.
! An approach to normalization is the Standard Scaler:
x−µ
xnew =
σ
where µ and σ are the mean and the standard deviation of X,
respectively
! The Standard Scaler assumes data is normally distributed within each
feature and scales them such that the distribution centered around 0,
with a standard deviation of 1.
! Centering and scaling happen independently on each feature by
computing the relevant statistics on the samples in the training set. If
data is not normally distributed, this is not the best Scaler to use.
95 / 99
Number of Hidden Units and Layers- 1/2
In general, it is better to have too many hidden units than too few:
! With too few hidden units, the model might not have enough flexibility to
capture the nonlinearities in the data;
! with too many hidden units, the extra weights can be shrunk toward
zero if appropriate regularization is used.
! Typically the number of hidden units is somewhere in the range of 5 to
100, with the number increasing with the number of inputs and number
of training cases.
It is most common to put down a reasonably large number of units and train
them with regularization.
96 / 99
Number of Hidden Units and Layers- 2/2
Some researchers use cross-validation to estimate the optimal number, but

this seems unnecessary if cross-validation is used to estimate the
regularization parameter.
Choice of the number of hidden layers is guided by background knowledge
and experimentation.
Each layer extracts features of the input for regression or classification.
Use of multiple hidden layers allows construction of hierarchical features at
different levels of resolution.
97 / 99
Multiple Minima
The error function R(w) is nonconvex, possessing many local minima. As a

result, the final solution obtained is quite dependent on the choice of starting
weights.
One must at least try a number of random starting configurations, and
choose the solution giving lowest (penalized) error.
Probably a better approach is to use the average predictions over the
collection of networks as the final prediction. This is preferable to averaging
the weights, since the nonlinearity of the model implies that this averaged
solution could be quite poor.
Another approach is via bagging, which averages the predictions of networks
training from randomly perturbed versions of the training data.
98 / 99
Labo activity with R
Lab activity 7.R
99 / 99

Stat Learning 7 R

Uploaded by

Copyright:

Available Formats

You might also like

Stat Learning 7 R

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Stat Learning 7 R

Uploaded by

Copyright:

Available Formats

Data Analysis and Statistical Learning:06 - Master in Data Science for Management - A.Y. 2019/2020 - Prof. S.

Master in Data Science for Management

Prof. Salvatore Ingrassia

”Experience without theory is blind,

7.1. A (very) brief history of research in machine learning

7.2. Multilayer Perceptrons

7.1 A (very) brief history of

Motivation: Biological Neuron

The McCullogh-Pitts Neuron - 1/2

The fundamental block of deep learning is artificial neuron i.e. it takes a

The McCullogh-Pitts Neuron - 2/2

Given x∈ {0, 1}n and a threshold T > 0

Rosenblatt’s Perceptron 1/6

Rosenblatt’s Perceptron 2/6

where b is a threshold value.

Rosenblatt’s Perceptron 3/6

The bias can be included in the model as w0 = b and set x0 = 1.

Rosenblatt’s Perceptron 4/6

Rosenblatt’s Perceptron 5/6

Given a linearly separable training set D and a learning rate η ∈ R+

Rosenblatt’s Perceptron 6/6

Rosenblatt considered a model that is a composition os several neurons:

The Idea of Neural Networks - 1/3

Idea of Neural Networks - 2/3

here τ (z) is a monotonic and differentiable function with the properties

Idea of Neural Networks - 3/3

Of course, gradient based techniques only guarantee finding local minima.

Nevertheless, it looked as the main idea of applied analysis of learning

As neural networks started to gain some respect among researchers in the

Decision trees, random forests, boosting

Decision Trees learned from data began to receive significant research

Back to Neural Networks - 1/2

Back to Neural Networks - 2/2

7.2 Multilayer Perceptrons

Feedforward Neural Networks - 1/7

Deep feedforward networks, also called feedforward neural networks or

Feedforward Neural Networks - 2/7

Feedforward Neural Networks - 3/7

A feedforward neural network can consist of three types of nodes:

Feedforward Neural Networks - 4/7

! Output Nodes. The Output nodes are collectively referred to as the

In a feedforward network, the information moves in only one direction -

Feedforward Neural Networks - 5/7

There are two examples of feedforward networks:

The best example to illustrate the single layer perceptron is through

Feedforward Neural Networks - 5/7

2. Multilayer Perceptron. A Multilayer Perceptron has one or more hidden

Feedforward Neural Networks - 6/7

f (x) = f (3) (f (2) (f (1) (x))).

Depth of the network

Single Hidden Layer Networks - 7/7

Why feedforward neural network works

Single Hidden Layer Networks -1/4

Single Hidden Layer Networks - 2/4

where aj = (aj1 , . . . , ajp )′ and c0 , c1 , . . . , cq are parameters (the weights of the

In other words, a feedforward network implements the mapping

and the graph is the graphical representation of this function.