Course: Intelligent Systems

Unit 2: Neural Networks

2.2 Training. Linear Regression

Daniel Manrique

Intelligent Systems
Neural Networks

1. Representing neural networks.

2. Training neural networks.
1. Linear Regression.
2. Softmax Regression.
3. Multilayer perceptrons.

Training artificial Neural Networks
n Artificial neural networks learn from a dataset.
n Learning is the process of adjusting the weights and bias
(parameters) of the connections between neurons that
makes the neural network adapt its responses to exhibit
the expected target behavior: to fit the dataset.
n This set of weights and bias is the solution to the problem.
n The goal is to provide adequate answers to unseen data, not
present in the training dataset. Generalization.

W[1] W[2]


Input 0419213143
MNIST 5361928694
examples ….

Weights adjustment
Supervised learning:




Neural architecture Comparison

● The training data fed to the algorithm

include the desired or target outputs, called
● Typical examples are classification, where
labels correspond to classes, or regression,
where labels correspond to the target
● An attribute is an input variable.
machine-learning-the-new-epm-black/ Laura Edell, 2015.

Attributes, x Target output, t

Long. Lat. Age Rooms Beds Pop. HouH. Inc. Ocean Median
proximity House value
-0.5623 0.763 0.17 -0.87 -0.74 -0.85 -091 0.93 0.33 0.87

0.4297 0.477 -.13 0.44 0.32 0.44 0.52 0.23 -1 -0.45

0.0212 -0.75 0.21 0.67 0.87 0.94 0.9 0.64 -0.33 0.34

-0.6171 -0.11 -.37 -0.31 -0.23 -0.34 -0.27 -0.34 1 -082

Linear regression bias

9 input variables (attributes)
-0.5623 0.763 0.17 0.87 -0.74 -0.85 -0.91 0.93 0.33


Wx+b=y 0.87 (328,500 $)



Training process
Linear activation function
n The linear activation function is the identity function.

y = f 𝑊𝑥 + 𝑏 = 𝑊𝑥 + 𝑏
For each unit:
$! $! f(x)
𝑦 = f $ x! w%! + b = $ x! w%! + b
!"# !"#

$! x
net = $ x! w%! + b
networks-1cbd9f8d91d6 7
LMS learning algorithm or Delta rule
n The LMS (least mean squares) algorithm was designed to
adjust a linear unit or ADALINE.
n The bias b is usually included as column 0 of W:
n Wi = (wi,0,wi,1, …, wi,n𝓍), where wi,0 is bi.
n The input vector x is also increased by a row at 0 with a
constant 1:
n xT = (1,x1, x2, …, xn𝓍).
n Therefore, the output vector is y = W x; yi= ∑$!"#
𝓍 x w
! %!
n Given a linear regression problem, the loss (error) function for
a single pth training sample with nx attributes and ny classes:
) (') (') *
ℒ (')
W =E '
(W) = ∑$%")
t% − y%
W is the matrix of weights.
ti(p) is the desired target output for the ith unit: the label corresponding to the pth training sample.
yi(p) is the calculated output for the ith unit and pth input vector.
LMS learning algorithm or Delta rule
n The gradient descent applied to the error function
adjusts the weights as follows:
! " && (!) (!) # )* + (+)
E (W) = #
∑$%" t$ − y$ ; △(p)wij = −α ),

n wij is the weight between the jth input unit and the ith
output unit.
n The chain rule permits to calculate the derivative of
the error function:
)* + (+) )* + (+) )-, )* + (+)
= (+) + ), ; (+) =-(t (')
% − y
% )
)-, ,- )-,

LMS learning algorithm or Delta rule
Since the activation function is linear:
0! (1)
yi(p) = neti(p) = ∑-./ wij x-
n Where xj(p) is the value of the jth position of the pth
input vector.
# (#)
𝜕E # W 𝜕E # W 𝜕E # W 𝜕y% 𝜕E # W " " 𝜕y% (#)
∆ w%& = −α ; = , ; = − t ! − y! ; = x &
𝜕w%& 𝜕w%& 𝜕y%
# 𝜕w%& 𝜕y%
# 𝜕w%&

The final weight adjustment for each training example is:

(') + + (')
∆ w%! = α t % − y% x !

Matrix: ∆(') W = α t (')

−y (') (')

Being 𝛂 the learning rate or step.

Cost and loss functions
n We adopt the following convention:
n The loss function, ℒ(𝑊), compares, for each input
example x(p), the computed output y(p), and the
target t(p). It is a local error, e.g., SSE:
(+) (+)
1 (+) (+) .
ℒ W =E W = $ t % − y%

n The cost function, J(W), involves the whole

dataset of size m. It is a global error, e.g., MSE:
J W = MSE(W) = $ 𝐸 (0) (W)
Linear regression: results
Stop condition: 1,000 epochs 1634 samples for training;
204 for development
Learning rate 𝛂=0.1
Test samples are not considered
Training MSE evolution

y(p) t(p)
-0.17 -0.54
MSE in training: 0.22 -0.21 0.17
MSE in dev: 0.238 -0.18 -0.54
MedianHouseValueContinuousOutput.csv -0.22 -0.41
-0.20 -0.43
-0.21 -0.26
-0.20 0.06
The MSE evolution gets flat already at 100 epochs.
The computed values are all around -0.2, the mean value of the target outputs.
A non-linear problem is tried to be solved with linear regression. 13
Linear models can not be deep
n Sandwiching in intermediate layers does not increase the
model’s computational power or its accuracy.
n The general case for the ℓth layer is: y[ℓ] = W[ℓ] ·y[ℓ-1]
n y[1] = W[1]·x
n y[2] = W[2]·y[1] = W[2]·W[1]·x
n …
n y[ℓ] = W[ℓ] ·y[ℓ-1] = W[ℓ]·W[ℓ-1] ·...·W[2]·W[1]·x;

n An ℓ-layers neural network is computationally equivalent to

one layer network whose connection matrix (kernel) W is:
n W=W[ℓ]·W[ℓ-1]·...·W[2]·W[1]. Therefore, y = Wx.

n Therefore, linear models can not be deep

n The linear activation function may be used in the output
layer, though, combined with non-linear activation functions.
n Computed output y(p) tends to the mean
value of the discretized target outputs to
achieve the least MSE.
n A linear approach can not solve a non-linear
n We’ll try a logistic regression
approach in the next lesson
to check whether or not
more accurate results are
achieved. Disney - Pixar
