Professional Documents
Culture Documents
2.2 Training. Linear Regression
2.2 Training. Linear Regression
Daniel Manrique
2021
2
Training artificial Neural Networks
n Artificial neural networks learn from a dataset.
n Learning is the process of adjusting the weights and bias
(parameters) of the connections between neurons that
makes the neural network adapt its responses to exhibit
the expected target behavior: to fit the dataset.
n This set of weights and bias is the solution to the problem.
n The goal is to provide adequate answers to unseen data, not
present in the training dataset. Generalization.
W[1] W[2]
W[3]
Input 0419213143
MNIST 5361928694
examples ….
3
Weights adjustment
Ajuste de pesos
y
Supervised learning:
Entrada
t
deseada
output
Target
Salida
x
Output
Salida
Input
Red de neuronas Comparación
Neural architecture Comparison
0.0212 -0.75 0.21 0.67 0.87 0.94 0.9 0.64 -0.33 0.34
+1
9 input variables (attributes)
-0.5623 0.763 0.17 0.87 -0.74 -0.85 -0.91 0.93 0.33
MedianHouseValuePreparedCleanAttributes.csv
bias
Weights
Training process
6
Linear activation function
n The linear activation function is the identity function.
y = f 𝑊𝑥 + 𝑏 = 𝑊𝑥 + 𝑏
For each unit:
$! $! f(x)
𝑦 = f $ x! w%! + b = $ x! w%! + b
!"# !"#
$! x
net = $ x! w%! + b
!"#
https://towardsdatascience.com/activation-functions-neural-
networks-1cbd9f8d91d6 7
LMS learning algorithm or Delta rule
n The LMS (least mean squares) algorithm was designed to
adjust a linear unit or ADALINE.
n The bias b is usually included as column 0 of W:
n Wi = (wi,0,wi,1, …, wi,n𝓍), where wi,0 is bi.
n The input vector x is also increased by a row at 0 with a
constant 1:
n xT = (1,x1, x2, …, xn𝓍).
n Therefore, the output vector is y = W x; yi= ∑$!"#
𝓍 x w
! %!
n Given a linear regression problem, the loss (error) function for
a single pth training sample with nx attributes and ny classes:
) (') (') *
ℒ (')
W =E '
(W) = ∑$%")
!
t% − y%
*
W is the matrix of weights.
ti(p) is the desired target output for the ith unit: the label corresponding to the pth training sample.
yi(p) is the calculated output for the ith unit and pth input vector.
8
LMS learning algorithm or Delta rule
n The gradient descent applied to the error function
adjusts the weights as follows:
! " && (!) (!) # )* + (+)
E (W) = #
∑$%" t$ − y$ ; △(p)wij = −α ),
,-
n wij is the weight between the jth input unit and the ith
output unit.
n The chain rule permits to calculate the derivative of
the error function:
(+)
)* + (+) )* + (+) )-, )* + (+)
),,-
= (+) + ), ; (+) =-(t (')
% − y
(')
% )
)-, ,- )-,
9
LMS learning algorithm or Delta rule
Since the activation function is linear:
0! (1)
yi(p) = neti(p) = ∑-./ wij x-
n Where xj(p) is the value of the jth position of the pth
input vector.
# (#)
(#)
𝜕E # W 𝜕E # W 𝜕E # W 𝜕y% 𝜕E # W " " 𝜕y% (#)
∆ w%& = −α ; = , ; = − t ! − y! ; = x &
𝜕w%& 𝜕w%& 𝜕y%
# 𝜕w%& 𝜕y%
# 𝜕w%&
y(p) t(p)
-0.17 -0.54
MSE in training: 0.22 -0.21 0.17
MSE in dev: 0.238 -0.18 -0.54
MedianHouseValuePreparedCleanAttributes.csv
MedianHouseValueContinuousOutput.csv -0.22 -0.41
-0.20 -0.43
-0.21 -0.26
-0.20 0.06
The MSE evolution gets flat already at 100 epochs.
The computed values are all around -0.2, the mean value of the target outputs.
A non-linear problem is tried to be solved with linear regression. 13
Linear models can not be deep
n Sandwiching in intermediate layers does not increase the
model’s computational power or its accuracy.
n The general case for the ℓth layer is: y[ℓ] = W[ℓ] ·y[ℓ-1]
n y[1] = W[1]·x
n y[2] = W[2]·y[1] = W[2]·W[1]·x
n …
n y[ℓ] = W[ℓ] ·y[ℓ-1] = W[ℓ]·W[ℓ-1] ·...·W[2]·W[1]·x;