ANN8

How is generalization possible?
Necessary conditions for good generalization.
1.The function you are trying to learn be, in some sense,

smooth. In other words, a small change in the inputs
should, most of the time, produce a small change in the
outputs.
2. The training cases be a sufficiently large and representative
subset of the set of all cases that you want to generalize
11 data points obtained by sampling h(x) at equal intervals of
x and adding random noise. Solid curve shows output of a
linear network.
Here we use a network which has more free parameters than
the earlier one. This network is more flexible. Approximation
improves.
Here we use a network which has many free parameters than
the earlier one. This complex network gives perfect fit to the
training data, but gives a poor representation of the function.
Not a simple model. Not a complex model. Complexity can be

controlled by controlling the free parameters.
Regularization
Adding penalty to the error function to control the model
complexity. Assume many free parameters. The total error then
where Ω is called a regularization. The parameter v controls the

extent to which Ω influences the form of the solution.
In the figure the function (function with lot of flexibility) has large
oscillations, and hence the function has regions of large
curvature. We might therefore choose a regularization function
which is large for functions with large values of the second
derivative, such, as
Weight Decay
• Weight decay adds a penalty term to the error function. The usual
penalty is the sum of squared weights times a decay constant. Weight
decay is a subset of regularization methods. The penalty term in
weight decay, by definition, penalizes large weights.
• The weight decay penalty term causes the weights to converge to
smaller absolute values than they otherwise would. Large weights can
hurt generalization in two different ways. Excessively large weights
leading to hidden units can cause the output function to be too rough,
possibly with near discontinuities. Excessively large weights leading
to output units can cause wild outputs far beyond the range of the data
if the output activation function is not bounded to the same range as
the data.
where the sum runs over all weights and biases.

Generalization Error
• Components of generalization error
– Bias: how much the average model over all training sets differ from
the true model?
• Error due to inaccurate assumptions/simplifications made by the
model
– Variance: how much models estimated from different training sets
differ from each other
• Underfitting: model is too “simple” to represent all the
relevant class characteristics
– High bias and low variance
– High training error and high test error
• Overfitting: model is too “complex” and fits irrelevant
characteristics (noise) in the data
– Low bias and high variance
– Low training error and high test error
Bias-Variance Trade-off
• Models with too few

parameters are
inaccurate because of a
large bias (not enough
flexibility).
• Models with too many

parameters are
inaccurate because of a
large variance (too much
sensitivity to the sample).
Bias-variance tradeoff
Underfitting Overfitting
Error
Test error
Training error
High Bias Low Bias

Low Variance
Complexity High Variance
Bias-variance tradeoff
Few training examples

Test Error
Many training examples
High Bias Low Bias

Low Variance
Complexity High Variance
Momentum
Performance Optimization
Taylor Series Expansion
d
F  x  = F  x  + F  x  x – x 
dx x = x
2
1 d 2
+ --- F x  x – x  + 
2 d x2
x = x
n
1 d n
+ ----- F x  x – x  + 
n! d x n
x = x
Example
–x
F x  = e
Taylor series of F(x) about x* = 0 :
–x –0 –0 1 –0 2 1 –0 3
F x  = e = e – e  x – 0  + ---e  x – 0  – -- e  x – 0  + 
2 6
1 2 1 3
F  x  = 1 – x + -- x – --- x + 
2 6
Taylor series approximations:
F  x   F0  x  = 1
F  x  F 1  x  = 1 – x
1 2
F  x   F 2  x  = 1 – x + --- x
2
Plot of Approximations
6
F2  x 
3
2 F1  x 
1
F0  x 
-2 -1 0 1 2
Vector Case
F  x = F  x1 x 2   x n 
 
F  x  = F  x  + F x  x 1 – x 1  + F x  x 2 – x 2 
 x1 x = x  x2 x=x 
2
 1  2
++ F x 
 x – x n  + ---
 n
F x 
 x – x1 
 1
 xn x = x 2 x
2 x x
=
1
2
1 
+ --- F x 
 x 1 – x 1   x 2 – x 2  + 
2  x 1 x 2 x = x
Matrix Form
T
F  x  = F  x  +  F  x   x – x 
x = x
1 T
+ ---  x – x  2 F  x  
 x – x  + 
2 x=x
Gradient Hessian
2 2 2
  
 F x F x  F x
F x 2
 x1  x 1 x 2  x 1 x n
 x1
2 2 2
   
F x F x F x  F x
 F  x  =  x2 2 F  x  =  x 2 x 1 2
 x2  x 2 x n


 2 2 2
F x   
 xn F x F x  F x
 x n x 1  x n x 2 2
 xn
Directional Derivatives
First derivative (slope) of F(x) along xi axis:  F  x   xi
(ith element of gradient)
2 2
Second derivative (curvature) of F(x) along xi axis:  F x   x i
(i,i element of Hessian)
T
p F x
First derivative (slope) of F(x) along vector p: -----------------------
p
T
Second derivative (curvature) of F(x) along vector p: p 2 F  x  p
------------------------------
2
p
Example
2 2
F x  = x 1 + 2x 1 x2 + 2 x2
x = 0.5 p = 1
0 –1

F x 
 x1 2x 1 + 2x 2 1
 F  x = = =
x = x  2x 1 + 4x 2 1
F x 
 x2 x = x
x = x
1
T 1 – 1
p F x 1 0
----------------------- = ------------------------ = ------- = 0
p 1 2
–1
Plots
Directional
Derivatives
2
20
15
1
1.4
10
1.3
5
x2 0 1.0
0
2
0.5
1 2
-1
0.0
0 1
0
-1
x2 -2 -2
-1
x1
-2
-2 -1 0 1 2
x1
Minima
Strong Minimum
The point x* is a strong minimum of F(x) if a scalar >0 exists, such that F(x*) <
F(x* + x) for all x such that  > ||x|| > 0.
Global Minimum
The point x* is a unique global minimum of F(x) if

F(x*) < F(x* + x) for all x  0.
Weak Minimum
The point x* is a weak minimum of F(x) if it is not a strong minimum, and a scalar >0
exists, such that F(x*) F(x* + x) for all x such that  > ||x|| > 0.
4 12
Scalar Example
F  x  = 3x – 7x – --- x + 6
2
Strong Maximum
6
2 Strong Minimum
Global Minimum
0
-2 -1 0 1 2
Vector Example
4 2 2 2
F x  =  x2 – x1  + 8x 1 x2 – x1 + x2 + 3 F  x =  x 1 – 1.5x 1 x2 + 2 x2  x1
2 2
1.5
1 1
0.5
0 0
-0.5
-1 -1
-1.5
-2 -2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2 -2 -1 0 1 2
12 8
6
8
4
2
0 0
2 2
1 2 1 2
0 1 0 1
0 0
-1 -1
-1 -1
-2 -2 -2 -2
4
F x  =  x2 – x1  + 8x 1 x2 – x1 + x2 + 3
1.5 [x,y] = meshgrid(-2:.01:2,-2:.01:2);

Fa = (y-x).^4 + 8*x.*y - x + y + 3;
1
contour(x,y,Fa,[1,2,3,4,5,6,7,8]);
0.5
-0.5
-1
-1.5
-2
-2 -1.5 -1 -0.5 0 0.5 1 1.5 2
Quadratic Functions
1 T T
F  x  = -- x Ax + d x + c (Symmetric A)
2
Gradient of Quadratic Function:
F  x  = Ax + d
Hessian of Quadratic Function:
 2F  x  = A
• If the eigenvalues of the Hessian matrix are all positive,
the function will have a single strong minimum.
• If the eigenvalues are all negative, the function will have
a single strong maximum.
• If some eigenvalues are positive and other eigenvalues
are negative, the function will have a single saddle point.
• If the eigenvalues are all nonnegative, but some
eigenvalues are zero, then the function will either have a
weak minimum or will have no stationary point.
• If the eigenvalues are all nonpositive, but some
eigenvalues are zero, then the function will either have a
weak maximum or will have no stationary point.
Stationary point nature summary
xT Ax i Definiteness H Nature x*
0 Positive d. Minimum
0 Positive semi-d. Valley
Indefinite Saddlepoint
0
0 Negative semi-d. Ridge
Negative d. Maximum
0
Steepest Descent
2 2
F  x  = x1 + 2 x1 x 2 + 2x 2 + x1
x 0 = 0.5  = 0.1
0.5

F x 
F  x  =
 x1
=
2x 1 + 2x2 + 1 g0 =  F x  = 3
x= x0 3
 2x 1 + 4x 2
F x 
 x2
x 1 = x 0 –  g 0 = 0.5 – 0.1 3 = 0.2

0.5 3 0.2
x2 = x1 –  g1 = 0.2 – 0.1 1.8 = 0.02

0.2 1.2 0.08
2
-1
-2
-2 -1 0 1 2
If A is a symmetric matrix with eigenvalues λs, then eigenvalues of I- αA are

1- αλ
Stable Learning Rates
(Quadratic)
1 T T
F  x  = -- x Ax + d x + c
2
F  x  = Ax + d
x k + 1 = xk –  gk = x k –   Ax k + d  xk + 1 =  I –  A  x k –  d
Stability is determined
by the eigenvalues of
this matrix.
 I –  A  zi = z i –  Az i = z i –  iz i =  1 –  i z i
(i - eigenvalue of A) Eigenvalues

of [I - A].
Stability Requirement:
2 2
 1 –  i  1   ----   ------------
i max
Example
  0.851     0.526  
A= 22 
 1  = 0.764 z
 1 =  
 2 = 5.24 z
 2 = 
24   – 0.526     0.851  
2 2
  ------------ = ---------- = 0.38
max 5.24
 = 0.37  = 0.39
2 2
1 1
0 0
-1 -1
-2 -2
-2 -1 0 1 2 -2 -1 0 1 2

ANN8

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ANN8

Uploaded by

Copyright:

Available Formats

How is generalization possible?

Necessary conditions for good generalization.

1.The function you are trying to learn be, in some sense,

Not a simple model. Not a complex model. Complexity can be

where Ω is called a regularization. The parameter v controls the

where the sum runs over all weights and biases.

• Models with too few

• Models with too many

High Bias Low Bias

Few training examples

Many training examples

High Bias Low Bias

Taylor series of F(x) about x* = 0 :

Taylor series approximations:

(ith element of gradient)

(i,i element of Hessian)

The point x* is a unique global minimum of F(x) if

1.5 [x,y] = meshgrid(-2:.01:2,-2:.01:2);

Gradient of Quadratic Function:

Hessian of Quadratic Function:

0 Positive semi-d. Valley

x 1 = x 0 –  g 0 = 0.5 – 0.1 3 = 0.2

x2 = x1 –  g1 = 0.2 – 0.1 1.8 = 0.02

If A is a symmetric matrix with eigenvalues λs, then eigenvalues of I- αA are

(i - eigenvalue of A) Eigenvalues

You might also like