Professional Documents
Culture Documents
Slides Basics Optimization
Slides Basics Optimization
Slides Basics Optimization
ML-Basics: Optimization
Learning goals
Understand how the risk function
R em
parameters of a model
Understand the idea of gradient
θ2
θ1
descent as a basic risk optimizer
LEARNING AS PARAMETER OPTIMIZATION
θ2
θ1
c Introduction to Machine Learning – 1 / 11
LEARNING AS PARAMETER OPTIMIZATION
The ERM optimization problem is:
c Introduction to Machine Learning – 2 / 11
LOCAL MINIMA
If Remp is continuous in θ we can define a local minimum θ̂ :
∃ > 0 ∀θ with
θ̂ − θ
< : Remp (θ̂) ≤ Remp (θ).
minimum
Remp(θ)
● global
●
● local
c Introduction to Machine Learning – 3 / 11
LOCAL MINIMA AND STATIONARY POINTS
If Remp is continuously differentiable in θ then a sufficient condition for a local
minimum is that θ̂ is stationary with 0 gradient, so no local improvement is possible:
∂
Remp (θ̂) = 0
∂θ
2
∂
and the Hessian ∂θ 2 Remp (θ̂) is positive definite. While the neg. gradient points into
the direction of fastest local decrease, the Hessian measures local curvature of Remp .
Remp
R em
100
R em
p
75
θ2
p
50
25
θ2
θ2
θ1 θ1
θ1
∂
∂θ
Remp (θ) const. pos. def. Hessian const. neg. def. Hessian
c Introduction to Machine Learning – 4 / 11
LEAST SQUARES ESTIMATOR
Now, for given features X ∈ Rn×p and target y ∈ Rn , we want to find
the best linear model regarding the squared error loss, i.e.,
n
X
Remp (θ) = kXθ − yk22 = (θ > x(i ) − y (i ) )2 .
i =1
c Introduction to Machine Learning – 5 / 11
GRADIENT DESCENT
The simple idea of GD is to iteratively go from the current candidate θ [t ]
in the direction of the negative gradient, i.e., the direction of the
steepest descent, with learning rate α to the next θ [t +1] :
∂
θ [t +1] = θ [t ] − α Remp (θ [t ] ).
∂θ
θ[0]
R em
θ1
c Introduction to Machine Learning – 6 / 11
GRADIENT DESCENT - EXAMPLE
∂
−α ∂θ Remp (θ[0] )
R em
θ1
[1]
θ Remp (θ [1] ) ≈ 42.73.
p
We improved:
Remp (θ [1] ) < Remp (θ [0] ).
θ2
θ1
c Introduction to Machine Learning – 7 / 11
GRADIENT DESCENT - EXAMPLE
R em
∂
−α ∂θ Remp (θ[1] ) Again we follow in the direction of the
p
θ1
R em
p
θ1
c Introduction to Machine Learning – 8 / 11
GRADIENT DESCENT - EXAMPLE
R em
vergence or termination.
θ2
θ1
R em
p
θ1
c Introduction to Machine Learning – 9 / 11
GRADIENT DESCENT - LEARNING RATE
The negative gradient is a direction that looks locally promising to reduce Remp .
Hence it weights components higher in which Remp decreases more.
∂
However, the length of − ∂θ Remp measures only the local decrease rate, i.e.,
there are no guarantees that we will not go "too far".
We use a learning rate α to scale the step length in each iteration. Too much can
lead to overstepping and no converge, too low leads to slow convergence.
Usually, a simple constant rate or rate-decrease mechanisms to enforce local
convergence are used
θ2
θ2
θ1 θ1 θ1
good convergence for α1 poor convergence for α2 (< α1 ) no convergence for α3 (> α1 )
c Introduction to Machine Learning – 10 / 11
FURTHER TOPICS
c Introduction to Machine Learning – 11 / 11