Professional Documents
Culture Documents
Lec 6 Tutorial
Lec 6 Tutorial
Optimization
Lecture 6: Optimization
CSC2515: Lecture 6
Optimization
CSC2515: Lecture 6
Optimization
Maximum Likelihood
Basic ML question: For which setting of the parameters is the data
we saw the most likely?
Assumes training data are iid, computes the log likelihood, forms
a function (w) which depends on the fixed training set we saw
and on the argument w:
n
n
p(yn, xn |w)
since iid
since log
log
CSC2515: Lecture 6
Optimization
CSC2515: Lecture 6
Optimization
Going Bayesian
This is the posterior distribution of the parameters given the data. A true
Bayesian would integrate over the posterior to make predictions:
p(w|y1 , x1 , y2 , x2 , ..., yn , xn )
CSC2515: Lecture 6
Optimization
CSC2515: Lecture 6
Optimization
w = b/2c
1
w = 1
E(w) = a + bT Cw
2C b
For linear regression with Gaussian noise:
C = XXT and b = 2XyT
7
CSC2515: Lecture 6
Optimization
CSC2515: Lecture 6
Optimization
Steepest Descent
= w E(w)
CSC2515: Lecture 6
Optimization
10
CSC2515: Lecture 6
Optimization
min )
Rate of convergence depends on (1 2 max
11
CSC2515: Lecture 6
Optimization
Adaptive Stepsize
= ; wt = wt1
Sensible choices for parameters: = 1.1, = 0.5
12
CSC2515: Lecture 6
Optimization
Momentum
If the error surface is a long and narrow valley, gradient descent goes quickly
down the valley walls, but very slowly along the valley floor
13
CSC2515: Lecture 6
Optimization
Some very nice error functions (e.g. linear least squares, logistic
regression, lasso) are convex, and thus have a unique (global)
minimum.
Convexity means that the second derivative is always positive.
No linear combination of weights can have greater error than
thelinear combination of the original errors.
But most settings do not lead to convex optimization problems.
14
CSC2515: Lecture 6
Optimization
CSC2515: Lecture 6
Optimization
Line Search
CSC2515: Lecture 6
Optimization
17
CSC2515: Lecture 6
Optimization
18
CSC2515: Lecture 6
Optimization
19
CSC2515: Lecture 6
Optimization
Conjugate Gradients
20
CSC2515: Lecture 6
Optimization
To first order, all three expressions below satisfy our constraint that along
the new search direction
T
g d(t) = 0
Hestenes Stiefel
Polak Ribiere
Fletcher Reeves
CSC2515: Lecture 6
Optimization
22
CSC2515: Lecture 6
Optimization
BFGS
CSC2515: Lecture 6
Optimization
Constrained Optimization
2.
24
CSC2515: Lecture 6
Optimization
Lagrange Multipliers
Imagine that our parameters have to live inside the constraint surface c(w) = 0
To optimize our function E (w) we want to look at the component of the
gradient that lies within the surface, i.e., with zero dot product to the normal
of the constraint.
At the constrained optimum, this gradient component is zero. In other words
the gradient of the function is parallel to the gradient of the constraint
surface.
25
CSC2515: Lecture 6
Optimization
L(w, )
= E(w) + T c(w)
L/
= c(w)
L/ w = E/ w + T c/ w
26
CSC2515: Lecture 6
Optimization
L
Now set L
x = 0 and = 0 :
x = Ab + AC
= 4(CT AC)CT Ab
27