Professional Documents
Culture Documents
Unit 1
Unit 1
Input Output
variables (X) f variables (Y)
Y = f(X)
• The machine learns from the training data to map the target function,
but the configuration of the function is unknown.
• An algorithm learns this target mapping function from training data.
• Based on the knowledge of the input data we try to evaluate different
machine learning algorithms and see which is better at approximating
the underlying function.
• Different algorithms make different assumptions or biases about the
form of the function and how it can be learned.
Parametric Machine Learning Algorithms
• Consider the k-nearest neighbors algorithm that makes predictions based on the k
most similar training patterns for a new data instance.
• The method does not assume anything about the form of the mapping function
other than patterns that are close are likely to have a similar output variable.
• Some more examples of popular nonparametric machine learning algorithms are:
– k-Nearest Neighbors
– Decision Trees like CART and C4.5
– Support Vector Machines
Nonparametric Machine Learning Algorithms
22
Regularization
23
Regularization
25
Regularization
• Assume that regularization effect is so high that some of the weights are nearly equal to zero
• This will result in a much simpler linear network and slight underfitting of the training data
• Such a large value of the regularization effect is not useful
• Optimize the value of regularization coefficient in order to obtain a well-fitted model
26
Regularization
• There are three very popular and efficient regularization techniques called L1, L2, and
dropout
• L1 and L2 are the most common types of regularization
• These update the cost function, J(θ), by adding another term known as regularization
term
Cost function = Loss + Regularization term
• Due to the addition of this regularization term, the values of weight matrices decrease
because it assumes that a neural network with smaller weight matrices leads to simpler
models
27
Regularization
L2 Regularization
• Most common type of all regularization techniques and is also commonly known
as weight decay or Ridge Regression.
• The regularization term is the Euclidean Norm (or L2 norm) of the weight
matrices, which is the sum over all squared weight values of a weight matrix
28
Regularization
L1 Regularization
• Also knows as Lasso regression
• It is the sum of the absolute values of the weight parameters in a weight matrix:
29
Gradient-based Optimization
What is Gradient Descent?
where
36
Gradient-based Optimization
Stochastic Gradient Descent
• Even after increasing the number of iterations, the computation cost is still less
than that of the gradient descent optimizer
• If the data is enormous and computational time is an essential factor,
stochastic gradient descent should be preferred over gradient descent
algorithm.
• Example: dataset contains 1000 rows, SGD updates the model parameters
1000 times (if batch size = 1) in one cycle of dataset instead of one time as in
Gradient Descent
θ = θ − α⋅∇J (θ is determined for each sample)
• As the model parameters are frequently updated, parameters have high
variance and fluctuations in loss functions at different intensities
37
Gradient-based Optimization
Stochastic Gradient Descent
• Advantages:
• Frequent updates of model parameters hence,
converges in less time
• Requires less memory as no need to store values of
loss functions
• May get new minima
• Disadvantages:
• High variance in model parameters
• May shoot even after achieving global minima
• To get the same convergence as gradient descent
needs to slowly reduce the value of learning rate
38
Gradient-based Optimization
Stochastic Gradient Descent
Gradient Descent
39
Gradient-based Optimization
Stochastic Gradient Descent
Dynamic Learning Rate
• Replacing with a time-dependent learning rate η(t)
adds to the complexity of controlling convergence of
an optimization algorithm.
• Need to figure out how rapidly should decay.
• If it is too quick, we will stop optimizing prematurely.
• If we decrease it, we waste too much time on
optimization.
• There are a few basic strategies that are used in
adjusting over time
40
Gradient-based Optimization
Types of Dynamic Learning Rate
• Exponential • Polynomial
41
Gradient-based Optimization
Mini-batch Stochastic Gradient Descent
• There are two extremes in the approach to gradient based learning
– Use the full dataset to compute gradients and to
update parameters, one pass at a time.
– Process one observation at a time to make progress.
– Each of them has its own drawbacks.
• Gradient Descent is not particularly data efficient whenever data is very large
• Stochastic Gradient Descent is not particularly computationally efficient since CPUs and
GPUs cannot exploit the full power of vectorization.
• Mini-batch Stochastic Gradient Descent optimizer is useful in such cases
• Cost function in mini-batch gradient descent is noisier than the batch gradient descent
algorithm but smoother than that of the stochastic gradient descent algorithm.
42
Gradient-based Optimization
Mini-batch Stochastic Gradient Descent
43
Gradient-based Optimization
All types of Gradient Descent have some challenges
• Choosing an optimum value of the learning rate. If the learning rate is
too small than gradient descent may take ages to converge.
• Have a constant learning rate for all the parameters. There may be
some parameters which need not be changed at the same rate
• May get trapped at local minima
44
Gradient-based Optimization
Momentum
• Intuition:
• A person is repeatedly asked to move in the same direction then person
probably gains confidence and start taking bigger steps in that direction
• Just as a ball gains momentum while rolling down a slope
• Stochastic gradient descent takes a much more noisy path than the gradient
descent algorithm
• Therefore, it requires a more number of iterations to reach the optimal
minimum and hence computation is very slow
• Momentum is used for reducing high variance in SGD
• Momentum helps reach convergence in less time
45
Gradient-based Optimization
SGD with Momentum
• Update weights and bias in SGD,
46
Gradient-based Optimization
SGD with Momentum
47
Gradient-based Optimization
Adaptive Gradient Algorithm (Adagrad)
• Value of learning rate can change the pace of training
• Learning rate is constant in GD, SGD and SGD with momentum
• Adagrad does not use momentum
• For a sparse feature values (most of the values are zero) high learning rate is
required to boost low values of gradients
• For dense data, learning rate can be low
• The solution is to have an adaptive learning rate that can change according to the
input provided
• Adagrad optimizer decays learning rate in proportion to the updated history of
the gradients
• It means that when there are larger updates, the history element is accumulated,
and therefore it reduces the learning rate and vice versa.
• One disadvantage of this approach is that the learning rate decays aggressively
and after some time it approaches zero
48
Gradient-based Optimization
Adagrad
• Learning rate is modified for given weight at a given time based on previous gradients
• Store the sum of the squares of the gradients up to time step, t
• ϵ is a smoothing term that avoids division by zero
• Decay the learning rate for parameters in proportion to their update history
49
Gradient-based Optimization
Adagrad
• History of the gradient is accumulated in αt
• The smaller the gradient accumulated, the smaller the αt value will be
leading to a bigger learning rate (because αt divides η)
• It makes big updates for less frequent parameters and a small step for frequent
parameters
50
Gradient-based Optimization
Adagrad
Advantages:
• Learning rate changes for each training parameter.
• Don’t need to manually tune the learning rate.
• Able to train on sparse data.
• Reaches convergence at a higher speed
Disadvantages:
• Computationally expensive as a need to calculate the second order derivative.
• It decreases the learning rate aggressively and monotonically
• There might be a point when the learning rate becomes extremely small
This is because the squared gradients in the denominator keep accumulating,
and thus the denominator part keeps on increasing
• Due to small learning rates, the model eventually becomes unable to acquire
more knowledge, and hence the accuracy of the model is compromised
51
Gradient-based Optimization
Adadelta
• an extension of Adagrad
• Attempts to solve its radically diminishing learning rates
• Instead of summing up all the past squared gradients from 1 to “t” time steps
Uses Exponentially Weighted Averages over Gradient
Adadelta
52
Gradient-based Optimization
Adaptive Moment Estimation (Adam)
• Intuition: We don’t want to roll so fast just because we can jump over the
minimum, we want to decrease the velocity a little bit for a careful search.
• Adam is the best optimizer
• Requires less time and is efficient
• For sparse data use the optimizers with dynamic learning rate.
• Combines the power of Adadelta and momentum-based SGD
• Power of momentum SGD to hold the history of updates and the adaptive
learning rate provided by Adadelta makes it a powerful method
53
Gradient-based Optimization
Adam
Exponential Weighted Averages for past Exponential Weighted Averages for past squared
gradients gradients
Adam
54
Gradient-based Optimization
Adam
• Advantages:
• The method is too fast and converges rapidly.
• Rectifies vanishing learning rate, high variance.
• Disadvantages:
• Computationally costly.
55
Gradient-based Optimization
RMSProp
56
Gradient-based Optimization
RMSProp
• Adv: The algorithm converges quickly and requires lesser tuning than gradient
descent algorithms and their variants.
• Disadv: learning rate has to be defined manually and value of beta doesn’t work
for every application
57
Challenges in Neural Network Optimization
• Because finding the minima of polynomials is a very well-studied subject and there
exist a host of very efficient methods to find the global minima of a polynomial using
derivatives, we can assume that the global minima of the surrogate function is the
same for the loss function.
• Surrogate optimization is technically a non-iterative method, although the training of
the surrogate function is often iterative; additionally, it is technically a no-gradient
method, although often effective mathematical methods to find the global minima of
the modelling function are based on derivatives.
• However, because both the iterative and gradient-based properties are ‘secondary’ to
surrogate optimization, it can handle large data and non-differentiable optimization
problems.
Gradient free Optimization
Surrogate optimization
• Optimization using a surrogate function is quite clever in a few ways:
• It is essentially smoothing out the surface of the true loss function, which
reduces the jagged local minima that cause so much of the extra training time
in neural networks.
• It projects a difficult problem into a much easier one: whether it’s a
polynomial, RBF, GP, MARS, or another surrogate model, the task of finding
the global minima is boosted with mathematical knowledge.
• Overfitting the surrogate model is not really much of an issue, because even
with a fair bit of overfitting, the surrogate function is still more smooth and
less jagged than the true loss function. Along with many other standard
considerations in building more mathematically-inclined models that have
been simplified, training surrogate models is hence much easier.
• Surrogate optimization is not limited by the view of where it currently is in
that it sees the ‘entire function’, opposed to gradient descent, which must
continually make risky choices on whether it thinks there will be a deeper
minima over the next hill.
Gradient free Optimization
Surrogate Annealing
• Surrogate optimization is almost always faster than gradient
descent methods, but often at the cost of accuracy.
• Using surrogate optimization may only be able to pinpoint the
rough location of a global minima, but this can still be tremendously
beneficial.
• An alternative is a hybrid model; a surrogate optimization is used to
bring the neural network parameters to the rough location, from
which gradient descent can be used to find the exact global minima.
• Another is to use the surrogate model to guide the optimizer’s
decisions, since the surrogate function can
– a) ‘see ahead’ and
– b) is less sensitive to specific ups and downs of the loss function.
Gradient free Optimization
Simulated optimization
• Simulated Annealing is a concept based on annealing in metallurgy, in
which a material can be heated above its recrystallization
temperature to reduce its hardness and alter other physical and
occasionally chemical properties, then allowing the material to
gradually cool and become rigid again.
• Using the notion of slow cooling, simulated annealing slowly
decreases the probability of accepting worse solutions as the solution
space is explored.
• Because accepting worse solutions allows for a more broad search for
the global minima (think — cross the hill for a deeper valley),
simulated annealing assumes that the possibilities are properly
represented and explored in the first iterations.
• As time progresses, the algorithm moves away from exploration and
towards exploitations.
Gradient free Optimization
Simulated optimization
• The following is a rough outline of how
simulated annealing algorithms work:
– The temperature is set at some initial positive value
and progressively approaches zero.
– At each time step, the algorithm randomly chooses
a solution close to the current one, measures its
quality, and moves to it depending on the current
temperature (probability of accepting better or
worse solutions).
– Ideally, by the time the temperature reaches zero,
the algorithm has converged on a global minima
solution.
Gradient free Optimization
Simulated optimization
• The simulation can be performed with
kinetic equations or with stochastic
sampling methods.
• Simulated Annealing was used to
solve the travelling salesman problem,
which tries to find the shortest
distance between hundreds of
locations, represented by data points.
• Obviously, the combinations are
endless, but simulated annealing —
with its reminiscence of
reinforcement learning — performs
very well.
Gradient free Optimization
Simulated optimization
• Simulated annealing performs especially well in
scenarios where an approximate solution is
required in a short period of time, outperforming
the slow pace of gradient descent.
• Like surrogate optimization, it can be used in
hybrid with gradient descent for the benefits of
both: the speed of simulated annealing and the
accuracy of gradient descent.
Gradient free Optimization