Download as pdf or txt
Download as pdf or txt
You are on page 1of 77

Unit 1

Review of machine learning and


deep learning concepts
Topics top be covered….

• Parametric vs non-parametric models


• Feedforward neural networks
• Gradient-based Optimization
• Regularization
• Challenges in Neural Network Optimization
Introduction
• Machine learning can be summarized as:
– learning a function (f) that maps input variables (X) to output variables (Y).
– Y = f(x)

Input Output
variables (X) f variables (Y)

Y = f(X)

• The machine learns from the training data to map the target function,
but the configuration of the function is unknown.
• An algorithm learns this target mapping function from training data.
• Based on the knowledge of the input data we try to evaluate different
machine learning algorithms and see which is better at approximating
the underlying function.
• Different algorithms make different assumptions or biases about the
form of the function and how it can be learned.
Parametric Machine Learning Algorithms

• Assumptions can greatly simplify the learning process, but


can also limit what can be learned.
• A learning model that summarizes data with a set of
parameters of fixed size (independent of the number of
training examples) is called a parametric model.
• No matter how much data you throw at a parametric
model, it won’t change its mind about how many
parameters it needs.
• Thus machine learning models are parameterized so that
their behavior can be tuned for a given problem. These
models can have many parameters and finding the best
combination of parameters can be treated as a search
problem.
Parametric Machine Learning Algorithms

• A model parameter is a configuration variable that is internal to the


model and whose value can be estimated from the given data.
– They are required by the model when making predictions.
– Their values define the skill of the model on your problem.
– They are estimated or learned from historical training data.
– They are often not set manually by the practitioner.
– They are often saved as part of the learned model.
• The examples of model parameters include:
– The weights in an artificial neural network.
– The support vectors in a support vector machine.
– The coefficients in linear regression or logistic regression.
• Machine learning algorithms are classified into two distinct
groups: parametric and nonparametric models.
Parametric Machine Learning Algorithms

• The algorithms involve two steps:


– Select a form for the function.
– Learn the coefficients for the
function from the training data.
• An easy to understand functional
form for the mapping function is a
line, as is used in linear regression:
– b0 + b1*x1 + b2*x2 = 0
– b0, b1, b2 → the coefficients of the
line that control the intercept and
slope
x1, x2 → input variables
Parametric Machine Learning Algorithms
• Assuming the functional form of a line greatly simplifies the
learning process.
• Now, all we need to do is estimate the coefficients of the
line equation and we have a predictive model for the
problem.
• Often the assumed functional form is a linear combination
of the input variables and as such parametric machine
learning algorithms are often also called “linear machine
learning algorithms“.
• The problem is, the actual unknown underlying function
may not be a linear function like a line.
• It could be almost a line and require some minor
transformation of the input data to work right.
• Or it could be nothing like a line in which case the
assumption is wrong and the approach will produce poor
results.
Parametric Machine Learning Algorithms

• Some more examples of parametric machine


learning algorithms include:
– Logistic Regression
– Linear Discriminant Analysis
– Perceptron
– Naive Bayes
– Simple Neural Networks
Parametric Machine Learning Algorithms

• Benefits of Parametric Machine Learning Algorithms:


– Simpler: These methods are easier to understand and interpret
results.
– Speed: Parametric models are very fast to learn from data.
– Less Data: They do not require as much training data and
can work well even if the fit to the data is not perfect.
• Limitations of Parametric Machine Learning Algorithms:
– Constrained: By choosing a functional form these methods are
highly constrained to the specified form.
– Limited Complexity: The methods are more suited to simpler
problems.
– Poor Fit: In practice the methods are unlikely to match the
underlying mapping function.
Nonparametric Machine Learning Algorithms

• Algorithms that do not make strong assumptions


about the form of the mapping function are
called nonparametric machine learning
algorithms.
• By not making assumptions, they are free to learn
any functional form from the training data.
• Nonparametric methods are good when you have
a lot of data and no prior knowledge, and when
you don’t want to worry too much about
choosing just the right features.
Nonparametric Machine Learning Algorithms

• Nonparametric methods seek to best fit the


training data in constructing the mapping
function, whilst maintaining some ability to
generalize to unseen data.
• As such, they are able to fit a large number of
functional forms.
Nonparametric Machine Learning Algorithms

• Consider the k-nearest neighbors algorithm that makes predictions based on the k
most similar training patterns for a new data instance.
• The method does not assume anything about the form of the mapping function
other than patterns that are close are likely to have a similar output variable.
• Some more examples of popular nonparametric machine learning algorithms are:
– k-Nearest Neighbors
– Decision Trees like CART and C4.5
– Support Vector Machines
Nonparametric Machine Learning Algorithms

• Benefits of Nonparametric Machine Learning Algorithms:


– Flexibility: Capable of fitting a large number of functional
forms.
– Power: No assumptions (or weak assumptions) about the
underlying function.
– Performance: Can result in higher performance models for
prediction.
• Limitations of Nonparametric Machine Learning
Algorithms:
– More data: Require a lot more training data to estimate the
mapping function.
– Slower: A lot slower to train as they often have far more
parameters to train.
– Overfitting: More of a risk to overfit the training data and it
is harder to explain why specific predictions are made.
Parametric and Nonparametric Machine Learning Algorithms

• When to use a parametric algorithm?


– Parametric algorithms are best suited for problems where the input data is well-defined and
predictable.
– This makes them ideal for tasks such as predictive modelling, where the goal is to predict the value
of a target variable based on a set of input variables.
– Example: Let’s say you want to predict the sales volume of a product. You would use a parametric
algorithm, such as linear regression, to build a model that defines the relationship between
historical sales data and the predicted sales volume.
• When to use a nonparametric algorithm?
– Nonparametric algorithms are best suited for problems where the input data is not well-defined or
too complex to be modelled using a parametric algorithm.
– This makes them ideal for tasks such as data classification, where the goal is to separate data into
distinct classes or groups.
– Additionally, nonparametric algorithms are often more accurate than parametric algorithms for
complex problems.
– Example: If you want to build an algorithm to classify images of animals, you would use a
nonparametric algorithm, such as a support vector machine. This is because the input data (images
of animals) is not well-defined and too complex to be modelled using a parametric algorithm.
Feedforward neural networks
• Linear machine learning models are limited to only linear functions.
• When the data is not linear separable, linear models face problems in
approximating whereas it is pretty easy for the neural networks.
• The hidden layers are used to increase the non-linearity and change the
representation of the data for better generalization over the function.
• Deep Feedforward networks or also known multilayer perceptrons are
the foundation of most deep learning models.
• Networks like CNNs and RNNs are just some special cases of Feedforward
networks.
• These networks are mostly used for supervised machine learning tasks
where we already know the target function i.e. the result we want our
network to achieve and are extremely important for practicing machine
learning and form the basis of many commercial applications, areas such
as computer vision and NLP were greatly affected by the presence of
these networks.
Feedforward neural networks
• The purpose of feedforward Neural networks is
to approximate certain functions.
• The input to the network is a vector of values, x, which is passed
through the network, layer by layer, and transformed into
an output, y.
• The network's final output predicts the target function for the given
input.
• The network makes this prediction using a set of parameters, θ
(theta), adjusted during training to minimize the error between the
network's predictions and the target function.
Feedforward neural networks

• The training involves adjusting the θ (theta) values to minimize


errors.
• This is done by presenting the network with a set of input-output
pairs (also called training data) and computing the error between
the network's prediction and the true output for each pair.
• This error is then used to compute the gradient of the error
concerning the parameters, which tells us how to adjust the
parameters to reduce the error.
• This is done using optimization techniques like gradient descent.
• Once the training process is completed, the network has " learned "
the function and can be used to predict new input.
• Finally, the network stores this optimal value of θ (theta) in its
memory, so it can use it to predict new inputs.
Feedforward neural networks
Feedforward neural networks

• The main goal of a feedforward network is to approximate some


function f*.
• For example, a regression function y = f *(x) maps an input x to a
value y. A feedforward network defines a mapping y = f (x; θ) and
learns the value of the parameters θ that result in the best function
approximation.
• These networks are represented by a composition of many different
functions. Each model is associated with an acyclic graph describing
how the functions are composed together. For example, we might
have three functions f (1), f (2), and f (3) connected in a chain, to
form f (x) = f(3)(f (2)(f (1)(x))). In this f(1) is the first layer, f(2) is the
second layer and f(3) is the output layer.

Feedforward neural networks
The training phase of a neural network
Feedforward neural networks
Steps involved in the implementation of a neural network
A neural network executes in 2 steps:
1. Feedforward:
• Initially, specific weight required by inputs are not
known
• Have a set of input features and some random weights
• Weight decides how vital is that feature for prediction
• The higher the weight, the greater the importance
2. Backpropagation:
• Calculate the error between predicted output and target
output
• Use an algorithm (gradient descent) to update the
weight values
Regularization

• The model with a higher complexity pick up


and learn patterns (noise) in the data that are
caused by some random fluctuation or error
• The network would be able to model each data
sample of the distribution one-by-one, while
not recognizing the true function that
describes the distribution
• Regularization is a set of techniques that can
prevent overfitting in neural networks
• Improve the accuracy of a Deep Learning
model for new data

22
Regularization

• If complexity of the model is increased beyond a certain level, training error


reduces but the testing error doesn’t

23
Regularization

• Regularization is a technique which makes


slight modifications to the learning algorithm
such that the model generalizes better
• This in turn improves the model’s
performance on the unseen data as well
• Regularization refers to a set of different
techniques that lower the complexity of a
neural network model during training
and thus prevent the overfitting
24
Regularization

Consider a neural network which is overfitting on the training data

• Regularization penalizes the coefficients


• In deep learning, it actually penalizes the weight matrices of the nodes

25
Regularization

• Assume that regularization effect is so high that some of the weights are nearly equal to zero
• This will result in a much simpler linear network and slight underfitting of the training data
• Such a large value of the regularization effect is not useful
• Optimize the value of regularization coefficient in order to obtain a well-fitted model

26
Regularization

• There are three very popular and efficient regularization techniques called L1, L2, and
dropout
• L1 and L2 are the most common types of regularization
• These update the cost function, J(θ), by adding another term known as regularization
term
Cost function = Loss + Regularization term
• Due to the addition of this regularization term, the values of weight matrices decrease
because it assumes that a neural network with smaller weight matrices leads to simpler
models

27
Regularization
L2 Regularization
• Most common type of all regularization techniques and is also commonly known
as weight decay or Ridge Regression.

• The regularization term is the Euclidean Norm (or L2 norm) of the weight
matrices, which is the sum over all squared weight values of a weight matrix

• Lambda is the regularization parameter


• It is the hyperparameter whose value is optimized for better results
• L2 regularization is also known as weight decay as it forces the weights to decay
towards zero (but not exactly zero)

28
Regularization
L1 Regularization
• Also knows as Lasso regression
• It is the sum of the absolute values of the weight parameters in a weight matrix:

• Penalize the absolute value of the weights


• Unlike L2, the weights may be reduced to zero
• Useful when we are trying to compress our model
• Otherwise, prefer L2 over it

29
Gradient-based Optimization
What is Gradient Descent?

• Algorithm that operates iteratively to find the


optimal values for weights
• Requires user-defined learning rate, and initial
weight values
• Steps: (Iterative)
1. Start with initial values of weights
2. Calculate cost
3. Update values using learning rate and update
function if cost is not acceptable
4. Stop if cost is acceptable
Gradient-based Optimization
Gradient descent
• Common cost functions:
• Mean squared error
• Cross-entropy loss (log loss)
• Cost, J(θ) is dependent on weight, θ
• To determine minimum point determine derivative of J(θ)
with respect to θ
Gradient-based Optimization
Gradient descent
• Process of gradient descent is

where

• ∆θi is the change in weight


• Set a learning rate, α, to control the size of
the change (step)
Gradient-based Optimization
Gradient descent

• For one weight, determine derivative (gradient) for


only one weight
• For multiple weights, determine partial derivative
(gradient) with respect to each weight
• As you reach a local optima, the slope will approach
zero
• Once the slope of current parameter reaches zero,
parameter value stops updating
• It results in convergence and signifies that we can stop
the iterative process
Gradient-based Optimization
Training ANN using gradient Descent

• Use gradient descent to update each of the


weights
• Recompute cost function with the new weights
• Repeat cost function converges to minimum
value
• During each iteration we perform forward
propagation to compute the outputs and
backward propagation to compute the errors
• One complete iteration is known as an epoch
• Keep checking cost function after each epoch to
watch the amount of error as network is trained
Gradient-based Optimization
Training ANN using gradient Descent

• Using updated weights, use forward


propagation and calculate 𝐽(𝜃).
• Go backwards and repeat the process till
output is equal to target output
• Or error is minimum
Gradient-based Optimization
Stochastic Gradient Descent

• Variant of Gradient Descent


• Tries to update the model’s parameters more frequently
• Model parameters are altered after computation of loss on each training
sample/ batches of samples
• Instead of taking complete dataset for each iteration, randomly select the
batches of data.
• Take few samples from the dataset as batches
• Randomly shuffle the data at each iteration
• Thus, SGD uses a higher number of iterations to reach the local minima
• Due to an increase in the number of iterations, the overall computation time
increases

36
Gradient-based Optimization
Stochastic Gradient Descent

• Even after increasing the number of iterations, the computation cost is still less
than that of the gradient descent optimizer
• If the data is enormous and computational time is an essential factor,
stochastic gradient descent should be preferred over gradient descent
algorithm.
• Example: dataset contains 1000 rows, SGD updates the model parameters
1000 times (if batch size = 1) in one cycle of dataset instead of one time as in
Gradient Descent
θ = θ − α⋅∇J (θ is determined for each sample)
• As the model parameters are frequently updated, parameters have high
variance and fluctuations in loss functions at different intensities

37
Gradient-based Optimization
Stochastic Gradient Descent
• Advantages:
• Frequent updates of model parameters hence,
converges in less time
• Requires less memory as no need to store values of
loss functions
• May get new minima
• Disadvantages:
• High variance in model parameters
• May shoot even after achieving global minima
• To get the same convergence as gradient descent
needs to slowly reduce the value of learning rate

38
Gradient-based Optimization
Stochastic Gradient Descent
Gradient Descent

• Trajectory of the variables in the SGD is much more noisy


than gradient descent
• This is due to the stochastic nature of the gradient
• Even after 50 steps the quality is still not so good
• It doesn’t not improve after additional steps
• Solution is to change the learning rate Stochastic Gradient Descent

• To resolve these conflicting goals is to reduce the learning


rate dynamically as optimization progresses

39
Gradient-based Optimization
Stochastic Gradient Descent
Dynamic Learning Rate
• Replacing with a time-dependent learning rate η(t)
adds to the complexity of controlling convergence of
an optimization algorithm.
• Need to figure out how rapidly should decay.
• If it is too quick, we will stop optimizing prematurely.
• If we decrease it, we waste too much time on
optimization.
• There are a few basic strategies that are used in
adjusting over time

40
Gradient-based Optimization
Types of Dynamic Learning Rate
• Exponential • Polynomial

41
Gradient-based Optimization
Mini-batch Stochastic Gradient Descent
• There are two extremes in the approach to gradient based learning
– Use the full dataset to compute gradients and to
update parameters, one pass at a time.
– Process one observation at a time to make progress.
– Each of them has its own drawbacks.
• Gradient Descent is not particularly data efficient whenever data is very large
• Stochastic Gradient Descent is not particularly computationally efficient since CPUs and
GPUs cannot exploit the full power of vectorization.
• Mini-batch Stochastic Gradient Descent optimizer is useful in such cases
• Cost function in mini-batch gradient descent is noisier than the batch gradient descent
algorithm but smoother than that of the stochastic gradient descent algorithm.

42
Gradient-based Optimization
Mini-batch Stochastic Gradient Descent

• Best among all the variations of gradient descent algorithms


• It is an improvement on both SGD and standard gradient descent
• It updates the model parameters after every batch
• Dataset is divided into various batches and after every batch, the parameters are updated.
θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples.
• Advantages:
• Frequently updates the model parameters and also has less variance.
• Requires medium amount of memory.
• Disadvantages:
• needs a hyper parameter that is “mini-batch-size”, which needs to be tuned to achieve the
required accuracy
• Generally, the batch size of 32 is considered to be appropriate. In some cases, it results in poor
final accuracy.

43
Gradient-based Optimization
All types of Gradient Descent have some challenges
• Choosing an optimum value of the learning rate. If the learning rate is
too small than gradient descent may take ages to converge.
• Have a constant learning rate for all the parameters. There may be
some parameters which need not be changed at the same rate
• May get trapped at local minima

44
Gradient-based Optimization
Momentum
• Intuition:
• A person is repeatedly asked to move in the same direction then person
probably gains confidence and start taking bigger steps in that direction
• Just as a ball gains momentum while rolling down a slope
• Stochastic gradient descent takes a much more noisy path than the gradient
descent algorithm
• Therefore, it requires a more number of iterations to reach the optimal
minimum and hence computation is very slow
• Momentum is used for reducing high variance in SGD
• Momentum helps reach convergence in less time

45
Gradient-based Optimization
SGD with Momentum
• Update weights and bias in SGD,

• Update weights and bias in SGD with momentum

• Current gradient is dependent on its previous Gradient and so on


• This accelerates SGD to converge faster and reduce the oscillation
• The momentum term β is usually set to 0.9
• Effect of momentum reduces with each time step by β times

46
Gradient-based Optimization
SGD with Momentum

• If high value of momentum is used, the possibility of


SGD without SGD with
skipping the optimal minimum also increases momentum momentum
• This might result in poor accuracy and even more
oscillations
• Advantages:
• Reduces the oscillations and high variance of the
parameters.
• Converges faster than gradient descent.
• Disadvantages:
• One more hyper-parameter is added which needs
to be selected manually and accurately.

47
Gradient-based Optimization
Adaptive Gradient Algorithm (Adagrad)
• Value of learning rate can change the pace of training
• Learning rate is constant in GD, SGD and SGD with momentum
• Adagrad does not use momentum
• For a sparse feature values (most of the values are zero) high learning rate is
required to boost low values of gradients
• For dense data, learning rate can be low
• The solution is to have an adaptive learning rate that can change according to the
input provided
• Adagrad optimizer decays learning rate in proportion to the updated history of
the gradients
• It means that when there are larger updates, the history element is accumulated,
and therefore it reduces the learning rate and vice versa.
• One disadvantage of this approach is that the learning rate decays aggressively
and after some time it approaches zero

48
Gradient-based Optimization
Adagrad
• Learning rate is modified for given weight at a given time based on previous gradients
• Store the sum of the squares of the gradients up to time step, t
• ϵ is a smoothing term that avoids division by zero
• Decay the learning rate for parameters in proportion to their update history

49
Gradient-based Optimization
Adagrad
• History of the gradient is accumulated in αt
• The smaller the gradient accumulated, the smaller the αt value will be
leading to a bigger learning rate (because αt divides η)
• It makes big updates for less frequent parameters and a small step for frequent
parameters

50
Gradient-based Optimization
Adagrad
Advantages:
• Learning rate changes for each training parameter.
• Don’t need to manually tune the learning rate.
• Able to train on sparse data.
• Reaches convergence at a higher speed
Disadvantages:
• Computationally expensive as a need to calculate the second order derivative.
• It decreases the learning rate aggressively and monotonically
• There might be a point when the learning rate becomes extremely small
This is because the squared gradients in the denominator keep accumulating,
and thus the denominator part keeps on increasing
• Due to small learning rates, the model eventually becomes unable to acquire
more knowledge, and hence the accuracy of the model is compromised

51
Gradient-based Optimization
Adadelta
• an extension of Adagrad
• Attempts to solve its radically diminishing learning rates
• Instead of summing up all the past squared gradients from 1 to “t” time steps
Uses Exponentially Weighted Averages over Gradient

Adadelta 

• The typical “β” value is 0.9 or 0.95

52
Gradient-based Optimization
Adaptive Moment Estimation (Adam)

• Intuition: We don’t want to roll so fast just because we can jump over the
minimum, we want to decrease the velocity a little bit for a careful search.
• Adam is the best optimizer
• Requires less time and is efficient
• For sparse data use the optimizers with dynamic learning rate.
• Combines the power of Adadelta and momentum-based SGD
• Power of momentum SGD to hold the history of updates and the adaptive
learning rate provided by Adadelta makes it a powerful method

53
Gradient-based Optimization
Adam

Exponential Weighted Averages for past Exponential Weighted Averages for past squared
gradients gradients

Adam

54
Gradient-based Optimization

Adam

• Advantages:
• The method is too fast and converges rapidly.
• Rectifies vanishing learning rate, high variance.
• Disadvantages:
• Computationally costly.

55
Gradient-based Optimization
RMSProp

• It is an improvement to the Adagrad optimizer


• It is similar to Adadelta
• Adadelta changes learning rate by accumulating square of gradients at the end of
iteration and computes exponential average
• RMSProp changes learning rate by accumulating square of gradients at the end of
each step
• Update rule of Adadelta,

56
Gradient-based Optimization

RMSProp

• Adv: The algorithm converges quickly and requires lesser tuning than gradient
descent algorithms and their variants.
• Disadv: learning rate has to be defined manually and value of beta doesn’t work
for every application

57
Challenges in Neural Network Optimization

• While gradient descent and its repurposing in


the form of backpropagation has been one of
the greatest breakthroughs in machine
learning, the optimization of neural networks
remain an unsolved problem.
Challenges in Neural Network Optimization

• Optimizers get caught up in local minima that are deep enough.


• Admittedly, there are clever solutions to sometimes get around these
issues, like momentum, which can carry optimizers over large hills;
stochastic gradient descent; or batch normalization, which smooths
the error space.
• Local minima is still the root cause of many branching problems in
neural networks, though.
Challenges in Neural Network Optimization

• Because optimizers are so tempted by local minima,


even if it manages to get out of it, it takes a really, really
long time.
• Gradient descent is generally a lengthy method because
of its slow convergence rate, even with adaptations for
large datasets like batch gradient descent.
Challenges in Neural Network Optimization

• Gradient descent is especially sensitive to the


initialization of the optimizer.
• For instance, performance may be much better if
the optimizer is initialized near the second local
minima instead of the first, but this is all
determined randomly.
Challenges in Neural Network Optimization

• Learning rates dictate how confident and risky the


optimizer is; setting too high a learning rate may cause it
to overlook global minima whereas too low a learning
causes the runtime to blow up.
• To address this problem, learning rates have evolved
with decaying, but choosing the rate of decay, among
many other variables dictating the learning rate, is
difficult.
Challenges in Neural Network Optimization

• Gradient descent requires gradients, meaning


that it is prone to gradient-based problems like
vanishing or exploding gradient problems, in
addition to its inability to handle non-
differentiable functions.
Gradient free Optimization

• There are some interesting optimization methods that are not


based on gradient, but are used to improve the performance
of the neural network and work exceptionally well in some
scenarios and not so well in others.
• Regardless of how well they perform on a specific task,
however, they are fascinating, creative, and a promising area
of research for the future of machine learning.
Gradient free Optimization
Particle Swarm Optimization
• Particle Swarm Optimization is a population-based method that defines a
set of ‘particles’ that explore the search space, attempting to find a
minimum.
• PSO iteratively improved a candidate solution with respect to a certain
quality metric.
• It solves the problem by having a population of potential solutions
(‘particles’) and moving them around according to simple mathematical
rules like the particle’s position and velocity.
• Each particle’s movement is influenced by the local position it believes is
best, but is also attracted by the best known positions in the search-place
(found by other particles).
• In theory, the swarm moves over several iterations towards the best
solutions.
Gradient free Optimization
Particle Swarm Optimization

PSO is a fascinating idea — it is much less sensitive to initialization


than neural networks, and the communication between particles on
certain findings could prove to be a very efficient method of searching
sparse and large areas.
Gradient free Optimization
Particle Swarm Optimization
• Because Particle Swarm Optimization is not gradient-based , it does not
require the optimization problem to be differentiable;
• hence using PSO to optimize a neural network or any other algorithm
would allow more freedom and less sensitivity on the choice of
activation function or equivalent role in other algorithms.
• Additionally, it makes little to no assumptions about the problem being
optimized and can search very large spaces.
• Population-based methods tend to be a fair bit more computationally
expensive than gradient-based optimizers, but not necessarily so.

Gradient free Optimization
Particle Swarm Optimization
• Because the algorithm is so open and non-rigid
— as evolution-based algorithms often are, one
can control
– the number of particles,
– the speed at which they move,
– the amount of information that is shared globally,
and so on;
• just like one might tune learning rates in a neural
network.
Gradient free Optimization
Surrogate optimization
• Surrogate optimization is a method of
optimization that attempts to model the loss
function with another well-established function
to find the minima.
• The technique samples ‘data points’ from the loss
function, meaning it tries different values for
parameters (the x) and stores the value of the
loss function (the y).
• After a sufficient number of data points have
been collected, a surrogate function is fitted to
the collected data.
Gradient free Optimization
Surrogate optimization

• Because finding the minima of polynomials is a very well-studied subject and there
exist a host of very efficient methods to find the global minima of a polynomial using
derivatives, we can assume that the global minima of the surrogate function is the
same for the loss function.
• Surrogate optimization is technically a non-iterative method, although the training of
the surrogate function is often iterative; additionally, it is technically a no-gradient
method, although often effective mathematical methods to find the global minima of
the modelling function are based on derivatives.
• However, because both the iterative and gradient-based properties are ‘secondary’ to
surrogate optimization, it can handle large data and non-differentiable optimization
problems.
Gradient free Optimization
Surrogate optimization
• Optimization using a surrogate function is quite clever in a few ways:
• It is essentially smoothing out the surface of the true loss function, which
reduces the jagged local minima that cause so much of the extra training time
in neural networks.
• It projects a difficult problem into a much easier one: whether it’s a
polynomial, RBF, GP, MARS, or another surrogate model, the task of finding
the global minima is boosted with mathematical knowledge.
• Overfitting the surrogate model is not really much of an issue, because even
with a fair bit of overfitting, the surrogate function is still more smooth and
less jagged than the true loss function. Along with many other standard
considerations in building more mathematically-inclined models that have
been simplified, training surrogate models is hence much easier.
• Surrogate optimization is not limited by the view of where it currently is in
that it sees the ‘entire function’, opposed to gradient descent, which must
continually make risky choices on whether it thinks there will be a deeper
minima over the next hill.
Gradient free Optimization
Surrogate Annealing
• Surrogate optimization is almost always faster than gradient
descent methods, but often at the cost of accuracy.
• Using surrogate optimization may only be able to pinpoint the
rough location of a global minima, but this can still be tremendously
beneficial.
• An alternative is a hybrid model; a surrogate optimization is used to
bring the neural network parameters to the rough location, from
which gradient descent can be used to find the exact global minima.
• Another is to use the surrogate model to guide the optimizer’s
decisions, since the surrogate function can
– a) ‘see ahead’ and
– b) is less sensitive to specific ups and downs of the loss function.
Gradient free Optimization
Simulated optimization
• Simulated Annealing is a concept based on annealing in metallurgy, in
which a material can be heated above its recrystallization
temperature to reduce its hardness and alter other physical and
occasionally chemical properties, then allowing the material to
gradually cool and become rigid again.
• Using the notion of slow cooling, simulated annealing slowly
decreases the probability of accepting worse solutions as the solution
space is explored.
• Because accepting worse solutions allows for a more broad search for
the global minima (think — cross the hill for a deeper valley),
simulated annealing assumes that the possibilities are properly
represented and explored in the first iterations.
• As time progresses, the algorithm moves away from exploration and
towards exploitations.
Gradient free Optimization
Simulated optimization
• The following is a rough outline of how
simulated annealing algorithms work:
– The temperature is set at some initial positive value
and progressively approaches zero.
– At each time step, the algorithm randomly chooses
a solution close to the current one, measures its
quality, and moves to it depending on the current
temperature (probability of accepting better or
worse solutions).
– Ideally, by the time the temperature reaches zero,
the algorithm has converged on a global minima
solution.
Gradient free Optimization
Simulated optimization
• The simulation can be performed with
kinetic equations or with stochastic
sampling methods.
• Simulated Annealing was used to
solve the travelling salesman problem,
which tries to find the shortest
distance between hundreds of
locations, represented by data points.
• Obviously, the combinations are
endless, but simulated annealing —
with its reminiscence of
reinforcement learning — performs
very well.
Gradient free Optimization
Simulated optimization
• Simulated annealing performs especially well in
scenarios where an approximate solution is
required in a short period of time, outperforming
the slow pace of gradient descent.
• Like surrogate optimization, it can be used in
hybrid with gradient descent for the benefits of
both: the speed of simulated annealing and the
accuracy of gradient descent.
Gradient free Optimization

• Non-gradient methods for optimization are fascinating


because of the creativity many of them utilize, not being
restricted by the mathematical chains of gradients.
• No one expects no-gradient methods to go mainstream ever
because gradient-based optimization performs so well even
considering its many problems.
• However, harnessing the power of no-gradient and gradient-
based methods with hybrid optimizers demonstrates
extremely high potential, especially in an era where we are
reaching a computational limit.

You might also like