Unit 1

Unit 1
Review of machine learning and

deep learning concepts
Topics top be covered….
• Parametric vs non-parametric models

• Feedforward neural networks
• Gradient-based Optimization
• Regularization
• Challenges in Neural Network Optimization
Introduction
• Machine learning can be summarized as:
– learning a function (f) that maps input variables (X) to output variables (Y).
– Y = f(x)
Input Output
variables (X) f variables (Y)
Y = f(X)
• The machine learns from the training data to map the target function,
but the configuration of the function is unknown.
• An algorithm learns this target mapping function from training data.
• Based on the knowledge of the input data we try to evaluate different
machine learning algorithms and see which is better at approximating
the underlying function.
• Different algorithms make different assumptions or biases about the
form of the function and how it can be learned.
Parametric Machine Learning Algorithms
• Assumptions can greatly simplify the learning process, but

can also limit what can be learned.
• A learning model that summarizes data with a set of
parameters of fixed size (independent of the number of
training examples) is called a parametric model.
• No matter how much data you throw at a parametric
model, it won’t change its mind about how many
parameters it needs.
• Thus machine learning models are parameterized so that
their behavior can be tuned for a given problem. These
models can have many parameters and finding the best
combination of parameters can be treated as a search
problem.
• A model parameter is a configuration variable that is internal to the

model and whose value can be estimated from the given data.
– They are required by the model when making predictions.
– Their values define the skill of the model on your problem.
– They are estimated or learned from historical training data.
– They are often not set manually by the practitioner.
– They are often saved as part of the learned model.
• The examples of model parameters include:
– The weights in an artificial neural network.
– The support vectors in a support vector machine.
– The coefficients in linear regression or logistic regression.
• Machine learning algorithms are classified into two distinct
groups: parametric and nonparametric models.
• The algorithms involve two steps:

– Select a form for the function.
– Learn the coefficients for the
function from the training data.
• An easy to understand functional
form for the mapping function is a
line, as is used in linear regression:
– b0 + b1*x1 + b2*x2 = 0
– b0, b1, b2 → the coefficients of the
line that control the intercept and
slope
x1, x2 → input variables
• Assuming the functional form of a line greatly simplifies the
learning process.
• Now, all we need to do is estimate the coefficients of the
line equation and we have a predictive model for the
problem.
• Often the assumed functional form is a linear combination
of the input variables and as such parametric machine
learning algorithms are often also called “linear machine
learning algorithms“.
• The problem is, the actual unknown underlying function
may not be a linear function like a line.
• It could be almost a line and require some minor
transformation of the input data to work right.
• Or it could be nothing like a line in which case the
assumption is wrong and the approach will produce poor
results.
• Some more examples of parametric machine

learning algorithms include:
– Logistic Regression
– Linear Discriminant Analysis
– Perceptron
– Naive Bayes
– Simple Neural Networks
• Benefits of Parametric Machine Learning Algorithms:

– Simpler: These methods are easier to understand and interpret
results.
– Speed: Parametric models are very fast to learn from data.
– Less Data: They do not require as much training data and
can work well even if the fit to the data is not perfect.
• Limitations of Parametric Machine Learning Algorithms:
– Constrained: By choosing a functional form these methods are
highly constrained to the specified form.
– Limited Complexity: The methods are more suited to simpler
problems.
– Poor Fit: In practice the methods are unlikely to match the
underlying mapping function.
Nonparametric Machine Learning Algorithms
• Algorithms that do not make strong assumptions

about the form of the mapping function are
called nonparametric machine learning
algorithms.
• By not making assumptions, they are free to learn
any functional form from the training data.
• Nonparametric methods are good when you have
a lot of data and no prior knowledge, and when
you don’t want to worry too much about
choosing just the right features.
• Nonparametric methods seek to best fit the

training data in constructing the mapping
function, whilst maintaining some ability to
generalize to unseen data.
• As such, they are able to fit a large number of
functional forms.
• Consider the k-nearest neighbors algorithm that makes predictions based on the k
most similar training patterns for a new data instance.
• The method does not assume anything about the form of the mapping function
other than patterns that are close are likely to have a similar output variable.
• Some more examples of popular nonparametric machine learning algorithms are:
– k-Nearest Neighbors
– Decision Trees like CART and C4.5
– Support Vector Machines
• Benefits of Nonparametric Machine Learning Algorithms:

– Flexibility: Capable of fitting a large number of functional
forms.
– Power: No assumptions (or weak assumptions) about the
underlying function.
– Performance: Can result in higher performance models for
prediction.
• Limitations of Nonparametric Machine Learning
Algorithms:
– More data: Require a lot more training data to estimate the
mapping function.
– Slower: A lot slower to train as they often have far more
parameters to train.
– Overfitting: More of a risk to overfit the training data and it
is harder to explain why specific predictions are made.
Parametric and Nonparametric Machine Learning Algorithms
• When to use a parametric algorithm?

– Parametric algorithms are best suited for problems where the input data is well-defined and
predictable.
– This makes them ideal for tasks such as predictive modelling, where the goal is to predict the value
of a target variable based on a set of input variables.
– Example: Let’s say you want to predict the sales volume of a product. You would use a parametric
algorithm, such as linear regression, to build a model that defines the relationship between
historical sales data and the predicted sales volume.
• When to use a nonparametric algorithm?
– Nonparametric algorithms are best suited for problems where the input data is not well-defined or
too complex to be modelled using a parametric algorithm.
– This makes them ideal for tasks such as data classification, where the goal is to separate data into
distinct classes or groups.
– Additionally, nonparametric algorithms are often more accurate than parametric algorithms for
complex problems.
– Example: If you want to build an algorithm to classify images of animals, you would use a
nonparametric algorithm, such as a support vector machine. This is because the input data (images
of animals) is not well-defined and too complex to be modelled using a parametric algorithm.
Feedforward neural networks
• Linear machine learning models are limited to only linear functions.
• When the data is not linear separable, linear models face problems in
approximating whereas it is pretty easy for the neural networks.
• The hidden layers are used to increase the non-linearity and change the
representation of the data for better generalization over the function.
• Deep Feedforward networks or also known multilayer perceptrons are
the foundation of most deep learning models.
• Networks like CNNs and RNNs are just some special cases of Feedforward
networks.
• These networks are mostly used for supervised machine learning tasks
where we already know the target function i.e. the result we want our
network to achieve and are extremely important for practicing machine
learning and form the basis of many commercial applications, areas such
as computer vision and NLP were greatly affected by the presence of
these networks.
• The purpose of feedforward Neural networks is
to approximate certain functions.
• The input to the network is a vector of values, x, which is passed
through the network, layer by layer, and transformed into
an output, y.
• The network's final output predicts the target function for the given
input.
• The network makes this prediction using a set of parameters, θ
(theta), adjusted during training to minimize the error between the
network's predictions and the target function.
• The training involves adjusting the θ (theta) values to minimize

errors.
• This is done by presenting the network with a set of input-output
pairs (also called training data) and computing the error between
the network's prediction and the true output for each pair.
• This error is then used to compute the gradient of the error
concerning the parameters, which tells us how to adjust the
parameters to reduce the error.
• This is done using optimization techniques like gradient descent.
• Once the training process is completed, the network has " learned "
the function and can be used to predict new input.
• Finally, the network stores this optimal value of θ (theta) in its
memory, so it can use it to predict new inputs.
• The main goal of a feedforward network is to approximate some

function f*.
• For example, a regression function y = f *(x) maps an input x to a
value y. A feedforward network defines a mapping y = f (x; θ) and
learns the value of the parameters θ that result in the best function
approximation.
• These networks are represented by a composition of many different
functions. Each model is associated with an acyclic graph describing
how the functions are composed together. For example, we might
have three functions f (1), f (2), and f (3) connected in a chain, to
form f (x) = f(3)(f (2)(f (1)(x))). In this f(1) is the first layer, f(2) is the
second layer and f(3) is the output layer.
•
The training phase of a neural network
Steps involved in the implementation of a neural network
A neural network executes in 2 steps:
1. Feedforward:
• Initially, specific weight required by inputs are not
known
• Have a set of input features and some random weights
• Weight decides how vital is that feature for prediction
• The higher the weight, the greater the importance
2. Backpropagation:
• Calculate the error between predicted output and target
output
• Use an algorithm (gradient descent) to update the
weight values
Regularization
• The model with a higher complexity pick up

and learn patterns (noise) in the data that are
caused by some random fluctuation or error
• The network would be able to model each data
sample of the distribution one-by-one, while
not recognizing the true function that
describes the distribution
• Regularization is a set of techniques that can
prevent overfitting in neural networks
• Improve the accuracy of a Deep Learning
model for new data
22
Regularization
• If complexity of the model is increased beyond a certain level, training error

reduces but the testing error doesn’t
23
Regularization
• Regularization is a technique which makes

slight modifications to the learning algorithm
such that the model generalizes better
• This in turn improves the model’s
performance on the unseen data as well
• Regularization refers to a set of different
techniques that lower the complexity of a
neural network model during training
and thus prevent the overfitting
24
Regularization
Consider a neural network which is overfitting on the training data
• Regularization penalizes the coefficients

• In deep learning, it actually penalizes the weight matrices of the nodes
25
Regularization
• Assume that regularization effect is so high that some of the weights are nearly equal to zero
• This will result in a much simpler linear network and slight underfitting of the training data
• Such a large value of the regularization effect is not useful
• Optimize the value of regularization coefficient in order to obtain a well-fitted model
26
Regularization
• There are three very popular and efficient regularization techniques called L1, L2, and
dropout
• L1 and L2 are the most common types of regularization
• These update the cost function, J(θ), by adding another term known as regularization
term
Cost function = Loss + Regularization term
• Due to the addition of this regularization term, the values of weight matrices decrease
because it assumes that a neural network with smaller weight matrices leads to simpler
models
27
Regularization
L2 Regularization
• Most common type of all regularization techniques and is also commonly known
as weight decay or Ridge Regression.
• The regularization term is the Euclidean Norm (or L2 norm) of the weight
matrices, which is the sum over all squared weight values of a weight matrix
• Lambda is the regularization parameter

• It is the hyperparameter whose value is optimized for better results
• L2 regularization is also known as weight decay as it forces the weights to decay
towards zero (but not exactly zero)
28
Regularization
L1 Regularization
• Also knows as Lasso regression
• It is the sum of the absolute values of the weight parameters in a weight matrix:
• Penalize the absolute value of the weights

• Unlike L2, the weights may be reduced to zero
• Useful when we are trying to compress our model
• Otherwise, prefer L2 over it
29
Gradient-based Optimization
What is Gradient Descent?
• Algorithm that operates iteratively to find the

optimal values for weights
• Requires user-defined learning rate, and initial
weight values
• Steps: (Iterative)
1. Start with initial values of weights
2. Calculate cost
3. Update values using learning rate and update
function if cost is not acceptable
4. Stop if cost is acceptable
Gradient descent
• Common cost functions:
• Mean squared error
• Cross-entropy loss (log loss)
• Cost, J(θ) is dependent on weight, θ
• To determine minimum point determine derivative of J(θ)
with respect to θ
Gradient descent
• Process of gradient descent is
where
• ∆θi is the change in weight

• Set a learning rate, α, to control the size of
the change (step)
Gradient descent
• For one weight, determine derivative (gradient) for

only one weight
• For multiple weights, determine partial derivative
(gradient) with respect to each weight
• As you reach a local optima, the slope will approach
zero
• Once the slope of current parameter reaches zero,
parameter value stops updating
• It results in convergence and signifies that we can stop
the iterative process
Training ANN using gradient Descent
• Use gradient descent to update each of the

weights
• Recompute cost function with the new weights
• Repeat cost function converges to minimum
value
• During each iteration we perform forward
propagation to compute the outputs and
backward propagation to compute the errors
• One complete iteration is known as an epoch
• Keep checking cost function after each epoch to
watch the amount of error as network is trained
Training ANN using gradient Descent
• Using updated weights, use forward

propagation and calculate 𝐽(𝜃).
• Go backwards and repeat the process till
output is equal to target output
• Or error is minimum
Stochastic Gradient Descent
• Variant of Gradient Descent

• Tries to update the model’s parameters more frequently
• Model parameters are altered after computation of loss on each training
sample/ batches of samples
• Instead of taking complete dataset for each iteration, randomly select the
batches of data.
• Take few samples from the dataset as batches
• Randomly shuffle the data at each iteration
• Thus, SGD uses a higher number of iterations to reach the local minima
• Due to an increase in the number of iterations, the overall computation time
increases
36
• Even after increasing the number of iterations, the computation cost is still less
than that of the gradient descent optimizer
• If the data is enormous and computational time is an essential factor,
stochastic gradient descent should be preferred over gradient descent
algorithm.
• Example: dataset contains 1000 rows, SGD updates the model parameters
1000 times (if batch size = 1) in one cycle of dataset instead of one time as in
Gradient Descent
θ = θ − α⋅∇J (θ is determined for each sample)
• As the model parameters are frequently updated, parameters have high
variance and fluctuations in loss functions at different intensities
37
• Advantages:
• Frequent updates of model parameters hence,
converges in less time
• Requires less memory as no need to store values of
loss functions
• May get new minima
• Disadvantages:
• High variance in model parameters
• May shoot even after achieving global minima
• To get the same convergence as gradient descent
needs to slowly reduce the value of learning rate
38
Gradient Descent
• Trajectory of the variables in the SGD is much more noisy

than gradient descent
• This is due to the stochastic nature of the gradient
• Even after 50 steps the quality is still not so good
• It doesn’t not improve after additional steps
• Solution is to change the learning rate Stochastic Gradient Descent
• To resolve these conflicting goals is to reduce the learning

rate dynamically as optimization progresses
39
Dynamic Learning Rate
• Replacing with a time-dependent learning rate η(t)
adds to the complexity of controlling convergence of
an optimization algorithm.
• Need to figure out how rapidly should decay.
• If it is too quick, we will stop optimizing prematurely.
• If we decrease it, we waste too much time on
optimization.
• There are a few basic strategies that are used in
adjusting over time
40
Types of Dynamic Learning Rate
• Exponential • Polynomial
41
Mini-batch Stochastic Gradient Descent
• There are two extremes in the approach to gradient based learning
– Use the full dataset to compute gradients and to
update parameters, one pass at a time.
– Process one observation at a time to make progress.
– Each of them has its own drawbacks.
• Gradient Descent is not particularly data efficient whenever data is very large
• Stochastic Gradient Descent is not particularly computationally efficient since CPUs and
GPUs cannot exploit the full power of vectorization.
• Mini-batch Stochastic Gradient Descent optimizer is useful in such cases
• Cost function in mini-batch gradient descent is noisier than the batch gradient descent
algorithm but smoother than that of the stochastic gradient descent algorithm.
42
Mini-batch Stochastic Gradient Descent
• Best among all the variations of gradient descent algorithms

• It is an improvement on both SGD and standard gradient descent
• It updates the model parameters after every batch
• Dataset is divided into various batches and after every batch, the parameters are updated.
θ=θ−α⋅∇J(θ; B(i)), where {B(i)} are the batches of training examples.
• Advantages:
• Frequently updates the model parameters and also has less variance.
• Requires medium amount of memory.
• Disadvantages:
• needs a hyper parameter that is “mini-batch-size”, which needs to be tuned to achieve the
required accuracy
• Generally, the batch size of 32 is considered to be appropriate. In some cases, it results in poor
final accuracy.
43
All types of Gradient Descent have some challenges
• Choosing an optimum value of the learning rate. If the learning rate is
too small than gradient descent may take ages to converge.
• Have a constant learning rate for all the parameters. There may be
some parameters which need not be changed at the same rate
• May get trapped at local minima
44
Momentum
• Intuition:
• A person is repeatedly asked to move in the same direction then person
probably gains confidence and start taking bigger steps in that direction
• Just as a ball gains momentum while rolling down a slope
• Stochastic gradient descent takes a much more noisy path than the gradient
descent algorithm
• Therefore, it requires a more number of iterations to reach the optimal
minimum and hence computation is very slow
• Momentum is used for reducing high variance in SGD
• Momentum helps reach convergence in less time
45
SGD with Momentum
• Update weights and bias in SGD,
• Update weights and bias in SGD with momentum
• Current gradient is dependent on its previous Gradient and so on

• This accelerates SGD to converge faster and reduce the oscillation
• The momentum term β is usually set to 0.9
• Effect of momentum reduces with each time step by β times
46
SGD with Momentum
• If high value of momentum is used, the possibility of

SGD without SGD with
skipping the optimal minimum also increases momentum momentum
• This might result in poor accuracy and even more
oscillations
• Advantages:
• Reduces the oscillations and high variance of the
parameters.
• Converges faster than gradient descent.
• Disadvantages:
• One more hyper-parameter is added which needs
to be selected manually and accurately.
47
Adaptive Gradient Algorithm (Adagrad)
• Value of learning rate can change the pace of training
• Learning rate is constant in GD, SGD and SGD with momentum
• Adagrad does not use momentum
• For a sparse feature values (most of the values are zero) high learning rate is
required to boost low values of gradients
• For dense data, learning rate can be low
• The solution is to have an adaptive learning rate that can change according to the
input provided
• Adagrad optimizer decays learning rate in proportion to the updated history of
the gradients
• It means that when there are larger updates, the history element is accumulated,
and therefore it reduces the learning rate and vice versa.
• One disadvantage of this approach is that the learning rate decays aggressively
and after some time it approaches zero
48
Adagrad
• Learning rate is modified for given weight at a given time based on previous gradients
• Store the sum of the squares of the gradients up to time step, t
• ϵ is a smoothing term that avoids division by zero
• Decay the learning rate for parameters in proportion to their update history
49
Adagrad
• History of the gradient is accumulated in αt
• The smaller the gradient accumulated, the smaller the αt value will be
leading to a bigger learning rate (because αt divides η)
• It makes big updates for less frequent parameters and a small step for frequent
parameters
50
Adagrad
Advantages:
• Learning rate changes for each training parameter.
• Don’t need to manually tune the learning rate.
• Able to train on sparse data.
• Reaches convergence at a higher speed
Disadvantages:
• Computationally expensive as a need to calculate the second order derivative.
• It decreases the learning rate aggressively and monotonically
• There might be a point when the learning rate becomes extremely small
This is because the squared gradients in the denominator keep accumulating,
and thus the denominator part keeps on increasing
• Due to small learning rates, the model eventually becomes unable to acquire
more knowledge, and hence the accuracy of the model is compromised
51
Adadelta
• an extension of Adagrad
• Attempts to solve its radically diminishing learning rates
• Instead of summing up all the past squared gradients from 1 to “t” time steps
Uses Exponentially Weighted Averages over Gradient
Adadelta 
• The typical “β” value is 0.9 or 0.95
52
Adaptive Moment Estimation (Adam)
• Intuition: We don’t want to roll so fast just because we can jump over the
minimum, we want to decrease the velocity a little bit for a careful search.
• Adam is the best optimizer
• Requires less time and is efficient
• For sparse data use the optimizers with dynamic learning rate.
• Combines the power of Adadelta and momentum-based SGD
• Power of momentum SGD to hold the history of updates and the adaptive
learning rate provided by Adadelta makes it a powerful method
53
Adam
Exponential Weighted Averages for past Exponential Weighted Averages for past squared
gradients gradients
Adam
54
Adam
• Advantages:
• The method is too fast and converges rapidly.
• Rectifies vanishing learning rate, high variance.
• Disadvantages:
• Computationally costly.
55
RMSProp
• It is an improvement to the Adagrad optimizer

• It is similar to Adadelta
• Adadelta changes learning rate by accumulating square of gradients at the end of
iteration and computes exponential average
• RMSProp changes learning rate by accumulating square of gradients at the end of
each step
• Update rule of Adadelta,
56
RMSProp
• Adv: The algorithm converges quickly and requires lesser tuning than gradient
descent algorithms and their variants.
• Disadv: learning rate has to be defined manually and value of beta doesn’t work
for every application
57
Challenges in Neural Network Optimization
• While gradient descent and its repurposing in

the form of backpropagation has been one of
the greatest breakthroughs in machine
learning, the optimization of neural networks
remain an unsolved problem.
• Optimizers get caught up in local minima that are deep enough.

• Admittedly, there are clever solutions to sometimes get around these
issues, like momentum, which can carry optimizers over large hills;
stochastic gradient descent; or batch normalization, which smooths
the error space.
• Local minima is still the root cause of many branching problems in
neural networks, though.
• Because optimizers are so tempted by local minima,

even if it manages to get out of it, it takes a really, really
long time.
• Gradient descent is generally a lengthy method because
of its slow convergence rate, even with adaptations for
large datasets like batch gradient descent.
• Gradient descent is especially sensitive to the

initialization of the optimizer.
• For instance, performance may be much better if
the optimizer is initialized near the second local
minima instead of the first, but this is all
determined randomly.
• Learning rates dictate how confident and risky the

optimizer is; setting too high a learning rate may cause it
to overlook global minima whereas too low a learning
causes the runtime to blow up.
• To address this problem, learning rates have evolved
with decaying, but choosing the rate of decay, among
many other variables dictating the learning rate, is
difficult.
• Gradient descent requires gradients, meaning

that it is prone to gradient-based problems like
vanishing or exploding gradient problems, in
addition to its inability to handle non-
differentiable functions.
Gradient free Optimization
• There are some interesting optimization methods that are not

based on gradient, but are used to improve the performance
of the neural network and work exceptionally well in some
scenarios and not so well in others.
• Regardless of how well they perform on a specific task,
however, they are fascinating, creative, and a promising area
of research for the future of machine learning.
Particle Swarm Optimization
• Particle Swarm Optimization is a population-based method that defines a
set of ‘particles’ that explore the search space, attempting to find a
minimum.
• PSO iteratively improved a candidate solution with respect to a certain
quality metric.
• It solves the problem by having a population of potential solutions
(‘particles’) and moving them around according to simple mathematical
rules like the particle’s position and velocity.
• Each particle’s movement is influenced by the local position it believes is
best, but is also attracted by the best known positions in the search-place
(found by other particles).
• In theory, the swarm moves over several iterations towards the best
solutions.
PSO is a fascinating idea — it is much less sensitive to initialization

than neural networks, and the communication between particles on
certain findings could prove to be a very efficient method of searching
sparse and large areas.
• Because Particle Swarm Optimization is not gradient-based , it does not
require the optimization problem to be differentiable;
• hence using PSO to optimize a neural network or any other algorithm
would allow more freedom and less sensitivity on the choice of
activation function or equivalent role in other algorithms.
• Additionally, it makes little to no assumptions about the problem being
optimized and can search very large spaces.
• Population-based methods tend to be a fair bit more computationally
expensive than gradient-based optimizers, but not necessarily so.
•
• Because the algorithm is so open and non-rigid
— as evolution-based algorithms often are, one
can control
– the number of particles,
– the speed at which they move,
– the amount of information that is shared globally,
and so on;
• just like one might tune learning rates in a neural
network.
Surrogate optimization
• Surrogate optimization is a method of
optimization that attempts to model the loss
function with another well-established function
to find the minima.
• The technique samples ‘data points’ from the loss
function, meaning it tries different values for
parameters (the x) and stores the value of the
loss function (the y).
• After a sufficient number of data points have
been collected, a surrogate function is fitted to
the collected data.
• Because finding the minima of polynomials is a very well-studied subject and there
exist a host of very efficient methods to find the global minima of a polynomial using
derivatives, we can assume that the global minima of the surrogate function is the
same for the loss function.
• Surrogate optimization is technically a non-iterative method, although the training of
the surrogate function is often iterative; additionally, it is technically a no-gradient
method, although often effective mathematical methods to find the global minima of
the modelling function are based on derivatives.
• However, because both the iterative and gradient-based properties are ‘secondary’ to
surrogate optimization, it can handle large data and non-differentiable optimization
problems.
• Optimization using a surrogate function is quite clever in a few ways:
• It is essentially smoothing out the surface of the true loss function, which
reduces the jagged local minima that cause so much of the extra training time
in neural networks.
• It projects a difficult problem into a much easier one: whether it’s a
polynomial, RBF, GP, MARS, or another surrogate model, the task of finding
the global minima is boosted with mathematical knowledge.
• Overfitting the surrogate model is not really much of an issue, because even
with a fair bit of overfitting, the surrogate function is still more smooth and
less jagged than the true loss function. Along with many other standard
considerations in building more mathematically-inclined models that have
been simplified, training surrogate models is hence much easier.
• Surrogate optimization is not limited by the view of where it currently is in
that it sees the ‘entire function’, opposed to gradient descent, which must
continually make risky choices on whether it thinks there will be a deeper
minima over the next hill.
Surrogate Annealing
• Surrogate optimization is almost always faster than gradient
descent methods, but often at the cost of accuracy.
• Using surrogate optimization may only be able to pinpoint the
rough location of a global minima, but this can still be tremendously
beneficial.
• An alternative is a hybrid model; a surrogate optimization is used to
bring the neural network parameters to the rough location, from
which gradient descent can be used to find the exact global minima.
• Another is to use the surrogate model to guide the optimizer’s
decisions, since the surrogate function can
– a) ‘see ahead’ and
– b) is less sensitive to specific ups and downs of the loss function.
Simulated optimization
• Simulated Annealing is a concept based on annealing in metallurgy, in
which a material can be heated above its recrystallization
temperature to reduce its hardness and alter other physical and
occasionally chemical properties, then allowing the material to
gradually cool and become rigid again.
• Using the notion of slow cooling, simulated annealing slowly
decreases the probability of accepting worse solutions as the solution
space is explored.
• Because accepting worse solutions allows for a more broad search for
the global minima (think — cross the hill for a deeper valley),
simulated annealing assumes that the possibilities are properly
represented and explored in the first iterations.
• As time progresses, the algorithm moves away from exploration and
towards exploitations.
• The following is a rough outline of how
simulated annealing algorithms work:
– The temperature is set at some initial positive value
and progressively approaches zero.
– At each time step, the algorithm randomly chooses
a solution close to the current one, measures its
quality, and moves to it depending on the current
temperature (probability of accepting better or
worse solutions).
– Ideally, by the time the temperature reaches zero,
the algorithm has converged on a global minima
solution.
• The simulation can be performed with
kinetic equations or with stochastic
sampling methods.
• Simulated Annealing was used to
solve the travelling salesman problem,
which tries to find the shortest
distance between hundreds of
locations, represented by data points.
• Obviously, the combinations are
endless, but simulated annealing —
with its reminiscence of
reinforcement learning — performs
very well.
• Simulated annealing performs especially well in
scenarios where an approximate solution is
required in a short period of time, outperforming
the slow pace of gradient descent.
• Like surrogate optimization, it can be used in
hybrid with gradient descent for the benefits of
both: the speed of simulated annealing and the
accuracy of gradient descent.
• Non-gradient methods for optimization are fascinating

because of the creativity many of them utilize, not being
restricted by the mathematical chains of gradients.
• No one expects no-gradient methods to go mainstream ever
because gradient-based optimization performs so well even
considering its many problems.
• However, harnessing the power of no-gradient and gradient-
based methods with hybrid optimizers demonstrates
extremely high potential, especially in an era where we are
reaching a computational limit.

Unit 1

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit 1

Uploaded by

Copyright:

Available Formats

Unit 1

Review of machine learning and

• Parametric vs non-parametric models

• Assumptions can greatly simplify the learning process, but

• A model parameter is a configuration variable that is internal to the

• The algorithms involve two steps:

• Some more examples of parametric machine

• Benefits of Parametric Machine Learning Algorithms:

• Algorithms that do not make strong assumptions

• Nonparametric methods seek to best fit the

• Benefits of Nonparametric Machine Learning Algorithms:

• When to use a parametric algorithm?

• The training involves adjusting the θ (theta) values to minimize

• The main goal of a feedforward network is to approximate some

• The model with a higher complexity pick up

• If complexity of the model is increased beyond a certain level, training error

• Regularization is a technique which makes

Consider a neural network which is overfitting on the training data

• Regularization penalizes the coefficients

• Lambda is the regularization parameter

• Penalize the absolute value of the weights

• Algorithm that operates iteratively to find the

• ∆θi is the change in weight

• For one weight, determine derivative (gradient) for

• Use gradient descent to update each of the

• Using updated weights, use forward

• Variant of Gradient Descent

• Trajectory of the variables in the SGD is much more noisy

• To resolve these conflicting goals is to reduce the learning

• Best among all the variations of gradient descent algorithms

• Update weights and bias in SGD with momentum

• Current gradient is dependent on its previous Gradient and so on

• If high value of momentum is used, the possibility of

• The typical “β” value is 0.9 or 0.95

• It is an improvement to the Adagrad optimizer

• While gradient descent and its repurposing in

• Optimizers get caught up in local minima that are deep enough.

• Because optimizers are so tempted by local minima,

• Gradient descent is especially sensitive to the

• Learning rates dictate how confident and risky the

• Gradient descent requires gradients, meaning

• There are some interesting optimization methods that are not

PSO is a fascinating idea — it is much less sensitive to initialization

• Non-gradient methods for optimization are fascinating

You might also like