Lesson 4 Gradient Descent

LESSON4
A R T I F I C I A L I N T E L L I G E N C E
M A C H I N E L E A R N I N G
ALGORITHMS
GRADIENT DESCENT
GRADIENT DESCENT
AI and ML Libraries
Armadillo
NICTA research center, Australia and independent contributors Steffen Nissen (original), several collaborators (present)
C++ | Linear algebra and scientific computing C | Used for: Developing multi-layer feed-forward artificial neural nets /
Binds to C#, Python and others
Bioinformatics,
Computer vision, Aerospace engineering,
Econometrics, AI,
Pattern recognition, Biology,
Signal processing, and Environmental sciences,
Statistics. Genetics,
Image recognition, and
Machine learning.
François Chollet (original), various (present)

Python | Deep learning
Activation functions,
Layers,
Objectives, and
Optimizers.
GRADIENT DESCENT
AI and ML Libraries
Google Brain Team

Python | C# | Deep learning
International Center for Numerical Methods in Engineering
(original), Artelnics (present)
Tensorflow is an open-source library for numerical computation and
C++ | Advanced analytics and neural networks implementation large-scale machine learning that ease Google Brain TensorFlow, the
process of acquiring data, training models, serving predictions, and
• Capable of implementing any number of layers of non-linear refining future results.
processing units for supervised learning.
• Enables multiprocessing programming using OpenMP. Tensorflow bundles together Machine Learning and Deep Learning
• It features data mining algorithms as a bundle of functions models and algorithms.
integrated into other software tools through an API.
• More than just a library, a general-purpose AI software
package.
GRADIENT DESCENT
The system was written in various languages,

including Java, C++, and Prolog, and C#. Runs on
the SUSE Linux Enterprise Server 11
https://youtu.be/lI-M7O_bRNg
GRADIENT DESCENT
 Gradient descent is an optimization algorithm which is commonly-used to
train machine learning models and neural networks, based on calculus and
linear algebra.
 Training data helps these models learn over time, and the cost function
within gradient descent specifically acts as a barometer, gauging its accuracy
with each iteration of parameter updates.
 Until the function is close to or equal to zero, the model will continue to
adjust its parameters to yield the smallest possible error.
 Once machine learning models are optimized for accuracy, they can be
powerful tools for artificial intelligence (AI) and computer science
applications.
Recall the following formula for the slope of a line, The gradient descent algorithm behaves similarly, but it is
which is y = mx + b, where m represents the slope and based on a convex function, such as the one above.
b is the intercept on the y-axis. Linear regression refers
to the points in the graph, up or down.
GRADIENT DESCENT
GRADIENT DESCENT
Gradient Descent Procedure
1) Initialize Coefficient
2) Evaluate Cost Function of the coefficient
3) Get derivative of the Cost
4) Update the coefficient with Learning Rate value
5) Repeat step 2 until all dataset records have been processed.
Local Minima - points which appear to be minima but are not the point
where the function actually takes the minimum value are called local
minima
Global Minima - the point at which a function takes the minimum

value
GRADIENT DESCENT y (known)
x =  guess
Explanation
error/loss = y - guess
= J cost function
y = mx + b
𝑚

𝐽 =∑ ( 𝑔𝑢𝑒𝑠𝑠1 − 𝑦 1 ) 2
𝑖=1
𝑚
ⅆ
𝐽 ⅆ𝐽 ⅆⅇ
𝐽 =∑ ( 𝑒 ) 2 = ×
𝑖=1 ⅆ𝑚 𝑑 𝑒 ⅆ 𝑚
= ([2] * error * x) * α
Goal : minimize error (optimization)

GRADIENT DESCENT
Example and Analogy
GRADIENT DESCENT
Types
1) Batch gradient descent
There are three types of
gradient descent learning Batch gradient descent sums the error for each point in a
algorithms: batch gradient training set, updating the model only after all training examples
descent, stochastic gradient have been evaluated. This process referred to as a training
descent and mini-batch epoch.
gradient descent.
While this batching provides computation efficiency, it can still
have a long processing time for large training datasets as it still
needs to store all of the data into memory. Batch gradient
descent also usually produces a stable error gradient and
convergence, but sometimes that convergence point isn’t the
most ideal, finding the local minimum versus the global one.
GRADIENT DESCENT
Types
There are three types of 2) Stochastic gradient descent

gradient descent learning
algorithms: batch gradient Stochastic gradient descent (SGD) runs a training epoch for
descent, stochastic gradient each example within the dataset and it updates each training
descent and mini-batch example's parameters one at a time.
gradient descent.
Since there is only one need to hold a training example, they are
easier to store in memory. While these frequent updates can
offer more detail and speed, it can result in losses in
computational efficiency when compared to batch gradient
descent. Its frequent updates can result in noisy gradients, but
this can also be helpful in escaping the local minimum and
finding the global one.
GRADIENT DESCENT
Types
There are three types of 3) Mini-batch gradient descent

gradient descent learning
algorithms: batch gradient Mini-batch gradient descent combines concepts from both batch
descent, stochastic gradient gradient descent and stochastic gradient descent. It splits the
descent and mini-batch training dataset into small batch sizes and performs updates on
gradient descent. each of those batches. This approach strikes a balance between
the computational efficiency of batch gradient descent and the
speed of stochastic gradient descent.
GRADIENT DESCENT
Performance Tips
Plot Cost versus Time: Collect and plot the cost values calculated by the algorithm each iteration. The
expectation for a well performing gradient descent run is a decrease in cost each iteration. If it does not
decrease, try reducing your learning rate.
Learning Rate: The learning rate value is a small real value such as 0.1, 0.001 or 0.0001. Try different
values for your problem and see which works best.
Rescale Inputs: The algorithm will reach the minimum cost faster if the shape of the cost function is not
skewed and distorted. You can achieved this by rescaling all of the input variables (X) to the same range,
such as [0, 1] or [-1, 1].
Few Passes: Stochastic gradient descent often does not need more than 1-to-10 passes through the training
dataset to converge on good or good enough coefficients.
Plot Mean Cost: The updates for each training dataset instance can result in a noisy plot of cost over time
when using stochastic gradient descent. Taking the average over 10, 100, or 1000 updates can give you a
better idea of the learning trend for the algorithm.
From Machinelearningmastery.com

Lesson 4 Gradient Descent

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Lesson 4 Gradient Descent

Uploaded by

Copyright:

Available Formats

LESSON4

François Chollet (original), various (present)

Google Brain Team

The system was written in various languages,

Global Minima - the point at which a function takes the minimum

Goal : minimize error (optimization)

There are three types of 2) Stochastic gradient descent

There are three types of 3) Mini-batch gradient descent

You might also like