Layer-Wise Training Rectified Linear Activation Function: Kick-Start Your Project With My New Book

You might also like

Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 3

The vanishing gradients problem is one example of unstable behavior that you may encounter when

training a deep neural network.

It describes the situation where a deep multilayer feed-forward network or a recurrent neural network
is unable to propagate useful gradient information from the output end of the model back to the
layers near the input end of the model.

The result is the general inability of models with many layers to learn on a given dataset, or for
models with many layers to prematurely converge to a poor solution.

Many fixes and workarounds have been proposed and investigated, such as alternate weight
initialization schemes, unsupervised pre-training, layer-wise training, and variations on gradient
descent. Perhaps the most common change is the use of the rectified linear activation function that
has become the new default, instead of the hyperbolic tangent activation function that was the default
through the late 1990s and 2000s.
In this tutorial, you will discover how to diagnose a vanishing gradient problem when training a
neural network model and how to fix it using an alternate activation function and weight initialization
scheme.

After completing this tutorial, you will know:

 The vanishing gradients problem limits the development of deep neural networks with
classically popular activation functions such as the hyperbolic tangent.
 How to fix a deep neural network Multilayer Perceptron for classification using ReLU and
He weight initialization.
 How to use TensorBoard to diagnose a vanishing gradient problem and confirm the impact of
ReLU to improve the flow of gradients through the model.
Kick-start your project with my new book Better Deep Learning, including step-by-step
tutorials and the Python source code files for all examples.
Let’s get started.

 Updated Oct/2019: Updated for Keras 2.3 and TensorFlow 2.0.


How to Fix the Vanishing Gradient By Using the Rectified Linear Activation Function

Photo by Liam Moloney, some rights reserved.

Tutorial Overview
This tutorial is divided into five parts; they are:

1. Vanishing Gradients Problem


2. Two Circles Binary Classification Problem
3. Multilayer Perceptron Model for Two Circles Problem
4. Deeper MLP Model with ReLU for Two Circles Problem
5. Review Average Gradient Size During Training
Vanishing Gradients Problem
Neural networks are trained using stochastic gradient descent.

This involves first calculating the prediction error made by the model and using the error to estimate
a gradient used to update each weight in the network so that less error is made next time. This error
gradient is propagated backward through the network from the output layer to the input layer.
It is desirable to train neural networks with many layers, as the addition of more layers increases the
capacity of the network, making it capable of learning a large training dataset and efficiently
representing more complex mapping functions from inputs to outputs.

A problem with training networks with many layers (e.g. deep neural networks) is that the gradient
diminishes dramatically as it is propagated backward through the network. The error may be so small
by the time it reaches layers close to the input of the model that it may have very little effect. As
such, this problem is referred to as the “vanishing gradients” problem.
Vanishing gradients make it difficult to know which direction the parameters should move to
improve the cost function …

— Page 290, Deep Learning, 2016.


In fact, the error gradient can be unstable in deep neural networks and not only vanish, but also
explode, where the gradient exponentially increases as it is propagated backward through the
network. This is referred to as the “exploding gradient” problem.
The term vanishing gradient refers to the fact that in a feedforward network (FFN) the
backpropagated error signal typically decreases (or increases) exponentially as a function of the
distance from the final layer.

— Random Walk Initialization for Training Very Deep Feedforward Networks, 2014.
Vanishing gradients is a particular problem with recurrent neural networks as the update of the
network involves unrolling the network for each input time step, in effect creating a very deep
network that requires weight updates. A modest recurrent neural network may have 200-to-400 input
time steps, resulting conceptually in a very deep network.

The vanishing gradients problem may be manifest in a Multilayer Perceptron by a slow rate of
improvement of a model during training and perhaps premature convergence, e.g. continued training
does not result in any further improvement. Inspecting the changes to the weights during training, we
would see more change (i.e. more learning) occurring in the layers closer to the output layer and less
change occurring in the layers close to the input layer.

There are many techniques that can be used to reduce the impact of the vanishing gradients problem
for feed-forward neural networks, most notably alternate weight initialization schemes and use of
alternate activation functions.

You might also like