Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 11

Batch Normalization

Introduction
• Normalization-bringing the numerical data into a common scale
without destroying its shape
• Reason-Neural Network processes the data easily and generalizes
appropriately
• Neural networks process the data not as an individual but rather as a
batch
Why Batch normalization
• Initially the input X is normalized before entering into a neural
network
• But as it goes through the network and at the last layer it will not be
in the same scale
• Because as we apply the activation function on the data at each layer
this leads to an internal co-variant shift in the data
Internal Covariant shift
• Suppose a model classifying the two different classes as a dog or not
• Ex: we have only white dog images.
• These images will have a certain distribution
• So model parameters trained for that
• If we have non-white dog images, this has a different distribution
• So, the model needs to change its parameter according to this.
• Hence the distribution of the hidden activation also needs to be changed.
• This hidden change is known as the internal Co-variant shift
• Data Distribution-the arrangement of the datapoints within the dataset.
• Internal Covariant-shifting deep learning, our target keeps changing
during training due to the continuous updates in weights and biases.
• This is known as the “internal covariate shift”.
• Batch normalization helps us stabilize this moving target, making our
task easier.
How Batch normalization works
• It works by normalizing the output of a previous activation layer by
subtracting the batch mean and dividing by the batch standard
deviation.
• However, these normalized values may not follow the original
distribution.
• To tackle this, batch normalization introduces two learnable
parameters, gamma and beta, which can shift and scale the
normalized values.
• Two-step process
• Input is normalized
• Scaling and offsetting is performed
• Step 1
• Normalization of input data-
• Mean =0
• SD=1
• In this step we have our batch input from layer h, first, we need to calculate the mean of this
hidden activation.
• m is the number of neurons at layer h.
• The next step is to calculate the standard deviation of the hidden
activations.

• Using the μ and σ we can normalize the hidden activation values

• ε - The smoothing term that assures numerical stability within the


operation by stopping a division by a zero value.
Rescaling of Offsetting
• two components γ(gamma) and β (beta) are used.
• These are learnable parameters that enable the accurate
normalization of each batch.
Benefits
• Speeds up learning: By reducing internal covariate shift, it helps the
model train faster.
• Regularizes the model: It adds a little noise to your model, and in some
cases, you might not even need to use dropout or other regularization
techniques.
• Allows higher learning rates: Gradient descent usually requires small
learning rates for the network to converge. Batch normalization helps us
use much larger learning rates, speeding up the training process.
• Speed up the training

You might also like