7 CNN 3

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 30

Batch Normalization

Shusen Wang
Feature Scaling for Linear Models
Why Feature Scaling?
People’s feature vectors: 𝐱+ , ⋯ , 𝐱 - ∈ ℝ& .
• The 1st dimension is one person’s income (in dollars).
• Assume it is randomly from the Gaussian distribution 𝑁 3000, 400& .
• The 2nd dimension is one person’s height (in inch).
• Assume it is randomly from the Gaussian distribution 𝑁 69, 3& .
Why Feature Scaling?
People’s feature vectors: 𝐱+ , ⋯ , 𝐱 - ∈ ℝ& .
• The 1st dimension is one person’s income (in dollars).
• Assume it is randomly from the Gaussian distribution 𝑁 3000, 400& .
• The 2nd dimension is one person’s height (in inch).
• Assume it is randomly from the Gaussian distribution 𝑁 69, 3& .

• Hessian matrix of least squares regression model:


+ -
∑35+ 𝐱 3 𝐱 34 9137.3 206.6
𝐇= = ×10<.
- 206.6 4.8
=>?@ 𝐇
• Condition number: = 9.2×10C .
=>AB 𝐇
Bad condition number means slow convergence of gradient descent!
Why Feature Scaling?
People’s feature vectors: 𝐱+ , ⋯ , 𝐱 - ∈ ℝ& .
• The 1st dimension is one person’s income (in thousand dollars).
• Assume it is randomly from the Gaussian distribution 𝑁 3, 0.4& .
• The 2nd dimension is one person’s height (in foot).
• Assume it is randomly from the Gaussian distribution 𝑁 5.75, 0.25& .
• Change metric.
• Hessian matrix of linear regression:
+ -
∑35+ 𝐱 3 𝐱 34 9.1 17.2
𝐇= = .
- 17.2 33.1
=>?@ 𝐇
• Condition number: = 281.7. (As opposed to 92K.)
=>AB 𝐇
Feature Scaling for 1D Data
Assume the samples 𝑥+ , ⋯ , 𝑥- are one-dimensional.

GH IJKL(GH )
• Min-max normalization: 𝑥3F = .
JOP(GH )IJKL(GH )

• After the scaling, the samples 𝑥+F , ⋯ , 𝑥-F are in 0, 1 .


Feature Scaling for 1D Data
Assume the samples 𝑥+ , ⋯ , 𝑥- are one-dimensional.

R
GH IQ
• Standardization: 𝑥3F = R
.
S
+
• 𝜇̂ = ∑-35+ 𝑥3 is the sample mean.
-
+
• 𝜎W & = ∑-35+ 𝑥3 − 𝜇̂ & is the sample variance.
-I+

• After the scaling, the samples 𝑥+F, ⋯ , 𝑥-F have zero mean and unit variance.
Feature Scaling for High-Dim Data

• Independently perform feature scaling for every feature.


• E.g., when scaling the “height” feature, ignore the “income” feature.
Feature Scaling for High-Dim Data

• Independently perform feature scaling for every feature.


• E.g., when scaling the “height” feature, ignore the “income” feature.
Batch Normalization
Batch Normalization: Standardization of Hidden Layers

• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.


•𝛍R ∈ ℝZ : sample mean of 𝐱 Y evaluated on a batch of samples.
•𝛔R ∈ ℝZ : sample std of 𝐱 Y evaluated on a batch of samples.
• 𝛄 ∈ ℝZ : scaling parameter (trainable).
• 𝛃 ∈ ℝZ : shifting parameter (trainable).
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ⋅ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.𝑗 = 1, ⋯ , 𝑑.
Batch Normalization: Standardization of Hidden Layers

• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.


•𝛍R ∈ ℝZ : sample mean of 𝐱 Y evaluated on a batch of samples.
•𝛔R ∈ ℝZ : sample std of 𝐱 Y evaluated on a batch of samples.
• 𝛄 ∈ ℝZ : scaling parameter (trainable).
• 𝛃 ∈ ℝZ : shifting parameter (trainable).
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ⋅ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.
Batch Normalization: Standardization of Hidden Layers

• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.


•𝛍R ∈ ℝZ : sample
Non-trainable. Y
mean of 𝐱Just evaluated on ainbatch
record them of samples.
the forward pass;
•𝛔 use them
R ∈ ℝZ : sample stdinofthe
𝐱 Ybackpropagation.
evaluated on a batch of samples.
• 𝛄 ∈ ℝZ : scaling parameter (trainable).
• 𝛃 ∈ ℝZ : shifting parameter (trainable).
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ⋅ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.
Backpropagation for Batch Normalization Layer
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ∘ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.

Y
𝐱 𝐳 Y
𝐱 Yd+
standardization scaling & shifting

R
𝝁 R
𝝈 𝜸 𝜷

Non-trainable Trainable
Backpropagation for Batch Normalization Layer
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ∘ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.

rs Yd+
We know ctu from the backpropagation (from the top to 𝑥 .)
rGb

ctu
rs rs rGb rs Y
• Use = ctu = ctu 𝑧a to update 𝛾a ;
rvb rGb rvb rGb
ctu
rs rs rGb rs
• Use = ctu = ctu to update 𝛽a .
rwb rGb rwb rGb
Backpropagation for Batch Normalization Layer
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ∘ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.

rs Yd+
We know ctu from the backpropagation (from the top to 𝑥 .)
rGb

ctu
rs rs rGb rs
Compute c = ctu c = 𝛾.
rxb rGb rxb rG ctu

c
rs rs rxb rs +
Compute c = c c = c Rb de.ee+
and pass it to the lower layers.
rGb rxb rGb rxb S
Batch Normalization Layer in Keras
Batch Normalization Layer
• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•𝛍 R ∈ ℝZ : non-trainable parameters.
R, 𝛔
• 𝛄, 𝛃 ∈ ℝZ : trainable parameters.
c
Y Rb
Gb IQ
• Standardization: 𝑧a = R b de.ee+
, for 𝑗 = 1, ⋯ , 𝑑.
S
Yd+ Y
• Scale and shift: 𝑥a = 𝑧a ⋅ 𝛾a + 𝛽a , for 𝑗 = 1, ⋯ , 𝑑.
Batch Normalization Layer
• Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•𝛍 R ∈ ℝZ : non-trainable parameters.
R, 𝛔
• 𝛄, 𝛃 ∈ ℝZ : trainable parameters.

Difficulty: There are 4𝑑 parameters which must be stored in memory.


𝑑 can be very large!

Example:
• The 1st Conv Layer in VGG16 Net outputs a 150×150×64 tensor.
• The number of parameters in a single Batch Normalization Layer
would be 4𝑑 = 1.44M.
Batch Normalization Layer
Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•Solution:
Z
R
•• 𝛍 , R
Make the 4: parameters
𝛔 ∈ ℝ non-trainable parameters.
1×1×64, instead of 150×150×64.
• 𝛄, 𝛃 ∈ ℝZ : trainable parameters.

Difficulty: There are 4𝑑 parameters which must be stored in memory.


𝑑 can be very large!
Batch Normalization Layer
Let 𝐱 Y ∈ ℝZ be the output of the 𝑘-th hidden layer.
•Solution:
Z
R
•• 𝛍 , R
Make the 4: parameters
𝛔 ∈ ℝ non-trainable parameters.
1×1×64, instead of 150×150×64.
𝛄, 𝛃 ∈ ℝZ : trainable parameters.
• How?
• A scalar parameter for a slice (e.g., a 150×150 matrix) of the tensor.
Difficulty:
• Of course,There aremake
you can 4𝑑 parameters which150×1×1
the parameters must be stored in memory.
or 1×150×1.
𝑑 can be very large!
3. Use Batch Normalization
In [6]:CNN for Digit Recognition

from keras import models


from keras import layers

model = models.Sequential()
model.add(layers.Conv2D(10, (5, 5), input_shape=(28, 28, 1)))
model.add(layers.BatchNormalization())
model.add(layers.Activation('relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Conv2D(20, (5, 5)))
model.add(layers.BatchNormalization())
model.add(layers.Activation('relu'))
model.add(layers.MaxPooling2D((2, 2)))
model.add(layers.Flatten())
model.add(layers.Dense(100))
model.add(layers.BatchNormalization())
model.add(layers.Activation('relu'))
model.add(layers.Dense(10, activation='softmax'))

# print the summary of the model.


CNN for Digit Recognition
CNN for Digit Recognition

Train the model (with Batch Normalization) on MNIST (𝑛 = 50,000).


CNN for Digit Recognition

Train the model (without Batch Normalization) on MNIST (𝑛 = 50,000).

Without Batch Normalization,


it takes 10x more epochs to
converge.
Summary
Feature Scaling

• Make the scales of all the features comparable.


• Why? Better conditioned Hessian matrix è Faster convergence.
Feature Scaling

• Make the scales of all the features comparable.


• Why? Better conditioned Hessian matrix è Faster convergence.
• Methods:
• Min-Max Normalization: scale the features to [0, 1].
• Standardization: every feature has zero mean and unit variance.
Batch Normalization

• Feature standardization for the hidden layers.


• Why? Faster convergence.
• 2 trainable parameters: shifting and scaling.
• 2 non-trainable parameters: mean, variance.
Batch Normalization

• Feature standardization for the hidden layers.


• Why? Faster convergence.
• 2 trainable parameters: shifting and scaling.
• 2 non-trainable parameters: mean, variance.
• Keras provides layers.BatchNormalization().
• Put the BN layer after Conv and before Activation.

You might also like