AMLD – Deep Learning in PyTorch

6. Visualizations, auto-encoders and GANs

François Fleuret
February 10, 2018


Visualizing the processing

A simple approach to analyzing “what a network is doing” is to compute the
derivative of the output wrt the input.

Zeiler and Fergus (2014) proposed to improve this approach by constructing a

deconvolutional network and compute the “activating pattern” of a sample.

Springenberg et al. (2014) improved upon the deconvolution with the guided
back-propagation, which aims at the best of both worlds: Discarding structures
which would not contribute positively to the final response, and discarding
structures which are not already present.

So we get three visualization methods, which differ only in the quantities

propagated through ReLU layers during the back-pass:

• back-propagation (Erhan et al., 2009; Simonyan et al., 2013):

1{s>0} ,

• deconvolution (Zeiler and Fergus, 2014):

1{ ∂l >0} ,
∂x ∂x

• guided back-propagation (Springenberg et al., 2014):

1{s>0} 1{ ∂l >0} .
∂x ∂x

These procedures can be implemented simply in PyTorch by changing the
nn.ReLU ’s backward pass.

The class nn.Module provides methods to register “hook” functions that are
called during the forward or the backward pass, and can implement a different
computation for the latter.

For instance

>>> x = Variable ( Tensor ([ 1.23 , -4.56 ]) )

>>> m = nn . ReLU ()
>>> m ( x )
Variable containing :
[ torch . FloatTensor of size 2]
>>> def my_hook ( module , input , output ) :
... print ( str ( module ) + ’ got ’ + str ( input [0]. size () ) )
>>> handle = m . r e g i s t e r _ f o r w a r d _ h o o k ( my_hook )
>>> m ( x )
ReLU () got torch . Size ([2])
Variable containing :
[ torch . FloatTensor of size 2]
>>> handle . remove ()
>>> m ( x )
Variable containing :
[ torch . FloatTensor of size 2]

Using hooks, we can implement the deconvolution as follows:

def r e l u _ b a c k w a r d _ d e c o n v _ h o o k ( module , grad_input , grad_output ) :

return F . relu ( grad_output [0]) ,

def e q u i p _ m o d e l _ d e c o n v ( model ) :
for m in model . modules () :
if isinstance (m , nn . ReLU ) :
m. register_backward_hook ( relu_backward_deconv_hook )

def grad_view ( model , image_name ) :

to_tensor = transforms . ToTensor ()
img = to_tensor ( PIL . Image . open ( image_name ) )
img = 0.5 + 0.5 * ( img - img . mean () ) / img . std ()

if torch . cuda . is_available () : img = img . cuda ()

input = Variable ( img . view (1 , img . size (0) , img . size (1) , img . size (2) ) , \
requires_grad = True )

output = model ( input )

result , = torch . autograd . grad ( output . max () , input )

result = result . data / result . data . max () + 0.5

return result
model = models . vgg16 ( pretrained = True )
model . eval ()
model = model . features
e q u i p _ m o d e l _ d e c o n v ( model )
result = grad_view ( model , ’ blacklab . jpg ’)
utils . save_image ( result , ’ blacklab - vgg16 - deconv . png ’)

The code is the same for the guided back-propagation, except the hooks

def r e l u _ f o r w a r d _ g b a c k p r o p _ h o o k ( module , input , output ) :

module . input_kept = input [0]

def r e l u _ b a c k w a r d _ g b a c k p r o p _ h o o k ( module , grad_input , grad_output ) :

return F . relu ( grad_output [0]) * F . relu ( module . input_kept ) . sign () ,

def e q u i p _ m o d e l _ g b a c k p r o p ( model ) :
for m in model . modules () :
if isinstance (m , nn . ReLU ) :
m. register_forward_hook ( relu_forward_gbackprop_hook )
m. register_backward_hook ( relu_backward_gbackprop_hook )

Experiments with an AlexNet-like network. Original images + deconvolution (or

filters) for the top-9 activations for channels picked randomly.

(Zeiler and Fergus, 2014)

Maximum response samples

Another approach to get an intuition of the information actually encoded in the
weights of a convnet consists of optimizing from scratch a sample to maximize
the activation f of a chosen unit, or the sum over an activation map.

Doing so generates images with high frequencies, which tend to activate units a
lot. For instance these images maximize the responses of the units “bathtub”
and “lipstick” respectively (yes, this is strange, we will come back to it).

To prevent this from happening, we add a penalty over the edge amplitude at
several scales, of the form
h(x) = kδ s (x) − g ~ δ s (x)k2

where g is a Gaussian kernel, and δ is a factor-2 downscale operator.

class M u l t i S c a l e E d g e E n e r g y ( nn . Module ) :
def __init__ ( self ) :
super ( MultiScaleEdgeEnergy , self ) . __init__ ()
k = Tensor ([[1 , 4 , 6 , 4 , 1]])
k _ 5 x 5 _ p s e u d o _ g a u s s i a n = k . t () . mm ( k ) . view (1 , 1 , 5 , 5)
k _ 5 x 5 _ p s e u d o _ g a u s s i a n /= k _ 5 x 5 _ p s e u d o _ g a u s s i a n . sum ()
self . k _ 5 x 5 _ p s e u d o _ g a u s s i a n = Parameter ( k _ 5 x 5 _ p s e u d o _ g a u s s i a n )
k_2x2_uniform = Tensor ([[0.25 , 0.25] , [0.25 , 0.25]]) . view (1 , 1 , 2 , 2)
self . k_2x2_uniform = Parameter ( k_2x2_uniform )

def forward ( self , x ) :

if x . size (1) > 1:
# dealing with multiple channels by unfolding them as as
# many one channel images
result = self ( x . view ( x . size (0) * x . size (1) , 1 , x . size (2) , x . size (3) ) )
result = result . view ( x . size (0) , -1) . sum (1)
else :
result = 0.0
while x . size (2) > 5 and x . size (3) > 5:
blurry = F . conv2d (x , self . k_5x5_pseudo_gaussian , padding = 2)
result += ( x - blurry ) . view ( x . size (0) , -1) . pow (2) . sum (1)
x = F . conv2d (x , self . k_2x2_uniform , stride = 2 , padding = 1)

return result

Then, the optimization of the image per se is straightforward:

model = models . vgg16 ( pretrained = True )

model . eval ()
edge_energy = M u l t i S c a l e E d g e E n e r g y ()
input = Tensor (1 , 3 , 224 , 224) . normal_ (0 , 0.01)

if torch . cuda . is_available () :

model . cuda ()
edge_energy . cuda ()
input = input . cuda ()

input = Variable ( input , requires_grad = True )

optimizer = optim . Adam ([ input ] , lr = 1e -1)

for k in range (250) :

output = model ( input )
score = edge_energy ( input ) - output [0 , 700] # paper towel
optimizer . zero_grad ()
score . backward ()
optimizer . step ()

result = input . data

result = 0.5 + 0.1 * ( result - result . mean () ) / result . std ()
torchvision . utils . save_image ( result , ’ result . png ’)

(take a second to think about the beauty of autograd)

VGG16, maximizing a channel of the 4th convolution layer

VGG16, maximizing a channel of the 7th convolution layer

VGG16, maximizing a unit of the 10th convolution layer

VGG16, maximizing a unit of the 13th (and last) convolution layer

VGG16, maximizing a unit of the output layer

“King crab” “Samoyed” (that’s a fluffy dog)

VGG16, maximizing a unit of the output layer

“Hourglass” “Paper towel”

VGG16, maximizing a unit of the output layer

“Ping-pong ball” “Steel arch bridge”

VGG16, maximizing a unit of the output layer

“Sunglass” “Geyser”

These results show that the parameters of a network trained for classification
carry enough information to generate identifiable large-scale structures.

Although the training is discriminative, the resulting model has strong

generative capabilities.

It also gives an intuition of the accuracy and shortcomings of the resulting

global compositional model.

Adversarial examples

In spite of their very good predictive capabilities, deep neural networks are quite
sensitive to adversarial inputs, that is to inputs crafted to make them behave
incorrectly (Szegedy et al., 2014).

The simplest strategy to exhibit such behavior is to optimize the input to

maximize the loss.

Let x be an image, y its proper label, f (x; w ) the network’s prediction, and L
the cross-entropy loss. We can construct an adversarial example by maximizing
the loss. To do so, we iterate a “gradient ascent” step:

xk+1 = xk + η∇|x L(f (xk ; w ), y ).

After a few iterations, this procedure will reach a sample x̌ whose class is not y .

The counter-intuitive result is that the resulting miss-classified images are

indistinguishable from the original ones to a human eye.

input = Variable ( input , requires_grad = True )

model = torchvision . models . alexnet ( pretrained = True )

cross_entropy = nn . C r o s s E n t r o p yL o s s ()
optimizer = optim . SGD ([ input ] , lr = 1e -1)

if torch . cuda . is_available () :

model . cuda ()
cross_entropy . cuda ()

target = model ( input ) . data . max (1) [1]. view ( -1)

if torch . cuda . is_available () : target = target . cuda ()
target = Variable ( target )

for k in range (15) :

output = model ( input )
loss = - cross_entropy ( output , target )
optimizer . zero_grad ()
loss . backward ()
optimizer . step ()

1.02% 0.27%

Predicted classes
Nb. iterations Image #1 Image #2
0 Weimaraner desktop computer
1 Weimaraner desktop computer
2 Labrador retriever desktop computer
3 Labrador retriever desktop computer
4 Labrador retriever desktop computer
5 brush kangaroo desktop computer
6 brush kangaroo desktop computer
7 sundial desktop computer
8 sundial desktop computer
9 sundial desktop computer
10 sundial desktop computer
11 sundial desktop computer
12 sundial desktop computer
13 sundial desktop computer
14 sundial desk

Many applications such as image synthesis, denoising, super-resolution, speech
synthesis, compression, etc. require to go beyond classification and regression,
and model explicitly a high dimension signal.

This modeling consists of finding “meaningful degrees of freedom” that describe

the signal, and are of lesser dimension.

An autoencoder combines an encoder f that embeds the original space X into

a latent space of lower dimension F, and a decoder g to map back to X, such
that their composition g ◦ f is [close to] the identity on the data.

Latent space F

Original space X

Training consists of minimizing the reconstruction error, for instance estimated

with the MSE
1 X
ŵf , ŵg = argmin kxn − g (f (xn ; wf ); wg )k2 .
wf ,wg N n=1

Deep Autoencoders

A deep autoencoder combines an encoder composed of convolutional layers, and

a decoder composed of the reciprocal transposed convolution layers.

To run a simple example on MNIST, we consider the following model, where

dimension reduction is obtained through filter sizes and strides > 1, avoiding
max-pooling layers.

AutoEncoder (
( encoder ) : Sequential (
(0) : Conv2d (1 , 32 , kernel_size =(5 , 5) , stride =(1 , 1) )
(1) : ReLU ( inplace )
(2) : Conv2d (32 , 32 , kernel_size =(5 , 5) , stride =(1 , 1) )
(3) : ReLU ( inplace )
(4) : Conv2d (32 , 32 , kernel_size =(4 , 4) , stride =(2 , 2) )
(5) : ReLU ( inplace )
(6) : Conv2d (32 , 32 , kernel_size =(3 , 3) , stride =(2 , 2) )
(7) : ReLU ( inplace )
(8) : Conv2d (32 , 8 , kernel_size =(4 , 4) , stride =(1 , 1) )
( decoder ) : Sequential (
(0) : C o nv Tr an s po se 2 d (8 , 32 , kernel_size =(4 , 4) , stride =(1 , 1) )
(1) : ReLU ( inplace )
(2) : C o nv Tr an s po se 2 d (32 , 32 , kernel_size =(3 , 3) , stride =(2 , 2) )
(3) : ReLU ( inplace )
(4) : C o nv Tr an s po se 2 d (32 , 32 , kernel_size =(4 , 4) , stride =(2 , 2) )
(5) : ReLU ( inplace )
(6) : C o nv Tr an s po se 2 d (32 , 32 , kernel_size =(5 , 5) , stride =(1 , 1) )
(7) : ReLU ( inplace )
(8) : C o nv Tr an s po se 2 d (32 , 1 , kernel_size =(5 , 5) , stride =(1 , 1) )

Tensor sizes / operations

nn.Conv2d(1, 32, kernel size=5, stride=1)

32×24×24 ×24
nn.Conv2d(32, 32, kernel size=5, stride=1)

32×20×20 ×20
nn.Conv2d(32, 32, kernel size=4, stride=2)

32×9×9 ×9
nn.Conv2d(32, 32, kernel size=3, stride=2)

32×4×4 ×4
nn.Conv2d(32, 8, kernel size=4, stride=1)

8×1×1 ×1

Tensor sizes / operations

nn.ConvTranspose2d(32, 8, kernel size=4, stride=1)

32×4×4 4
nn.ConvTranspose2d(32, 32, kernel size=3, stride=2)

32×9×9 3
nn.ConvTranspose2d(32, 32, kernel size=4, stride=2)

32×20×20 4
nn.ConvTranspose2d(32, 32, kernel size=5, stride=1)

32×24×24 5
nn.ConvTranspose2d(1, 32, kernel size=5, stride=1)

1×28×28 5

Training is achieved with MSE and Adam

model = AutoEncoder ( embedding_dim , nb_channels )

mse_loss = nn . MSELoss ()

if torch . cuda . is_available () :

model . cuda ()
mse_loss . cuda ()

optimizer = optim . Adam ( model . parameters () , lr = 1e -3)

for epoch in range ( args . nb_epochs ) :

for input , _ in iter ( train_loader ) :
if torch . cuda . is_available () : input = input . cuda ()
input = Variable ( input )
output = model ( input )
loss = mse_loss ( output , input )
model . zero_grad ()
loss . backward ()
optimizer . step ()

X (original samples)

g ◦ f (X ) (CNN, d = 8)

g ◦ f (X ) (PCA, d = 8)

To get an intuition of the latent representation, we can pick two samples x and
x 0 at random and interpolate samples along the line in the latent space

∀x, x 0 ∈ X 2 , α ∈ [0, 1], ξ(x, x 0 , α) = g ((1 − α)f (x) + αf (x 0 )).

f (x)

x x0
f (x 0 )

Latent space F

Original space X

Autoencoder interpolation (d = 8)

And we can assess the generative capabilities of the decoder g by introducing a
[simple] density model q Z over the latent space F, sample there, and map the
samples into the image space X with g .

We can for instance use a Gaussian model with diagonal covariance matrix.
f (X ) ∼ N(m̂, ∆)

ˆ a diagonal matrix, both estimated on training data.

where m̂ is a vector and ∆

Latent space F

Original space X

Autoencoder sampling (d = 8)

Autoencoder sampling (d = 16)

Autoencoder sampling (d = 32)

These results are unsatisfying, because the density model used on the latent
space F is too simple and inadequate.

Building a “good” model amounts to our original problem of modeling an

empirical distribution, although it may now be in a lower dimension space.

Generative Adversarial Networks

A different approach to learn high-dimension generative models are the
Generative Adversarial Networks proposed by Goodfellow et al. (2014).

The idea behind GANs is to train two networks jointly:

• A generator G to map a Z following a [simple] fixed distribution to the

desired “real” distribution, and
• a discriminator D to classify data points as “real” or “fake” (i.e. from G).

The approach is adversarial since the two networks have antagonistic objectives.

Training GANs is difficult and involve a lot of technical details regarding the
network structure and the optimization procedure.
Under review as a conference paper at ICLR 2016
Radford et al. (2015) proposed a simple convolutional architecture for
generating small images.

Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribu-
tion Z is projected to a small spatial extent convolutional representation with many feature maps.
A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called
deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no
fully connected or pooling layers are used.

suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped
(Radford et al., 2015)
stabilize training.

4.1 LSUN
As visual quality of samples from generative image models has improved, concerns of over-fitting
We can have a look at the reference implementation provided in

# default nz = 100 , ngf = 64

class _netG ( nn . Module ) :

def __init__ ( self , ngpu ) :
super ( _netG , self ) . __init__ ()
self . ngpu = ngpu
self . main = nn . Sequential (
# input is Z , going into a convolution
nn . C o nv Tr an s po se2d ( nz , ngf * 8 , 4 , 1 , 0 , bias = False ) ,
nn . BatchNorm2d ( ngf * 8) ,
nn . ReLU ( True ) ,
# state size . ( ngf * 8) x 4 x 4
nn . C o nv Tr an s po se2d ( ngf * 8 , ngf * 4 , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ngf * 4) ,
nn . ReLU ( True ) ,
# state size . ( ngf * 4) x 8 x 8
nn . C o nv Tr an s po se2d ( ngf * 4 , ngf * 2 , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ngf * 2) ,
nn . ReLU ( True ) ,
# state size . ( ngf * 2) x 16 x 16
nn . C o nv Tr an s po se2d ( ngf * 2 , ngf , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ngf ) ,
nn . ReLU ( True ) ,
# state size . ( ngf ) x 32 x 32
nn . C o nv Tr an s po se2d ( ngf , nc , 4 , 2 , 1 , bias = False ) ,
nn . Tanh ()
# state size . ( nc ) x 64 x 64

# default nz = 100 , ndf = 64

class _netD ( nn . Module ) :

def __init__ ( self , ngpu ) :
super ( _netD , self ) . __init__ ()
self . ngpu = ngpu
self . main = nn . Sequential (
# input is ( nc ) x 64 x 64
nn . Conv2d ( nc , ndf , 4 , 2 , 1 , bias = False ) ,
nn . LeakyReLU (0.2 , inplace = True ) ,
# state size . ( ndf ) x 32 x 32
nn . Conv2d ( ndf , ndf * 2 , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ndf * 2) ,
nn . LeakyReLU (0.2 , inplace = True ) ,
# state size . ( ndf *2) x 16 x 16
nn . Conv2d ( ndf * 2 , ndf * 4 , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ndf * 4) ,
nn . LeakyReLU (0.2 , inplace = True ) ,
# state size . ( ndf *4) x 8 x 8
nn . Conv2d ( ndf * 4 , ndf * 8 , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ndf * 8) ,
nn . LeakyReLU (0.2 , inplace = True ) ,
# state size . ( ndf *8) x 4 x 4
nn . Conv2d ( ndf * 8 , 1 , 4 , 1 , 0 , bias = False ) ,
nn . Sigmoid ()

# custom weights initia lizatio n called on netG and netD
def weights_init ( m ) :
classname = m . __class__ . __name__
if classname . find ( ’ Conv ’) != -1:
m . weight . data . normal_ (0.0 , 0.02)
elif classname . find ( ’ BatchNorm ’) != -1:
m . weight . data . normal_ (1.0 , 0.02)
m . bias . data . fill_ (0)

criterion = nn . BCELoss ()

input = torch . FloatTensor ( opt . batchSize , 3 , opt . imageSize , opt . imageSize )

noise = torch . FloatTensor ( opt . batchSize , nz , 1 , 1)
fixed_noise = torch . FloatTensor ( opt . batchSize , nz , 1 , 1) . normal_ (0 , 1)
label = torch . FloatTensor ( opt . batchSize )
real_label = 1
fake_label = 0

fixed_noise = Variable ( fixed_noise )

# setup optimizer
optimizerD = optim . Adam ( netD . parameters () , lr = opt . lr , betas =( opt . beta1 , 0.999) )
optimizerG = optim . Adam ( netG . parameters () , lr = opt . lr , betas =( opt . beta1 , 0.999) )

for epoch in range ( opt . niter ) :

for i , data in enumerate ( dataloader , 0) :

# (1) Update D network : maximize log ( D ( x ) ) + log (1 - D ( G ( z ) ) )

# train with real

netD . zero_grad ()
real_cpu , _ = data
batch_size = real_cpu . size (0)
if opt . cuda :
real_cpu = real_cpu . cuda ()
input . resize_as_ ( real_cpu ) . copy_ ( real_cpu )
label . resize_ ( batch_size ) . fill_ ( real_label )
inputv = Variable ( input )
labelv = Variable ( label )

output = netD ( inputv )

errD_real = criterion ( output , labelv )
errD_real . backward ()
D_x = output . data . mean ()

# train with fake

noise . resize_ ( batch_size , nz , 1 , 1) . normal_ (0 , 1)
noisev = Variable ( noise )
fake = netG ( noisev )
labelv = Variable ( label . fill_ ( fake_label ) )
output = netD ( fake . detach () )
errD_fake = criterion ( output , labelv )
errD_fake . backward ()
D_G_z1 = output . data . mean ()
errD = errD_real + errD_fake
optimizerD . step ()

# (2) Update G network : maximize log ( D ( G ( z ) ) )

netG . zero_grad ()
# fake labels are real for generator cost
labelv = Variable ( label . fill_ ( real_label ) )
output = netD ( fake )
errG = criterion ( output , labelv )
errG . backward ()
D_G_z2 = output . data . mean ()
optimizerG . step ()

Note that this update implements the − log(D(G (z))) trick.

Real images from LSUN’s “bedroom” class.

François Fleuret AMLD – Deep Learning in PyTorch / 6. Visualizations, auto-encoders and GANs 54 / 56
Fake images after 1 epoch (3M images)

Fake images after 20 epochs

