Redes Neuronales Convoluciones Gan

AMLD – Deep Learning in PyTorch
6. Visualizations, auto-encoders and GANs
François Fleuret
http://fleuret.org/amld/
February 10, 2018
ÉCOLE POLYTECHNIQUE
FÉDÉRALE DE LAUSANNE
Visualizing the processing
François Fleuret AMLD – Deep Learning in PyTorch / 6. Visualizations, auto-encoders and GANs 2 / 56
A simple approach to analyzing “what a network is doing” is to compute the
derivative of the output wrt the input.
Zeiler and Fergus (2014) proposed to improve this approach by constructing a

deconvolutional network and compute the “activating pattern” of a sample.
Springenberg et al. (2014) improved upon the deconvolution with the guided
back-propagation, which aims at the best of both worlds: Discarding structures
which would not contribute positively to the final response, and discarding
structures which are not already present.
So we get three visualization methods, which differ only in the quantities

propagated through ReLU layers during the back-pass:
• back-propagation (Erhan et al., 2009; Simonyan et al., 2013):
∂l
1{s>0} ,
∂x
• deconvolution (Zeiler and Fergus, 2014):
∂l
1{ ∂l >0} ,
∂x ∂x
• guided back-propagation (Springenberg et al., 2014):
∂l
1{s>0} 1{ ∂l >0} .
∂x ∂x
These procedures can be implemented simply in PyTorch by changing the
nn.ReLU ’s backward pass.
The class nn.Module provides methods to register “hook” functions that are
called during the forward or the backward pass, and can implement a different
computation for the latter.
For instance
>>> x = Variable ( Tensor ([ 1.23 , -4.56 ]) )

>>> m = nn . ReLU ()
>>> m ( x )
Variable containing :
1.2300
0.0000
[ torch . FloatTensor of size 2]
>>> def my_hook ( module , input , output ) :
... print ( str ( module ) + ’ got ’ + str ( input [0]. size () ) )
...
>>> handle = m . r e g i s t e r _ f o r w a r d _ h o o k ( my_hook )
>>> m ( x )
ReLU () got torch . Size ([2])
1.2300
0.0000
>>> handle . remove ()
>>> m ( x )
1.2300
0.0000
Using hooks, we can implement the deconvolution as follows:
def r e l u _ b a c k w a r d _ d e c o n v _ h o o k ( module , grad_input , grad_output ) :

return F . relu ( grad_output [0]) ,
def e q u i p _ m o d e l _ d e c o n v ( model ) :
for m in model . modules () :
if isinstance (m , nn . ReLU ) :
m. register_backward_hook ( relu_backward_deconv_hook )
def grad_view ( model , image_name ) :

to_tensor = transforms . ToTensor ()
img = to_tensor ( PIL . Image . open ( image_name ) )
img = 0.5 + 0.5 * ( img - img . mean () ) / img . std ()
if torch . cuda . is_available () : img = img . cuda ()
input = Variable ( img . view (1 , img . size (0) , img . size (1) , img . size (2) ) , \
requires_grad = True )
output = model ( input )

result , = torch . autograd . grad ( output . max () , input )
result = result . data / result . data . max () + 0.5
return result
model = models . vgg16 ( pretrained = True )
model . eval ()
model = model . features
e q u i p _ m o d e l _ d e c o n v ( model )
result = grad_view ( model , ’ blacklab . jpg ’)
utils . save_image ( result , ’ blacklab - vgg16 - deconv . png ’)
The code is the same for the guided back-propagation, except the hooks
themselves:
def r e l u _ f o r w a r d _ g b a c k p r o p _ h o o k ( module , input , output ) :

module . input_kept = input [0]
def r e l u _ b a c k w a r d _ g b a c k p r o p _ h o o k ( module , grad_input , grad_output ) :

return F . relu ( grad_output [0]) * F . relu ( module . input_kept ) . sign () ,
def e q u i p _ m o d e l _ g b a c k p r o p ( model ) :
for m in model . modules () :
if isinstance (m , nn . ReLU ) :
m. register_forward_hook ( relu_forward_gbackprop_hook )
m. register_backward_hook ( relu_backward_gbackprop_hook )
Original
Gradient
Deconvolution
Guided-backprop
Experiments with an AlexNet-like network. Original images + deconvolution (or

filters) for the top-9 activations for channels picked randomly.
(Zeiler and Fergus, 2014)

(Zeiler and Fergus, 2014)
Maximum response samples
Another approach to get an intuition of the information actually encoded in the
weights of a convnet consists of optimizing from scratch a sample to maximize
the activation f of a chosen unit, or the sum over an activation map.
Doing so generates images with high frequencies, which tend to activate units a
lot. For instance these images maximize the responses of the units “bathtub”
and “lipstick” respectively (yes, this is strange, we will come back to it).
To prevent this from happening, we add a penalty over the edge amplitude at
several scales, of the form
X
h(x) = kδ s (x) − g ~ δ s (x)k2
s>0
where g is a Gaussian kernel, and δ is a factor-2 downscale operator.
class M u l t i S c a l e E d g e E n e r g y ( nn . Module ) :
def __init__ ( self ) :
super ( MultiScaleEdgeEnergy , self ) . __init__ ()
k = Tensor ([[1 , 4 , 6 , 4 , 1]])
k _ 5 x 5 _ p s e u d o _ g a u s s i a n = k . t () . mm ( k ) . view (1 , 1 , 5 , 5)
k _ 5 x 5 _ p s e u d o _ g a u s s i a n /= k _ 5 x 5 _ p s e u d o _ g a u s s i a n . sum ()
self . k _ 5 x 5 _ p s e u d o _ g a u s s i a n = Parameter ( k _ 5 x 5 _ p s e u d o _ g a u s s i a n )
k_2x2_uniform = Tensor ([[0.25 , 0.25] , [0.25 , 0.25]]) . view (1 , 1 , 2 , 2)
self . k_2x2_uniform = Parameter ( k_2x2_uniform )
def forward ( self , x ) :

if x . size (1) > 1:
# dealing with multiple channels by unfolding them as as
# many one channel images
result = self ( x . view ( x . size (0) * x . size (1) , 1 , x . size (2) , x . size (3) ) )
result = result . view ( x . size (0) , -1) . sum (1)
else :
result = 0.0
while x . size (2) > 5 and x . size (3) > 5:
blurry = F . conv2d (x , self . k_5x5_pseudo_gaussian , padding = 2)
result += ( x - blurry ) . view ( x . size (0) , -1) . pow (2) . sum (1)
x = F . conv2d (x , self . k_2x2_uniform , stride = 2 , padding = 1)
return result
Then, the optimization of the image per se is straightforward:
model = models . vgg16 ( pretrained = True )

model . eval ()
edge_energy = M u l t i S c a l e E d g e E n e r g y ()
input = Tensor (1 , 3 , 224 , 224) . normal_ (0 , 0.01)
if torch . cuda . is_available () :

model . cuda ()
edge_energy . cuda ()
input = input . cuda ()
input = Variable ( input , requires_grad = True )

optimizer = optim . Adam ([ input ] , lr = 1e -1)
for k in range (250) :

score = edge_energy ( input ) - output [0 , 700] # paper towel
optimizer . zero_grad ()
score . backward ()
optimizer . step ()
result = input . data

result = 0.5 + 0.1 * ( result - result . mean () ) / result . std ()
torchvision . utils . save_image ( result , ’ result . png ’)
(take a second to think about the beauty of autograd)
VGG16, maximizing a channel of the 4th convolution layer
VGG16, maximizing a channel of the 7th convolution layer
VGG16, maximizing a unit of the 10th convolution layer
VGG16, maximizing a unit of the 13th (and last) convolution layer
VGG16, maximizing a unit of the output layer
“King crab” “Samoyed” (that’s a fluffy dog)
“Hourglass” “Paper towel”
“Ping-pong ball” “Steel arch bridge”
“Sunglass” “Geyser”
These results show that the parameters of a network trained for classification
carry enough information to generate identifiable large-scale structures.
Although the training is discriminative, the resulting model has strong

generative capabilities.
It also gives an intuition of the accuracy and shortcomings of the resulting

global compositional model.
Adversarial examples
In spite of their very good predictive capabilities, deep neural networks are quite
sensitive to adversarial inputs, that is to inputs crafted to make them behave
incorrectly (Szegedy et al., 2014).
The simplest strategy to exhibit such behavior is to optimize the input to

maximize the loss.
Let x be an image, y its proper label, f (x; w ) the network’s prediction, and L
the cross-entropy loss. We can construct an adversarial example by maximizing
the loss. To do so, we iterate a “gradient ascent” step:
xk+1 = xk + η∇|x L(f (xk ; w ), y ).
After a few iterations, this procedure will reach a sample x̌ whose class is not y .
The counter-intuitive result is that the resulting miss-classified images are

indistinguishable from the original ones to a human eye.
input = Variable ( input , requires_grad = True )
model = torchvision . models . alexnet ( pretrained = True )

cross_entropy = nn . C r o s s E n t r o p yL o s s ()
optimizer = optim . SGD ([ input ] , lr = 1e -1)

model . cuda ()
cross_entropy . cuda ()
target = model ( input ) . data . max (1) [1]. view ( -1)

if torch . cuda . is_available () : target = target . cuda ()
target = Variable ( target )
for k in range (15) :

loss = - cross_entropy ( output , target )
optimizer . zero_grad ()
loss . backward ()
optimizer . step ()
Original
Adversarial
Differences
kx−x̌k
kxk
1.02% 0.27%
Predicted classes
Nb. iterations Image #1 Image #2
0 Weimaraner desktop computer
1 Weimaraner desktop computer
2 Labrador retriever desktop computer
5 brush kangaroo desktop computer
6 brush kangaroo desktop computer
7 sundial desktop computer
14 sundial desk
Autoencoders
Many applications such as image synthesis, denoising, super-resolution, speech
synthesis, compression, etc. require to go beyond classification and regression,
and model explicitly a high dimension signal.
This modeling consists of finding “meaningful degrees of freedom” that describe

the signal, and are of lesser dimension.
An autoencoder combines an encoder f that embeds the original space X into

a latent space of lower dimension F, and a decoder g to map back to X, such
that their composition g ◦ f is [close to] the identity on the data.
Latent space F
Original space X
Training consists of minimizing the reconstruction error, for instance estimated

with the MSE
N
1 X
ŵf , ŵg = argmin kxn − g (f (xn ; wf ); wg )k2 .
wf ,wg N n=1
Deep Autoencoders
A deep autoencoder combines an encoder composed of convolutional layers, and

a decoder composed of the reciprocal transposed convolution layers.
To run a simple example on MNIST, we consider the following model, where

dimension reduction is obtained through filter sizes and strides > 1, avoiding
max-pooling layers.
AutoEncoder (
( encoder ) : Sequential (
(0) : Conv2d (1 , 32 , kernel_size =(5 , 5) , stride =(1 , 1) )
(1) : ReLU ( inplace )
)
( decoder ) : Sequential (
(0) : C o nv Tr an s po se 2 d (8 , 32 , kernel_size =(4 , 4) , stride =(1 , 1) )
)
)
Encoder
Tensor sizes / operations
1×28×28
28
nn.Conv2d(1, 32, kernel size=5, stride=1)
32×24×24 ×24
24
32×20×20 ×20
20
32×9×9 ×9
9
32×4×4 ×4
4
8×1×1 ×1
Decoder
Tensor sizes / operations
8×1×1
×1
nn.ConvTranspose2d(32, 8, kernel size=4, stride=1)
32×4×4 4
×4
32×9×9 3
×9
32×20×20 4
×20
32×24×24 5
×24
1×28×28 5
Training is achieved with MSE and Adam
model = AutoEncoder ( embedding_dim , nb_channels )

mse_loss = nn . MSELoss ()

model . cuda ()
mse_loss . cuda ()
optimizer = optim . Adam ( model . parameters () , lr = 1e -3)
for epoch in range ( args . nb_epochs ) :

for input , _ in iter ( train_loader ) :
if torch . cuda . is_available () : input = input . cuda ()
input = Variable ( input )
loss = mse_loss ( output , input )
model . zero_grad ()
loss . backward ()
optimizer . step ()
X (original samples)
g ◦ f (X ) (CNN, d = 8)
g ◦ f (X ) (PCA, d = 8)
To get an intuition of the latent representation, we can pick two samples x and
x 0 at random and interpolate samples along the line in the latent space
∀x, x 0 ∈ X 2 , α ∈ [0, 1], ξ(x, x 0 , α) = g ((1 − α)f (x) + αf (x 0 )).
f
f (x)
x x0
f (x 0 )
Latent space F
Original space X
Autoencoder interpolation (d = 8)
And we can assess the generative capabilities of the decoder g by introducing a
[simple] density model q Z over the latent space F, sample there, and map the
samples into the image space X with g .
We can for instance use a Gaussian model with diagonal covariance matrix.
ˆ
f (X ) ∼ N(m̂, ∆)
ˆ a diagonal matrix, both estimated on training data.

where m̂ is a vector and ∆
Latent space F
Original space X
Autoencoder sampling (d = 8)
These results are unsatisfying, because the density model used on the latent
space F is too simple and inadequate.
Building a “good” model amounts to our original problem of modeling an

empirical distribution, although it may now be in a lower dimension space.
Generative Adversarial Networks
A different approach to learn high-dimension generative models are the
Generative Adversarial Networks proposed by Goodfellow et al. (2014).
The idea behind GANs is to train two networks jointly:
• A generator G to map a Z following a [simple] fixed distribution to the

desired “real” distribution, and
• a discriminator D to classify data points as “real” or “fake” (i.e. from G).
The approach is adversarial since the two networks have antagonistic objectives.
Training GANs is difficult and involve a lot of technical details regarding the
network structure and the optimization procedure.
Under review as a conference paper at ICLR 2016
Radford et al. (2015) proposed a simple convolutional architecture for
generating small images.
Figure 1: DCGAN generator used for LSUN scene modeling. A 100 dimensional uniform distribu-
tion Z is projected to a small spatial extent convolutional representation with many feature maps.
A series of four fractionally-strided convolutions (in some recent papers, these are wrongly called
deconvolutions) then convert this high level representation into a 64 × 64 pixel image. Notably, no
fully connected or pooling layers are used.
suggested value of 0.9 resulted in training oscillation and instability while reducing it to 0.5 helped
(Radford et al., 2015)
stabilize training.
4.1 LSUN
As visual quality of samples from generative image models has improved, concerns of over-fitting
We can have a look at the reference implementation provided in
https://github.com/pytorch/examples.git
# default nz = 100 , ngf = 64
class _netG ( nn . Module ) :

def __init__ ( self , ngpu ) :
super ( _netG , self ) . __init__ ()
self . ngpu = ngpu
self . main = nn . Sequential (
# input is Z , going into a convolution
nn . C o nv Tr an s po se2d ( nz , ngf * 8 , 4 , 1 , 0 , bias = False ) ,
nn . BatchNorm2d ( ngf * 8) ,
nn . ReLU ( True ) ,
# state size . ( ngf * 8) x 4 x 4
nn . C o nv Tr an s po se2d ( ngf * 8 , ngf * 4 , 4 , 2 , 1 , bias = False ) ,
nn . C o nv Tr an s po se2d ( ngf * 4 , ngf * 2 , 4 , 2 , 1 , bias = False ) ,
nn . C o nv Tr an s po se2d ( ngf * 2 , ngf , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ngf ) ,
# state size . ( ngf ) x 32 x 32
nn . C o nv Tr an s po se2d ( ngf , nc , 4 , 2 , 1 , bias = False ) ,
nn . Tanh ()
# state size . ( nc ) x 64 x 64
)
# default nz = 100 , ndf = 64
class _netD ( nn . Module ) :

def __init__ ( self , ngpu ) :
super ( _netD , self ) . __init__ ()
self . ngpu = ngpu
self . main = nn . Sequential (
# input is ( nc ) x 64 x 64
nn . Conv2d ( nc , ndf , 4 , 2 , 1 , bias = False ) ,
nn . LeakyReLU (0.2 , inplace = True ) ,
# state size . ( ndf ) x 32 x 32
nn . Conv2d ( ndf , ndf * 2 , 4 , 2 , 1 , bias = False ) ,
nn . BatchNorm2d ( ndf * 2) ,
# state size . ( ndf *2) x 16 x 16
nn . Conv2d ( ndf * 2 , ndf * 4 , 4 , 2 , 1 , bias = False ) ,
nn . Conv2d ( ndf * 4 , ndf * 8 , 4 , 2 , 1 , bias = False ) ,
nn . Conv2d ( ndf * 8 , 1 , 4 , 1 , 0 , bias = False ) ,
nn . Sigmoid ()
)
# custom weights initia lizatio n called on netG and netD
def weights_init ( m ) :
classname = m . __class__ . __name__
if classname . find ( ’ Conv ’) != -1:
m . weight . data . normal_ (0.0 , 0.02)
elif classname . find ( ’ BatchNorm ’) != -1:
m . weight . data . normal_ (1.0 , 0.02)
m . bias . data . fill_ (0)
criterion = nn . BCELoss ()
input = torch . FloatTensor ( opt . batchSize , 3 , opt . imageSize , opt . imageSize )

noise = torch . FloatTensor ( opt . batchSize , nz , 1 , 1)
fixed_noise = torch . FloatTensor ( opt . batchSize , nz , 1 , 1) . normal_ (0 , 1)
label = torch . FloatTensor ( opt . batchSize )
real_label = 1
fake_label = 0
fixed_noise = Variable ( fixed_noise )
# setup optimizer
optimizerD = optim . Adam ( netD . parameters () , lr = opt . lr , betas =( opt . beta1 , 0.999) )
optimizerG = optim . Adam ( netG . parameters () , lr = opt . lr , betas =( opt . beta1 , 0.999) )
for epoch in range ( opt . niter ) :

for i , data in enumerate ( dataloader , 0) :
# (1) Update D network : maximize log ( D ( x ) ) + log (1 - D ( G ( z ) ) )
# train with real

netD . zero_grad ()
real_cpu , _ = data
batch_size = real_cpu . size (0)
if opt . cuda :
real_cpu = real_cpu . cuda ()
input . resize_as_ ( real_cpu ) . copy_ ( real_cpu )
label . resize_ ( batch_size ) . fill_ ( real_label )
inputv = Variable ( input )
labelv = Variable ( label )
output = netD ( inputv )

errD_real = criterion ( output , labelv )
errD_real . backward ()
D_x = output . data . mean ()
# train with fake

noise . resize_ ( batch_size , nz , 1 , 1) . normal_ (0 , 1)
noisev = Variable ( noise )
fake = netG ( noisev )
labelv = Variable ( label . fill_ ( fake_label ) )
output = netD ( fake . detach () )
errD_fake = criterion ( output , labelv )
errD_fake . backward ()
D_G_z1 = output . data . mean ()
errD = errD_real + errD_fake
optimizerD . step ()
# (2) Update G network : maximize log ( D ( G ( z ) ) )
netG . zero_grad ()
# fake labels are real for generator cost
labelv = Variable ( label . fill_ ( real_label ) )
output = netD ( fake )
errG = criterion ( output , labelv )
errG . backward ()
D_G_z2 = output . data . mean ()
optimizerG . step ()
Note that this update implements the − log(D(G (z))) trick.
Real images from LSUN’s “bedroom” class.
Fake images after 1 epoch (3M images)
Fake images after 20 epochs
References
D. Erhan, Y. Bengio, A. Courville, and P. Vincent. Visualizing higher-layer features of a

deep network. Technical Report 1341, Departement IRO, Université de Montréal, 2009.
I. J. Goodfellow, J. Pouget-Abadie, M. Mirza, B. Xu, D. Warde-Farley, S. Ozair,
A. Courville, and Y. Bengio. Generative adversarial networks. CoRR, abs/1406.2661,
2014.
A. Radford, L. Metz, and S. Chintala. Unsupervised representation learning with deep
convolutional generative adversarial networks. CoRR, abs/1511.06434, 2015.
K. Simonyan, A. Vedaldi, and A. Zisserman. Deep inside convolutional networks:
Visualising image classification models and saliency maps. CoRR, abs/1312.6034, 2013.
J. Springenberg, A. Dosovitskiy, T. Brox, and M. Riedmiller. Striving for simplicity: The all
convolutional net. CoRR, abs/1412.6806, 2014.
C. Szegedy, W. Zaremba, I. Sutskever, J. Bruna, D. Erhan, I. Goodfellow, and R. Fergus.
Intriguing properties of neural networks. In International Conference on Learning
Representations (ICLR), 2014.
M. D. Zeiler and R. Fergus. Visualizing and understanding convolutional networks. In
European Conference on Computer Vision (ECCV), 2014.

Redes Neuronales Convoluciones Gan

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Redes Neuronales Convoluciones Gan

Uploaded by

Copyright:

Available Formats

AMLD – Deep Learning in PyTorch

6. Visualizations, auto-encoders and GANs

Visualizing the processing

Zeiler and Fergus (2014) proposed to improve this approach by constructing a

So we get three visualization methods, which differ only in the quantities

• back-propagation (Erhan et al., 2009; Simonyan et al., 2013):

• deconvolution (Zeiler and Fergus, 2014):

• guided back-propagation (Springenberg et al., 2014):

>>> x = Variable ( Tensor ([ 1.23 , -4.56 ]) )

def r e l u _ b a c k w a r d _ d e c o n v _ h o o k ( module , grad_input , grad_output ) :

def grad_view ( model , image_name ) :

if torch . cuda . is_available () : img = img . cuda ()

output = model ( input )

result = result . data / result . data . max () + 0.5

def r e l u _ f o r w a r d _ g b a c k p r o p _ h o o k ( module , input , output ) :

def r e l u _ b a c k w a r d _ g b a c k p r o p _ h o o k ( module , grad_input , grad_output ) :

Experiments with an AlexNet-like network. Original images + deconvolution (or

(Zeiler and Fergus, 2014)

Maximum response samples

where g is a Gaussian kernel, and δ is a factor-2 downscale operator.

def forward ( self , x ) :

Then, the optimization of the image per se is straightforward:

model = models . vgg16 ( pretrained = True )

if torch . cuda . is_available () :

input = Variable ( input , requires_grad = True )

for k in range (250) :

result = input . data

(take a second to think about the beauty of autograd)

VGG16, maximizing a channel of the 7th convolution layer

VGG16, maximizing a unit of the 13th (and last) convolution layer

“King crab” “Samoyed” (that’s a fluffy dog)

VGG16, maximizing a unit of the output layer

“Hourglass” “Paper towel”

“Ping-pong ball” “Steel arch bridge”

VGG16, maximizing a unit of the output layer

Although the training is discriminative, the resulting model has strong

It also gives an intuition of the accuracy and shortcomings of the resulting

The simplest strategy to exhibit such behavior is to optimize the input to

xk+1 = xk + η∇|x L(f (xk ; w ), y ).

The counter-intuitive result is that the resulting miss-classified images are

model = torchvision . models . alexnet ( pretrained = True )

if torch . cuda . is_available () :

target = model ( input ) . data . max (1) [1]. view ( -1)

for k in range (15) :

This modeling consists of finding “meaningful degrees of freedom” that describe

An autoencoder combines an encoder f that embeds the original space X into

Training consists of minimizing the reconstruction error, for instance estimated

A deep autoencoder combines an encoder composed of convolutional layers, and

To run a simple example on MNIST, we consider the following model, where

Tensor sizes / operations

Tensor sizes / operations

model = AutoEncoder ( embedding_dim , nb_channels )

if torch . cuda . is_available () :

optimizer = optim . Adam ( model . parameters () , lr = 1e -3)

for epoch in range ( args . nb_epochs ) :

∀x, x 0 ∈ X 2 , α ∈ [0, 1], ξ(x, x 0 , α) = g ((1 − α)f (x) + αf (x 0 )).

ˆ a diagonal matrix, both estimated on training data.

Autoencoder sampling (d = 16)

Autoencoder sampling (d = 32)

Building a “good” model amounts to our original problem of modeling an

Generative Adversarial Networks

The idea behind GANs is to train two networks jointly: