Download as pdf or txt
Download as pdf or txt
You are on page 1of 22

12 Deep Learning

In the previous chapter we discussed the b · f l ks


th • as1cs o neura networ
and how ey can be trained. In this chapter, we continue with deep
neural networks composed of many hidden layers. We discuss how
such models can be trained efficiently, different deep network archi-
tectures, and some applications where they are used successfully.

12.1 Introduction

WHEN A linear model is not sufficient, one possibility is to define new


features that are nonlinear functions of the input, for example, higher-
order terms, and then build a linear model in the space of those features;
we discussed this in section 10.2. This requires us to know what such
good basis functions are. Another possibility is to use one of the feature
extraction methods we discussed in chapter 6 to learn the new space, for
example, PCA or Isomap; such methods have the advantage that they are
trained on data, but they are generally unsupervised.
Currently the best approach seems to be to use a multilayer perceptron
(MLP) that extracts such features in its hidden layer, as we discussed
in chapter 11. The MLP has the advantage that the first layer (feature
extraction) and the second layer (where those features are combined to
predict the final output) are learned together in a supervised manner.
In chapter 11, we introduced the basic perceptron with a single layer
and the MLP with one hidden layer. Though we know that the MLP with
one hidden layer is a universal approximator; that is, it can approximate
any function with arbitrary accuracy, it may need a very large number of
hidden units to achieve that. It has long been known empirically that
a "long and thin" network not only has fewer parameters (and hence
12 Deep L
T 12.1 Introduction
314 ea~Hi/19
315
. ) than a "short and fat" one but also achie and connectivity is also usef
smaller complexihi~ ·s the idea behind deep neural networks Ves better Th OId u1, as we Will .
ali anon T s 1 , Wh e approach was to tr • see m section 12 6
DEEP NEURAL gener z ..den layers and, starting from the raw input, eac e_re Ive dinov 2006). The aim of eacham! one layer at a time (Hinto~ ~d Salakhut-
NEnv oRKs have manY rud h alues in its preceding layer and learns h hidden f ct . ayer 1s to extr
bines t e v Illar data
. e. to It, and a method such as the act the salient features in the
layer com . f the input. successive bidden layers cor e col!J. m secnon 11.9 can be used for thi autoencoder that we discussed
li ated tuncnons o . . respo
P c resentations until we get to the final layer h lld to that we can use unlabeled data for s fuurpo se; there is the extra advantage
ore abstract rep b , w er
m ed in terms of these most a stract concepts e the input, we train an autoencoder an~ ~lirpose. So, starting from the raw
utputs are Iearn . ·
0
we will see the convolunonal neural network f . in its hidden layer is then used' . e encoded representation learned
for examP1e, . f or un as mput to tr · h
. . ;n section 12.s, where, startmg rom local group of p· age and so on, until we get to the final la t _e next autoencoder,
recogrutwn ,..u til !Xels manner with the labeled data O allyer that is tramed ID a supervised
s then to corners, and so on, un we get to a digit W , Ive
get to edge , h . . . cliff . e kno one by one, they are all stacked · neeand th thehlayers are trame · d ID
· t hi sway,
. age is composed of sue pmmnves at erent levels 1v
that an un . . d h , and 1v •
the lab e l e ct ct ata. Today, with more com
I
e w otin.
e network is finetuned with
knowledge to define the connecUVIty an t e overall archit e
use thi s bidd . . ecture · bl e, t h e whole deep network can beputram· g power
sed an MLP where each en umt IS connected to all the . • availa d · and labeled• data
If we U th . inputs . . e m a supervJsed man-
ork has no knowledge that e mputs are face images , ner. Still, usmg an unsupervised method to ·ti·aliz th ·h
th e netw . . . . , or even . e e we1g ts may
I
. Di

that the input is two-dimens10nal. Usmg a convolunonal network h somenmes work better than random initialization-learning can be done
.
bidden units are fed with loc ze two-
ali d dim ens10n . al patches in the Were . much faster and with fewer labeled data. Or, earlier feature extraction
h ·
is an inductive bias and as sue IS a way o ee
t f d tbi inf
s ormation of 1
image layers can b e "transferred" from another network trained on a larger but
. b l ocaJ similar data.
correlations such that correct a b stracnons can e earned at different
levels. Deep learning methods are attractive mainly because they need less
DEEP LEARNING In deep learning, the idea is to learn feature levels of increasing ab- manual interferenc e. We do not need to craft the right features or suitable
s traction with minimum human contribution {Bengio 2009; Goodfellow basis functions or kernels (chapter 14), or worry about optimizing the
Bengio, and Courville 2016). This is because in most applications, w~ exact network architecture. Once we have data-and these days we have
do not know what structure there is in the input, or we may know it "big" data-and sufficient computation available, we just wait and let the
on1y in the first few layers of basic feature extraction, and any dependen- learning algorithm discover by itself all that is necessary.
cies there are should be automatically discovered during training. It is The idea of multiple layers of increasing abstraction that lies under-
this extraction of dependencies, or patterns, or regularities, that allows neath deep learning is intuitive. Not only in vision-in handwritten digits
abstraction and learning general descriptions. or face images-but in many applications, we can think of layers of ab-
Though deep learning is attractive, it has various potential problems. straction, and discovering such abstract representations allows a better
first, a deep network is much larger, with more weights and processing description of the problem. . . .
units, and hence needs more memory and more computation; luckily, Consider machine translation. For example, starting With an English
sentence which is a sequence of characters, in multiple levels of pro-
GPUs and multicore architectures are getting cheaper every day, which
. • d abstraction that are learned automatically from a very lar~e
helps us with that. Second, a deep network has more free parameters, cessmg an th 1 . al syntactic and semantic
and training it is more difficult. Not only do we have a more compli- corpus of English senten ces to code ~d ; ~ t~ the mos~ abstract rep-
cated error surface, but the fact that there are many hidden layers is
also problematic. For example, in backpropagating the error to an early
rules of the English language, we 7
sentence in French. The levels of
resentation. Now consider the sam F nch corpus would be different,
hi time fr om a re
layer, because of the chain rule we multiply the derivatives in all the lay- processing learned t s hing at the most abstract, language-
ean the same t , .
ers afterward, and the gradient may vanish or explode. This has led to but if two sentences m imil.ar representations.
h uld have very s
~any_heuristics to finetune the basic gradient descent method, and we independent level, they s o . basically composed of an en-
. machin e translation as
So w e can VIeW
will discuss those in section 12.3. Finetuning the network architecture
12 D
ee,, 1...eo,-
,
316 l 2.2 How to Train /vfu/Up/e Hidden Layers
%,ll
b k to back. The encoder takes an Engi·
coder and a decoder ace-independent embedding of the rne•sh senten 12.2 317
t s a Janguag . a111n Ce How to Train Multiple Hidden Layers
and genera e thi. nd synthesizes the corresponding French g, anci .
decoder takes s ad to end· the training data contains Pair sentenc" 12.2.1 Rectified Linear Unit
. . I·s done en ' di . s of E e.
END-TO-END TRAINING Trauung nces and all interme ate processing is !ear _ ng11 8h
and French s_ente s t~ctic, and semantic constraints that are ned. lliat In chapter 11, we used sigrnoid and tanh as nonlinear activation func-
is, all the _Iexic~, ~hesis are learned from data and implerne necessary tions. An activation function for hidden units that has becorne popular
RECTIFI ED LINEAR
for analysis a~ ;~eep encoder and decoder networks: This nted in the UNIT recently with deep networks is the renified lin ea r unit (ReLU) (Nair and
many layersf_oht -a:t machine translation is learned and done1s(Wbasica11y Hinton 2010; Glorot, Bordes, and Bengio 20 11), which is defined as

ls aO
how state-o t e u et al ifa > o
2016) f . . . (12.1) ReLU(a) = otherwise
. f--"·ar wi·th this type o processmg lil computer sc·
we are dllw1 . Jenee• ~.
. for example , how a compiler works: There 1s the
1s, . front enct th at •dl,us Though ReLU is not differentiable at a = w . _
. al d syntactic analyses and generates an intermediate r oes the left derivative: 0• e use It anyway, we use
Jexic an . . eprese
taaon,. and the back end takes this as
h mput and
il generates cod f n-
e or
. ular processor. One can adapt t e comp er to different Ian a (12.2) ReLU' (a) = S1 if a > O
partlc .
and processors by changrng only the front or back end, respectively.
gua&es l O otherwise
We have many other examples of the same approach in machine
1 One advantage of ReLU is that, because it does not saturate (unlike
ing as well. For example, cons1.d er th e tas k o f gene_rating captionsearn- for
sigmoid and tanh), updates can still be done for large positive a. The
images (Vinyals et al. 2015). The encoder takes the IIDage as input, an-
other advantage is that, for some inputs, some hidden unit activations
alyzes it in a deep network, and generates an embedding which is
will be zero, meaning that we will have a sparse representation. Sparse
abstract representation of the image content; the decoder takes this an~ representations lead to faster training (where there are fewer nonzero
generates an English sentence as the caption.
hidden units, there is less interference between them); they are also easier
we also see this in game playing, for example, in the Atari player (Mnih to interpret. The disadvantage of ReLU is that, because the derivative is
et al. 2013). The encoder takes the screen image and generates a com- zero for a s 0, there is no further training if, for a hidden unit, the
pressed embedding, and the decoder takes this and decides on the action. weighted sum somehow becomes negative. This implies that one should
Again, the whole network is trained end to end. Previously, such a system be careful in initializing the weights so that the initial activation for all
would be handled in two stages: There would be one vision module tak- hidden units is positive.
ing an image and generating an intermediate representation, and there In the leaky ReLU (Maas, Hannun, and Ng 2013), the output is also
would be an action module taking the intermediate representation and linear on the negative side but with a smaller slope, just enough to make
deciding the action; the two modules would typically be designed by dif. sure that there will be updates for negative activations, albeit small:
ferent groups of people with different areas of expertise who would sit
down together and decide on the intermediate representation. Sa ifa > O
Now with deep, end-to-end training, the two modules are trained in a
(12.3) leaky-ReLU(a ) = L rw otherwise
coupled manner; our training data consists of pairs of input at the very For examp 1e, C< . taken as O-01 , which is the derivative for a s 0.
1s
beginning and the output at the very end, for example, pairs of an image
ang the best action, and it is the learning algorithm that decides on the
intermediate representation. All we need is a large enough network aTI<l 12.2.2 Initialization
lots and lots of data to cover as many combinations as possible. One should be careful when initializing the weights in a deep network.
Let us now see how this is done.
In chapter 11, we said that we can choose initial weight values uniformly

........._
• 318
randomlY frorn the interval [-0.01, O.Oll, and when w
12

cton with tens of inputs and hidden units, this is s ~ have an a


receiving inputs frorn thousands of units or more , th ecient.
Deep Le
Qr,1i,-,

ran If a PPlica.
Ul\i
9
' l 2 .2

E
How lo Train Multiple Hidden Layers

Let u s say that we are mini.mizing the sum of squared errors:

= !.2 L I
K
<r,' - Y/) 2
3 19
IIIJt

t l= l
scaled down correspondingly. ge sho t is
The airn is always to rnake sure that we start with . llld be
There are K outputs, and the range of t depends on whether we are doing
the derivative is nonzero. In the case of sigmoid and activations
is rnaxiIDUill at zero, so we .need to initialize so th at we
tanh,
g the de nvau
_ivhere onJine, minibatch, or batch learning. The error at output i is
and so that we do not acodentally get a value Iess th et small Va] Ve ei = rl - yf
than 5. For ReLU, we need to make sure that we an - S or Ues
for example, initially weights can similarly be veget a nonnegativ &feater and we update the we ights to that output as
can have a positive bias. ry close to zero eanct we
Value; 6 V;/ , e;r'
= '7 L, Zz1
1·· /

t
g,'
'.I
The error at hidden unit I in the second hidden layer is the sum of the
b ackpropagated errors each multiplied by the weight on its path, which
,.-,-;; 12.2.3
Generalizing Backpropagation to Multiple Hidden Layers
indicates the total "responsibility" of the hidden unit
I"
·-', I.I-
d.
In chapter 11, we derived the update equations for a multilayer
,;;1(f, e;, = I e) v1

'~§\<'~· ;_:'
rron with a single hidden layer. When we have multiple hidden laPercep.
. . . . f Yers III
each layer we have hidd en uruts receivmg mput rom its precedin 1 '
taking a weighted sum with its own modifiable weights and giVingg ayer, and to upda te a weigh t to that hidden unit, we multiply this by the deriva-
,~ . thr . . tive of the activa tion function, and its input


1!/f,· value after passmg the sum ough the nonlinear activation functio
Let us say that we have an MLP with two hidden layers, and let ut5
that this is a regression task with multiple outputs. Given input x' 1~;
6W2/h = '1 Ie;,nz;1lz\h
I

have Similarly, th e er ro r at hidden unit h in the first hidden layer is the sum

(ta
of the b ac kpropagated errors, each multiplied by the weight on its path
z(h = f(w[hx') = f IYJhjX)) e /ll, -
- "
L ' e Iz 1W 2/h

(f
I

and again to update a weight to that hidden unit, we multiply the back-
zL = f (wf,z\) = f w21hz(h)
h=O propated erro r, the derivative of the activation function, and its input
H2 is
T t
Y1
r _
- V;22 = L'\' vuz21t + v;o
l=O 6 w 1hj = '7 Ie\ hf'(z\h)x~
I

where f denotes the activation function, for example, sigmoid, tanh, or The w h o le p rocess involves first the propagation of information in the
ReLU; W1h and W21 are the first- and second-layer weights; Z1h and z21are fonvard direc tio n of calculating the hidden units, and then in the back-
the units on the first and second hidden layers with H1 and H2 hidden ward direc tio n of calculating the errors.
units; and v are the third-layer weights. Training such a network is si.Illl· Le t u s generaliz e. Assume we have arbitrary weight w~hj from layer
lar except that, to train the first-layer weights, we need to backpropagate k - 1 to k, and let u s see fir st ho"· it is used and then how it is updated
over one more layer.
(see figure 12. l ).
l 2. 3 Improving Training Convergence

- 320
12 Deep Learning
321

Summation goes over all the units that this unit is connected to in the
following layer; we call this the fan-our. If k is the output layer, there is

k+1
0 0 no layer k + 1 and elh is the derivative of the error function with respect
to output Zkh, The update uses this backpropagated error:

(12,6) 6WkhJ = 11 LAhf'< 2 khl 2 k-t,J


'
If k is the output layer and we have an output activation g, we use g'.
So first we go in the forward direction using equation 12.4, starting with

0 0 0 k = I and incrementing it until we get to the output layer. Then we go in


the backward direction starting at the output and decrementing k until
we get to the first hidden layer. In some architectures the layers are not
fully connected; in such a case, the summations in forward and backward

t '·• directions are accordingly limited to their fan-in and fan-out.

o o'•,.io
Wk.hj

Improving Training Convergence

n
12.3
it k-1
d.
,;J rt Gradient descent has various advantages. It is simple. It is also local;
;il
ii;; ;.,.; ,f Backward
,,, ,1(.- Forward that is, the change in a weight uses only the values of the presynaptic
·~\) '
'
il\ ..'.:
't,/1/(//11!
figure 12.1 In the forward direction, when we use weights, we calculate unit
values as a weighted sum of unit values in the previous layer. In the backward
and postsynaptic units (the units before and after the connection) and
the error (suitably backpropagated). When online training is used, it does
not need to store the training set and can adapt as the task to be learned
'-'.._,.._. direction, when we update weights, we calculate errors as a weighted sum of
changes. But by itself, gradient descent converges slowly. Methods such
errors in the following layer. as the conjugate gradient and second-order methods are computation-
ally intensive and can not be used easily with deep networks with many
We start with the forward direction, in which we calculate z values as weights (Goodfellow, Bengio, and Courville 2016). However, simple tech-
the weighted sum of the earlier z values (zb1 = x are the inputs): niques improve the performance of the gradient descent considerably,
1 making it feasible in real applications.

(12.4) Zkh =f ( f WkhjZk - I,j)


12.3.1 Momentum
Summation goes over all the units that are connected to this unit in Let us say that w; is any weight in an MLP in any layer, including the
the previous layer; we call this the fan-in. If k is the output layer, we use biases. At each parameter update, successive gradient values may be so
an activation function suitable for the application, for example, linear for different that large oscillations may occur and slow convergence. tis the
regression and softmax for multiclass. time index that is the epoch number in batch learning and the iteration
In the backward direction, we calculate e values as weighted sum of the number in online or roinibatch learning, The idea is to take a running
later e values: average by incorporating the previous update in the current change as if
(12.5) t -- L.
ekh " ek+1,1Wk+I,lh
t there is a momentum due to previous updates.
MOMENTUM
I

........__
,r
12 D 1
322 eep L
2.3 Improving Training Convergen ce
. . earning
For each weight w;, we keep an add1t1onal values;, origina!ly . 323
keeps a running average o f th e pas t gra di ents, and that 1.s h ow lero ' th a1
averagew,,
weight where we giverule
the update tnore
is importance to more recent values. For any
,., a£'
1 =
oE'
0/Sc-I + (1 - OI)-
(12 .9) 6wf = - Gaw
(12.7) I 0~ v ri I

6,wf = -f/Sf
where r; is the accumulated Past gradient, one for each weight in the
The hyperparameter O < 0/ < l is the decay or forget . ne twork. It is initially zero and updated after each epoch as
of the running average and is ge_nerally taken as 0.9. Mom:g Pararneter r _ t- 1 _ _ 2
cially useful when onlme or =batch learning is used ,Werea h ttun is espe· (12.10) r; - pr; + (l p) aw,
a£' 1
I
\ _.
we get an effect of averaging and smooth the trajector d . s a resUJt
gence. The disadvantage is that the s; values should by urmg conver. ADAM
e stored 111
. where p is generally taken as 0.999. Adam (Kingma and Ba 20 14) is an
memory. extra extension that also includes a momentum factor. At iteration r, we have
+ ( 1 - a )0£
1
(12.11) Sr = 0/S t-l -
1 I aw;
12.3.2 Adaptive Learning Factor
2
r
:;: In gradient descent, the learning factor 'I determines the magnitud r'I pr.I - I+ (1- p) _
I
a£'
OW;
1
I
change to be made in the parameter. Typically it takes a value less eh of
<.
C . . tan
or equal to 0.2. A very earIy h eunsnc was to make it adaptive for fas (12.12)
-,
S;
s' rf
- - '-andr' = - -
1 - a' ' 1 - P'
convergence, where it is kept large when learning takes place and is ~: -,
S;
creased when learning slows down: 6w/ -r, G
yri
+a if £'+T < £'
(12.8)
6,r, = { -br, otherwise Apologies for the two different uses of the superscript r here: With si , for
example, t denotes an index; with a' and p', it denotes the tth power.
Thus we increase 'I by a constant amount (or keep it unchanged) if
1n terminology of Adam, short for Adaptive moments, s/ is the first
the error on the training set decreases and decrease it geometrically if moment (it uses the gradient) and r/ is the second moment (it uses the
it increases. Because E may oscillate from one epoch to another, it is a square of the gradient), and the versions of these with - on top that
better idea to take the average of the past few epochs as £'. we calculate in equation 12.12 are the so-called bias-corrected values:
In a deep network with many hidden layers and units, the contributions Because both s/ and rf start from O and because both a and p are close
of weights to the error are not the same, and one can adapt the learn- to 1, in early iterations, the estimated values will be biased toward O; the
ing factor separately for each weight, depending on convergence along divisions correct for those. When t gets large (which happens later, when
that direction. The idea in such methods is to accumulate the past error the corresponding hyperparameter is closer to I), its rth power will be
gradient magnitudes for each weight and then make the learning factor close to O, and the division in bias correction will be almost by I.
inversely proportional to that. In directions where the gradient is small,
the learning factor is larger to compensate, and it can be smaller along
directions where the gradient is already large. 12.3.3 Batch Normalization
ADAGRAD
In AdaGrad (Duchi, Hazan, and Singer 2011), we keep a running sum We know from earlier chapters (e.g., section 11 .3) that it is always a good
RMSPROP
of all the past gradients; in RMSProp (Hinton 2012), we keep a running id ea to z-normalize the inputs to have zero mean and unit variance so
- 32,4

that all inputs are in the same scale. The advantage is that
attribute, all values may be clustered
12 Dee
P Lear- ·
"'ng
' 12.4 Regularization 325

training data aJ ,values shift up and down all toge ther when the earlier
, t in some
, small range and ' for
n sorn e weights change m this or that way. When we normalize, we get rid of
ization spreads them; wh e~ th e mpu s are m the same scale ortnaJ. such oscillations and can focus on the relative differences.
sponding weights are also m the same scale, so initializin th ' the corre. coVAR( ATE SHI FT We have here an example of covariate shift, which normally is the dif-
same range or using the same learning rate makes sense. g etn in the ference between the training and test distribu tions. For example, in face
A similar case can also be made for the hidden units nd recognition all training fa ces may have been taken under ideal condi·
idea behind batch normalization (Joffe and Szegedy 2oi s~ this is the tions, but the lighting may not be as good in test images. Robu stness to
BATCH
ize hidden unit values before applying , the activation fun ct1on
· . We nonna1. such shifts is an important problem in machine learning in general. Here,
NOR.MALIZA TION 5
sigmoid or ReLU. Let us call t h at weighted sum a ,. For ' Uch as if you consider all the layers before ai , it is as if in successive epochs
minibatch, for each hidden unit j we calculate the U:ean m ,each batch or the distribution of the data coming out of them is constantly c:hangmg
1 st
deviation Sj of its values, and we first z-normalize: and andarct (because of changes in the parameters in those layers), and batch nor-
malization helps with that.
l' a_
iij = _J--m
_J
·
'""'~,' (1 2.13)
Sj
~I 12.4 Regularization
,; We can then map these to have arbitrary m ean Yj and scale /3 j
J. (1 2.14) iij = yj iij + /3j We know from chapter 11 that.just like any other estimator, the complex-
rt
.,,,J ,.s; ity of an MLP needs to be adjusted to the complexity of the problem that
,,. and then we apply the activation function.
) und erlies the data. A network that is too small has bias; a network that
<. There are a number of important points here. Firs t, m j and Sj are cal-
I is too large has variance. In this chapter we are concerned with networks
\ '· culated anew for each batch, and we see immediately that batch normal-
~/(~ . ~! that have many layers, each with many hidden units and hence too many
.f!_~.' ization is not meaningful with online learning or very small minibatches, connec tions, and we need to find ways to keep variance in check. This
Second, Yj and /3j are parameters that are initialized and updated (after approach is inherent in deep learning: We do not want to be bothered
each batch or rninibatch) using gradient descent, just like the connection with finetuning the exact architecture; we start with a large network and
weights, So they require extra memory and computation. These new pa- expect learning to use whatever portion of it is needed, ignoring wbat is
rameters allow each hidden unit to have its arbitrary (but learned) mean extra.
and standard deviation for its activation. Of course, the main factor that helps us is that nowadays we have much
Note also that if we normalize using equation 12.13, an incoming bias larger data sets. Big data implies not only that we have data that well re·
to hidden unit j (in calculating aj) is useless; it is a constant added for all fl eets the main charac teristics of the data but also that we have instances
instances and will be subtracted out; actually, f31 will now act as the bias. for very rare cases; these latter help us distinguish rare bu t legitimate
Once learning is done, we can go over the whole training data (or a
cases from noise.
large enough subset) and calculate m1 and s1 to use during testing, when This section discusses different ways we can regularize deep neural
we see one instance at a time. • bl ther learning methods as
networks. Some of these are applica e to 0
One reason why batch normalization helps learning is the same as why well , a nd some are special for neural networks.
we normalize inputs: It allows a better use of the range. In some mini·
batches, all instances may lead to very similar a1 ; renormalization makes
the differences between them clearer. There is another reason that IS 12.4.1 Hints
specific to hidden units: Note that the value of each hidd en unit in a In certain applications we have information about the underlying appli·
deep network depends on all of the weights before it, and th ese are con· ca tion, and it is always helpful if we can integrate it into learning. Hi nts
tmuously updated. This means that, as learning proceeds, fo r the same HINTS

........
326 12 Deep Learning / 2.4 Regularization

327

approximating neural network, where 0 are its Weights. Then, for all
such pairs (x,x ' ), we define the penalty l'unction

Figure 12.2 The identity of the object does not change when it is
Eh = L [g(x l0) - g(x' l0)] 2
X,x'
rotated, or scaled. Note that this may not always be true, or may be translated
point: 'b' and 'q' are rotated versions of each other. These are hi true up to '
. ms th a and add it as an extra term to the u .
incorporated into the learrung process to make learning easier. at can be sua 1error function:
E' = E + ,\h · Eh

are properties of the target function that are known to u s independent


/c:•. .' the training examples (Abu-Mostafa 1995). of This is a penalty term penalizing the cases where our predictions do

gr
/o'-~';,l
Jn image recognition, there are invariance hints: The identity f
H
object does not change w_hen is rotated , transla ted, or scaled (se~
ure 12.2). Hints are auxiliary information that can be used to guide th
fl;
not obey the hint, and ,\h is the weight o[ such a penalty (Abu-Mostafa
1995).

One way this is used is to get robustness to noise. For example, let us
;.:i rt )earning process and are especially useful when the training set is lirnitecte say we have an image recognition problem where we know that there
There are different ways in which hints can b e used: can be occlusions. We generate Virtual examples with parts synthet·
·i
,~
ja1. :;:~C. ically occluded and introduce them as additional e.-xamples \\ith the
lvi
,~ tC VIRTITAL EXAMPLES 1. Hints can be used to create virtual examples. For example, knmvin
that the object is invariant to scale, from a given training exampl;
same label. The network trained with these will learn to distribute its
input to all of the image so that it is minimally affected by missing
\07~;/
'-'.~- .. we can generate multiple copies at different scales and add them to data in parts. We saw an example o[ this with noisy autoencoders in
the training set with the same label. This has the advantage that we section 11.10.
increase the training set and do not need to modify the learner in any TANGENT PROP Yet another example is the tangent prop (Simard et al. 1992), where
way. The problem may be that too many examples may be needed for the transformation against which we are defining the hint-for e.-xam-
the learner to learn the invariance. ple, rotation by an angle-is modeled by a function. The usual error
function is modified (by adding another term) so as to allow param-
2. The invariance may be implemented as a preprocessing stage. For
eters to move along this line of transformation without changing the
example, optical character readers have a pre processing stage where error.
the input character image is centered and normalized for size and
slant. This is the easiest solution, when it is possible.
12.4.2 Weight Decay
3. The hint may be incorporated into the n etwork structure. Convolu-
tional networks with localized connections and weight sharing, which If a weight is close to zero, then the input or hidden unit before that
weight effectively does not participate in the calculation, and the higher
we will discuss in section 12. 5, is one example.
the weight in magnitude, the more it is taken into account, effectively in-
4. The hint may also be incorporated into the error function. Let us say creasing the complexity of the model. That is why we initialize weights
we know that x and x' are the same from the application's point of 'Ni.th values close to zero, and that is why stopping early is a regularizer.
view, where x' may be a "virtual example" of x; for example, x' may As learning proceeds, more weights move away from zero; it is as if the
be a slightly translated version of x: f(x) = f(x ') , when f(x) is the complexity (and hence the variance) of the model increases with train-
ing, and though the training error may decrease, validation error may
function we would like to approximate. Let us d e note by g(x l0) our
, ,....
T
328 12 Deep l J 2.4 Regularization
earning

level off and then start increasing. We stop when we obser 329
ve that 1'- ·
happening. ••1s is have two networks that have the same training error, the simpler one-
When all the incoming weights are close to zero, the w . h the to
onenew
~thexamples.
fewer weights-has a higher probability of better generaliz-
. b' . al e1g led ing
also close to zero (assummg the ias 1s so very small) 1 . sum is
. . f . . • W Uch 1·
that we are in the IIl!ddle secnon o a s1gm01d (or tanh) th . 1llPlies Jf you consider the vector all weights, the second term in equation 12. 16
linear. If we have successive hidden layers that are almost lin: t is almost corresponds to the Euclidean (L2) norm of that vector, and hence another
L2 REGVLAR IZATION
the linear combination of linear models is also linear tog thar, because name for weight decay is L2 regularization. Another possibili ty is
. . ' e er th ,\
as one linear model, which effecnvely corresponds to prim;~ _ the hid
ey act
~uug (12.17) E' = E+ 2 Llwtl
layers in between. den I
Even if we start with a weight close to zero, because of som .
li R£GULAR IZATION
called L1 regularization, which forces more of the w ; to be zero and leads
I instances, it may move away from zero; the idea in weight d ecay .set noisy
f'
WEIGHT DECAY 1 O to sparser representations.
some small constant background force that always pulJs a Wei h add
zero, unless it is absolutely necessary that it be large (in m g t toward There is a Bayesian interpretation of weight decay. Remember that the

r~. agnitu de) to


(I
./ decrease error. For any weight w;, the update rule is Bayesian approach considers the parameters as random variables drawn
from a prior distribution. In neural networks, the parameters are the con-
~ (12.15) oE - ii ' w;
t.w; = - TJ-
ow;
nection weights w ; drawn from a prior distribution p(w;). The posterior
probability given the data is
It::'- 'a1~ ,
,~;::...) :f_· This is equivalent to doing gradient
. . descent on the error fu nct1on
. With
_ (1 2.1 8)
_ p(X lw )p(w)
p(w l X ) - p(X )
( an added penalty term, penalizrng networks \\1th many nonz _ ..

,iJ (,\' = i\T)): ero "eights
lj.!(11,/ where w is the vector of all weights of the network. The MAP estimate w
'-- ·· is the mode of the posterior
(12 .16) E'
=E+ -
,\ '
L.., W;
2
2 ; (1 2. 19) W MAP = argrnJx logp (wJX)
where f is the usual classification or regression error (negative log likeli-
hood). The sum in the second term is over all weigh ts in all layers, and Taking the log of equation 12.18, we get
its effect is like that of a spring that pulls each weight to zero. Starling (12.20) l ogp(w l X ) = logp( XJw) + logp (w) + C
from a value close to zero, unless the actual error gradient is large and
causes an update, due to the second term the weight will gradually decay . the log lik elib oo d' andh therior
The firs t term on the right is second is tbe
is taken as
to zero. log of the prior. If the weights are mdependent and t e P
The hyperparameter ii determines the relative importance of the error Gaussian, N(O, 1/ 2.\ ), we have
on the training set and the complexity due to nonzero parameters and
thus determines the speed of decay: With a large i\, weights will be pulled (12 .2 1) p(w) = n; p (w,) where p(w;) c. !o'-'P [--3_]
2(1 / 2,\ )
to zero no matter what the training error is; \~th a small ii, there is not
much penalty for nonzero weights. Like any hyp erparame ter, ii is fine· Then maximizing equation I 2.20 is equh·alent to minimizing
tuned using cross-validation. This approach of removing unnecessary
(12.22) E' = E + il JJ wJl 2
RIDGE REGREss 10N parameters is known as ridge regression in statistics.
That simpler networks are better generalizers is a hint we implement which is the same as equation 12. 16. From this perspective, a large,\
by adding a penalty term. Note that we are not saying that simple nel- assumes small variability in parame ters, puls a larger force on them to
works are always better than large networks; we are saying that, if we be close to zero, and takes into accounl the prior more than the data;

-~
330 12 Deep Learn ·
•ng
, J 2. 5 Convo /uriona/ Laye rs

331
. . all then the allowed variability of the parameters is 1
1f A 1s sm , . ak W arger 0
ve them to zero 1s we er. e talk more ab <l!Jd
the f orce to mo . . out thi .
t
ch ap er 17 On Bayesian esnmanon. For
.
example,
.
we will aJs
Ll regularization assumes a Laplace pnor that_ 1s n_iore peaked aro t at
zero (see section 17.4). The use of Bayesian estimation in !raining
is treated in MacKay (1992a, 1992b). _ _
o see h
s

~;d
1n

s
~ 0 "¼~
d c/-ij 0
Empirically it has been shown that after trammg most of thew .
, . . . eigh ts r
an MLP are distributed normally around zero, Justifymg the use of . 0 Figure 12.3 In dropout, the output of a ran
decay But this may not always be t h e case. Nowlan and Hinto weigh ( t ze ro, and backpropagation is done on th dom subset of the uni1s are set to
. . . h n 1992) e remammg smaller network.
SOFT WEIGHT SHARING proposed soft weight shanng, where we1g ts are drawn from a .
of Gaussians, allowrng . them to f orm m ul tip . l e cI usters, not oneID1Xture
Al
. so
these clusters may be centered anywhere and not necessarily at • unit with probability p that is t ·
and they have variances that are also modifiable. This changes the zero, .. . ' · se its output to zero, or keep it with prob-
. ability 1 - P (Sn va st av_a et al. 20 14). Pis adjusted using cross-validation;
of equation 12.21 to a llJXture o f M ;:,, 2 Gauss1ans:
. Pnor
with more mpu ts or hidden units in a layer we can aff d I
figure 12 _3 )_ , or a arger p (see
M
(12.23) p(w;) = L 0/;P;(w;) In each batch or minibatch, for each unit independently we decide ran-
J=l domly to keep it or not. Let us say that p = 0.25 . So, on average . we
remove a quarter of the units and we do backpropagation as usual on
where a1 are the component proportions and P;( w; ) ~ N(m1 , s2) are the the remaining network for that batch or minibatch. We need to make up
component Gaussians. Mis set by the user, and oc; , m; , and s1 aie learned for the loss of units, though: In each layer, we divide the activatio n of
from the data. Using such a prior and augmenting the error function IVith the remaining units by 1 - p to make sure that they provide a vec tor of
its log during training, the weights converge to decreas e error and also similar magni tude to the next layer. There is no dropout during testing.
are grouped automatically to increase the log prior. In each batch or minibatch, a smaller network (\\i th smaller variance)
is trained. Thus dropout is effectively sampling from a pool of possible
12.4.3 Dropout n e tworks of different depths and widths. The resulting effec t is that,
as we m entioned above, instead of giving large weights to few units in
Previously we talked about the advantage of adding noise to inputs, which the preceding layer, dropout gives small weights to many, just like we
makes the network more robust in that the network learns not to rely h ave in L2 regularization. Dropout has been found to be very effec tive in
too much on particular inputs. Say we have an image data set where a regularizing deep neural networks.
group of pixels in the background, simply by luck, always takes similar oRo r coNNE CT Th ere is a version called dropco nnect that drops out or not connec tions
values for examples of a class. In such a case, the n etwork can very independ ently, which allows a larger set of possible networks to sample
quickly, but wrongly, learn to base its decision for that class on them. from, and this may be preferable in smaller networks (Wan et al. 20 13).
But if we introduce additional examples where those pixels are missing
or corrupted, the dependency on them will be reduced. A similar case
can als_o be made for hidden units, and it has been also proposed to 12.5 Convolutional Layers
add nmse to hidden units, thereby distributing decisions to whole layers
ra~ter th an specific units (Poole, Sohl-Dickstein, and Ganguli, 2014). 12.5.1 The Idea
e extreme case Is to discard inputs or hidden units completely. In fi t the data one possibility is to adapt the struc-
DROPOUT dro t h ' ' ln getting a betterk t bo tter r;pr~sent the constraints of the underlying
pou • we ave a hyperparameter p, and we drop the input or hidden tw·e of th e networ to e
12 Deep L
332 eqrn;ng 12.5 Convolutiona/ Layers

. thi is a type of inductive bias and implies refl ectin 333


problem,_ s have regarding the data into the architectur g Whatever
informanon we e of the
wor k. .
pplications the mput h as Ioc al t
s ructure For
. In some~ow that ne~by pixels are correlated, and ;here ;xarnPle, In
V1st0n we b. h e 1ocaJ f
e edges and corners; any o Ject, sue as a handwr· ea.
tures lil< . . f h . .. . Itten ct· .
be defined as a combmanon o sue pnrrut1ves. SIIUilarly . 1g1t,
may . . . , ill spe h
locality is in time and mputs close m tune can be grouped ec ,
. . . 1 as spee h
rimitives· by combining t hese pnrruttves, anger utterances f c
P ' fin d ' or exain
pie, speech phonemes, may be de e . In su_ch a case when desi . ·
the MLP, hidden units are not connected to all mput units becau llning
inputs are correlated. Instead, we define hidden units that defi se not au Figure 12.4 In a convolutionaJ MLP, each . .
ne a 11'in
dow over the input space and are connecte d to only a small local · of uruts below it and checks for . urut is connected to a local group
•• a parucu1ar fea ture f
of the inputs. This decreases the fan-m an t erefore the numbe subset
. d h f v,s1on. The outputs of these are th .
en combined by
- or example. an edge-in
•1
r o free learns a more abstract feature- for examp e a co a urut
parameters (LeCun et al. 1989). I • n a later layer
.
tha1
we can repeat this in successive layers, where each layer is conn one hidden unit is shown fo r each re ion· · . rner-m a 1arger region. Only
. ected different local features. g • typ1Cally lhere are many to check for
to a small number of local uruts below and checks for a more com ._
11
cated feature by combining the features below in a larger part of the . P
. lllPUt
space until we get to the output layer (see figure 12.4). For example, start-
ing learning, we calculate the gradients on different patches each with its
ing from pixels as input, the first hidden layer units may learn to check own mput , and then we average these gradients up and make a single up-
for edges of various orientations. Then, by combining edges, the second date on all copies. This implies a single parameter that defines the weight
hidden layer units can learn to check for combinations of edges-for ex- on multiple connections. Also, because the update on a weight is based
ample, arcs, corners, and line ends-and then, combining them in upper on gradients fo r several inputs, it is as if the training set is effectively
layers, the units can look for semicircles, rectangles , or in the case of a multiplied.
face recognition application, eyes, mouth, and other features. This is an
HIERARCH ICAL CONE example of a hierarchical cone, where features get more complex, more
12.5.2 Formalization
abstract, and less local as we go up the network, until we get to classes.
NEOCOGNITRON An early architecture based on this idea is the neocognitron (Fukushima CONVOLUTION FILTER We call the incoming weights of a hidden unit a convolu tion /liter or a ker·
CONVOLUTIONAL 1980). In this section, we discuss the convolutional neural nerwork where nel. Each filter has a size, such as 3 x 3, that defines its scope. This filter is
NEURAL NETIVORK
such features are learned from data. applied at all positions in the input; fo r example, the 3x3 filter is element-
One may think that with such a network, where we have units check· wise multiplied by every 3 x 3 patch of the image, returning a value for
ing for different features in all local patches, we may end up with too each position. Converting a matrLx to a ,·ector, these are 9-dimensional
many parameters, but as we will see, we reduce the number of parame· vectors, and convolution becomes a dot product, as we always do with
WEIGHT SHARING ters by weight sharing. Taking the example of visual recognition again, hidden units. To this resul t, typically we add a bias, and then we apply
we can see that, when we look for features lil<e oriented edges, they may a nonlinearity, such as ReLU. So a convolution is a type of hidden ~nit
be present in different parts of the input space. So instead of defining whose input is restricted to a subset of the input. here a local reg10n,
caII ed a recepnve. ,Ae
, /d. The output composed. for all positions arranged .
independent hidden units learning different features in different parts RECEPTIVE FIELD

of the input space, we have copies of the same hidden units looking at FEATURE MAil
in the san1e way, for example, 2-dirnensional, ts called a feature map, see
Dumoulin and Visin (2016) for a guide to con\'Olutional networks.
different parts of the input space. This is very simple to impl ement: Dur·
12 D
334 ecp Lea,-,r 12.5 Convo/urionat Layers
I/Jg

. h ks for one feature; remember that the dot Proct 335


Each filter c ec Uct I k
. lue when the two vectors are the same and th a es
its maxunum va h • . ' e re
e dissimilarity between t e mput m that Patch su1i
decreases as th d f . anct lh
I tl·on kernel increases. On the e ges o the image, if wo d e
convo u d d all 'd on
want to Iose infor mation
'
we nee to
.
pa on s1 es; for exa~ c
' •up1e f
01

3 x 3 fil ter' we add pixels (typicallySwith value zero) of Width 2 on u' 0 r a


before applying the convolution. o if t e IIllage 1s, say, 100 >< a size s
h · ·
100
the filter is still 100 x I 00 (otherw1se It would have been d' 1he
res ult o f re uced
to 96 X 96).
Note that, from a neural network perspective, this implies lO,ooo .
.. h .h . hid.
den units, each looking at one posinon, eac wit Illne weights and one
bias; but those weights and bias are the same-they are shared. So there
are 100,000 connections but only IO free parameters.
Frequently we have multiple fHters, because we _want to check for more
than one feature. Each filter has Jts own convolut10n matrix, and the fea-
ture map for each filter is called a channel. Somerunes, the original inpui
is already in multiple channels; for example, color images are ty picall ,
given as three separate channels in red, blue, and green intensities. Whe~
we apply a filter to multiple channels, for each position, we apply the
same filter separately to each channel and we ge t a value for each. We 12 5
figure - The 3 x 3 convolu tional layer applies lhe convolution at all points.
then sum those up, add a bias, and pass it through an activation function The pooling layer pools the values ln a 2 x 2 patch, and a stride of 2 lowers the
such as ReLU. resoluti on by half.
Let us say that, in the previous layer, we have three channels each of
size 100 x 100, which we denote as 3 >< I 00 x I 00. Let us say we want
to check for six features. Assume that the filter size is 3 x 3 and that
olution d ecreases, and we get some small invariance to translations. To
the input is padded. The output is then 6 x 100 x 100. For each of
downsample, ins tead of using the value at every locatio n we can skip over
the six features, at every position (i, j) in the image, we apply the same
some. For example, with a stride of 2, we take the value at every other
3 x 3 feature to all three channels, sum the resulting three values, add pixel (in both dimensions).
a bias, and pass it through ReLU, and that will be the (i, j ) value for the
Typically, before striding we use pooling, where fo r a smalJ patch we
corresponding channel output. So the 3 x 3 convolution sums over the 9
take the maximum or average of the values. For example, let us say the
pixels in each channel, and we also sum over the three channels to get an original image is 100 x 100, and we apply a 3 x 3 ke rnel at alJ positions.
output. Note that the same convolution is used in all incoming channels, Then we have a pooling of size 2 x 2, where of those -I ,·alues we choose
so we have weight sharing there, too.
the maximum, and then we can use a stride of 2 because we will have
There is a neat idea called a I x I convolution, where the filter is 1 x I (a the same values nearby. We th us end up 11ith a 50 x 50 feature map.
single value); this allows us to sum over the channels without summing We typically talk about a pooLin g layer, where a pooling and s triding are
over pixels.
done, but note that this is a layer wher e the operation is fixed and there
A network may be composed of multiple convolution layers, one af- are no learned parame ters (see figure 12.5). .
ter th e other. Typically as we go up we expect the features to get more In image recognition typically in successive convolut:1onal layers, the
ab str act (and larger), and the image size in parallel to get smaller. Res· · · ge t s sm a Iler but the number of channels mcreases. The reso-
1mage s tze
p
12.5 Convo/utiona/ Layers
12 Deep Learn .
336 lfJg
337
ut we allow for a larger set of abstract fearu
lution decreases, b . . res "'h
umber of units because the rmage size that is · 1 is
decreases th e n f h 1 - - squared .
. f t r than the number o c anne s 1s mcreasing is
decreasmg as e _ · Once
enough image with many channels, and when web . ive
have a small . f h . e1ieve
ressed representation o t e rmage, we move t ive
~ a p d Comp oa~
cormected layer. .
Let us say that, in the final convoluttonal layer, we have a lO x 10 i!na
.
m 12
charmels and we have 512 hidden units in the next fully c0
• • f d .
layer. Each of those 512 hidden UIUts 1s e with all the 1,200 Uni .
nnected
ge .
32x32
th e previous layer' it is as if the 12 x 10 x 10 is "flattened" before I.t IS
. ts 1n
fed
to the fully cormected layer.
/
/-'• _,
A fully cormected layer does not encode the position informat· Figure 12.6 LeNe t-5 network for recognition of handwritt en digits (LeCun et al.
/.,,·
<' '
ion ex- 1998).
~:-~1 plicitly, and it detects for abstract features over the whole ima
. . ge, For
example, a very early convoIutton UIUt may correspond to an ed
. . m· I . . th .
particular onentatton at some spec c ocatton m e rmage, a later con-
ge of
12.5.3 Examples: LeNet-5 and AlexNet
volution unit may detect a arcle m the top left of the image, and a fuUi,
cormected layer may be turned on if there is a face anywhere in the image, Le t u s examine an example of a convolutional network. LeNe t-5, the earli-
Typically we have one or more fully connected layers, and then we est success s tory of convolutional and deep networks, was a commercial
have the output layer. Roughly speaking, we can consider the convolu- product used to read handwritten checks (LeCun et al. 1998).
tion layer(s) as an analyzer that breaks down the image, generating an The network architecture is given in figure 12.6. The image is 32 x 32
intermediate representation, and the fully connected layer(s) as a synthe- and is already padded. The first convolution layer uses 5 x 5 kernels (and
sizer that generates the output from this intermediate representation. In biases), and there are six channels. There are 122,304 connections but
section 11.10 we discussed the autoencoder model, and we can consider 156 parameters. Then there is a 2 x 2 average pooling layer with a strid e
what we have here as an extension of that model, where the analyzer of 2. LeNet-5 actually has trainable parameters in its pooling layer: We
corresponds to the encoder, and the synthesizer corresponds to the de- sum the four valu es, multiply it with a weight, add a bias, and then pass
coder, except that the output can be any other type of information and it through the tanh activation function. This layer bas 12 parameters.
not necessarily the input itself. The next convolution layer also uses 5 x 5 convolutions and then 2 x 2
Given the whole network, the weights are all trained together in a su- pooling in 16 channels. Those add another 1,516 + 32 parameters. The
pervised manner. For weights that are shared, as we discussed above, we 400-dimensional output is then fe d to two fully connec ted layers of 120
calculate the gradients as usual, and then we average them and make the and 84 units each, and finally we have 10 output units. These contain
update on all copies. Note that in such a network most of the computa· 48,120 and 10,1 64 parameters, respectively. LeNet-5 uses Euclidean ra-
tion is done in the convolutional layers, but most of the free parameters dial basis function units at the output, which we will discuss in chap-
are in the fuJJy connected layers, so our efforts of regularization (using, ter 13, but thi s is an older approach, and using softmax at the output has
e.g., weight decay or dropout), should focus on those fuJ!y connected recently become the norm for classification. . .
layers. LeNet-5 has around 60,000 weights and is trained_on a rr~g set of
ers were the limits for 11s day. Inc1dentally,
60,000 examp 1es. Th ose numb · . . ct d
h MNIST (Modified Nanonal lnsntute of Stan ar s
the data used there i_s t e ·ct ed in man)' deep neural
d et that has been \\1 eIYus
and Techno logy ) ata s , d , with test error less than half
network architectures since th en. No\\ a a)S
12 D
eep Lea,-, I
V J 2.5 Convo/utiona/ Layers
1119
339
338
'd d an easy "toy" data set.
ST is cons1 ere
percent, MN! t demonstrator of the deep convolutionaJ ne
The next unportan hevskY sutskever, and Hinton 2012) Whi ~ vork
exNet (Knz ' ' c,, i
was the Al . ed on a more difficult problem. AlexNe t has five s <1
larger network trai::r of which are followed by pooling layers anct con.
volutional layers, t ee making a total of 11 layers. It has appro . three
fully connected Jayerss, uses ReLU as the activation function Xl!ndr ately
li n parameter , , op 0
60 mil °. . ftmax at the output, and training is paraJleli Ut
for r egularizanon, so . . . . zect on
N t data set over which 1t 1s tramed contains 1 2 l'hn,,
GPUs. The Image e
f hich is 256 x 256, and there are 1,000 classes.
· -.,,,,_,,o
n
images each o w
was 3 7.5 percent, and the error reduced to 1 7 Figure 12. 7 Denconvolution also use .
The error ra te . ·0 Per- stead of downsampling using str'd s convolutwns. The difference is that, in-
'f the correct class was allowed to be m one of the top five ( t es to decrease · ·
cent 1 . bl b th out increase image size by expandin d fi . . unage stze. we upsample to
of l,OOO). These values were cons1dera Y e tter an the state of the g an lltng Ln the missing values by zero.
dl·t was the success of AlexNet that demons trated that d
mtman h . .
convolutional networks actually scale up; tha t 1s, With deep er and larger
networks and more data, problems that are much more difficult than ginningI of this chap. ter. Let us say the input is 1,000 di mens1.on al . For
handwritten recognition can be learned end to end. examp e, we s t art with an au toencoder that has 1,000 inputs, 250 hidden
uruts , and 1,000 outpu ts. Once that is trained, its encoder can be used
to m ap d a ta to 250 dimensions, and Wi th those another autoencoder can
12.5.4 Extensions be trained with 250 inputs, 100 hidden uni ts, and 250 outputs. Now if
A deep convolutional network has many hyperparame ters tha t need to we pu t those two encoders back to back, we have a two-layer network
be adjusted for good performance. We have to d ecide how many convo- w:tth 1,000 mputs, and then 250 hidden units in its firs t layer and J OO
Jutional and full layers to use, and for each convolutional layer we need hidd en units in the seco nd layer. Both trainings involve only one hid-
to decide on the filter size, stride, and number of channels. We need to d en layer, so they are fast, and training can be done With unlabeled da ta.
decide where and how we do pooling. For fully connected layer s, we need STACKED This a pproach of stacked autoencoders was popular in the early days of
to decide on the number of hidden units. Adjusting all these hyperpa- AUTOENCOOERS d eep learning when GPUs and labeled data were still scarce. Nowadays,
rameters takes many trial runs. research ers prefer to train end to end using labeled data.
TRANSFER LEARNING One possibility is transfer learning. Le t u s say we have another network DEEP AUTOENcoorns It is also possib le to train deep autoencoders with multiple hidd en lay-
that has already been trained on the same or some other data for a similar er s . If we have data where there is locality, such as images, early layers
task. We can then, for example, just copy its early convolutional layers are convo lu tional, followed by one or more fully connec ted layers until
and train only the la ter layers with our data on our task. ln such a case, we ge t to the intermediate code. ln such networks, the architecture of the
those early layers that we copy will act as a preprocessor, m apping our d ecoder is taken as the inverse of the encoder; we have one or mo re fully
original data to a representation that we believe will be good, and what we co nnected laye rs, and then we have deconvolution or transposed convo-
actually learn will be a smaller network tha t can b e trained on a smaller /utional layers, where we extract features in the inverse direc tion using
data set starting from that representation. again convolutions. lnstead of striding, where we skip over elements, we
One possible network that can be used is the autoen coder, which is a up samp le by including new elements filled in with zero to increase the
neural network trained to reconstruct its input a t the outpu t. The au· image size (see figure 12.7). We also decrease the number of channels as
toencoder network we discussed in section 11.10 has on e hidden layer, we go u p in the decoder until we have the resolution and th e number o f
and one possibility is to use tha t repeatedly, as we di scussed at the be· channels of the input to the encoder.
Y'

340
12 Deep l earn-
UJg
12. 6 Tuning the Network Structure
99521
34 1
12.5.5 Multimodal Deep Networks I"
·ct that if we have multiple channels as input th h
Previously we sail er ~ake a sum of them using the same c~nv elc an. \~
I in the next ay . o Ut1on
ne s this is not a strict reqwrement; one can always
trix Of course . . 1earn
ma ·
separate we1g
. hts for the different channels. This defines a more flnJO-bl
. d d h ' e ' ,--I
model but nee ds more data to train,. an we nee to ave a good reason
to expect those features to be different.
In the inception network (Szegedy et al. 2014), an_other idea ca!Ject f ~ '.,
CHANNEL channel concatenation was used. Let us say there 1s a smgle channel anct
CONCATENATION from that we generate multiple channels; for example, we may have three
where one uses a 3 x 3 filter, another 5 x 5,_ and another 7 x 7. Instead of
- up the results after the convolunons, we can just concat thern
: : ; l e channels. For example, if the input is 100 x 100 and we have
L
Dy namic node creation
five filters for each of the three, then the output is 15 x 100 x 100- we Cascade correlation
just create another channel instead of summing.
Figure 12.8 Two examples of constructive algorithms. Dynamic node creation
It is also possible to first process the channels for a number of layers adds a unit to an existing layer. Cascade correlation adds each unit as a new
separately and then fuse them at a more abstract level. This can be used hidden layer connected to all the previous layers. Dashed lines denote the newly
when channels are fed with inputs in different modalities. For example, added unit/ connections. Bias units/ weights are omitted for clarity.
in speech recognition, in addition to the acoustic speech signal we may
also have a video of the movement of the mouth. We will talk in more
detail about methods for combining multiple sources in chapter 18, but it
should be apparent that it is not always a good idea to concat very differ- learning the parametffs. The idea is to adapt the network structure, for
ent pieces of data and feed them to a single network. At the lowest level, example, the number of hidden layers and units, during training incre-
the features of the two modalities are most probably not correlated, so it mentally, without needing a manual finetuning of such hyperparameters.
does not make sense to have hidden units that are fed with both. What One may view weight decay (see section 12.4.2) from such a perspective
we can do is have separate columns for different modalities extracting as a destructive approach where we start with a large network and prune
increasingly abstract features, and then at some level those may feed to unnecessary weights. Another possibility is the constnictive approach
a shared hidden layer; then maybe after some more layers we get to the where we start from a small network and add units and associated con-
outputs. Both columns may have one or more convolutional layers and DYNAMIC NODE nections if n eeded (figure 12.8). In dynamic node creation (Ash 1989),
CREATION
one or more fully connected layers before feeding to the shared layer. an MLP with one bidd en layer with one hidden unit is trained, and after
MULTIMODAL DEEP
This is called a multimoda/ deep network (Ngiam et al. 2011). convergence, if · s till hi gh , an other _hidden unit
· the error 1s _ is added.
_ The
_ _
NElWORK
incoming weights of the newly added unit and Jts outgmng we1g_ht are uu-
tiali zed randomly and trained with the previously ex1snng weights that
12.6 Tuning the Network Structure are not reinitiali zed but continue from their previous values. .
CASCADE In cascade correIanon ... '·nan and Lebiere 1990), each
· (F-ct1w added urut
h
12.6.1 .
Structure and Hyperparameter Search CORRELATION . . . . another hidden layer. Every hidden 1ayer as
1s a new ludden LUU! ll1 ' f h hidden units preceding it
11
Since the very early days of neural networks researchers have mves· only one unit that is connectle)~etXJ~s~ingo w:i;hts are frozen and are not
tigated whether the ideal network structure c~n be learned, while also d I · ts The prev10us , .
an t 1e mpu · . . . weights of the newly added urut
trained; only the inconung and outgomg
12 Deep L 12.6 Tu ning the Ne twork s rrocrure
earnin9 3-13
342

are trained.
DynaIDI
layer an
·c node crea I
d never a s
d
r·on adds a new hidden unit to an existing h·ct
.
C
dd another hidden layer. ascade correlation
.
I den
a1ways
er with a single urut. The ideal constructive m h
cb
w hid en 1ay

D
. - et od
adds a ne d ·de when to introduce a n ew hidden layer and h
should be able to :~sting layer. This is an open research problemw en
dd a urut to an = .
to a al algorithms are interesting b ecause they correspond to=
Increment al h . u,Od-
. . not only the parameters but so t e structure dunng learllin
ifymg all ve can think of a space d efined not only by such strug.
~rep=~• h . G
rural hyperpar amete rs but also by others t at affect learning · such as ~,
the learning rate or the momentum factor.' and we want to find the best
AITTOMATED MACHINE
LEARNING
configuration in this space. The idea behind_automated machine learn-
. (AutoMl) is to automate not only the opnrruzanon of parameters, as v,
Figure 12.9 (a) are the slop connections to layer/; (b) 9 , are the gating units
::grypically done, but all steps in building a machine learning system, that mediate the effect of the slo p connections. Both allow shorter paths to the
output.
inducting the optimization of hyperparameters. . _.
There are various approaches. One 1s popu/at10n-based trainmg Uader-
berg et al. 2017), where there are a number of models with different bypasses all the layers between L and / and, as such, can be regarded as
hyperparameters trained all together, wllh informanon transfer between a simpl er m odel. A similar approach underlies residual networks (He et
models allowing an exploration of the vicinity of promising models, just al. 201 5); we can say that the shorter path is the main prediction and the
NETIVORK like in genetic algorithms. Another interesting work is network architec- lon ger path (with ex tra computation) prO\ides any necessary residual.
A RCHITECTURE ture search (Zoph and Le 2016), where the model description is generated T he a dvantage of such skip connections is also apparent in training. We
SEARCH
as a sequence of actions and the best sequence is learned using reinforce- saw in section 12.2 that when we backpropagate, to calculate the gradient
ment learning (see chapter 19). at any layer , we multiply the error by all the weights on the path after that
layer. Wh en we have deep networks ,vith many layers, if those weights
12.6.2 Skip Connections are greater than I , the gradient may become Ye.I)' large after many multi-
plication s; if they are smaller than I, the gradient may becom e vel)' small.
We saw above that, in cascade correlation, connections skip layers, which
VANISH- This is th e problem of va nishing/ exploding gradients. When we h ave skip
we can also have with other networks. For example, a hidden unit may ING/ EXPLODING
be connected not only to all the units in its precedin g layer, as usual, layers , we effec tively define shorter paths to the output, and this helps
GRADIENTS
but also to units in a layer much earlier. Con sider a unit at layer / (see propagate the error back to earlier layers \\ithout gradients exploding or
figure 12.9(a)): vanishing. Using su ch residu al connections, He et al. (2015) were able to
train n etworks \\~th 151 layers.
(12.24)
Z/h =f ( W/ ,h,iZ/ - 1, i + o/ V/ ,h,jZ/- L.j )
12.6.3 Gating Units
where w,,h,i are the weights from all units i of layer / - 1 to unit h of layer
Equation takes a direct sum of the t\rn paths. An alterna tive is to
SK IP CONNECTIONS
/- this is what we normally have between successive layers-and V/ - 1.h J take a we ight ed a,·erage and also learn those weights (figure 12.9(b))
are the skip connections from all units J in layer / - L.
We can view information coming from two paths; the u sual pat h uses
all th e layers between 1 and I; the second is a shorter path because it (l 2 .2 5) Z/h =f (9 111 ll'// 11 2/-1.1 + (I - 911,) \ '/llj Z/- L.J )

D
12 Deep Lear,r 12. 7 Learning Sequences
Ing
344 345

1 0 learned from data during backpropagation (gat Y,


where g1h are a s 11 them apart from activations). In the simpl e Values
d lined to te est
are un er nstants· in highway networks (Srivastava Gr ff case,
h gates are co ' , e
HIGHWAY NETWORKS sue ) they depend on the z1- L values in th e lay ' and
Schm.idhuber 2015 . ' er Where
the two paths split:
zo= + \
(12.26) g11, -- sigmoid (a~Z/ - L + a1hO)
-h ameters of the gating unit, a11,, a1ho, are also learned using gr .
T e par th t - b ad1- x0 =+ t
W.1y
ent descen,t and the sigmoid guarantees a 91h 1s etween o and 1.
A similar method is adaptive dropout (Ba and Frey 2013), where whether delay , delay -
ADAPTIVE DROPOUT
a hidden unit is dropped out or not depend~ on a parameter, and that
parameter is also learned. It seems more sUJtable_ to call it a gating Pa-
rameter rather than a dropout parameter, because It 1s learned from data
and is not a parameter of an external stochastic process. Figure 12.10 A time-delay neural network. Inputs in a time wi ndow of length T
a re d elayed in time until we can feed all r inputs as the input vec tor to the MLP.

12.7 Learning Sequences


12.7.2 Time-Delay Neural Networks
12.7.1 Example Tasks The easiest way to recognize a temporal sequence is by converting it to
Until now, we have been concerned with cases where the input is fed once, a spa tial sequ ence. Then any method discussed up to this point can be
all together. In some applications, the input is a sequence, for example, TIME-DELAY NEURAL used for classification. In a time-delay neural nerwork (Waibel et al. 1989),
in time. In others, the output may be sequential. Examples are as follows: NEDYORK previous inputs are delayed in time so as to synchronize with the final in-
put, and all are fed together as input to the system (see figure 12.10).
Sequence recognition. This is the assignment of a given sequence to Backpropagation can then be used to train the weights. To extract fea-
one of several classes. Speech recognition is one example where the tures local in time, one can have convolutional units and weight sharing in
input sequence is the spoken speech and the output is the code of the early layers. The main restriction of this architecrure is that all sequences
word spoken. That is, the input changes in time but the output comes should be of the same length.
once at the end.
Sequence reproduction. Here, after seeing part of a given sequence, the 12.7.3 Recurrent Networks
system should predict the rest. Time-series prediction is one example
RECURRENT NEDVORK In a recurrent network, in addition to the [eedforward connections, units
where the input is, for example, the sequence of daily temperatures
have self-connec tions or connections to units in the previous layers. This
during the past week and the output is the prediction for the next day.
recurrency acts as a short-term memory and lets the network remember
• Sequence association. This is the most general case where a particular what happened in the past. k .h limited
output sequence is given as output after a specific input sequence. The Most frequently, one uses a partially recurrent networ w ere a
input and output sequences may be different in form and length. An number of recurrent connections are added to an MLP (see tfigura:~ t~ ~f
example is translation: The input is a sentence, which is a sequence or . b' h dvantage of the nonlinear approxuna 10n
This com mes t e a . abili , of the recurrency, and
characters, in the source language, and the output is its translation in the MLP with the temporal rcpresentatwn I) al
. lemcnt any of the three tempor as-
the target language, which is another sequence of characters. such a ne twork can be use d to imp
12 Deepl 12. 7 Learning Sequences
earning
346 347

)'

Ir '

_,-
w
J
• "1
R
r1

lb) (c) w • "1 "'


(a) R
Figure 12 _11 Examples of MI..P with partial recurrency. Recurrent connections
y x' Ir'
are shown with dashed lines: (a) self-conneetJons m the hidden layer, (b) self-
connections in the output layer, and (c) connections from the output 10 the
w • "1
R
hidden ]ayer. Combinations of these are also possible. x'

w / ~
r",

.., •
1,0

sociation tasks. It is also possible to have hidden units in the recurrent


backward connections, which are known as context units. /J P" ?-
_,, ---------

Let us consider the operation of the recurrent network shown in fig- (b)
ure 12.12(a):
Figure 12.1 2 Backpropagation through time: (a) recurrent network and (b) its
(12.27) Zh f (t +,~WhjXJ
1
r h1zt )
equivalent unfolded network that behaves identically in four steps.

(12.28) y' = g (I VhZh) aga tion. If we have required output given at time t, we can calculate the
loss at time t and use gradient descent to update v h, "'hj , and r h1 - If the re-
We see that the activation of hidden units at time t also depends on UNFOLDING IN TI ME quired output comes at the end of the sequence, we unfold in time. That
zi
the activations at time t - l through the recurrent weights r hl · The are is, we write the equivalent feedforward network by creating a separate
taken as zero. The activation function f at the hidden units is generally copy of each unit and connection at different times (see figure 12.12(b)).
taken as sigmoid or tanh, and the output activation g depends on the The res ul ting network can be trained as usual, with the additio nal re-
application. quirement that all copies of a connection should remain identical. The
In equation 12.27, inside f the first part is the contribu tion of the cur· solution, as in weight sharing, is to sum up the different weight changes
rent input, and the second part is what is remembered from the past. BACKPROPAGATION in time and change the weight by the average. This is called backpropa-
zr- , contains a compressed representation of all the inputs seen until r, THROUGH TI ME gation through time (Rumelhart, Hin ton, and Williams 1986). _ _
and its effect is combined with that of the current input at time t. The An importan t problem in training recurrent neural networks 1s varush-
values of Whj and rh1 determine how much we give weight to these two ing gradients; the unfolded network is deep, and as we can see m figure
sources. 12. 12(b), backpropagating requires mul tiplying r h1 weights _many tunes.
The training of a recurrent neural network is also done using backprop- , ·s b)• usinuo gating units, as we descnbe next.
0 ne way to com bat thi s 1
I 2. 7 Lea rnin g Sequences
12 Deep Lear,, .
rng 349
348
if ihis close to Oand{' clos tO
Memory Unit -:-- . ..!!. e I, th e state of h h'
Long Short-Term t he next state; this all ows re b . t e 1dden unit is copied to
12.7.4 _ rm memory (LSTM) unit (Hochreiter and Schmidh . h mem enng input .
c hangmg t em. Operationall y thi _ . s seen m the past withou t
The long short re x form of the recurrent unit. It does not J·u Uber unnecessary additional comp! . , swill . be Just Ilk ki
e a s player, bypassing
LONG SHORT-TE RM
· a more comp1e h st tak ex1ty. It is not su ..
MEMORY I 997 ) is f h ffects of the input and t e past, but uses a e work s (equ ation 12.26) and LSTM rpnsmg that highway net-
- ple sum o t e e ff I kin se1of group. s were proposed by the same research
a sllll . h turn these effects on or o , oo g at the input Th
ting uIDtS t at al . d · ese
ga h . wn parameters that are so trame during backp Let us say we have an applicaiion wher 1 ,
gates have t err o ropa.
at a time. Compare the fo llo1v•_ e ,e process sen1ences one word
,ng two sen1ences:
gation. ted input to an LSTM unit, x' , is the concatenation of th
The augmen . • • r e
. , and the past hidden urut acnvat10ns z - I . All decis• "The man who entered the house was in his thi . ,,
current mput x Ions
use this augmented input; that is, they are based on what is seen now "The woman who enlered the hous . rues.
e was m her thirt ies."
and what is remembered from the past. . .
In LSTM unit h, first, the effect of the mput 1s calculated as Whether the correct pronoun is "his" "h ,, d
subj ec t is "man" or "woman ,, or er epends on whether the
' and th at word appears a number of words
(] 2.29) ch= tanh(w~x') before. To be able to generate
_ such a sent ence, for example, or to be able
The cumulative internal state of unit h is a weighted sum of the pas t stale to gene rate a translat10n of it, at the time of the pronoun the network
and current input: should remember the subject·, some part of th e hidd en representat10n .
should be set when the subject is set, and it should remain intact until
(12.30) sh = fl ·sr 1 + 1 ·ch we get to the pronoun. This is difficul1 fo r an ordinary recurrent network,
where fh and ihare the forget and input gate_s, respectively (gate values where we contmually average over the hidden unils, but LSTM unit s "ith
are underlinedto tell them apart from acnvanons): their gates turning on and remaining on for some time (until maybe an-
other event happens to turn them of!) can learn such depend encies.
fl sigmoid ( wf x') LSTM nowadays is co nsidered to be the state-of-the-art model fo r di -
1_ sigmoid (wTx') verse sequence learning applications from translation to caption gener-
ation. The more fl exible architecture allows longer and more compl ex
The overall value is calculated as sequences to be learned, recognized, and produced.
(12 .31) zh = 1 ·tanh (sh)
with the output gate defined as 12.7.5 Gated Recurrent Unit

(1 2.3 2) 1 = sigmoid (wrx') GATED RECURRENT The gated recurrent unit (GRU) (Cho el al. 201-lJ 1s a simpler 1-ersion of
UN IT LSTM \\ith fewer gates and less compulation.
The input activation ch and all three gates see the extended input and The effect of the input is calculated as
have their own weight vectors that are trained from data. One vari~~)
of LSTM adds peephole connections, this means adding the vector of 5h (12 .33) c/,= tanh (wYx'+1•w;z' - 1 )
to the augmented input x' of the gating units (Gers, Schraudolph, and
Scbmidhuber 2002). No te that this differs from LSnl's equation 12.29 in that W e is split into
The gates allow each unit to selectively decide bow much to take the w., and w,, and there is a reset gate 1 and an input gate 1_:
current input or the past into account, and how much to give out as out-
put. They help with learning long-term dependencies. In equation IZ.30, (1 2.34) 1 = sigmoid ( w;,~') and 1 = sigmoid (w; x')
J 2.8 Ge n era tive Adversarial Ne n.vork
12 DeeP Learr, · 35 1
350
'"9
. ast is forgotten, hence the name, reset ga te. The .
If ,j_ 1s O, t be P alculated as the weighted average of the past hid.
den unit value 1s c and the
present
, _ ( l _ i' ) . zJ- + ih ·Ch
1
(1 2.35) 2h - h
1
- 30 h ·
-;: unlike LSTM' s equa tion 12. , t e.re 1s not a separa te for
N~te it::aken as 1 minus the input gate. We also see tha t, unlike I.S.;:t 1 (true)
gate, it . te.rmediate Sh and no output gate; whatever value is ca] I,
there 1s no m - . CU-
. b alue of bidden unit h . As a srmp 1e.r version of LSTM GR
lated IS t e V I . ' u Figure 1 2 -13 _ In GAN, C is the ge nerator that ge nerates fak e e.x amples from
finds successful applications in many seque n ce earrung tasks (Chung et random z . D 15 the discnrrunator that is trained to discnminate s uch rak e G(z )
al. 2014). · I · h d from tru e x of th e trainin g se t.
The basic recurrent unit takes a srmp e we1g te ave.rage of the CUr-
rent input and the past bidden acn vanons . LSTM replaces 11 Wi th a more
complex model; actually we can consider an LS_TM as a small specific MLP a genera tor C tha t takes a low-ctimens ional z as input and gives ou t a
structure With internal values and gates, JUSt like hidden units, that have "fake " examp le that we de note by G(z ). The z are drawn from some prior
parameters that are trained-GRU corresponds to a srmpler l\llLP. m ultiva ria te clis tribution, typi cally a k-dimensional multivaria te Gaussian
wi th m ean ze ro, unit variance on all climensions, and zero covariances,
th a t is , Nd O, I).
12.8 Generative Adversarial Network
G is imp le men ted as a deep neu ral network that has k inputs, d ou t-
The generarive adversarial netw ork (GAN) is not a d eep learning method pu ts, and a numb er of hidden layers in between. If we wan t to generate
GENERATIVE
ADVERSARIAL per se. It uses two deep neural networks, but the novelty of GAN is in the an image, the fina l layers will be deconvolu tional, just as we have with a
N ETI\"ORK way the task is defined rather than how it is implemented . d eep a u toen cod er (see section 12.5..J); the similarity between a gen erator
Let us say that we have a training set of faces and that our aim is to and the d ecode r of an autoe ncoder is a topic we will come back to sho rtly.
be able to gene.rate new faces; that is , we want to learn a generator that G takes a low-dimensional ctistribution and learns to expand and dis to rt
can synthesize new images tha t everyone will recognize as legitimate face it suitab ly to cover the target x distribution.
images, but different from the ones we already have in the training set. In adcli tion to G, GAN also uses a discriminacor D that is trained to clis-
Our training instances X = {x'J , are drawn from an unknown probabil- crimina te true and fake examples. D is another deep ne twork that takes
ity ctistribution p (x ) ; what we would like to do is firs t estima te and then d-dimen s io nal x as its inp ut and, after some hidden layers. co nvolu tionaJ
sample from p (x ). Of course, one can always try any of the parametric, o r full , has a single sigmoid output, because it is trained on a two-class
serniparametric, or nonparametric m e thods for d en sity estimation, dis· task: For all true x', the label isl; for all G(z ) that are ge nerated b y G
cussed respectively in chapters 5, 7, and 8, but GAN u ses a clifferent and (from some z ), the label is 0 (see figure 12. l 3).
a very interesting approach that bypasses the estima tion of p (x ). T h e GAN criterion is
The underlying idea in GAN is somehow reminiscent of word 2vec (sec-
tion 11.11) in that to solve the original unsupervised problem we define (1 2.3 6) :Z:: logD (x ') T L log( l -O (G(z )))
Z -p (Z l
an auxiliary supervised problem that helps u s with the original task. A
GAN is composed of two learners, a gen erator and a cli scrimina tor, which which is maximized by D and minimized by G. D is tTained so th a t its
are inlplemented as two deep neural n etworks (Goodfellow et al. 2014). outp u t is as la rge as possible fo r a trne ins tance and as small as poss ible
We s_tart with the training set of "true" examples, X = {x ' }, , each of fo r a fake ins tance. G is trained to fool D; it is trained to gene ra te fakes
which IS d-ctimensional. These are valid face images . We want to learn for which D gives large ou tp ut s. During training, as G ge ts be tte r m
12 D 12.9 No tes
eep Learn,·
352 ng
353
. fakes D gets better in detecting them, which in tur r but they are not always .
generating • n orce
to generate much better fakes, and so on. . . sG very inform ·
to gen erate a screen full Off k ative. As a result the b . .
. g from random weights for both, trauung of the two . h a es anct · est way snll Is
Star on . h 1s alt at t em . This is very subJ·ecn manually evalu ate them by 1 kin
ed G we update the we1g ts of D so that the lo er. h . . ve and limi oo g
~ d - For a fix ' u~ 01. w ere this IS possible, such as images. ts the use of GAN to applications
.
1s m
axuru·•zed This makes D a good detector for a fixed G Th
· . . . · en we fix
e
D and update the_ w_eights of G to IllllUID.Iz e the second term, Which is
equivalent to maxuruzmg 12.9 Notes
I logD(G(z))
Deep learning algorithms that ct · .
Z - p (Z ) we ISCUss Ill thi h
tute the mos t popular approa h . . s c apter currently consti-
which makes G a better faker for a fixed D. . c m machine learnm Th .
t hi s 1s they are very successful . g. e ma.in reason for
A variant of GAN that frequently leads to better generators uses th m many doma· d· •
cations, such as image recognit' ms, an m important appli-
WASSERSTEIN LOSS
Wasserstein loss (Arjovsky, Chintala, and Bottou, 2017): e they have led to sigru'ficant . 10n, speech recognition, and translation,
improvements M .
that do some form of pattern r . . · any commercial products
(12.37) IE[D(x')] - I E[D (G( z )) ] learning. ecogrunon today use a variant of deep
Z - p (Z )
Neural networks a.re easy to deploy Th d .
which is maximized by D and minimized by G (which is equivalent to than other models and 1 . h · ey nee 1ess manual tumng
. . ' Vil enough data and a large network. one can
maximizing the second term). qwckly achieve a. good level of performance. Nowadays it is easy to find
In this case, D is not a classifier but a regressor; it has a single linear data. Computmg is ge tting cheaper too. Another fa ctor is the a.vaila.bil-
output, which is a score of the "trueness" of its input. Wasserstein loss 1ty of open software; there are many free software packages that help in
measures the difference between the average score of true instances and building and training deep neural networks. All these contribute to the
the average score of fake instances: D tries .to maximize this; G tries to fact that one can have a prototype ready and running in a short a.mount
minimize this, by maximizing the average score of the fakes. A similar of time with little cost. The commercial interes t is helpful in that it pro-
criterion is used by the loss-sensitive GAN (Qi 2017). VIdes resources that have not been previously a1·aiJable; it also introduces
GAN is currently one of the hottest topics of research in machine learn- unnecessary hype and unrealistic expectations.
ing, and we are seeing various applications as well as extensions. One of The main attraction of deep neural networks is that they are trained
them is the bidirectional GAN (Donahue, Krahenhiihl, and Darrell 20 16) end to end. Many applications fi t this encoder-decoder structure where
and the equivalent adversarially learned inference (Dumoulin et al. 2016), the encoder takes the raw input and generates some intermediate repre·
where there is an additional encoder model that is trained to output z sentation and the decoder takes this and generates the output. The ad-
for a given x. This new component works in the opposite direction of the vantage of end-to-end training is that it is enough to pro,~de the network
generator that works like a decoder in going from z to x. 1,vith example pairs of input at the very beginning and desired output at
Training a GAN is difficult because we have two deep networks to train the very end, and any transformation that needs to be done in between
and the task of adjusting the hyperparameters is doubled. G is trained by the encoder and the decoder, as well as the forma t of the intermediate
backpropagating through D, which adds to the difficulty. One frequent representation, is automatically learned. . .
problem is mode collapse. There may be different ways of being a true For example, in translation the encoder, wluch typ1ca.lly uses LSTM,
. the source language one word at a tune and gener-
instance, but the generator learns only a few of those. tak es a sen t ence IJl ,,
. . nal cti·stributed "thought ,·ector ; the_decoder takes
An additional problem is that there is no good evaluation measure_ to ates a hig h-dJfilenst0 .
. . h LSTM generates a sentence m the target language
detect the quality of a trained generator. There are frequently used cnte· this and usmg anot er . .
. To generate a caption for an llila.ge, t11e en-
ria, which basically compare the distribution of true and fake in stances, again one word at a tune.
1 2. 1 0 Exe rcises
12 Deep Learn ·
1n9 355
354
. al network that analyzes the image and gen The augmented error is
. nvolunon . erate
coder 1s a co tor and the decoding LSTM generates a se sa
distributed content .:~e the caption. We will see other examples q~ence E , = Iog 'L r ' log y' + ,\
,
L log L,"' "'J P;( W,)
0 J,. I
of words rbat c~nStl the decoder is an action .module: In playjng b th.is I

in chapter 19 '~ er:alyzes the board image and generates an int: r oatct where P ; ( w ,) - :N( m; ,sJ). Note that { w .
the bi as. When we use gradie t d ,), mc.ludes all the weights including
game, the enco er module rakes this and decides on the bes t mo llJedi- n escent, " e get
ate code· the acnon ve .
. ' all the success stories, there are a number of problems too 0 c.w,' = T) ( r ' - y' )x: - () ,\ L TT;(w,) (w, - , m ; )
Despite . th h . · ne I Sj
. . .. m with neural networks 1s at t ey are not mterpret bi
maJor cnnos h a e where
. way of validating whether t ey learne d the correct t k,
so there 1s n 0 . . . as ·.
. ·gnm·cant research on the topic, hidden uruts remain hidd 7TJ( w ,) = C<; PJ(w,)
Despite s1 . . . · en L1cc, p i( w, l
d when we look at some hidden urut Ill some !Ilte rrnedia te layer of '
an twork we have no idea what it has learned, or if it is really O a is the pos terio r probabilit, that w belo t
d ~m ' . . -~ u pd ated to both d h ' ngs O component j. The weigh t is
essary. Supporting this is the existence of adversana/ _examples (Good- f h ecrease t e cross-entropy and move it c.loser to the mean
ADVERSARIAL 0 t e neare St Gauss ian. Using such a sc.heme, we can al so update the mLxture
fellow, Shlens, and Szegedy 2014): Small perturbat10ns Ill the input may
EXAMPLES pa rameters, fo r example
sometimes cause large changes in the output. This is an indicator that the
way such networks generalize from training examples is not completely C.mJ = T] t\ L TT;(w,) ( w, -,m, l
understood, and it may jus t be that some of thes e deep ne tworks work Sj

simply as gigantic lookup tables. The ) is c.lose to I if it is high!) probable that w , comes from componen t
77J ( w,

These observations indicate that we need to b e very careful before us- J: m _such _a case, m 1 is updat ed to be c.loser to the weight 11·, it represe nts.
ing such networks in critical applications. They als o show tha t the prob- Th.is 1s an iterative clustering procedure, and "e wt.II discuss such methods in
lem of machine learning is far from being s olved and d eep learning is an more detail in cha pter 13 (see e.g., equation 13.5).
important but yet another step toward this aim. 6. In some s tudies, different dropout rates are used fo r mput and hid den unit s.
Discuss when th.is "ill be app ropriate.
7. In designing a convolu tional layer for an image recognition da ta, what are the
12.10 Exercises factors tha t affect the filter size. number of filters. and stride?
L Derive the update equations for an MLP that uses ReLU in its hidden units. 8. Propose a one-dimensional convolution la\ er for sentence classification where
Assume one hidden layer of H units and one output trained for regression. the basic inputs are characters.
2. Some researchers say that in batch normalization cal culating new statistics 9. Incremental learning of the s tructure of an ~[Lp can be ,iewed as a s tate space
for each minibatch and normalizing accordingly is equivalent to adding some search. What are the operators? \I hat is the goodness function? What typ e of
noise to hidden unit outputs and that this has a r egularizing effect. Comment. search s tra tegies are appropriate? Define these in such a way that dynamic
no de cr eation and cascade co rrelation are special instantiations.
3. Propose a hint that helps in image recognition and explain how it can be
mcorporated into the learning process. 10. For the MLP giwn in figure \".12, deri\e the update equations fo r the un-
4· In weight decay, does it make sense to use different ,\ in different layers? fo lded network
1 1. Derive th e update equations for the parameters of the gating uni ts in highway
s. Derive th e update equations for soft weight sharing.
networks.
SliOUJTION: Assume a single-layer network for two-class classification for sim· 12. Compare LSTM and GRU units and propose a new unit mechanism dis tinc t
p City:
from those, justifying it b) an application.
Y' = sigmoid ( w;x/) 13. Ass uming a GAN composed of a linear G and a linear D, deriw the up d a te
equa tions fo r training them for (al ,·anilla GAN and (bl \\·asserstein GAN .

You might also like