Professional Documents
Culture Documents
ML Unit-5
ML Unit-5
12.1 Introduction
that the input is two-dimens10nal. Usmg a convolunonal network h somenmes work better than random initialization-learning can be done
.
bidden units are fed with loc ze two-
ali d dim ens10n . al patches in the Were . much faster and with fewer labeled data. Or, earlier feature extraction
h ·
is an inductive bias and as sue IS a way o ee
t f d tbi inf
s ormation of 1
image layers can b e "transferred" from another network trained on a larger but
. b l ocaJ similar data.
correlations such that correct a b stracnons can e earned at different
levels. Deep learning methods are attractive mainly because they need less
DEEP LEARNING In deep learning, the idea is to learn feature levels of increasing ab- manual interferenc e. We do not need to craft the right features or suitable
s traction with minimum human contribution {Bengio 2009; Goodfellow basis functions or kernels (chapter 14), or worry about optimizing the
Bengio, and Courville 2016). This is because in most applications, w~ exact network architecture. Once we have data-and these days we have
do not know what structure there is in the input, or we may know it "big" data-and sufficient computation available, we just wait and let the
on1y in the first few layers of basic feature extraction, and any dependen- learning algorithm discover by itself all that is necessary.
cies there are should be automatically discovered during training. It is The idea of multiple layers of increasing abstraction that lies under-
this extraction of dependencies, or patterns, or regularities, that allows neath deep learning is intuitive. Not only in vision-in handwritten digits
abstraction and learning general descriptions. or face images-but in many applications, we can think of layers of ab-
Though deep learning is attractive, it has various potential problems. straction, and discovering such abstract representations allows a better
first, a deep network is much larger, with more weights and processing description of the problem. . . .
units, and hence needs more memory and more computation; luckily, Consider machine translation. For example, starting With an English
sentence which is a sequence of characters, in multiple levels of pro-
GPUs and multicore architectures are getting cheaper every day, which
. • d abstraction that are learned automatically from a very lar~e
helps us with that. Second, a deep network has more free parameters, cessmg an th 1 . al syntactic and semantic
and training it is more difficult. Not only do we have a more compli- corpus of English senten ces to code ~d ; ~ t~ the mos~ abstract rep-
cated error surface, but the fact that there are many hidden layers is
also problematic. For example, in backpropagating the error to an early
rules of the English language, we 7
sentence in French. The levels of
resentation. Now consider the sam F nch corpus would be different,
hi time fr om a re
layer, because of the chain rule we multiply the derivatives in all the lay- processing learned t s hing at the most abstract, language-
ean the same t , .
ers afterward, and the gradient may vanish or explode. This has led to but if two sentences m imil.ar representations.
h uld have very s
~any_heuristics to finetune the basic gradient descent method, and we independent level, they s o . basically composed of an en-
. machin e translation as
So w e can VIeW
will discuss those in section 12.3. Finetuning the network architecture
12 D
ee,, 1...eo,-
,
316 l 2.2 How to Train /vfu/Up/e Hidden Layers
%,ll
b k to back. The encoder takes an Engi·
coder and a decoder ace-independent embedding of the rne•sh senten 12.2 317
t s a Janguag . a111n Ce How to Train Multiple Hidden Layers
and genera e thi. nd synthesizes the corresponding French g, anci .
decoder takes s ad to end· the training data contains Pair sentenc" 12.2.1 Rectified Linear Unit
. . I·s done en ' di . s of E e.
END-TO-END TRAINING Trauung nces and all interme ate processing is !ear _ ng11 8h
and French s_ente s t~ctic, and semantic constraints that are ned. lliat In chapter 11, we used sigrnoid and tanh as nonlinear activation func-
is, all the _Iexic~, ~hesis are learned from data and implerne necessary tions. An activation function for hidden units that has becorne popular
RECTIFI ED LINEAR
for analysis a~ ;~eep encoder and decoder networks: This nted in the UNIT recently with deep networks is the renified lin ea r unit (ReLU) (Nair and
many layersf_oht -a:t machine translation is learned and done1s(Wbasica11y Hinton 2010; Glorot, Bordes, and Bengio 20 11), which is defined as
ls aO
how state-o t e u et al ifa > o
2016) f . . . (12.1) ReLU(a) = otherwise
. f--"·ar wi·th this type o processmg lil computer sc·
we are dllw1 . Jenee• ~.
. for example , how a compiler works: There 1s the
1s, . front enct th at •dl,us Though ReLU is not differentiable at a = w . _
. al d syntactic analyses and generates an intermediate r oes the left derivative: 0• e use It anyway, we use
Jexic an . . eprese
taaon,. and the back end takes this as
h mput and
il generates cod f n-
e or
. ular processor. One can adapt t e comp er to different Ian a (12.2) ReLU' (a) = S1 if a > O
partlc .
and processors by changrng only the front or back end, respectively.
gua&es l O otherwise
We have many other examples of the same approach in machine
1 One advantage of ReLU is that, because it does not saturate (unlike
ing as well. For example, cons1.d er th e tas k o f gene_rating captionsearn- for
sigmoid and tanh), updates can still be done for large positive a. The
images (Vinyals et al. 2015). The encoder takes the IIDage as input, an-
other advantage is that, for some inputs, some hidden unit activations
alyzes it in a deep network, and generates an embedding which is
will be zero, meaning that we will have a sparse representation. Sparse
abstract representation of the image content; the decoder takes this an~ representations lead to faster training (where there are fewer nonzero
generates an English sentence as the caption.
hidden units, there is less interference between them); they are also easier
we also see this in game playing, for example, in the Atari player (Mnih to interpret. The disadvantage of ReLU is that, because the derivative is
et al. 2013). The encoder takes the screen image and generates a com- zero for a s 0, there is no further training if, for a hidden unit, the
pressed embedding, and the decoder takes this and decides on the action. weighted sum somehow becomes negative. This implies that one should
Again, the whole network is trained end to end. Previously, such a system be careful in initializing the weights so that the initial activation for all
would be handled in two stages: There would be one vision module tak- hidden units is positive.
ing an image and generating an intermediate representation, and there In the leaky ReLU (Maas, Hannun, and Ng 2013), the output is also
would be an action module taking the intermediate representation and linear on the negative side but with a smaller slope, just enough to make
deciding the action; the two modules would typically be designed by dif. sure that there will be updates for negative activations, albeit small:
ferent groups of people with different areas of expertise who would sit
down together and decide on the intermediate representation. Sa ifa > O
Now with deep, end-to-end training, the two modules are trained in a
(12.3) leaky-ReLU(a ) = L rw otherwise
coupled manner; our training data consists of pairs of input at the very For examp 1e, C< . taken as O-01 , which is the derivative for a s 0.
1s
beginning and the output at the very end, for example, pairs of an image
ang the best action, and it is the learning algorithm that decides on the
intermediate representation. All we need is a large enough network aTI<l 12.2.2 Initialization
lots and lots of data to cover as many combinations as possible. One should be careful when initializing the weights in a deep network.
Let us now see how this is done.
In chapter 11, we said that we can choose initial weight values uniformly
........._
• 318
randomlY frorn the interval [-0.01, O.Oll, and when w
12
ran If a PPlica.
Ul\i
9
' l 2 .2
E
How lo Train Multiple Hidden Layers
= !.2 L I
K
<r,' - Y/) 2
3 19
IIIJt
t l= l
scaled down correspondingly. ge sho t is
The airn is always to rnake sure that we start with . llld be
There are K outputs, and the range of t depends on whether we are doing
the derivative is nonzero. In the case of sigmoid and activations
is rnaxiIDUill at zero, so we .need to initialize so th at we
tanh,
g the de nvau
_ivhere onJine, minibatch, or batch learning. The error at output i is
and so that we do not acodentally get a value Iess th et small Va] Ve ei = rl - yf
than 5. For ReLU, we need to make sure that we an - S or Ues
for example, initially weights can similarly be veget a nonnegativ &feater and we update the we ights to that output as
can have a positive bias. ry close to zero eanct we
Value; 6 V;/ , e;r'
= '7 L, Zz1
1·· /
t
g,'
'.I
The error at hidden unit I in the second hidden layer is the sum of the
b ackpropagated errors each multiplied by the weight on its path, which
,.-,-;; 12.2.3
Generalizing Backpropagation to Multiple Hidden Layers
indicates the total "responsibility" of the hidden unit
I"
·-', I.I-
d.
In chapter 11, we derived the update equations for a multilayer
,;;1(f, e;, = I e) v1
'~§\<'~· ;_:'
rron with a single hidden layer. When we have multiple hidden laPercep.
. . . . f Yers III
each layer we have hidd en uruts receivmg mput rom its precedin 1 '
taking a weighted sum with its own modifiable weights and giVingg ayer, and to upda te a weigh t to that hidden unit, we multiply this by the deriva-
,~ . thr . . tive of the activa tion function, and its input
-·
1!/f,· value after passmg the sum ough the nonlinear activation functio
Let us say that we have an MLP with two hidden layers, and let ut5
that this is a regression task with multiple outputs. Given input x' 1~;
6W2/h = '1 Ie;,nz;1lz\h
I
have Similarly, th e er ro r at hidden unit h in the first hidden layer is the sum
(ta
of the b ac kpropagated errors, each multiplied by the weight on its path
z(h = f(w[hx') = f IYJhjX)) e /ll, -
- "
L ' e Iz 1W 2/h
(f
I
and again to update a weight to that hidden unit, we multiply the back-
zL = f (wf,z\) = f w21hz(h)
h=O propated erro r, the derivative of the activation function, and its input
H2 is
T t
Y1
r _
- V;22 = L'\' vuz21t + v;o
l=O 6 w 1hj = '7 Ie\ hf'(z\h)x~
I
where f denotes the activation function, for example, sigmoid, tanh, or The w h o le p rocess involves first the propagation of information in the
ReLU; W1h and W21 are the first- and second-layer weights; Z1h and z21are fonvard direc tio n of calculating the hidden units, and then in the back-
the units on the first and second hidden layers with H1 and H2 hidden ward direc tio n of calculating the errors.
units; and v are the third-layer weights. Training such a network is si.Illl· Le t u s generaliz e. Assume we have arbitrary weight w~hj from layer
lar except that, to train the first-layer weights, we need to backpropagate k - 1 to k, and let u s see fir st ho"· it is used and then how it is updated
over one more layer.
(see figure 12. l ).
l 2. 3 Improving Training Convergence
- 320
12 Deep Learning
321
Summation goes over all the units that this unit is connected to in the
following layer; we call this the fan-our. If k is the output layer, there is
k+1
0 0 no layer k + 1 and elh is the derivative of the error function with respect
to output Zkh, The update uses this backpropagated error:
o o'•,.io
Wk.hj
n
12.3
it k-1
d.
,;J rt Gradient descent has various advantages. It is simple. It is also local;
;il
ii;; ;.,.; ,f Backward
,,, ,1(.- Forward that is, the change in a weight uses only the values of the presynaptic
·~\) '
'
il\ ..'.:
't,/1/(//11!
figure 12.1 In the forward direction, when we use weights, we calculate unit
values as a weighted sum of unit values in the previous layer. In the backward
and postsynaptic units (the units before and after the connection) and
the error (suitably backpropagated). When online training is used, it does
not need to store the training set and can adapt as the task to be learned
'-'.._,.._. direction, when we update weights, we calculate errors as a weighted sum of
changes. But by itself, gradient descent converges slowly. Methods such
errors in the following layer. as the conjugate gradient and second-order methods are computation-
ally intensive and can not be used easily with deep networks with many
We start with the forward direction, in which we calculate z values as weights (Goodfellow, Bengio, and Courville 2016). However, simple tech-
the weighted sum of the earlier z values (zb1 = x are the inputs): niques improve the performance of the gradient descent considerably,
1 making it feasible in real applications.
........__
,r
12 D 1
322 eep L
2.3 Improving Training Convergen ce
. . earning
For each weight w;, we keep an add1t1onal values;, origina!ly . 323
keeps a running average o f th e pas t gra di ents, and that 1.s h ow lero ' th a1
averagew,,
weight where we giverule
the update tnore
is importance to more recent values. For any
,., a£'
1 =
oE'
0/Sc-I + (1 - OI)-
(12 .9) 6wf = - Gaw
(12.7) I 0~ v ri I
6,wf = -f/Sf
where r; is the accumulated Past gradient, one for each weight in the
The hyperparameter O < 0/ < l is the decay or forget . ne twork. It is initially zero and updated after each epoch as
of the running average and is ge_nerally taken as 0.9. Mom:g Pararneter r _ t- 1 _ _ 2
cially useful when onlme or =batch learning is used ,Werea h ttun is espe· (12.10) r; - pr; + (l p) aw,
a£' 1
I
\ _.
we get an effect of averaging and smooth the trajector d . s a resUJt
gence. The disadvantage is that the s; values should by urmg conver. ADAM
e stored 111
. where p is generally taken as 0.999. Adam (Kingma and Ba 20 14) is an
memory. extra extension that also includes a momentum factor. At iteration r, we have
+ ( 1 - a )0£
1
(12.11) Sr = 0/S t-l -
1 I aw;
12.3.2 Adaptive Learning Factor
2
r
:;: In gradient descent, the learning factor 'I determines the magnitud r'I pr.I - I+ (1- p) _
I
a£'
OW;
1
I
change to be made in the parameter. Typically it takes a value less eh of
<.
C . . tan
or equal to 0.2. A very earIy h eunsnc was to make it adaptive for fas (12.12)
-,
S;
s' rf
- - '-andr' = - -
1 - a' ' 1 - P'
convergence, where it is kept large when learning takes place and is ~: -,
S;
creased when learning slows down: 6w/ -r, G
yri
+a if £'+T < £'
(12.8)
6,r, = { -br, otherwise Apologies for the two different uses of the superscript r here: With si , for
example, t denotes an index; with a' and p', it denotes the tth power.
Thus we increase 'I by a constant amount (or keep it unchanged) if
1n terminology of Adam, short for Adaptive moments, s/ is the first
the error on the training set decreases and decrease it geometrically if moment (it uses the gradient) and r/ is the second moment (it uses the
it increases. Because E may oscillate from one epoch to another, it is a square of the gradient), and the versions of these with - on top that
better idea to take the average of the past few epochs as £'. we calculate in equation 12.12 are the so-called bias-corrected values:
In a deep network with many hidden layers and units, the contributions Because both s/ and rf start from O and because both a and p are close
of weights to the error are not the same, and one can adapt the learn- to 1, in early iterations, the estimated values will be biased toward O; the
ing factor separately for each weight, depending on convergence along divisions correct for those. When t gets large (which happens later, when
that direction. The idea in such methods is to accumulate the past error the corresponding hyperparameter is closer to I), its rth power will be
gradient magnitudes for each weight and then make the learning factor close to O, and the division in bias correction will be almost by I.
inversely proportional to that. In directions where the gradient is small,
the learning factor is larger to compensate, and it can be smaller along
directions where the gradient is already large. 12.3.3 Batch Normalization
ADAGRAD
In AdaGrad (Duchi, Hazan, and Singer 2011), we keep a running sum We know from earlier chapters (e.g., section 11 .3) that it is always a good
RMSPROP
of all the past gradients; in RMSProp (Hinton 2012), we keep a running id ea to z-normalize the inputs to have zero mean and unit variance so
- 32,4
that all inputs are in the same scale. The advantage is that
attribute, all values may be clustered
12 Dee
P Lear- ·
"'ng
' 12.4 Regularization 325
training data aJ ,values shift up and down all toge ther when the earlier
, t in some
, small range and ' for
n sorn e weights change m this or that way. When we normalize, we get rid of
ization spreads them; wh e~ th e mpu s are m the same scale ortnaJ. such oscillations and can focus on the relative differences.
sponding weights are also m the same scale, so initializin th ' the corre. coVAR( ATE SHI FT We have here an example of covariate shift, which normally is the dif-
same range or using the same learning rate makes sense. g etn in the ference between the training and test distribu tions. For example, in face
A similar case can also be made for the hidden units nd recognition all training fa ces may have been taken under ideal condi·
idea behind batch normalization (Joffe and Szegedy 2oi s~ this is the tions, but the lighting may not be as good in test images. Robu stness to
BATCH
ize hidden unit values before applying , the activation fun ct1on
· . We nonna1. such shifts is an important problem in machine learning in general. Here,
NOR.MALIZA TION 5
sigmoid or ReLU. Let us call t h at weighted sum a ,. For ' Uch as if you consider all the layers before ai , it is as if in successive epochs
minibatch, for each hidden unit j we calculate the U:ean m ,each batch or the distribution of the data coming out of them is constantly c:hangmg
1 st
deviation Sj of its values, and we first z-normalize: and andarct (because of changes in the parameters in those layers), and batch nor-
malization helps with that.
l' a_
iij = _J--m
_J
·
'""'~,' (1 2.13)
Sj
~I 12.4 Regularization
,; We can then map these to have arbitrary m ean Yj and scale /3 j
J. (1 2.14) iij = yj iij + /3j We know from chapter 11 that.just like any other estimator, the complex-
rt
.,,,J ,.s; ity of an MLP needs to be adjusted to the complexity of the problem that
,,. and then we apply the activation function.
) und erlies the data. A network that is too small has bias; a network that
<. There are a number of important points here. Firs t, m j and Sj are cal-
I is too large has variance. In this chapter we are concerned with networks
\ '· culated anew for each batch, and we see immediately that batch normal-
~/(~ . ~! that have many layers, each with many hidden units and hence too many
.f!_~.' ization is not meaningful with online learning or very small minibatches, connec tions, and we need to find ways to keep variance in check. This
Second, Yj and /3j are parameters that are initialized and updated (after approach is inherent in deep learning: We do not want to be bothered
each batch or rninibatch) using gradient descent, just like the connection with finetuning the exact architecture; we start with a large network and
weights, So they require extra memory and computation. These new pa- expect learning to use whatever portion of it is needed, ignoring wbat is
rameters allow each hidden unit to have its arbitrary (but learned) mean extra.
and standard deviation for its activation. Of course, the main factor that helps us is that nowadays we have much
Note also that if we normalize using equation 12.13, an incoming bias larger data sets. Big data implies not only that we have data that well re·
to hidden unit j (in calculating aj) is useless; it is a constant added for all fl eets the main charac teristics of the data but also that we have instances
instances and will be subtracted out; actually, f31 will now act as the bias. for very rare cases; these latter help us distinguish rare bu t legitimate
Once learning is done, we can go over the whole training data (or a
cases from noise.
large enough subset) and calculate m1 and s1 to use during testing, when This section discusses different ways we can regularize deep neural
we see one instance at a time. • bl ther learning methods as
networks. Some of these are applica e to 0
One reason why batch normalization helps learning is the same as why well , a nd some are special for neural networks.
we normalize inputs: It allows a better use of the range. In some mini·
batches, all instances may lead to very similar a1 ; renormalization makes
the differences between them clearer. There is another reason that IS 12.4.1 Hints
specific to hidden units: Note that the value of each hidd en unit in a In certain applications we have information about the underlying appli·
deep network depends on all of the weights before it, and th ese are con· ca tion, and it is always helpful if we can integrate it into learning. Hi nts
tmuously updated. This means that, as learning proceeds, fo r the same HINTS
........
326 12 Deep Learning / 2.4 Regularization
327
approximating neural network, where 0 are its Weights. Then, for all
such pairs (x,x ' ), we define the penalty l'unction
Figure 12.2 The identity of the object does not change when it is
Eh = L [g(x l0) - g(x' l0)] 2
X,x'
rotated, or scaled. Note that this may not always be true, or may be translated
point: 'b' and 'q' are rotated versions of each other. These are hi true up to '
. ms th a and add it as an extra term to the u .
incorporated into the learrung process to make learning easier. at can be sua 1error function:
E' = E + ,\h · Eh
gr
/o'-~';,l
Jn image recognition, there are invariance hints: The identity f
H
object does not change w_hen is rotated , transla ted, or scaled (se~
ure 12.2). Hints are auxiliary information that can be used to guide th
fl;
not obey the hint, and ,\h is the weight o[ such a penalty (Abu-Mostafa
1995).
One way this is used is to get robustness to noise. For example, let us
;.:i rt )earning process and are especially useful when the training set is lirnitecte say we have an image recognition problem where we know that there
There are different ways in which hints can b e used: can be occlusions. We generate Virtual examples with parts synthet·
·i
,~
ja1. :;:~C. ically occluded and introduce them as additional e.-xamples \\ith the
lvi
,~ tC VIRTITAL EXAMPLES 1. Hints can be used to create virtual examples. For example, knmvin
that the object is invariant to scale, from a given training exampl;
same label. The network trained with these will learn to distribute its
input to all of the image so that it is minimally affected by missing
\07~;/
'-'.~- .. we can generate multiple copies at different scales and add them to data in parts. We saw an example o[ this with noisy autoencoders in
the training set with the same label. This has the advantage that we section 11.10.
increase the training set and do not need to modify the learner in any TANGENT PROP Yet another example is the tangent prop (Simard et al. 1992), where
way. The problem may be that too many examples may be needed for the transformation against which we are defining the hint-for e.-xam-
the learner to learn the invariance. ple, rotation by an angle-is modeled by a function. The usual error
function is modified (by adding another term) so as to allow param-
2. The invariance may be implemented as a preprocessing stage. For
eters to move along this line of transformation without changing the
example, optical character readers have a pre processing stage where error.
the input character image is centered and normalized for size and
slant. This is the easiest solution, when it is possible.
12.4.2 Weight Decay
3. The hint may be incorporated into the n etwork structure. Convolu-
tional networks with localized connections and weight sharing, which If a weight is close to zero, then the input or hidden unit before that
weight effectively does not participate in the calculation, and the higher
we will discuss in section 12. 5, is one example.
the weight in magnitude, the more it is taken into account, effectively in-
4. The hint may also be incorporated into the error function. Let us say creasing the complexity of the model. That is why we initialize weights
we know that x and x' are the same from the application's point of 'Ni.th values close to zero, and that is why stopping early is a regularizer.
view, where x' may be a "virtual example" of x; for example, x' may As learning proceeds, more weights move away from zero; it is as if the
be a slightly translated version of x: f(x) = f(x ') , when f(x) is the complexity (and hence the variance) of the model increases with train-
ing, and though the training error may decrease, validation error may
function we would like to approximate. Let us d e note by g(x l0) our
, ,....
T
328 12 Deep l J 2.4 Regularization
earning
level off and then start increasing. We stop when we obser 329
ve that 1'- ·
happening. ••1s is have two networks that have the same training error, the simpler one-
When all the incoming weights are close to zero, the w . h the to
onenew
~thexamples.
fewer weights-has a higher probability of better generaliz-
. b' . al e1g led ing
also close to zero (assummg the ias 1s so very small) 1 . sum is
. . f . . • W Uch 1·
that we are in the IIl!ddle secnon o a s1gm01d (or tanh) th . 1llPlies Jf you consider the vector all weights, the second term in equation 12. 16
linear. If we have successive hidden layers that are almost lin: t is almost corresponds to the Euclidean (L2) norm of that vector, and hence another
L2 REGVLAR IZATION
the linear combination of linear models is also linear tog thar, because name for weight decay is L2 regularization. Another possibili ty is
. . ' e er th ,\
as one linear model, which effecnvely corresponds to prim;~ _ the hid
ey act
~uug (12.17) E' = E+ 2 Llwtl
layers in between. den I
Even if we start with a weight close to zero, because of som .
li R£GULAR IZATION
called L1 regularization, which forces more of the w ; to be zero and leads
I instances, it may move away from zero; the idea in weight d ecay .set noisy
f'
WEIGHT DECAY 1 O to sparser representations.
some small constant background force that always pulJs a Wei h add
zero, unless it is absolutely necessary that it be large (in m g t toward There is a Bayesian interpretation of weight decay. Remember that the
-~
330 12 Deep Learn ·
•ng
, J 2. 5 Convo /uriona/ Laye rs
331
. . all then the allowed variability of the parameters is 1
1f A 1s sm , . ak W arger 0
ve them to zero 1s we er. e talk more ab <l!Jd
the f orce to mo . . out thi .
t
ch ap er 17 On Bayesian esnmanon. For
.
example,
.
we will aJs
Ll regularization assumes a Laplace pnor that_ 1s n_iore peaked aro t at
zero (see section 17.4). The use of Bayesian estimation in !raining
is treated in MacKay (1992a, 1992b). _ _
o see h
s
~;d
1n
s
~ 0 "¼~
d c/-ij 0
Empirically it has been shown that after trammg most of thew .
, . . . eigh ts r
an MLP are distributed normally around zero, Justifymg the use of . 0 Figure 12.3 In dropout, the output of a ran
decay But this may not always be t h e case. Nowlan and Hinto weigh ( t ze ro, and backpropagation is done on th dom subset of the uni1s are set to
. . . h n 1992) e remammg smaller network.
SOFT WEIGHT SHARING proposed soft weight shanng, where we1g ts are drawn from a .
of Gaussians, allowrng . them to f orm m ul tip . l e cI usters, not oneID1Xture
Al
. so
these clusters may be centered anywhere and not necessarily at • unit with probability p that is t ·
and they have variances that are also modifiable. This changes the zero, .. . ' · se its output to zero, or keep it with prob-
. ability 1 - P (Sn va st av_a et al. 20 14). Pis adjusted using cross-validation;
of equation 12.21 to a llJXture o f M ;:,, 2 Gauss1ans:
. Pnor
with more mpu ts or hidden units in a layer we can aff d I
figure 12 _3 )_ , or a arger p (see
M
(12.23) p(w;) = L 0/;P;(w;) In each batch or minibatch, for each unit independently we decide ran-
J=l domly to keep it or not. Let us say that p = 0.25 . So, on average . we
remove a quarter of the units and we do backpropagation as usual on
where a1 are the component proportions and P;( w; ) ~ N(m1 , s2) are the the remaining network for that batch or minibatch. We need to make up
component Gaussians. Mis set by the user, and oc; , m; , and s1 aie learned for the loss of units, though: In each layer, we divide the activatio n of
from the data. Using such a prior and augmenting the error function IVith the remaining units by 1 - p to make sure that they provide a vec tor of
its log during training, the weights converge to decreas e error and also similar magni tude to the next layer. There is no dropout during testing.
are grouped automatically to increase the log prior. In each batch or minibatch, a smaller network (\\i th smaller variance)
is trained. Thus dropout is effectively sampling from a pool of possible
12.4.3 Dropout n e tworks of different depths and widths. The resulting effec t is that,
as we m entioned above, instead of giving large weights to few units in
Previously we talked about the advantage of adding noise to inputs, which the preceding layer, dropout gives small weights to many, just like we
makes the network more robust in that the network learns not to rely h ave in L2 regularization. Dropout has been found to be very effec tive in
too much on particular inputs. Say we have an image data set where a regularizing deep neural networks.
group of pixels in the background, simply by luck, always takes similar oRo r coNNE CT Th ere is a version called dropco nnect that drops out or not connec tions
values for examples of a class. In such a case, the n etwork can very independ ently, which allows a larger set of possible networks to sample
quickly, but wrongly, learn to base its decision for that class on them. from, and this may be preferable in smaller networks (Wan et al. 20 13).
But if we introduce additional examples where those pixels are missing
or corrupted, the dependency on them will be reduced. A similar case
can als_o be made for hidden units, and it has been also proposed to 12.5 Convolutional Layers
add nmse to hidden units, thereby distributing decisions to whole layers
ra~ter th an specific units (Poole, Sohl-Dickstein, and Ganguli, 2014). 12.5.1 The Idea
e extreme case Is to discard inputs or hidden units completely. In fi t the data one possibility is to adapt the struc-
DROPOUT dro t h ' ' ln getting a betterk t bo tter r;pr~sent the constraints of the underlying
pou • we ave a hyperparameter p, and we drop the input or hidden tw·e of th e networ to e
12 Deep L
332 eqrn;ng 12.5 Convolutiona/ Layers
of the input space, we have copies of the same hidden units looking at FEATURE MAil
in the san1e way, for example, 2-dirnensional, ts called a feature map, see
Dumoulin and Visin (2016) for a guide to con\'Olutional networks.
different parts of the input space. This is very simple to impl ement: Dur·
12 D
334 ecp Lea,-,r 12.5 Convo/urionat Layers
I/Jg
340
12 Deep l earn-
UJg
12. 6 Tuning the Network Structure
99521
34 1
12.5.5 Multimodal Deep Networks I"
·ct that if we have multiple channels as input th h
Previously we sail er ~ake a sum of them using the same c~nv elc an. \~
I in the next ay . o Ut1on
ne s this is not a strict reqwrement; one can always
trix Of course . . 1earn
ma ·
separate we1g
. hts for the different channels. This defines a more flnJO-bl
. d d h ' e ' ,--I
model but nee ds more data to train,. an we nee to ave a good reason
to expect those features to be different.
In the inception network (Szegedy et al. 2014), an_other idea ca!Ject f ~ '.,
CHANNEL channel concatenation was used. Let us say there 1s a smgle channel anct
CONCATENATION from that we generate multiple channels; for example, we may have three
where one uses a 3 x 3 filter, another 5 x 5,_ and another 7 x 7. Instead of
- up the results after the convolunons, we can just concat thern
: : ; l e channels. For example, if the input is 100 x 100 and we have
L
Dy namic node creation
five filters for each of the three, then the output is 15 x 100 x 100- we Cascade correlation
just create another channel instead of summing.
Figure 12.8 Two examples of constructive algorithms. Dynamic node creation
It is also possible to first process the channels for a number of layers adds a unit to an existing layer. Cascade correlation adds each unit as a new
separately and then fuse them at a more abstract level. This can be used hidden layer connected to all the previous layers. Dashed lines denote the newly
when channels are fed with inputs in different modalities. For example, added unit/ connections. Bias units/ weights are omitted for clarity.
in speech recognition, in addition to the acoustic speech signal we may
also have a video of the movement of the mouth. We will talk in more
detail about methods for combining multiple sources in chapter 18, but it
should be apparent that it is not always a good idea to concat very differ- learning the parametffs. The idea is to adapt the network structure, for
ent pieces of data and feed them to a single network. At the lowest level, example, the number of hidden layers and units, during training incre-
the features of the two modalities are most probably not correlated, so it mentally, without needing a manual finetuning of such hyperparameters.
does not make sense to have hidden units that are fed with both. What One may view weight decay (see section 12.4.2) from such a perspective
we can do is have separate columns for different modalities extracting as a destructive approach where we start with a large network and prune
increasingly abstract features, and then at some level those may feed to unnecessary weights. Another possibility is the constnictive approach
a shared hidden layer; then maybe after some more layers we get to the where we start from a small network and add units and associated con-
outputs. Both columns may have one or more convolutional layers and DYNAMIC NODE nections if n eeded (figure 12.8). In dynamic node creation (Ash 1989),
CREATION
one or more fully connected layers before feeding to the shared layer. an MLP with one bidd en layer with one hidden unit is trained, and after
MULTIMODAL DEEP
This is called a multimoda/ deep network (Ngiam et al. 2011). convergence, if · s till hi gh , an other _hidden unit
· the error 1s _ is added.
_ The
_ _
NElWORK
incoming weights of the newly added unit and Jts outgmng we1g_ht are uu-
tiali zed randomly and trained with the previously ex1snng weights that
12.6 Tuning the Network Structure are not reinitiali zed but continue from their previous values. .
CASCADE In cascade correIanon ... '·nan and Lebiere 1990), each
· (F-ct1w added urut
h
12.6.1 .
Structure and Hyperparameter Search CORRELATION . . . . another hidden layer. Every hidden 1ayer as
1s a new ludden LUU! ll1 ' f h hidden units preceding it
11
Since the very early days of neural networks researchers have mves· only one unit that is connectle)~etXJ~s~ingo w:i;hts are frozen and are not
tigated whether the ideal network structure c~n be learned, while also d I · ts The prev10us , .
an t 1e mpu · . . . weights of the newly added urut
trained; only the inconung and outgomg
12 Deep L 12.6 Tu ning the Ne twork s rrocrure
earnin9 3-13
342
are trained.
DynaIDI
layer an
·c node crea I
d never a s
d
r·on adds a new hidden unit to an existing h·ct
.
C
dd another hidden layer. ascade correlation
.
I den
a1ways
er with a single urut. The ideal constructive m h
cb
w hid en 1ay
D
. - et od
adds a ne d ·de when to introduce a n ew hidden layer and h
should be able to :~sting layer. This is an open research problemw en
dd a urut to an = .
to a al algorithms are interesting b ecause they correspond to=
Increment al h . u,Od-
. . not only the parameters but so t e structure dunng learllin
ifymg all ve can think of a space d efined not only by such strug.
~rep=~• h . G
rural hyperpar amete rs but also by others t at affect learning · such as ~,
the learning rate or the momentum factor.' and we want to find the best
AITTOMATED MACHINE
LEARNING
configuration in this space. The idea behind_automated machine learn-
. (AutoMl) is to automate not only the opnrruzanon of parameters, as v,
Figure 12.9 (a) are the slop connections to layer/; (b) 9 , are the gating units
::grypically done, but all steps in building a machine learning system, that mediate the effect of the slo p connections. Both allow shorter paths to the
output.
inducting the optimization of hyperparameters. . _.
There are various approaches. One 1s popu/at10n-based trainmg Uader-
berg et al. 2017), where there are a number of models with different bypasses all the layers between L and / and, as such, can be regarded as
hyperparameters trained all together, wllh informanon transfer between a simpl er m odel. A similar approach underlies residual networks (He et
models allowing an exploration of the vicinity of promising models, just al. 201 5); we can say that the shorter path is the main prediction and the
NETIVORK like in genetic algorithms. Another interesting work is network architec- lon ger path (with ex tra computation) prO\ides any necessary residual.
A RCHITECTURE ture search (Zoph and Le 2016), where the model description is generated T he a dvantage of such skip connections is also apparent in training. We
SEARCH
as a sequence of actions and the best sequence is learned using reinforce- saw in section 12.2 that when we backpropagate, to calculate the gradient
ment learning (see chapter 19). at any layer , we multiply the error by all the weights on the path after that
layer. Wh en we have deep networks ,vith many layers, if those weights
12.6.2 Skip Connections are greater than I , the gradient may become Ye.I)' large after many multi-
plication s; if they are smaller than I, the gradient may becom e vel)' small.
We saw above that, in cascade correlation, connections skip layers, which
VANISH- This is th e problem of va nishing/ exploding gradients. When we h ave skip
we can also have with other networks. For example, a hidden unit may ING/ EXPLODING
be connected not only to all the units in its precedin g layer, as usual, layers , we effec tively define shorter paths to the output, and this helps
GRADIENTS
but also to units in a layer much earlier. Con sider a unit at layer / (see propagate the error back to earlier layers \\ithout gradients exploding or
figure 12.9(a)): vanishing. Using su ch residu al connections, He et al. (2015) were able to
train n etworks \\~th 151 layers.
(12.24)
Z/h =f ( W/ ,h,iZ/ - 1, i + o/ V/ ,h,jZ/- L.j )
12.6.3 Gating Units
where w,,h,i are the weights from all units i of layer / - 1 to unit h of layer
Equation takes a direct sum of the t\rn paths. An alterna tive is to
SK IP CONNECTIONS
/- this is what we normally have between successive layers-and V/ - 1.h J take a we ight ed a,·erage and also learn those weights (figure 12.9(b))
are the skip connections from all units J in layer / - L.
We can view information coming from two paths; the u sual pat h uses
all th e layers between 1 and I; the second is a shorter path because it (l 2 .2 5) Z/h =f (9 111 ll'// 11 2/-1.1 + (I - 911,) \ '/llj Z/- L.J )
D
12 Deep Lear,r 12. 7 Learning Sequences
Ing
344 345
)'
Ir '
_,-
w
J
• "1
R
r1
w / ~
r",
.., •
1,0
Let us consider the operation of the recurrent network shown in fig- (b)
ure 12.12(a):
Figure 12.1 2 Backpropagation through time: (a) recurrent network and (b) its
(12.27) Zh f (t +,~WhjXJ
1
r h1zt )
equivalent unfolded network that behaves identically in four steps.
(12.28) y' = g (I VhZh) aga tion. If we have required output given at time t, we can calculate the
loss at time t and use gradient descent to update v h, "'hj , and r h1 - If the re-
We see that the activation of hidden units at time t also depends on UNFOLDING IN TI ME quired output comes at the end of the sequence, we unfold in time. That
zi
the activations at time t - l through the recurrent weights r hl · The are is, we write the equivalent feedforward network by creating a separate
taken as zero. The activation function f at the hidden units is generally copy of each unit and connection at different times (see figure 12.12(b)).
taken as sigmoid or tanh, and the output activation g depends on the The res ul ting network can be trained as usual, with the additio nal re-
application. quirement that all copies of a connection should remain identical. The
In equation 12.27, inside f the first part is the contribu tion of the cur· solution, as in weight sharing, is to sum up the different weight changes
rent input, and the second part is what is remembered from the past. BACKPROPAGATION in time and change the weight by the average. This is called backpropa-
zr- , contains a compressed representation of all the inputs seen until r, THROUGH TI ME gation through time (Rumelhart, Hin ton, and Williams 1986). _ _
and its effect is combined with that of the current input at time t. The An importan t problem in training recurrent neural networks 1s varush-
values of Whj and rh1 determine how much we give weight to these two ing gradients; the unfolded network is deep, and as we can see m figure
sources. 12. 12(b), backpropagating requires mul tiplying r h1 weights _many tunes.
The training of a recurrent neural network is also done using backprop- , ·s b)• usinuo gating units, as we descnbe next.
0 ne way to com bat thi s 1
I 2. 7 Lea rnin g Sequences
12 Deep Lear,, .
rng 349
348
if ihis close to Oand{' clos tO
Memory Unit -:-- . ..!!. e I, th e state of h h'
Long Short-Term t he next state; this all ows re b . t e 1dden unit is copied to
12.7.4 _ rm memory (LSTM) unit (Hochreiter and Schmidh . h mem enng input .
c hangmg t em. Operationall y thi _ . s seen m the past withou t
The long short re x form of the recurrent unit. It does not J·u Uber unnecessary additional comp! . , swill . be Just Ilk ki
e a s player, bypassing
LONG SHORT-TE RM
· a more comp1e h st tak ex1ty. It is not su ..
MEMORY I 997 ) is f h ffects of the input and t e past, but uses a e work s (equ ation 12.26) and LSTM rpnsmg that highway net-
- ple sum o t e e ff I kin se1of group. s were proposed by the same research
a sllll . h turn these effects on or o , oo g at the input Th
ting uIDtS t at al . d · ese
ga h . wn parameters that are so trame during backp Let us say we have an applicaiion wher 1 ,
gates have t err o ropa.
at a time. Compare the fo llo1v•_ e ,e process sen1ences one word
,ng two sen1ences:
gation. ted input to an LSTM unit, x' , is the concatenation of th
The augmen . • • r e
. , and the past hidden urut acnvat10ns z - I . All decis• "The man who entered the house was in his thi . ,,
current mput x Ions
use this augmented input; that is, they are based on what is seen now "The woman who enlered the hous . rues.
e was m her thirt ies."
and what is remembered from the past. . .
In LSTM unit h, first, the effect of the mput 1s calculated as Whether the correct pronoun is "his" "h ,, d
subj ec t is "man" or "woman ,, or er epends on whether the
' and th at word appears a number of words
(] 2.29) ch= tanh(w~x') before. To be able to generate
_ such a sent ence, for example, or to be able
The cumulative internal state of unit h is a weighted sum of the pas t stale to gene rate a translat10n of it, at the time of the pronoun the network
and current input: should remember the subject·, some part of th e hidd en representat10n .
should be set when the subject is set, and it should remain intact until
(12.30) sh = fl ·sr 1 + 1 ·ch we get to the pronoun. This is difficul1 fo r an ordinary recurrent network,
where fh and ihare the forget and input gate_s, respectively (gate values where we contmually average over the hidden unils, but LSTM unit s "ith
are underlinedto tell them apart from acnvanons): their gates turning on and remaining on for some time (until maybe an-
other event happens to turn them of!) can learn such depend encies.
fl sigmoid ( wf x') LSTM nowadays is co nsidered to be the state-of-the-art model fo r di -
1_ sigmoid (wTx') verse sequence learning applications from translation to caption gener-
ation. The more fl exible architecture allows longer and more compl ex
The overall value is calculated as sequences to be learned, recognized, and produced.
(12 .31) zh = 1 ·tanh (sh)
with the output gate defined as 12.7.5 Gated Recurrent Unit
(1 2.3 2) 1 = sigmoid (wrx') GATED RECURRENT The gated recurrent unit (GRU) (Cho el al. 201-lJ 1s a simpler 1-ersion of
UN IT LSTM \\ith fewer gates and less compulation.
The input activation ch and all three gates see the extended input and The effect of the input is calculated as
have their own weight vectors that are trained from data. One vari~~)
of LSTM adds peephole connections, this means adding the vector of 5h (12 .33) c/,= tanh (wYx'+1•w;z' - 1 )
to the augmented input x' of the gating units (Gers, Schraudolph, and
Scbmidhuber 2002). No te that this differs from LSnl's equation 12.29 in that W e is split into
The gates allow each unit to selectively decide bow much to take the w., and w,, and there is a reset gate 1 and an input gate 1_:
current input or the past into account, and how much to give out as out-
put. They help with learning long-term dependencies. In equation IZ.30, (1 2.34) 1 = sigmoid ( w;,~') and 1 = sigmoid (w; x')
J 2.8 Ge n era tive Adversarial Ne n.vork
12 DeeP Learr, · 35 1
350
'"9
. ast is forgotten, hence the name, reset ga te. The .
If ,j_ 1s O, t be P alculated as the weighted average of the past hid.
den unit value 1s c and the
present
, _ ( l _ i' ) . zJ- + ih ·Ch
1
(1 2.35) 2h - h
1
- 30 h ·
-;: unlike LSTM' s equa tion 12. , t e.re 1s not a separa te for
N~te it::aken as 1 minus the input gate. We also see tha t, unlike I.S.;:t 1 (true)
gate, it . te.rmediate Sh and no output gate; whatever value is ca] I,
there 1s no m - . CU-
. b alue of bidden unit h . As a srmp 1e.r version of LSTM GR
lated IS t e V I . ' u Figure 1 2 -13 _ In GAN, C is the ge nerator that ge nerates fak e e.x amples from
finds successful applications in many seque n ce earrung tasks (Chung et random z . D 15 the discnrrunator that is trained to discnminate s uch rak e G(z )
al. 2014). · I · h d from tru e x of th e trainin g se t.
The basic recurrent unit takes a srmp e we1g te ave.rage of the CUr-
rent input and the past bidden acn vanons . LSTM replaces 11 Wi th a more
complex model; actually we can consider an LS_TM as a small specific MLP a genera tor C tha t takes a low-ctimens ional z as input and gives ou t a
structure With internal values and gates, JUSt like hidden units, that have "fake " examp le that we de note by G(z ). The z are drawn from some prior
parameters that are trained-GRU corresponds to a srmpler l\llLP. m ultiva ria te clis tribution, typi cally a k-dimensional multivaria te Gaussian
wi th m ean ze ro, unit variance on all climensions, and zero covariances,
th a t is , Nd O, I).
12.8 Generative Adversarial Network
G is imp le men ted as a deep neu ral network that has k inputs, d ou t-
The generarive adversarial netw ork (GAN) is not a d eep learning method pu ts, and a numb er of hidden layers in between. If we wan t to generate
GENERATIVE
ADVERSARIAL per se. It uses two deep neural networks, but the novelty of GAN is in the an image, the fina l layers will be deconvolu tional, just as we have with a
N ETI\"ORK way the task is defined rather than how it is implemented . d eep a u toen cod er (see section 12.5..J); the similarity between a gen erator
Let us say that we have a training set of faces and that our aim is to and the d ecode r of an autoe ncoder is a topic we will come back to sho rtly.
be able to gene.rate new faces; that is , we want to learn a generator that G takes a low-dimensional ctistribution and learns to expand and dis to rt
can synthesize new images tha t everyone will recognize as legitimate face it suitab ly to cover the target x distribution.
images, but different from the ones we already have in the training set. In adcli tion to G, GAN also uses a discriminacor D that is trained to clis-
Our training instances X = {x'J , are drawn from an unknown probabil- crimina te true and fake examples. D is another deep ne twork that takes
ity ctistribution p (x ) ; what we would like to do is firs t estima te and then d-dimen s io nal x as its inp ut and, after some hidden layers. co nvolu tionaJ
sample from p (x ). Of course, one can always try any of the parametric, o r full , has a single sigmoid output, because it is trained on a two-class
serniparametric, or nonparametric m e thods for d en sity estimation, dis· task: For all true x', the label isl; for all G(z ) that are ge nerated b y G
cussed respectively in chapters 5, 7, and 8, but GAN u ses a clifferent and (from some z ), the label is 0 (see figure 12. l 3).
a very interesting approach that bypasses the estima tion of p (x ). T h e GAN criterion is
The underlying idea in GAN is somehow reminiscent of word 2vec (sec-
tion 11.11) in that to solve the original unsupervised problem we define (1 2.3 6) :Z:: logD (x ') T L log( l -O (G(z )))
Z -p (Z l
an auxiliary supervised problem that helps u s with the original task. A
GAN is composed of two learners, a gen erator and a cli scrimina tor, which which is maximized by D and minimized by G. D is tTained so th a t its
are inlplemented as two deep neural n etworks (Goodfellow et al. 2014). outp u t is as la rge as possible fo r a trne ins tance and as small as poss ible
We s_tart with the training set of "true" examples, X = {x ' }, , each of fo r a fake ins tance. G is trained to fool D; it is trained to gene ra te fakes
which IS d-ctimensional. These are valid face images . We want to learn for which D gives large ou tp ut s. During training, as G ge ts be tte r m
12 D 12.9 No tes
eep Learn,·
352 ng
353
. fakes D gets better in detecting them, which in tur r but they are not always .
generating • n orce
to generate much better fakes, and so on. . . sG very inform ·
to gen erate a screen full Off k ative. As a result the b . .
. g from random weights for both, trauung of the two . h a es anct · est way snll Is
Star on . h 1s alt at t em . This is very subJ·ecn manually evalu ate them by 1 kin
ed G we update the we1g ts of D so that the lo er. h . . ve and limi oo g
~ d - For a fix ' u~ 01. w ere this IS possible, such as images. ts the use of GAN to applications
.
1s m
axuru·•zed This makes D a good detector for a fixed G Th
· . . . · en we fix
e
D and update the_ w_eights of G to IllllUID.Iz e the second term, Which is
equivalent to maxuruzmg 12.9 Notes
I logD(G(z))
Deep learning algorithms that ct · .
Z - p (Z ) we ISCUss Ill thi h
tute the mos t popular approa h . . s c apter currently consti-
which makes G a better faker for a fixed D. . c m machine learnm Th .
t hi s 1s they are very successful . g. e ma.in reason for
A variant of GAN that frequently leads to better generators uses th m many doma· d· •
cations, such as image recognit' ms, an m important appli-
WASSERSTEIN LOSS
Wasserstein loss (Arjovsky, Chintala, and Bottou, 2017): e they have led to sigru'ficant . 10n, speech recognition, and translation,
improvements M .
that do some form of pattern r . . · any commercial products
(12.37) IE[D(x')] - I E[D (G( z )) ] learning. ecogrunon today use a variant of deep
Z - p (Z )
Neural networks a.re easy to deploy Th d .
which is maximized by D and minimized by G (which is equivalent to than other models and 1 . h · ey nee 1ess manual tumng
. . ' Vil enough data and a large network. one can
maximizing the second term). qwckly achieve a. good level of performance. Nowadays it is easy to find
In this case, D is not a classifier but a regressor; it has a single linear data. Computmg is ge tting cheaper too. Another fa ctor is the a.vaila.bil-
output, which is a score of the "trueness" of its input. Wasserstein loss 1ty of open software; there are many free software packages that help in
measures the difference between the average score of true instances and building and training deep neural networks. All these contribute to the
the average score of fake instances: D tries .to maximize this; G tries to fact that one can have a prototype ready and running in a short a.mount
minimize this, by maximizing the average score of the fakes. A similar of time with little cost. The commercial interes t is helpful in that it pro-
criterion is used by the loss-sensitive GAN (Qi 2017). VIdes resources that have not been previously a1·aiJable; it also introduces
GAN is currently one of the hottest topics of research in machine learn- unnecessary hype and unrealistic expectations.
ing, and we are seeing various applications as well as extensions. One of The main attraction of deep neural networks is that they are trained
them is the bidirectional GAN (Donahue, Krahenhiihl, and Darrell 20 16) end to end. Many applications fi t this encoder-decoder structure where
and the equivalent adversarially learned inference (Dumoulin et al. 2016), the encoder takes the raw input and generates some intermediate repre·
where there is an additional encoder model that is trained to output z sentation and the decoder takes this and generates the output. The ad-
for a given x. This new component works in the opposite direction of the vantage of end-to-end training is that it is enough to pro,~de the network
generator that works like a decoder in going from z to x. 1,vith example pairs of input at the very beginning and desired output at
Training a GAN is difficult because we have two deep networks to train the very end, and any transformation that needs to be done in between
and the task of adjusting the hyperparameters is doubled. G is trained by the encoder and the decoder, as well as the forma t of the intermediate
backpropagating through D, which adds to the difficulty. One frequent representation, is automatically learned. . .
problem is mode collapse. There may be different ways of being a true For example, in translation the encoder, wluch typ1ca.lly uses LSTM,
. the source language one word at a tune and gener-
instance, but the generator learns only a few of those. tak es a sen t ence IJl ,,
. . nal cti·stributed "thought ,·ector ; the_decoder takes
An additional problem is that there is no good evaluation measure_ to ates a hig h-dJfilenst0 .
. . h LSTM generates a sentence m the target language
detect the quality of a trained generator. There are frequently used cnte· this and usmg anot er . .
. To generate a caption for an llila.ge, t11e en-
ria, which basically compare the distribution of true and fake in stances, again one word at a tune.
1 2. 1 0 Exe rcises
12 Deep Learn ·
1n9 355
354
. al network that analyzes the image and gen The augmented error is
. nvolunon . erate
coder 1s a co tor and the decoding LSTM generates a se sa
distributed content .:~e the caption. We will see other examples q~ence E , = Iog 'L r ' log y' + ,\
,
L log L,"' "'J P;( W,)
0 J,. I
of words rbat c~nStl the decoder is an action .module: In playjng b th.is I
in chapter 19 '~ er:alyzes the board image and generates an int: r oatct where P ; ( w ,) - :N( m; ,sJ). Note that { w .
the bi as. When we use gradie t d ,), mc.ludes all the weights including
game, the enco er module rakes this and decides on the bes t mo llJedi- n escent, " e get
ate code· the acnon ve .
. ' all the success stories, there are a number of problems too 0 c.w,' = T) ( r ' - y' )x: - () ,\ L TT;(w,) (w, - , m ; )
Despite . th h . · ne I Sj
. . .. m with neural networks 1s at t ey are not mterpret bi
maJor cnnos h a e where
. way of validating whether t ey learne d the correct t k,
so there 1s n 0 . . . as ·.
. ·gnm·cant research on the topic, hidden uruts remain hidd 7TJ( w ,) = C<; PJ(w,)
Despite s1 . . . · en L1cc, p i( w, l
d when we look at some hidden urut Ill some !Ilte rrnedia te layer of '
an twork we have no idea what it has learned, or if it is really O a is the pos terio r probabilit, that w belo t
d ~m ' . . -~ u pd ated to both d h ' ngs O component j. The weigh t is
essary. Supporting this is the existence of adversana/ _examples (Good- f h ecrease t e cross-entropy and move it c.loser to the mean
ADVERSARIAL 0 t e neare St Gauss ian. Using such a sc.heme, we can al so update the mLxture
fellow, Shlens, and Szegedy 2014): Small perturbat10ns Ill the input may
EXAMPLES pa rameters, fo r example
sometimes cause large changes in the output. This is an indicator that the
way such networks generalize from training examples is not completely C.mJ = T] t\ L TT;(w,) ( w, -,m, l
understood, and it may jus t be that some of thes e deep ne tworks work Sj
simply as gigantic lookup tables. The ) is c.lose to I if it is high!) probable that w , comes from componen t
77J ( w,
These observations indicate that we need to b e very careful before us- J: m _such _a case, m 1 is updat ed to be c.loser to the weight 11·, it represe nts.
ing such networks in critical applications. They als o show tha t the prob- Th.is 1s an iterative clustering procedure, and "e wt.II discuss such methods in
lem of machine learning is far from being s olved and d eep learning is an more detail in cha pter 13 (see e.g., equation 13.5).
important but yet another step toward this aim. 6. In some s tudies, different dropout rates are used fo r mput and hid den unit s.
Discuss when th.is "ill be app ropriate.
7. In designing a convolu tional layer for an image recognition da ta, what are the
12.10 Exercises factors tha t affect the filter size. number of filters. and stride?
L Derive the update equations for an MLP that uses ReLU in its hidden units. 8. Propose a one-dimensional convolution la\ er for sentence classification where
Assume one hidden layer of H units and one output trained for regression. the basic inputs are characters.
2. Some researchers say that in batch normalization cal culating new statistics 9. Incremental learning of the s tructure of an ~[Lp can be ,iewed as a s tate space
for each minibatch and normalizing accordingly is equivalent to adding some search. What are the operators? \I hat is the goodness function? What typ e of
noise to hidden unit outputs and that this has a r egularizing effect. Comment. search s tra tegies are appropriate? Define these in such a way that dynamic
no de cr eation and cascade co rrelation are special instantiations.
3. Propose a hint that helps in image recognition and explain how it can be
mcorporated into the learning process. 10. For the MLP giwn in figure \".12, deri\e the update equations fo r the un-
4· In weight decay, does it make sense to use different ,\ in different layers? fo lded network
1 1. Derive th e update equations for the parameters of the gating uni ts in highway
s. Derive th e update equations for soft weight sharing.
networks.
SliOUJTION: Assume a single-layer network for two-class classification for sim· 12. Compare LSTM and GRU units and propose a new unit mechanism dis tinc t
p City:
from those, justifying it b) an application.
Y' = sigmoid ( w;x/) 13. Ass uming a GAN composed of a linear G and a linear D, deriw the up d a te
equa tions fo r training them for (al ,·anilla GAN and (bl \\·asserstein GAN .