HW 08

Chapter 6 Multilayer neural networks Problem Solutions Section 6.2 1. Consider a three-layer network with linear units throughout, having input vector x, vector at the hidden units y, and output vector 2, For such a linear system we have y = Wx and 2 = Wey for two matrices W; and Ws. Thus we can write the output as 2 = Way =WiWix = Wax for some matrix W3 = W2W. But this equation is the same as that of a two-layer network having connection matrix Ws. ‘Thus a three-layer network with linear units roughout can be implemented by a two-layer network with eppropriately chosen ‘connections. Clearly, a nonlinearly ceparable problem cannot be solved by a three-layer neural network with linear hidden units. To sce this, suppose a noa-linearly separable problem can be solved by’. three-layer neural network with hidden units. ‘Then, equiv- alently, it can be solved by a two-layer neural network. ‘Then clearly the problem is linearly separable. But, by assumption the problem is only nonlinearly separable. Hence there is & contradiction and the above conclusion holds true. 2. Fourier's theorem shows that a three-layer neural network with sigmoidal hidden units can act as a universal approximator. Consider a two-dimensional input and single output z(2s,:r2) = 2(x). Fourier’s theorem states 262) =F Ag 008( fi )e0s( fata). a (a) Fourier’s theorem, as stated above, can be rewritten with the trigonometric identity: 1 cos(aeos( 8) = Zeas(a + 8) + Zeos(a ~ 8), 219220 CHAPTER 6. MULTILAYER NEURAL NETWORKS: tigi Hay, 22) = ABE feos fans + fata) + ens fzs — fo] TB (b) We want to show that cos(2), or indeed any continuous function, ean be ep- proximated by the following linear combination 1a) = Sle0)+ leans ~ Hea] SEE] = ‘The Fourier series of a function f(2) at a point zo converges to f(a) if f(x) is of bounded variation in some interval (19 —h, zo + h) centered on zo. A function of bounded variation is a follows, given a partition on the interval = 9 <1 <2 << Sa wj,4, is computed in 2c operations. Thus we have wy’s are computed in [24+ 2¢]nzr time. ‘Thus, the time for one iteration is the time to compute Aw plus the time to add w to Aw, that is, = e(2ng + 10+ 2(d+c+ ln + (dry + mHe) Binge + Snye-+ e+ nn In summary, the time complexity is (Sdnyr + 5ny0-+6-+ 2) () Here, the number of iterations = ning and thus the time complexity is (3dn + Snyet c+ 2ny)n my 4, Equation 20 in the text gives the sensitivity at a hidden unit S nets) 4 For a four-layer or higher-layer neural network, the sensitivity of a hidden unit is likewise given by224 CHAPTER 6. MULTILAYER NEURAL NETWORKS 10. We express the derivative of a sigmoid in terms of the sigmoid itself for positive constants @ and 6 for the following cases. (a) Por f(net) = 1/(1 + e* "*t), the derivative is Giret) _ 1 \" go net = ~af(net)(t ~ (net). (b) Por f(net) = atanh(b net), the derivative is Heed = -28%atanh(b net)(1 — tanh? (b net). 11. We ts the following notation: ‘The activations atthe frst (ipa, second, thir, and fourth (output) layers are sz, yj, Ut, and zp, respectively, and the indexes are ‘ilar fom usage. ‘The number of units in the firs hidden lever is my, and the nurnber Inthe ssond hidden layer is Mit gut Algorithm 0 (Four-layer backpropagation) 1 begin initialize 2x2 2 ere ’ zea 4 return rox 5 end Section 6.4 12. Suppose the input to hidden weights are set equal to the same valle, Say Woy then wij = wo, ‘Then we have nety = f(net)= Sum =~. Ds228 CHAPTER 6. MULTILAYER NEURAL NETWORKS (b) Already shown above. 19. ‘The assumption that the network can represent the underlying true distribution is not used before Eq, 28 in the text. For Eq. 29, however, we invoke ge(x; w) = p(WelX), which is used for : 7 (x; Ww) — P(e x)] p(x)dx = 0. XJ ime Ix) ] ‘This is true only when the above assumption is met. If the essumption is not met, the gradient descent procedure yields the closest projection to the posterior probability jn the class spanned by the network 20. Recall the equation Plylan) = eA HBO ey (a) Given p(ylux), we use Bayes’ Thoorem to write the posterior as Ply|en)P(wr) Play) = yy (b) We interpret A()), We and ¢ as follows: Plug) = oA") Ainly) __eplAlws) + Bly, ¢) + Wy] Plu) Xe lAlim) + Bly, 9) + Whey] Pom) _ Rete ie where nety — by, 4) + Wy. Thus, Bly, ¢) is the bias, W is the weight vector describing the separating plane, and ¢ “(**) is P(x). 21. Backpropagation with softmax is done in the usual manner; all that must be ovaluatod differently are the sensitivities at the output and hidden layer units, that ents ben Bnetn = an(1 zn), (a) We are given the following torms: nety = Sop nee = Swain wy = Sines) oa penPROBLEM SOLUTIONS 229 and the error function ‘To derive the learning rule we have to compute BJ/Own; and 8J/Ows.. We start with the former: Bs _ Ad Bnet Bugg Tete Orcas” ‘We next compute bd nels We also have itoxk oe Si net, ete 4g (1 ee a aap ifs =k Den ) (g =) and finally ay On (—1)(ts — 26). Potting these together we got ad net = AV ~ 2) zont) + (te alan 2) a We use nets /Bug; = yj and obtain Yj Dailte — 20)(o74) — sll ~ 24)(Z4~ 28) cvs Now we have to find inpul-to-hidden weight contribution to J. By the chain rule we have 7 _ 8) dy Snot; Buy, Oy; Orets ws.” We can at once find out the last two partial derivatives J (nets) and230 CHAPTER 6. MULTILAYER NEURAL NETWORKS 825 = --s . Ste 2 te Inte 4 Bnei, By; = Sees |e Sate go Pee Sas =i Es Drety yy, * 2 Drete, Guy, = =D = 2) ee — Beg HD seattle — 20) a ee ‘We put all this together and find = xf (nets) Dla — 20) So wejeete = - al Bun5, aif! (net) ) (te — 24) (2s #2); Of course, the leaming rule is then ay Au = Bug ad ea where the derivatives are as given above. (b) We are given the cross-entropy criterion function ‘The learning rule derivation is the same as in part (a) except that we need to replace 8/82 with BJce/9z, Note that for the cross-entropy criterion we have 8ee/82y = ~t/2u- ‘Then, following the steps in part (2), we have Bin yar att lon ab Be = WY Rew whe = wD tere —yjtell — 2). ” Similarly, we have Glee asptnets) 02 ote a mame maaf (nets) Yo Sas ee ~ 2PROBLEM SOLUTIONS 231 ‘Thus we find Ge = aoa Dus rico esp (nes) Ste Of course, the learning rule is thon BJs Suse = nga Adee Aus = ag where the derivatives are as given above. 22. In the two-ategory case, if 1 ~ P(ui|2), then 1~ g1 = P(usla), since we ccan assume the categories are mutually exclusive and exhaustive. But 1 — 9; can bbe computed by a network with inpul-to-hidden weights identical to those used to compute g:. From Eq. 27 in the text, we know D [ iontos w) — Plena ae + J tar (s,37)~ Pla) oe is a minimum. this implies that every term in the above equation is minimized. Section 6.7 23, Consider the weight update rules given by Eqs. 12 and 23 in the text. (a) ‘The weight updates are have the factor n/"(net). If we take the sigmoid fy(h) = tanh(bh), we have em 1 SO) = Doe ap — 1 = ha Pees? 2D =1, Clearly, 0 < D < 0.25 for all b and fh, If we assume 1 is constant, then clearly the product 7/7f!s(h) will be equal to nfi(h), whieh tells us the increment inn the weight values will be the same, preserving the convergence time. ‘The sseumption wil be approximately true as long as [bh| is very small or kept constant in spite of the change in 5 (&) If the input data is sealed by 1/a, then the increment of weights at each step will be exactly the seme. That is, hall) = nf". “Therefore the convorgen time will be kept the same, (However, ifthe network is a multi-layer one, the input scaling should be applied to the hidden units’ outputs as well.)

HW 08

Uploaded by

Copyright:

Available Formats

You might also like

HW 08

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

HW 08

Uploaded by

Copyright:

Available Formats

You might also like