Professional Documents
Culture Documents
HW 08
HW 08
HW 08
wj,4, is computed in 2c operations. Thus we have wy’s are computed in [24+ 2¢]nzr time. ‘Thus, the time for one iteration is the time to compute Aw plus the time to add w to Aw, that is, = e(2ng + 10+ 2(d+c+ ln + (dry + mHe) Binge + Snye-+ e+ nn In summary, the time complexity is (Sdnyr + 5ny0-+6-+ 2) () Here, the number of iterations = ning and thus the time complexity is (3dn + Snyet c+ 2ny)n my 4, Equation 20 in the text gives the sensitivity at a hidden unit S nets) 4 For a four-layer or higher-layer neural network, the sensitivity of a hidden unit is likewise given by224 CHAPTER 6. MULTILAYER NEURAL NETWORKS 10. We express the derivative of a sigmoid in terms of the sigmoid itself for positive constants @ and 6 for the following cases. (a) Por f(net) = 1/(1 + e* "*t), the derivative is Giret) _ 1 \" go net = ~af(net)(t ~ (net). (b) Por f(net) = atanh(b net), the derivative is Heed = -28%atanh(b net)(1 — tanh? (b net). 11. We ts the following notation: ‘The activations atthe frst (ipa, second, thir, and fourth (output) layers are sz, yj, Ut, and zp, respectively, and the indexes are ‘ilar fom usage. ‘The number of units in the firs hidden lever is my, and the nurnber Inthe ssond hidden layer is Mit gut Algorithm 0 (Four-layer backpropagation) 1 begin initialize 2x2 2 ere ’ zea 4 return rox 5 end Section 6.4 12. Suppose the input to hidden weights are set equal to the same valle, Say Woy then wij = wo, ‘Then we have nety = f(net)= Sum =~. Ds228 CHAPTER 6. MULTILAYER NEURAL NETWORKS (b) Already shown above. 19. ‘The assumption that the network can represent the underlying true distribution is not used before Eq, 28 in the text. For Eq. 29, however, we invoke ge(x; w) = p(WelX), which is used for : 7 (x; Ww) — P(e x)] p(x)dx = 0. XJ ime Ix) ] ‘This is true only when the above assumption is met. If the essumption is not met, the gradient descent procedure yields the closest projection to the posterior probability jn the class spanned by the network 20. Recall the equation Plylan) = eA HBO ey (a) Given p(ylux), we use Bayes’ Thoorem to write the posterior as Ply|en)P(wr) Play) = yy (b) We interpret A()), We and ¢ as follows: Plug) = oA") Ainly) __eplAlws) + Bly, ¢) + Wy] Plu) Xe lAlim) + Bly, 9) + Whey] Pom) _ Rete ie where nety — by, 4) + Wy. Thus, Bly, ¢) is the bias, W is the weight vector describing the separating plane, and ¢ “(**) is P(x). 21. Backpropagation with softmax is done in the usual manner; all that must be ovaluatod differently are the sensitivities at the output and hidden layer units, that ents ben Bnetn = an(1 zn), (a) We are given the following torms: nety = Sop nee = Swain wy = Sines) oa penPROBLEM SOLUTIONS 229 and the error function ‘To derive the learning rule we have to compute BJ/Own; and 8J/Ows.. We start with the former: Bs _ Ad Bnet Bugg Tete Orcas” ‘We next compute bd nels We also have itoxk oe Si net, ete 4g (1 ee a aap ifs =k Den ) (g =) and finally ay On (—1)(ts — 26). Potting these together we got ad net = AV ~ 2) zont) + (te alan 2) a We use nets /Bug; = yj and obtain Yj Dailte — 20)(o74) — sll ~ 24)(Z4~ 28) cvs Now we have to find inpul-to-hidden weight contribution to J. By the chain rule we have 7 _ 8) dy Snot; Buy, Oy; Orets ws.” We can at once find out the last two partial derivatives J (nets) and230 CHAPTER 6. MULTILAYER NEURAL NETWORKS 825 = --s . Ste 2 te Inte 4 Bnei, By; = Sees |e Sate go Pee Sas =i Es Drety yy, * 2 Drete, Guy, = =D = 2) ee — Beg HD seattle — 20) a ee ‘We put all this together and find = xf (nets) Dla — 20) So wejeete = - al Bun5, aif! (net) ) (te — 24) (2s #2); Of course, the leaming rule is then ay Au = Bug ad ea where the derivatives are as given above. (b) We are given the cross-entropy criterion function ‘The learning rule derivation is the same as in part (a) except that we need to replace 8/82 with BJce/9z, Note that for the cross-entropy criterion we have 8ee/82y = ~t/2u- ‘Then, following the steps in part (2), we have Bin yar att lon ab Be = WY Rew whe = wD tere —yjtell — 2). ” Similarly, we have Glee asptnets) 02 ote a mame maaf (nets) Yo Sas ee ~ 2PROBLEM SOLUTIONS 231 ‘Thus we find Ge = aoa Dus rico esp (nes) Ste Of course, the learning rule is thon BJs Suse = nga Adee Aus = ag where the derivatives are as given above. 22. In the two-ategory case, if 1 ~ P(ui|2), then 1~ g1 = P(usla), since we ccan assume the categories are mutually exclusive and exhaustive. But 1 — 9; can bbe computed by a network with inpul-to-hidden weights identical to those used to compute g:. From Eq. 27 in the text, we know D [ iontos w) — Plena ae + J tar (s,37)~ Pla) oe is a minimum. this implies that every term in the above equation is minimized. Section 6.7 23, Consider the weight update rules given by Eqs. 12 and 23 in the text. (a) ‘The weight updates are have the factor n/"(net). If we take the sigmoid fy(h) = tanh(bh), we have em 1 SO) = Doe ap — 1 = ha Pees? 2D =1, Clearly, 0 < D < 0.25 for all b and fh, If we assume 1 is constant, then clearly the product 7/7f!s(h) will be equal to nfi(h), whieh tells us the increment inn the weight values will be the same, preserving the convergence time. ‘The sseumption wil be approximately true as long as [bh| is very small or kept constant in spite of the change in 5 (&) If the input data is sealed by 1/a, then the increment of weights at each step will be exactly the seme. That is, hall) = nf". “Therefore the convorgen time will be kept the same, (However, ifthe network is a multi-layer one, the input scaling should be applied to the hidden units’ outputs as well.)