30 Years of Adaptive Neural Networks (JNL Article) - B. Widrow, M. Lehr (1990) WW

30 Years of Adaptive Neural Networks: Perceptron, Madaline, and Backpropagation BERNARD WIDROW, rttow, iff, AND MICHAEL A. LEHR Fundamental development a feedlonard artis! neva net wos om the past iney years are reviewed. The cena heme {hs paper iss descplon of the history onginaon, operating ‘harsctensies and asl theory of sever supersed neal et ‘ork uainngalgothons including the Percept rule, the INS Lignin, thre Madaline rales, and the backpropagaon ich ‘igue. These methods were develope! indepondent but with th penpective of history they ean abe elated to eachother The ‘oneept undering these algorithms is he "minimal disturbance (Giec ner information ato a network ive eater tet ated ‘Moved infomation to esas extent possible, ‘This year marks the 30th anniversary of the Perceptron rule and the LMS algorithm, two early rules for taining Adaptive elements. Both algorithms were frst published in 1960. 1m the yeas following these discoveries, many new techniques have been developed inthe field of neural net works, and the discipline is growing rapidly. One early evelopment was Steinbuch’s Learning Matrx(1],2pattera recognition machine based on linear discriminant func. tions, At the same time, Widrow and his students devised -Madaline Rule (MRD, the earliest popular learning euletor neural networks with multipe adaptive elements 2], Other farly work included the “mode-seeking” technique of Stark, Okajima, and Whipple 3]. This was probably the fist example of competitive learning in the literature, though Incould be argued tha earlier work by Rosenblatt on "spon: taneous learning’ [i {5} deserves this distinction. Further pioneering work on competitive learning and selt-organi ation was performed in the 1970s by von der Malsbutg (6) land Grossberg (71. Fukushima explored related ideas with his biologically inspired Cogritron and Neocognitron models (8 (9 Manuscript received September 12,1989; revised Api 13,1980, ‘his work was sponsored by SOIO Innovative Science and Tech nology ofice and managed by ONR under contract no, NIDOTESS- GTI, by the Dept of the Army Belvoir RDAE Center under con tracts OAAK 7087 P-4and no DAAK 708K 001, bya grant from the Lockheed Mises and Space Ca by NASA winder com tract no. NCA209, and by Rome Air Development Cente under onvraci no. 3060338 D0, subcontract no E21 The authors ae wih the Information Systems laboratory Department of Electrical Engineering, Stantord University, star ford, casemsaoss, usa. IEEE Log Number 9038824, Widrow devised a reinforcement learning algorithm called “punishireward” or "bootstrapping" {10} (1] in the ‘mid-1960s. This can be used to solve problems when uncer tainty about the error signal causes supervised training methods to be impractical. A related reinforcement learn ing approach was later explored in aclassic paper by Barto, Sviton, and Andersonon the “creditassignment” problem [12], Barto et a's technique is also somewhat reminiscent ‘of Albus's adaptive CMAC, a distributed table-ook-up sys tem based on models of human memory 13, [14 Tn the 1970s Grossberg developed his Adaptive Reso- ance Theoty (ART), a number of novel hypotheses about the underlying principles governing biological neural sys tems {15}. These ideas served asthe bass for later work by Carpenter and Grossberg involving three classes of ART architectures: ART 116], ART 2 [17], and ART 3 [16], These {re seltorganizing neural implementations of pattera clus- tering algorithms. Other important theory on sel¥-organiz~ ing systems was pioneered by Kohonen with his work on feature maps 19) (20, Tn the eatly 980s, Hopfield and others introduced outer product rules as well as equivalent approaches based on the early work of Hebb (21 for Waining a class of recurrent [signal feedbaclo networks now called Hopfield models (22), [23], More recently, Kosko extended some of the ideas of Hopiield and Grossberg to develop his adaptive Bidiec sional Associative Memory (BAM) [24 a network model ‘employing diferential as well as Hebbian and competitive learning laws, Other significant models from the past de- ‘eade include probabilistic ones such as Hinton, Sejnowski, and Ackley’s Boltzmann Machine [23 [26] which, to over “Simplify, is a Hoplield model that settes into solutions by ‘simulated annealing process governed by Boltzmann sta- tistics, The Boltzmann Machine is trained by a clever two- phase Hebbian-based technique. While these developments were taking place, adaptive systems reearch at Stanford taveled an independent path, ‘Aer devising their Madaline I rule, Wideow and his stu- ‘dents developed uses for the Adaline and Madaline. Early applications included, among others, speech and pattern recognition 27], weather forecasting{ 28], and adaptive controls [29]. Work then switched to adaptive filtering. and adaptive signal processing [30 ater attempts to develop learning rules for networks with multiple adaptive layers were unsuccessful. Adaptive signal processing proved 10bbe a ruitfl avenue forresearch with applications involving adaptive antennas (31, adaptive inverse controls [32 adap tivenoise canceling(33), and seismic signal processing 20). Outstanding work by Lucky and others at Bell Laboratories led to major commercial applications af adaptive filters and the LMS algorithm to adaptive equalization in high-speed ‘modems [34], [35] and to adaptive echo cancellers for long distance telephone and satelite cicuits 36), After 20 yeats ‘of research in adaptive signal processing, the work in Wid row’ laboratory has once again returned to neural net. works, The first major extension ofthe feedforward neural net: work beyond Madaline | took place in 1971 when Werbos developed a backpropagation training algorithm which, in 1974, he fist published in his doctoral dissertation 371" Unfortunately, Werbos's work remained almost unknown Inthe scientific community. In 1982, Parker rediscovered the technique [39] and in 1985, published a report on iat MALT. [40], Not long after Parker published his findings, Rumethar, Hinton, and Willams (a1, [42]als0 rediscovered thetechnigueand, largely asa result of theclear framework ‘within which they presented their ideas, they finally suc: ‘ceeeded in making it widely known, The elements used by Rumelhart etal. n the backprop- _agation network differ from those use inthe earlier Made line architectures. The adaptive elements in the original Madaline structure used hardimiting quantizers (sig ‘nums) while the elements inthe backpropagation network use only differentiable nonlinearities, oF “sigmotd” tune tions.* In digital implementations, the hard-limiting {quantizer is more easily computed than any of the difer tentiable nonlinearities used in backpropagation networks, In 1987, Widrow, Winter, and Baxterlooked backatthe orig. inal Madaline algorithm with the goal of developing anew technique that could adapt multiple layers of adaptive ele- ‘ments using the simpler hard imiting quantizers. The esult was Madaline Rule (83). David Andes ofS. Naval Weapons Center of Chinatake, ‘CA, modified Madalin Ii 1988 by replacing the har lim: iting quantizers in the Adaline and sigmoid functions, ‘thereby inventing Madaline Rule (RIL, Widrow andhis students were fist to recognize that this rule is mathe- matically equivalent to backpropagation. ‘The outline above gives only a partal view of the disc pline, and many landmark discoveries have not been men: tioned. Needless to say, the field of neural networks is {quickly becoming a vast one, and in one short survey We {could not hope to cover the entire subject in any detal Consequently, many signiticant developments, including some of those mentioned above, are nat discussed inthis paper. The algorithms described are limited prim: “‘Weshould note however, hatin the ld ofvarationaaletas the idea of error backpropagation through noslinear systeme ‘sted conturesbetore Werbosestthoughtto apy thisconcept Toneural networks. Inthe past 23 yeas, these methods Mave bean Used widely inthe il of optimal contol as duced by te Cun om “The term “sigmoid usualy used in telerence to monoton ieallyincresing"sshapedlunctions, such asthe hyperbalitan {Een Inths paper, however, we genealy ue the frm fo denote ny smooth nonlinear functions the ouput ef snes adap "ement nother papers, these nonlinearities go by 3 varity of ames, suchas "squashing functions "activation functions transfer characteris,” of “thresald functions” those developed in our laboratory at Stanford, andtorelated techniques developed elsewhere, the most important of ‘whieh isthe backpropagation algorithm. Section I explores fundamental concepts, Section Ii discusses adaptation and ‘the minimal disturbance principle, Sections WV and V cover error correction rules, Sections VI and Vil delve into Steepestdescent rules, and Section Vill provides a sum: mary. nformationaboul the neural network paradigms not dis cussediinthis paper can be obtained roma numberof other sources, such asthe concise survey by Lippmann [4], and the collection of classics by Anderson and Rosenfeld (4). ‘Much of the early work inthe field from the 1960s is care fully reviewed in Nilsson’s monograph [d6l. A good view of some of the more recent results is presented in Rumel- hart and McClelland’s popular three-volume set 47]. A paper by Moore [48] presents a clear discussion about ART ‘and some of Grossberg's terminology. Another resource is the DARPA Study report [45] which gives a very compre- hensive and readable "snapshot" of the field in 1968, 1, Funoawenrat Coneerrs Today we can build computers and other machines that periormavarietyofwelldefinedtaskswithcelerityandrel blity unmatched by humans. No human can invert mar es or solve systems of differential equations at speeds rivaling modern workstations. Nonetheless, many problems remain to be solved to our satisfaction by any man- ‘made machine, but are easily disentangled by the percep- ‘wal or cognitive powers of humans, and often lower mam: mals or even fish and insects. No computer vision system can fival the human abiliy to recognize visual images formed by objects ofall shapes and orientations under a wide range of conditions. Humans effortlessly recognize objects in diverse environments and lighting conditions, even when obscured by dirt, oF occluded by other objects Likewise, the performance of current speech-recognition technology pales when compared to the performance of the human adult who easily recognizes words spoken by diferent people, at different rates, pitches, and volumes, {even in the presence of distortion ar background noise, ‘The problems solved more effectively by the brain than by the digital computer typically have two characteristics: they are generally ill defined, and they usually require an {enormous amount of processing, Recognizing the char- acter ofan object from its image on television, frinstance, involves resolving ambiguities associated with distortion And lighting. It also involves filling in information about a threedimensional scene which is missing from the two- dimensional image on the screen. An infinite number of threedimensional scenes can be projected into a two dimensional image. Nonetheless, the brain deals well with this ambiguity, and using learned cues usually has lite dit. ficuty correctly determining the role played by the missing dimension. ‘Asanyone whohas performed even simplefitering oper- ations on images is aware, processing. high-esolution Images requires a great deal of computation. Our brains accomplish this by utilizing massive parallelism, with mi Tionsandeven billions of euronsin partsof the brainwork: ing together to solve complicated problems. Because soid- state operational amplifiers and logic gates can compute‘many orders of magnitude faster than current estimates of the computational speed of neurons in the brain, we may ‘soon be able to build relatively inexpensive machines with the ability to process as much information 2s the human brain. This enormous processing power willdolitletohelp ussolve problems, however, unless we can utilize i effec tively, For instance, coordinating many thousands of pro: cessors, which must efficiently cooperate to solve a prob lem, is not a simple task. If each processor must be programmed separately, andifallcontingencies associated ‘ith various ambiguities must be designed into the soit. ware, even a elatvely simple problem can quickly become Unmanageable. The slow progress over the past?5 years or Soinmachinevisionandother areas of artificial intelligence Is testament to the difficulties associated with solving ambiguous and computationally intensive problems on von Neumann computers and related architectures, “Thus, theee is some reason to consider atacking certain problems by designing naturally parallel computers, which processinformationand learn by principles borrowed irom the nervous systems of biological creatures. This does not ‘necessarily mean we should attempt to copy the brain part for par. Although the bird served to inspire development ofthe airplane, bids do not have propellers, and airplanes do not operate by flapping feathered wings. The primary parallel between biological nervous systems and artificial neural networks is that each typically consste of 3 large ‘number of simple elements that learn and are able to co lectively solve complicated and ambiguous problems. “Today, most artificial neural network research and appl cation is accomplished by simulating networks on serial Computers. Speed limitations keep such networks cela tively small, but even with small networks some su ‘ingly dificult problems have been tackled. Networks with fewer than 180 neural elements have been used success: fully in vehicular control simulations (50, speech genera tion [51,52], and undersea mine detection {49}. Small networks have aso been used successfully in airport explosive detection 53}, expert systems 54), (55), and scores of other applications. Furthermore efforts to develop parallel neural network hardwarearemecting withsome success, and such hardware should be available inthe future for attacking ‘more dificult problems, such as speech recognition (36) 1, Whether implemented in parallel hardware or simulated fn a computer, all neural networks consist ofa collection ‘of simple elements that work together to solve problems. [A basic building block of nearly all artiicial neural net- ‘works, and most other adaptive systems, isthe adaptive lin: fear combiner A. The Adaptive Linear Combiner ‘The adaptive linear combiner diagrammed in Fig. 1. output iss linear combination of inputs. In a git implementation, this element reeives at time kan input Signal vector or input pattern vector Xi = [Ke toe “an and a desired response dy, a special mpUt sed to effect learning. The components ofthe input vector are weighted by set of coefiiens, the weight vector Wy Ue te Wey * Wa The sum of the weighted inputs fr then computed, producing a linear output, the inner product = XW. The components of X, may be ether Adaptive linear combiner. Fes Continuous analog values or binary values. The weights are ‘essentially continuously variable, and can take on negative ‘3s well as positive values. During the training process, input patterns and corre- sponding desired responses are presented to the linear ‘combiner. An adaptation algorithm automatically adjusts the weights so that the output responses tothe input patterns willbe as close as possible to their respective desired reponses. In signal processing applications, the most popular method for adapting the weights isthe simple LMS {least mean square) algorithm (58), [58], often called the ‘Widrow.Hoft delta rule (42), This algorithm minimizes the sumol squares ofthe linear errors overthe raining set. The linear error és defined to be the ditference between the ‘desired response d, and the linear outputs, during pre sentation ke Having this etror signal is necessary for adapt. ing the weights. When the adaptive linear combiner is ‘embedded ina multielement neural network, however, a erro signal soften not drectlyavailable for each individual linear combiner and more complicated procedures must be devised for adapting the weight vectors. These proce: dures are the main focus ofthis paper. B, A Linear Cassifier—The Single Threshold Element ‘The basic building block used in many neural networks isthe “adaptive linear element,” or Adaline [58] Fig. Thisisan adaptive threshold logic element. It consists of an adaptive linear combiner cascaded with aharelimiting, {quantizer, which is used to produce a binary + 1 output, Y= sgn sy). The bias weight wy, whichis connected to 4 Constant input xy = +1, effectively controls the threshold level of the quantizer Insingle-element neural networks, an adaptive algorithm (suchas the LMS algorithm, or the Perceptonrule)isoften Used to adjust the weights ofthe Adaline so that t responds correctly 038 many patterns as possible in a training set that has binary desired responses, Once the weights are adjusted, the responses the trained element canbe tested by applying various input patterns. ithe Adaline responds correctly with high probability to input patterns that were ‘ot included in the training set itis said that generalization hhastaken place. Learning and generalization areamong the most uselul attributes of Adalnes and neural networks. “near Separability: With n binary inputs and one binary 2in the neural network literature, such elements are often refered to 3 adaptive neurons.” However n'a conversation (eneen avid Hubelof Harvard Medial School and Bereard Wed. ‘ow, Dr. Hubel pointed out thatthe Adaline ers rom the bi {opie neuron hat contansnotonly the neural cel body, but flko the impart syapses anda mechanism Tor aining them

30 Years of Adaptive Neural Networks (JNL Article) - B. Widrow, M. Lehr (1990) WW

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

30 Years of Adaptive Neural Networks (JNL Article) - B. Widrow, M. Lehr (1990) WW

Uploaded by

Copyright:

Available Formats

You might also like