30 Years of Adaptive Neural Networks:
Perceptron, Madaline, and Backpropagation
BERNARD WIDROW, rttow, iff, AND MICHAEL A. LEHR
Fundamental development a feedlonard artis! neva net
wos om the past iney years are reviewed. The cena heme
{hs paper iss descplon of the history onginaon, operating
‘harsctensies and asl theory of sever supersed neal et
‘ork uainngalgothons including the Percept rule, the INS
Lignin, thre Madaline rales, and the backpropagaon ich
‘igue. These methods were develope! indepondent but with
th penpective of history they ean abe elated to eachother The
‘oneept undering these algorithms is he "minimal disturbance
(Giec ner information ato a network ive eater tet ated
‘Moved infomation to esas extent possible,
‘This year marks the 30th anniversary of the Perceptron
rule and the LMS algorithm, two early rules for taining
Adaptive elements. Both algorithms were frst published in
1960. 1m the yeas following these discoveries, many new
techniques have been developed inthe field of neural net
works, and the discipline is growing rapidly. One early
evelopment was Steinbuch’s Learning Matrx(1],2pattera
recognition machine based on linear discriminant func.
tions, At the same time, Widrow and his students devised
-Madaline Rule (MRD, the earliest popular learning euletor
neural networks with multipe adaptive elements 2], Other
farly work included the “mode-seeking” technique of
Stark, Okajima, and Whipple 3]. This was probably the fist
example of competitive learning in the literature, though
Incould be argued tha earlier work by Rosenblatt on "spon:
taneous learning’ [i {5} deserves this distinction. Further
pioneering work on competitive learning and selt-organi
ation was performed in the 1970s by von der Malsbutg (6)
land Grossberg (71. Fukushima explored related ideas with
his biologically inspired Cogritron and Neocognitron
models (8 (9
Manuscript received September 12,1989; revised Api 13,1980,
‘his work was sponsored by SOIO Innovative Science and Tech
nology ofice and managed by ONR under contract no, NIDOTESS-
GTI, by the Dept of the Army Belvoir RDAE Center under con
tracts OAAK 7087 P-4and no DAAK 708K 001, bya grant
from the Lockheed Mises and Space Ca by NASA winder com
tract no. NCA209, and by Rome Air Development Cente under
onvraci no. 3060338 D0, subcontract no E21
The authors ae wih the Information Systems laboratory
Department of Electrical Engineering, Stantord University, star
ford, casemsaoss, usa.
IEEE Log Number 9038824,
Widrow devised a reinforcement learning algorithm
called “punishireward” or "bootstrapping" {10} (1] in the
‘mid-1960s. This can be used to solve problems when uncer
tainty about the error signal causes supervised training
methods to be impractical. A related reinforcement learn
ing approach was later explored in aclassic paper by Barto,
Sviton, and Andersonon the “creditassignment” problem
[12], Barto et a's technique is also somewhat reminiscent
‘of Albus's adaptive CMAC, a distributed table-ook-up sys
tem based on models of human memory 13, [14
Tn the 1970s Grossberg developed his Adaptive Reso-
ance Theoty (ART), a number of novel hypotheses about
the underlying principles governing biological neural sys
tems {15}. These ideas served asthe bass for later work by
Carpenter and Grossberg involving three classes of ART
architectures: ART 116], ART 2 [17], and ART 3 [16], These
{re seltorganizing neural implementations of pattera clus-
tering algorithms. Other important theory on sel¥-organiz~
ing systems was pioneered by Kohonen with his work on
feature maps 19) (20,
Tn the eatly 980s, Hopfield and others introduced outer
product rules as well as equivalent approaches based on
the early work of Hebb (21 for Waining a class of recurrent
[signal feedbaclo networks now called Hopfield models (22),
[23], More recently, Kosko extended some of the ideas of
Hopiield and Grossberg to develop his adaptive Bidiec
sional Associative Memory (BAM) [24 a network model
‘employing diferential as well as Hebbian and competitive
learning laws, Other significant models from the past de-
‘eade include probabilistic ones such as Hinton, Sejnowski,
and Ackley’s Boltzmann Machine [23 [26] which, to over
“Simplify, is a Hoplield model that settes into solutions by
‘simulated annealing process governed by Boltzmann sta-
tistics, The Boltzmann Machine is trained by a clever two-
phase Hebbian-based technique.
While these developments were taking place, adaptive
systems reearch at Stanford taveled an independent path,
‘Aer devising their Madaline I rule, Wideow and his stu-
‘dents developed uses for the Adaline and Madaline. Early
applications included, among others, speech and pattern
recognition 27], weather forecasting{ 28], and adaptive con-
trols [29]. Work then switched to adaptive filtering. and
adaptive signal processing [30 ater attempts to develop
learning rules for networks with multiple adaptive layers
were unsuccessful. Adaptive signal processing proved 10bbe a ruitfl avenue forresearch with applications involving
adaptive antennas (31, adaptive inverse controls [32 adap
tivenoise canceling(33), and seismic signal processing 20).
Outstanding work by Lucky and others at Bell Laboratories
led to major commercial applications af adaptive filters and
the LMS algorithm to adaptive equalization in high-speed
‘modems [34], [35] and to adaptive echo cancellers for long
distance telephone and satelite cicuits 36), After 20 yeats
‘of research in adaptive signal processing, the work in Wid
row’ laboratory has once again returned to neural net.
works,
The first major extension ofthe feedforward neural net:
work beyond Madaline | took place in 1971 when Werbos
developed a backpropagation training algorithm which, in
1974, he fist published in his doctoral dissertation 371"
Unfortunately, Werbos's work remained almost unknown
Inthe scientific community. In 1982, Parker rediscovered
the technique [39] and in 1985, published a report on iat
MALT. [40], Not long after Parker published his findings,
Rumethar, Hinton, and Willams (a1, [42]als0 rediscovered
thetechnigueand, largely asa result of theclear framework
‘within which they presented their ideas, they finally suc:
‘ceeeded in making it widely known,
The elements used by Rumelhart etal. n the backprop-
_agation network differ from those use inthe earlier Made
line architectures. The adaptive elements in the original
Madaline structure used hardimiting quantizers (sig
‘nums) while the elements inthe backpropagation network
use only differentiable nonlinearities, oF “sigmotd” tune
tions.* In digital implementations, the hard-limiting
{quantizer is more easily computed than any of the difer
tentiable nonlinearities used in backpropagation networks,
In 1987, Widrow, Winter, and Baxterlooked backatthe orig.
inal Madaline algorithm with the goal of developing anew
technique that could adapt multiple layers of adaptive ele-
‘ments using the simpler hard imiting quantizers. The esult
was Madaline Rule (83).
David Andes ofS. Naval Weapons Center of Chinatake,
‘CA, modified Madalin Ii 1988 by replacing the har lim:
iting quantizers in the Adaline and sigmoid functions,
‘thereby inventing Madaline Rule (RIL, Widrow andhis
students were fist to recognize that this rule is mathe-
matically equivalent to backpropagation.
‘The outline above gives only a partal view of the disc
pline, and many landmark discoveries have not been men:
tioned. Needless to say, the field of neural networks is
{quickly becoming a vast one, and in one short survey We
{could not hope to cover the entire subject in any detal
Consequently, many signiticant developments, including
some of those mentioned above, are nat discussed inthis
paper. The algorithms described are limited prim:
“‘Weshould note however, hatin the ld ofvarationaaletas
the idea of error backpropagation through noslinear systeme
‘sted conturesbetore Werbosestthoughtto apy thisconcept
Toneural networks. Inthe past 23 yeas, these methods Mave bean
Used widely inthe il of optimal contol as duced by te Cun
om
“The term “sigmoid usualy used in telerence to monoton
ieallyincresing"sshapedlunctions, such asthe hyperbalitan
{Een Inths paper, however, we genealy ue the frm fo denote
ny smooth nonlinear functions the ouput ef snes adap
"ement nother papers, these nonlinearities go by 3 varity of
ames, suchas "squashing functions "activation functions
transfer characteris,” of “thresald functions”
those developed in our laboratory at Stanford, andtorelated
techniques developed elsewhere, the most important of
‘whieh isthe backpropagation algorithm. Section I explores
fundamental concepts, Section Ii discusses adaptation and
‘the minimal disturbance principle, Sections WV and V cover
error correction rules, Sections VI and Vil delve into
Steepestdescent rules, and Section Vill provides a sum:
mary.
nformationaboul the neural network paradigms not dis
cussediinthis paper can be obtained roma numberof other
sources, such asthe concise survey by Lippmann [4], and
the collection of classics by Anderson and Rosenfeld (4).
‘Much of the early work inthe field from the 1960s is care
fully reviewed in Nilsson’s monograph [d6l. A good view
of some of the more recent results is presented in Rumel-
hart and McClelland’s popular three-volume set 47]. A
paper by Moore [48] presents a clear discussion about ART
‘and some of Grossberg's terminology. Another resource
is the DARPA Study report [45] which gives a very compre-
hensive and readable "snapshot" of the field in 1968,
1, Funoawenrat Coneerrs
Today we can build computers and other machines that
periormavarietyofwelldefinedtaskswithcelerityandrel
blity unmatched by humans. No human can invert mar
es or solve systems of differential equations at speeds
rivaling modern workstations. Nonetheless, many prob-
lems remain to be solved to our satisfaction by any man-
‘made machine, but are easily disentangled by the percep-
‘wal or cognitive powers of humans, and often lower mam:
mals or even fish and insects. No computer vision system
can fival the human abiliy to recognize visual images
formed by objects ofall shapes and orientations under a
wide range of conditions. Humans effortlessly recognize
objects in diverse environments and lighting conditions,
even when obscured by dirt, oF occluded by other objects
Likewise, the performance of current speech-recognition
technology pales when compared to the performance of
the human adult who easily recognizes words spoken by
diferent people, at different rates, pitches, and volumes,
{even in the presence of distortion ar background noise,
‘The problems solved more effectively by the brain than
by the digital computer typically have two characteristics:
they are generally ill defined, and they usually require an
{enormous amount of processing, Recognizing the char-
acter ofan object from its image on television, frinstance,
involves resolving ambiguities associated with distortion
And lighting. It also involves filling in information about a
threedimensional scene which is missing from the two-
dimensional image on the screen. An infinite number of
threedimensional scenes can be projected into a two
dimensional image. Nonetheless, the brain deals well with
this ambiguity, and using learned cues usually has lite dit.
ficuty correctly determining the role played by the missing
dimension.
‘Asanyone whohas performed even simplefitering oper-
ations on images is aware, processing. high-esolution
Images requires a great deal of computation. Our brains
accomplish this by utilizing massive parallelism, with mi
Tionsandeven billions of euronsin partsof the brainwork:
ing together to solve complicated problems. Because soid-
state operational amplifiers and logic gates can compute‘many orders of magnitude faster than current estimates of
the computational speed of neurons in the brain, we may
‘soon be able to build relatively inexpensive machines with
the ability to process as much information 2s the human
brain. This enormous processing power willdolitletohelp
ussolve problems, however, unless we can utilize i effec
tively, For instance, coordinating many thousands of pro:
cessors, which must efficiently cooperate to solve a prob
lem, is not a simple task. If each processor must be
programmed separately, andifallcontingencies associated
‘ith various ambiguities must be designed into the soit.
ware, even a elatvely simple problem can quickly become
Unmanageable. The slow progress over the past?5 years or
Soinmachinevisionandother areas of artificial intelligence
Is testament to the difficulties associated with solving
ambiguous and computationally intensive problems on von
Neumann computers and related architectures,
“Thus, theee is some reason to consider atacking certain
problems by designing naturally parallel computers, which
processinformationand learn by principles borrowed irom
the nervous systems of biological creatures. This does not
‘necessarily mean we should attempt to copy the brain part
for par. Although the bird served to inspire development
ofthe airplane, bids do not have propellers, and airplanes
do not operate by flapping feathered wings. The primary
parallel between biological nervous systems and artificial
neural networks is that each typically consste of 3 large
‘number of simple elements that learn and are able to co
lectively solve complicated and ambiguous problems.
“Today, most artificial neural network research and appl
cation is accomplished by simulating networks on serial
Computers. Speed limitations keep such networks cela
tively small, but even with small networks some su
‘ingly dificult problems have been tackled. Networks with
fewer than 180 neural elements have been used success:
fully in vehicular control simulations (50, speech genera
tion [51,52], and undersea mine detection {49}. Small net-
works have aso been used successfully in airport explosive
detection 53}, expert systems 54), (55), and scores of other
applications. Furthermore efforts to develop parallel neural
network hardwarearemecting withsome success, and such
hardware should be available inthe future for attacking
‘more dificult problems, such as speech recognition (36)
1,
Whether implemented in parallel hardware or simulated
fn a computer, all neural networks consist ofa collection
‘of simple elements that work together to solve problems.
[A basic building block of nearly all artiicial neural net-
‘works, and most other adaptive systems, isthe adaptive lin:
fear combiner
A. The Adaptive Linear Combiner
‘The adaptive linear combiner diagrammed in Fig. 1.
output iss linear combination of inputs. In a git
implementation, this element reeives at time kan input
Signal vector or input pattern vector Xi = [Ke toe
“an and a desired response dy, a special mpUt sed
to effect learning. The components ofthe input vector are
weighted by set of coefiiens, the weight vector Wy
Ue te Wey * Wa The sum of the weighted inputs
fr then computed, producing a linear output, the inner
product = XW. The components of X, may be ether
Adaptive linear combiner.
Fes
Continuous analog values or binary values. The weights are
‘essentially continuously variable, and can take on negative
‘3s well as positive values.
During the training process, input patterns and corre-
sponding desired responses are presented to the linear
‘combiner. An adaptation algorithm automatically adjusts
the weights so that the output responses tothe input pat-
terns willbe as close as possible to their respective desired
reponses. In signal processing applications, the most pop-
ular method for adapting the weights isthe simple LMS
{least mean square) algorithm (58), [58], often called the
‘Widrow.Hoft delta rule (42), This algorithm minimizes the
sumol squares ofthe linear errors overthe raining set. The
linear error és defined to be the ditference between the
‘desired response d, and the linear outputs, during pre
sentation ke Having this etror signal is necessary for adapt.
ing the weights. When the adaptive linear combiner is
‘embedded ina multielement neural network, however, a
erro signal soften not drectlyavailable for each individual
linear combiner and more complicated procedures must
be devised for adapting the weight vectors. These proce:
dures are the main focus ofthis paper.
B, A Linear Cassifier—The Single Threshold Element
‘The basic building block used in many neural networks
isthe “adaptive linear element,” or Adaline [58] Fig.
Thisisan adaptive threshold logic element. It consists of
an adaptive linear combiner cascaded with aharelimiting,
{quantizer, which is used to produce a binary + 1 output,
Y= sgn sy). The bias weight wy, whichis connected to 4
Constant input xy = +1, effectively controls the threshold
level of the quantizer
Insingle-element neural networks, an adaptive algorithm
(suchas the LMS algorithm, or the Perceptonrule)isoften
Used to adjust the weights ofthe Adaline so that t responds
correctly 038 many patterns as possible in a training set
that has binary desired responses, Once the weights are
adjusted, the responses the trained element canbe tested
by applying various input patterns. ithe Adaline responds
correctly with high probability to input patterns that were
‘ot included in the training set itis said that generalization
hhastaken place. Learning and generalization areamong the
most uselul attributes of Adalnes and neural networks.
“near Separability: With n binary inputs and one binary
2in the neural network literature, such elements are often
refered to 3 adaptive neurons.” However n'a conversation
(eneen avid Hubelof Harvard Medial School and Bereard Wed.
‘ow, Dr. Hubel pointed out thatthe Adaline ers rom the bi
{opie neuron hat contansnotonly the neural cel body, but
flko the impart syapses anda mechanism Tor aining them