10 Learning

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

CHAPTER

17
LEARNING

Viat men do not !earn very mucfi_.from tlie lessons ojlii.storg is the most important efal[ the lessons ef
!tistory.
-Aldou s Huxley
(1894-1963), American Writer and Author

17.1 WHAT IS LEARNING?


to
One of the most often heard criticisms of AI is that machines cannot be called intelligent until they are able
learn to do new things and to, adapt to new situations, rather than simply doing as they are told to do. There can
be little question that the ~bility to adapt to new surroundings and to solve new problems is an important
the
characteristic of intelligen t entities. Can we expect to see such abilities in programs? Ada Augusta, one of
earliest philosophers of computin g, wrote that

The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how 10 order
it to perfonn. [Lovelace, 1961]

This remark has been interprete d by several Al critics as saying that computers cannot learn. In fact, it does
that
~ot say that at all. Nothing prevents us from telling a computer how to interpret its inputs in such a way
its performance gradually improves. ·
Rather than asking in advance w~ether it is possible for comput;~rs to ~'le:~rn," it is much mo~ enlightening
to try to describe exactly what activities we mean when we say learnmg and w~at mechanism
s could be
Used to enable us to perform those activities. Simon [ 1983] has proposed that learnmg denotes

. . th adaptive in Lhe sense that they enable the system to do the same task or tasks drawn
., .
···Changes m the system. at are fficiently and more effecuvely .
f th the next time.
rom e same populat10n more e

As th d fi . a wide range of phenomena. At one end of the spectrum is skill refinement ·


us e med learrung covers 'd •
PeopJe g t b ' k ·mply by practicing. The more you n ea bicycle or play tennis ' the better you
e etter at many tas s s1t m lies knowledge acqulSltwn · ·· A h
get, At th th . s we ave seen, many AI progran1s draw
e o er end of the spec ru
348 Art ificial fnt elligence
~ wa.P'I _ _ .,. ----.;...:wwwse m

h ·1
eavi Y on knowledge a,;; their source l)f power. Knowledge 1•s genera II Y acquired through ex perience · and
such acqui sition is the focu s of thi~ chapter. f . t d· f
Knowledge acquisition it,;;elf includes many different act ivities. Simple st0 ring O compu e m ormation.
·
0 r mte I emw11g. ·
1s the most basic lenmi ng act1·v1·ty. M any com puter programs . · · e ·g ., database systems
. · can 1.._
ut:
·ct "1 .. •
sa.i to earn 111 thi s sense. although most peup 1e wou no Id t call such simple storage learnin
. g. Howeve
. r.
.
many AI programs are able to improve their• pe,fonnance
· , t·ta II Y through rote- learn mg technique,; a d
substan · n
we will look at one example in depth. the checker-playi ng program of Samuel.[ 1~63_J. .
Another way we learn is through taking advice from others. Advice takin~ is sim_ilar to rote le~ing. but
high-level advice may not be in a fmm simple enough for a program to use directly m problem-sol ving. The
advice may need to be first operationalized, a process explored in Section 17 .3..
People also learn through their own problem-sol ving experience. After solvmg ~ complex problem. we
remember the structure of the problem and the methods we used to solve it. The next time we see the problem,
we can solve it more efficiently. Moreover, we can generalize from our experience to solve related problems
more easily. In contrast to advice taking, learning from problem-solving experience does not _usually involve
gathering new knowledge that was previously unavailable to the learning program. That is, the program
remembers its experiences and generalizes from them, but does not add to the transitive closure 1 of its
knowledge, in the sense that an advice-taking program would, i.e., by receiving stimuli from the outside
world. In large problem spaces, however, efficiency gains are critical. Practically speaking, learning can mean
the difference between solving a problem rapidly and not solving it at all. In addition, programs that learn
through problem-sol ving experience may be able to-come up with qualitatively better solutions in the future.
Another form of learning that does involve stimuli from the outside is learning from examples. We often
learn to classify things in the world without being given explicit rules. For example, adults can differentiate
between cats and dogs, but small children often cannot. Somewhere along the line, we induce a method for
telling cats from dogs based on seeing numerous examples of each. Learning from ex~p)es usually involves
a :teacher who helps us classify things by correcting us when we are wrong. Sometimes, however. a program
can discover things without the aid of a teacher.
A1 researchers have proposed many me'ehanisms for doing the kinds of learning described above. In this
chapter, we discuss several of them. But keep in mind throughout this discussion that learning is itself a
problem-solving process. ln fact, it is very 'difficult to formulate a precise definition of learning that distinguishes
it from other problem-solving tasks. Thus it should come as no surprise that, throughout this chapter. we will
make extensive use of both the problem-solving mechanisms and the knowledge representation techniques
that were presented in Parts I and TJ.

17.2 ROTE LEARNING


When a computer stores a piece of data, it is performing a rudimentary form of learning. At\er all, this at't of:
storage presumabl y allows the program to perform better in the future (otherwise, why bother?). In the case ol
data caching, we &tore computed values so that we do not have to recompute them later. When computation is
more expensive than recalJ , thi &strategy can save a i,;ignificant amount of time. Caching has been used in Al
programs LO produce some surpri sing performance improvements. Such caching is known as mte tearniflg.
In Chapter 12, we mentioned one or the earliest gume-pluying programs, Sumuel's checkers program
[Samuel , I 963). This _progra~ learned to play checkers well enough to beat its creator. It exploi ted two kinds
of learning : rote learn mg, wh ich we look at now, and parameter (or coefficient) adj ustment, which is described
in Section J 7.4. l . Samuel's program used the minimax search procedurt": to ex plore checkers game rrees. As

'The transitive closure of a progrnm's J..nowledge is that knowledge plus whatever the program can logicaJly deduce front it.
349
_ _ _ __ __
_ _ing _...
es..__ _ _ _ _ _. __ l earn

few levels in the tree. (~h e


with ~II such pr~ gram s. time con strnjnts pen nitt ed it to search only a
·s tJle case lied its static ~vaJu_atJ~;
n .) Wh en it could sear ch no dee per. it app
;,a1cl number van~d dep~,!~mg on the situatio e to continu e its search of the gam e
tree . Wh en ,t fini sh
th bo~t rd poS ition and used that scor
function to e t:he roo t
d atin g the valu es bac kwa rd. it had a score for the pos ition repr ese nted by
th
searc hing e tree at1 pro pag th
pos itio n at e roo t
cou ld 1h en cho ose the bes t mo ve nnd nrnk e it. But it also reco rded the boa rd 1_7.1 (a).
of the tree . It ted for it. Thi s s ituatjon is sho wn jn Fig .
th e that had just bee n co mpu
of 1J1e lft'e nnd e bac_ked up scor In stea d of us mg tbe
th at m a late r gam e, the situ atio n shown in Fig. 17. I (b) were to arise.
Now suppose for A can be used . Thi s cre~ tes
score for pos itio n A, the stored value
srntic evaluati on_function to com pute a pute d by bac kin g
mg sear che d an add itio nal seve ral ply sinc e the stored valu e for A was com
tht' t'ffect of hav
up \'alur~ from exa ctly suc h a sear ch.
Gam e Tree

10
Stored Scores A : 10 (b)
(a)
Fig.17.1 Storing Backed-Up Values
hist icat ed pro blem -sol vin g
It doe s not app ear to invo lve any sop
Rote learning of this sort is very sim ple. that will bec ome incr easi ngly imp orta
nt in mor e
sho ws the nee d for som e cap abil jties
capabilities. But eve n it
ities include:
complex learning syst ems . The se cap abil it wou ld be to
age of lnfo nna tion -In orde r for it to be faster to use a stor ed valu e than
• Organized Stor In Sam uel' s pr o~ this
mpu te it, ther e mus t be a way to acce ss the appropriate stor ed valu e quickly.
reco pieces. But
boa rd pos ition s by a few imp orta nt characteristics, suc h as the num ber of
was done by inde xing necessary.
plex ity of the stor ed info rma tion incr eases, mor e sophisticated tech niqu es are
as the com stor ed can be very larg e. To
inct obje cts that mig ht pote ntia lly be
• Gen era liza Jion -Th e num ber of dist ion is necessary.
ber of stor ed obje cts dow n to a man age able level, som e kind of gen eral izat
keep the num be stor ed was equal to the
num ber of dist inct obje cts that cou ld
In Sam uel' s pro gram, for exa mpl e, the eralizat ion
t boa rd pos itio ns that can ruis e in a grune. Onl y a few sim ple forms of gen
number of diff eren ed as thou gh \Vh ite is to
in Sam uel' s pro gram to cut dow n that number. All pos itio ns are stor
were used
. Wh en pos sibl e, rota tion s along the
diagonal are
num ber of stor ed pos itio ns in half
move. Th.is cuts the s, so too doe s the
bine d. Aga in, thou gh, as the com plex ity of the lear ning proc ess incr ease
also com
need for generaJization . pro blem solv ing.
hav e beg un to see one way in whi ch learning is sim ilar to othe r kinds of
At this point, we
its knowlectge huse .
al stru cture for
lts success dep end s on a goo d orga niza tion

t 7,3 LEARNING BY TAKING ADVICE


t? r writ es a seri es of inst ruct ions
a progr~m f~r it run . Wh ~n n pro gnu nme
~ com puter can do very littl e with out
sort of teac her, and the
r, a rud ime ntar y kind of Jear nmg JS wkm g pluce: The _pro gram mer is a l
I.Oto a com pute .
Aft er bein g pro gram med , the com pute r JS now able to do som ethi ng it pre VIOU S y
cornputer is a sort of stud ent. ' ' . .l . I . gra m is wri tten
co I the pro gram may not be such a s11n p e matteJ' mweve1. Sup pos e the pro
u d not. Exe cuti ng
Artl
:r.,ctal Intelligence
0
350 'J' • -

·1 r must intervene to change the teach ,


. 1te1 reter or compt e er s
11
in a hi oh-level lanouage like LISP. Sotne -p d' tl y
th
instructions into code that the machine can execute tree ~ e advice "fight for control of e center of th
d plans · A computer progranie
. • I
People process adv ice m an ana ogous
way · In chess
. 10•, to concrete moves an
board'' is useless unless the player can translate tbe a~vice 1ua u·on function to include a factor based on lhe
. . . 't stattc eva
mi ght make use of the ~dv,ce by adJustmg 1 s_
number of center squares attacked by its own p1eces~hich accepts advice for playing hearts, a card game. A
Mos row [ J983] describes a program called ,FOO, sentation that FOO can understand.·For exarnp]
human user first translates the advice from English into a repre . e.
"Avoid taking points.. becomes:

(avoid (take-points me) (trick))


. . . . . · to an express.ion that contains concept~ and actions
FOO must operationa/rze this advice by tw:rung it m coo ·•· follow is to UNFOLD an express· . b ' 1
FOO can use when playing the game of hearts. -One stra.te_gy •r . :G_afl . . d FOO .th 10n y
replacing some term by its definition. By UNFOLDing the defimuon of avm ' comes up wi :

(achieve (not (during (trick) (t~e-poil).ts me))))

FOO considers the advice to apply to the player called ·"me." Next, FOO UNFOLDS the definition of trick:

(achieve (not (during


(scenario
(each pl (players) (play-card pl))
(take-trick (trick-winner)))
(take-points me))))

In other words, the player should avoid taking points during Uie scenario consisting of ( 1) players pla}ing
cards and (2) one player taking the trick. ROO then uses case analysis to <let.e rmine which steps could cau..-e
one to take points. It rules out step 1 on µie basis tha~ it knows of po intersection of the concepts take-poinlS
and. play-card. But step 2 could affect taking points, so FOO UNFOLDs the definition of take-points:

(achieve (not (there--exists cl (cards-pl~yed)


(there-exists c2 (point-cards)
(during (take (trick-winner) cl)
(take me c2))))))

This ad vice says ,that the player should avoid taking point card d . th , · - th ttick-winne!
. . . · k Th • . · - s pnng 1,11,.e process o1 e ,
taking tue tnc . e questi on for FOO now 1s: Under wt-,at condi·u· ct ( ak ) durino (tok(
· , J)? B • . ~• •-~. ons o.es t~ e m,e c2 occur • , o ·f
(~ ·ck-wmner! .I c . y using a technique called partial match, FOO b th . th . ill be taken1.
me = trick-winner and c2 = cl. Jt transform s the advice into: ,ypo es1zes . ~t pp.mts w.

(achieve (not (and (have-poi1Jts (ca rd s-play,e d))


(= (trick-winner) ml!))))

This means "Do not win a trick that has points " W i ''avoid
taking points," but it is important to note ~lat the cu~re te e
lavb not traveled very far ~onceptually frolll \150f
, . a- ctually playing the game of hearts. Throygh a numb n ;ocah ulcuy is one that FOO can understand in tel11'
. . er o ot ~r trans(onnations, FOO eventually settles o1 .
--=1!11:!1:l-.i~ll'l:a-- !::,_arning 351

(achieve (>= (and (in-s~it-led (card-of me))


(possible (ttick-has-poi nts)))
(low (card-of me)))

other words, when playing a card that i h . . ·


1\, , contains points. then play a low car s t e same su it as the card that was played first, if the tn~k
11° 551 ) . ts'' ·nto a specific usabl- h . . d. At laSl, FOO has translated the rather vague advice "av01d
,: ng polll . 1 , e eunstic FOO 1· bl · · .a..._;
wr:i_ A human can watch FOO la de · ~ a e to play a better game of hearts after receiving uuS
advice. . h ·d h . . c p Y, tect new mistakes, and correct them through yet more advice, such
,. tav 1110 cat s w en 1t 1s sa,e to do so" Th b"l•
ns P , fte t , d . . · e a 1 tty to operationalize knowledge is critical for systems
learn ·om a eac11er s a vice It 1s also an ·important component of explanation-based learning, another
(hat . d' , d . S ·.
fonn of leammg 1scusse m ect,on 17 .6.

t7.4 LEARNING IN PROBLEM-SOLVING


Jn the last section, we saw how a problem-solver could improve its performance by taking advice from a
reacher. Can a program get better without the aid of a teacher? It can, by generalizing from its own experiences.
17.4.1 Learning by Parameter Adjustment
Many programs rely on an evaluation procedure that combines information from several sources into a single
summary statistic. Game-playing programs do this in their static evaluation functions , in which a variety of
factors, such as piece advantage and mobility, are combined into a single score reflecting the desirability of a
particular board position. Pattern classification programs often combine several features to determine the
correct category into which a given stimulus should be placed. In designing su_ch programs, it is often difficult
to know a priori how much weight should be attached to each feature being used. One way of finding the
correct weights is to begin with some estimate of the correct settings a11d then to let the program modify the
settings on the basis of its experience. Features that appear to be good predictors of overall success will have
their weights increased, while those that do not will have their weights decreased, perhaps even to the point of
being dropped entirely.
Samuel's checkers program [Samuel, 1963] exploited this kind of learning in addition to the rote learning
described above, and it provides a good example of its use. As its static evaluation function, the program used
a polynomial of the form
c 1t 1 + c2t2 + ... + c 16t 16
The t terms are the values of the sixteen features that contribute to the evaluation. The c terms are the
COetlicients (weights) that are attached to each of these values. As learning progresses, the c values will change.
The most important question in the design of a learning program based on parameter adjustment is --when
should the value of a coefficient be increased and when should it be decreased?" The serond question to be
answered is then "By how much should the value be ch&nged?" The simple ai~swer to the ~t question~s _that the
COeflicients of terms that predicted the final outcome ac~u~ately should ~ increased, w~~e ~e coeffic1ents of
~r predictors should be decreased. In some domains, this 1s easy to do. It a pattern dass.ificauon program uses
its evaluation function to clac,sify an input and it gets the right answer, then all the t~ns that ~dieted that answer
· we1g
should have th err .
. hts increased . But in game-playing
· progmms,
- the_ problem. 1s more difttcult.
. The program
o
does not get . '-~~ b k from individual moves. lt does not find out tor sure until the end of the game
any concrete 1c;c;u ac t- , 1 E ·t· th · •
whether ,·t has won. 8 ut many move s have contributed 10 that . uuu .outcome.. ven . 1 e program . . . wms , n may
ha ve made some bad moves al ong the way· The problem of· appropnutely
.
ass1gmng responsibility to each of the
Sleps that I d to a smgle
. · known as the credit ass1g11111e111~ p,vblem. ·_ ·
e outcome 1s
Samu 1, . technique albeit imperfect, tor so 1vmg tlus problem. Assume that the
.. 1
e s program exp oils one ' h h I 1 · t· ·
Innial v I ffi . ts are good enough t at t e tota eva uauon unctrnn produces values
a ues chosen for the coe 1c1en
352 n.ce-'=- - - - •
ig;.e..
II rt ijlclo I I Ill'_e../1.;;

tha t are fa irly reasonnhl c measures of the correct score e ven if' they arc not us accurate as we hope to get lhcrn
The n thi s evalu ati on fun ction ca n he used to provide l'ccdbnck to itse lf. Mo ve seq~ie nces th ut lead l'o positio :
al
with ~1i~hcr values cnn he considered good (n nd the 1cr111 s in the cvu lu utl on fun c tion th sugges ted them c:~
be rcmtorced) .
Because or the li,nitatio ns o f th i~ approac h, however. Su111uc l ·s prograr~ di~ two 0 th cr things: ? nc of Which
provided an add itionnl test that progress wn~ bei ng 1111H.lc t1 11tl the other ol wluc h ge nerated add 1t1onal nudge~
to keep the process ou t of n rut :
• Whe n Lhc pmg rnm was in learni ng mode, it played agains t anotl~e r co?y of itse lf. On ly one of Lhe
copies alt e red its scoring fu nction during the game ; the other re ma ined fixed. At the end of the game·
if U1e copy w ith the mod ified func tion wo n, the n the modified function was accepted . Otherwi se, th~
old one wa.-- re tained . If. however, thi s happened very many times, then some drastic change was made
t.o the function in an attempt to get the process going in a more profitable direction .
• Periodicall y. one term in the scori ng function' was eliminated and replaced by another. Thi s was possible
because. although the program used only sixteen features at any one time, it actually knew about thi rty.
eight. This replacement differed from the rest of the learning procedure since it created a sndden
change in the scori ng function rather than a gradual shift in its weights.
This process of learning b y successive modifications to the weights of terms in a scoring function has
many linutations, mostly ari sing out of its lack of exploitation of any knowledge about the structure of the
problem with which it is deal ing and the logical relationships among the problem's components. In addition,
because the learning procedure is a variety of hill climbing, it suffers from the same difficulties as do other
hill-climbing programs. Parameter adjustment is certainly not a solution to the overall learning problem. But
it is often a useful tech nique, e ither in situations where very little additional knowledge is available or in
programs in which it is combined with more knowledge-intensive methods . We have more to say about this
type of learning in Chapter J 8.

17.4.2 Learning with Macro-Operators


We saw in Section 17.2 how rote learning was used in the context of a checker-playing program. Similar
techniques can be used in more general problem-solving programs. The idea is the same: to avoid expensive
recomputation. For example, suppose you are faced with the problem of getting to the downtown post office.
Your solution may involve getting in your car, starting it, and driving along a certain route. Substantial planning
may go into choosing the appropriate route, but you need not plan about how to go about starti ng your car.
You are free to treat START-CAR as an atomic action, even though it really consists of several actions: sitting
down , adjusting the mirror, in serting the key, and turning the key. Sequences of actions that can be n-eated as
a whole are called macro-operaiors. ·
Macro-operators were used in the early problem-solving system STRIPS [Fikes and Nilsson. 1971 : Fikes
et al. , 1972]. We di scussed the operator and goal structures of STRIPS in Section 13.1, but STRIPS also has
a learni ng compone nt. After each problem-solving episode, the learning compone nt takes the c0mputed P :
1811
1
and stores it away as a macro-operator, or MA C ROP. A MA CROP is just like a regular operntor except ~h_at '.
consi sts of a seque nce of actions, not just a single one. A MACROP's preconditions are the initial concht.t0ns
of the problem just so lved , and its postconditions correspond to the goal just achie ved. In its simple st fonn.
the caching of previou sly compu ted plan s is similar to rote learning .
Suppose we are given an initial blocks world situation in whic h ON(C, B) and ON(A, Toble) are both u-ue.
STRIPS can achi eve th e goa l ON(A , 8) by d e vi s ing n plan with th e four steps UNSTACK(C. B)).
PUTDOWN(C), PJCKUP(A ), STA•K(A , 8) . STRIPS now builds a MACROP with preconditions ON(C, B '
. t· h four
ON(A , Table) and postcondition s ON(C, Table ), ON(A, B). The body of the MACROP consists o t e
353

. , just ,nenti()lll' d· In future planning. STRIPS is free to use lhi s comp· Jex macro-operator just as it would
· 1.,P · , l1t I,er op e1
,1 ·a1·01·•
use :in)
· 13ut rarrlY."' . 1·11 STRIPS I bl
.._ sec t 1e exact same problem twice. New prbblems will differ from previous pro ~ms.
we would st ill hke the ?'_"oblem solver to make efficient use of the knowledge it gained from its P:evwus
rx erienccs. By_ge,'.em~,_z.rng MACROPs before storing them , STRIPS is able to accompli sh this. The s,mp~est
id~l for generahzati~n is ~o replace a_ll of the constants in the macro-operator by variables. Instead of stonng
ihe MACROP descnbed m th e previous paragraph, STRIPS can generalize the plan to consist of th~ steps
LINSTACK(x1, J".:), ~UTDOW~ (~1), PICKUP(x3). STACK(x3 x2) , where x 1, x 2, and x3 ate variables. Thi s plan
·Rn tJicn be stored with preconditions ON(x 1, x2), ON(x3, Table) and postconditionsON(x 1, Table), ON(.,ti, X3),
~uch a MACR_OP _can now apply in a variety of situations ..
Generali zati on is not so easy, however. Sometimes constants must retain their specific values. Suppose our
dolllain included an operator called STACK-ON-B(x), with preconditions that both x and B be clear, and with
po,tcondition ON(x, B). Consider the same problem as above:

L
start: ON(C, B)

goal: ON(A, B)
STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK-ON-B (A). Let's
generalize this plan and store it as a MACROP. The precondition becomes ON(x3, x 2), the postcondition
becomes ON(x 1, x2), and the plan itself becomes UNSTACK(x3, x2) , PUTDO\VN(x3), STACK-ON-B(x 1).
Now, suppose we encounter a slightly different problem:

start: ON(E, C) goal: ON(A, C)


ON(D, B)

The generalized MACROP we just stored seems well-suited to solving this problem if we let x 1 = A. x-i =
C, and x3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C), PUTDOWN(E),
STACK-ON-B(A). But this plan does not work. The problem is that the postcondition of the MACROP is
overgeneralized. This operation is only useful for stacking blocks onto B, which is not what we need in this
new example. In this case, this difficulty will be discovered when the last step is attempted. Although we
cleared C, which is where we wanted to put A, we failed to clear B, which is were the MACROP is going to
try to put it. Since B is not clear, STACK-ON-B cannot be executed. If B had happened to be clear, the
MACROP would have executed to completion, but it would not have accomplished the stated goal.
~n reality, STRIPS uses a more complex g_enerallzation procedure. First, all constants are replaced by
van.ables. Then, for each operator in the parameterized plan, STRIPS revaluates its preconditions. In our
~xampJe, the preconditions of steps 1 and 2 are satisfied, but the only way to ensure that B is clear for step 3
18
Lo assume that block x whi ch was cleared by the UNSTACK operutor, is actually block B. Through "re-
2
Prov·ing" that the generalized
' plan works, STRIPS Jooates· .
constramts ot· tl·11s ' mct .
. k'
~ore recent work on macro-operators appears in Korf r1985b ]. ~t turns out that the set of problems for

~:s
Whichmacro-operators are critical are exactly those problems with rumserinliw ble subgoals. Nonserializability
that working on one subgoal wi~I necessai:ily in_terl'ere ~ith _the pr~vious s~lution to another subgoal.
can bll th~t we _di scussed such problems 111 connect1_on ~1th 1101:hnem planning (Section 13 .5~. Macro-operators
th ou ehuseful in such cases , since one macro-ope1ato1 can p10duce a .small global change 111 the world , e ven
g the individual operators that make it up produce many undesirable local changes.
Art 1/i<'ial lnt<•lli,tJt'II('('
354

' l' n ,wo11rnt11 hns ct11'1't·l'l ly placed the lirst four tiles. ii ,is diflicult t11
For exmr1pIe. comn'der t11c o'"-puzz 1c. 0 11 l c-- · •
. 'ti i· t
ms m'1 •
,mg~ tli 1.:' 1-11.st rnm lkL'tHISl' dist11rh111µ previously solved suhguuls 1s dctcci.." ,,
"1us
1 l1 t1 1e w1 1m1t
plnee ti,e fift • · • •

heuristic scrnin.g functions. it is stmngly t'\:siS tlld. :or '.'u 111 ~Y'.'nblc,ns. lt1cluc.hn_
g th c 8-p11·1.1.lc unt1
n bud thing by
Rubik's cube. wenk methods bm:ed on heuristic scori ng urc thcrclmc msultH.: tcnt. I lcnc.: _c. we c.: ~thcr need tlomuin.
specific knowledge. or else n new wcnk method. FtH'ttmutdy. we en~, tea m the domu111-spcc.:1~c know ledge we
need in the foim of mucro-opemtors. Thus. mncro-opcrntors cun be viewed os o wcu~ mcthocl l't>t· lcnrning, In the
S-puu.le. for example. we might hnvc n muc,u--:n rompkx. prcstored s,x1_ucnce ol opcrnto1:s- for plocing the
fifth tile without disturbing nny of the first four _11lcs cxt~rnttlly ([tllhough 111 ~m.:l they ure d1stur~e(~within lhe
macro itselD. Kotf [ l 985bl gives an algorithm tor lenrnmg n complete set_ ol muc.:ro-~~pcrnto ,~s. fh1 s appronch
contrasts with STRIPS. ,:,,,hich teamed its MACROPs grndunlly, from expencncc. Korf s ulgonthm runs in time
proportional to the time it takes to solve a single problem without macro-operators.

17.4.3 Learning by Chunking


Chunking is a process similar in flavor to macro-operators. The idea of chunking comes from the psychological
literature on memory and problem solving. Its computational basis is in production systems. of the type
studied in Chapter 6. Recall that in that chapter we described the SOAR system and di scussed its use of
control knowledge. SOAR also exploits chunking [Laird et al., l986] so that its performance can increase
with e>,.,"])erience. In fact, the designers of SOAR hypothesize that chunking is a universal learning method.
i.e., it can account for all types of learning in intelligent systems.
SOAR solves problems by firing productions, which are stored in long-term memory. Some of those
firings turn out to be more useful than others. When SOAR detects a use ful sequence of production firings. it
creates a chunk. which is essentially a large production that does the work of an entire sequence of smaller
ones. As in MACROPs, chunks are generalized before they m·e stored.
RecalJ from Section 6.5 that SOAR is a unifonn processing architecture. Problems like choosing which
subgoals to tackle and which operators to try (i.e., search control problems) are solved with the same mechanisms
as problems in the original problem space. Because the problem-solving is uniform, chunking can be used to
learn general search control knowledge in addition to operator sequences. For example, if SOAR u-ies sevt:rJI
different operators, but only one leads to a useful path in the search space, then SOAR builds productions that
help it choose operators more wisely in the future.
SOAR has used chunking to replicate the macro-operator results described in the last section. In solving
the 8-puzzle, for example, SOAR learns how to place a given tile without permanently disturbing the previous!~
placed tiles. Given the way that SOAR learns, several chunks may encode a single nu\c1-o-ol)(}n\tor. m\d OtK'
chunk may participate in a number of macro sequences. Chunks are generally applicnble towm'd any goal
state. This con~ts with macro tables, which are structured toward reaching a particuhu· goal stute frotn tU~'\
initial state. Also, chunking emphasizes how learning cnn occur during problt)m-solving, while m (\Cl'O tnbk'~
are usually built during a preprocessing stage. As n res uh, SOAR is uble to leurn \.Vi thin trinls ns well ns art'\\'~
th
trials . Chunks lear~ed d~ing the initial stages of solving u ilroblem nre upplicnble in the latl!r stugt}S ~- ~
th
same problem-solvmg episode. AfLer a solution is round, the chunks t't) lllUin in memory, roudy-fnr-ttsl~tt\ l
next problem.
The ~ric~ that SOAR pays _for this generality un<l llexibility is speod. At prt}sont, chunking is im1d~qiuitt:
for dupbcat:mg the contents of large, directly -computed m11crn-npornt.or tublos.
17 .4.4 The Utility Problem
PRODI?Y [Minton et al., 1989], which we described in Section 6.5, ulso acquires ronu·ol knowll:J~c
auto~aucally. PRODIG_Y employs several learning mechanisms. Onl:l mechunism uses explcma.tion-llostd
learn mg (EB~), a lean~mg method we di scuss in Section 17 .6. PRODlGY cun exmnine u trnce of its ~lwn
· why cettum
problem-solvmg behavior and tr Y l o exp I·am • , . The progrum; uses those exp IH11nllllll~
. · puths latkd
355
Learning

those paths in the futur e. So whil e SOA R learns


rn11ilntl' control ru le: tha_t l~~I~ tl}e prob lem so lver avoid . failures ·
iof() ·1 frolll exa mple s of succ ess1u l probl em ·solvi ng· PRODIGY aIso Iearns f rom its ..
Y . . · ~ . . . . h 'd •r, t· n of the tittllty
ilfl PRO DIG Y IM' t .
in on, 1988 1was t e , ent1 ,ca
. r conin.but1 on of the.: wo,. k lHl EBL 111 10
r,,nni 111a,10 ledge can be of great benefit in solving future
control know
A/r111 in ':a~1mg_syst~ms: Whil e, new search un~ of
~ ,~nis etl1riently: Lhct~ : •e also so me draw~ack
s. The learn ed control rules can take up large amo
mg.
st
~or) ' and the seatch prnc ram mu tak~ lh~ tt_me
to consider each rule at each step during problem so!~
s
onditions are desirable and seeing if its precon_d ,uon
oio~idering a r:n~ l ~le amounts _to se~mg if its post~ ng ti me by
CO risfied This is a nme-consummg pwce
ss. So wh1Ie learned rules may reduce problem-solvi
lem-solving time by forcing the prob lem
~ ~ng the searc h mor~ carefu lly. th ey may also increase prob
number of node expansions in the search space, then
di~r ro consi der them . It we only want to min'.mize the
the total CPU time required to solve a
~ re control mies we learn. the better. But tf we want to minimize
it,e~:m. we must co~sider t~i_s trade-off.
ol rule. This measure takes into account the average
~ ROD[GY maintains a utJltty measure for each contr proposed rule
·a!!S provided by the rule, the frequency
of its application, and the cost of matching it. If a
othe r
If not, it is placed in long-term memory with the
: \~negative urili~, it is di~carded (or "forgotten"). irica l
ng. If its utility falls , the rule is discarded. Emp
ru]es.11 is then monitored dunng subsequent problem solvi ty
ing only those control rules with high utility. Utili
experiments have demonstr~ted the effectiveness of keep with
ms. For example, for a discussion of bow to deal
con~derations apply to a wide range of learning syste ·
1arge. expensive chunks in SOAR, see Tambe
and Rosenbloom [1989].

17.5 LEARNING FROM EXAMPLES: INDUCTION


input, the name of a class to which it belongs. The
Ciassificarion is the process of assigning to a particular r
choose can be described in a variety of ways. Thei
classes from which the classification procedure can
put.
definiti on will depend on the use to which they will be
lem-solving tasks . In its simplest form , it is presented
Classification is an important component of many prob "
is the question "What letter of the alphabet is this?
as astraightforward recognition task. An example of this
tion. To see how this can happen, consider a problem-
But often classification is embedded inside another opera
rule:
iOlving system that contains the following production
e A to plac e B , and
I f: the curr ent goa l is to get from plac
es
ther e is a WALL sepa rati ng the two plac
and go thro ugh it.
then : look for a DOORWAY in the WALL
a ~vall.
Wi To use ~ s rule successfully, the system's matc
hing routine must be able to identify an object as
recog mze a
iihou1 trus, the rule can never be invoked. Then
, to apply the mle, the system must be able to
dcxirway.
must be defined. This can be done in a varie ty of
Before classification can be done ' the classes il will use
·
ways, UJclu ding:
domain. Defin e each class by a weig hted sum of
• Isolate a set of features that are relevant to the wsk ar to the
ll scoring function that looks very simil
values of lhese features. Each class is then defin ed by has the fom1 :
ns g11me pl11yin.g. Such a function
scoring function s often used in other situations , such
C1f1 + <'2/2 + < 'i~ +...
, nnd ~ucl~ c r~p.rt,sent s the weig ht to be attached
Each t corre sponds lo a value of a relevant parurneter
to 111d1cate teatu.res whose prese nce usually
to the corresponding ,. Nega tive we.ights can be used
con st itutes negative evidence for a given class .
356 Artificial Intelligence

For e_xample, if the task is weather prediction, the parameters can _be such measurements as rainfal
locauon of cold fronts . Different functions can be written to comb me these parameters to predict and
I
cloudy, rainy, or snowy weather. sunny,
• Isolate a set of features that are relevant to the task domain. Defi ne each class as a structure com
of those features. Posed
For example, if the task is to identify animals, the body of each type of animal can be stored
structure, with various features representing such things as color, length of neck, and feathers. as a
There are advantages and disadvantages to each of these general approaches. The statistical appr
taken by the first scheme presented here is often more efficient than the structural approach taken b oach
th
second. But the second is more flexible and more extensible. y e
Regardless of the way that classes are to be described, it is often difficult to construct, by hand. good cl
definitions. This is particularly trne in domains that are not well understood or that change rapidly. Thus~s
idea of producing a classification program that can evolve its own class definitions is appealing. This task ;
0
constructing class definitions is called concept learning, or induction. The techniques used for this task mu
. . . s~
of course, depend on the way that classes (concepts) are descnbed. If classes are described by scoring functions,
then concept learning can be done using the technique of coefficient adjustment described in Section 17.4.1.
If, however, we want to define classes structurally, some other technique for learning class definitions is
necessary. In this section, we present three such techniques.

17.5.1 Winston's Learning Program


Winston [ 1975] describes an early structural concept Concept Near Miss
learning program. This program operated in a simple
blocks world domain. Its goal was to construct
representations of the definitions of concepts in the blocks
domain. For example, it learned the concepts House, Tent,
House
@ L0 @
L10 DV
I

and Arch shown in Fig. 17 .2. The figure also shows an Tent
example of a near miss for each concept. A near miss is I
an object that is not an instance of the concept in question
but that is very similar to such instances.
The program started with a line drawing of a blocks
world structure. It used procedures such as the one
Arch
ff~ ill
Fig. 17 _2 Some Blocks World Concepts
described in Section 14.3 to analyze the drawing and
construct a semantic net representation of the strnctural description of the object(s) .This strnctural description
was then provided as input to the learning program. An example of such a structural description for thr H011st'
of Fig. 17.2 is shown in Fig. 17.3(a). Node A represents the entire structure, which is composed (if two parts:
nodeB , a Wedge, andnodeC,aBrick. Figures 17.3(b)and 17.3(c)showdescriptions ofthe twoArchstructu~s
of Fig. 17 .2. These descriptions are identical except for the types of the objects on the top: one is a Brick_"~•htle
the other is a Wedge. Noti ce that the two supporting objects are related not only by l<1t-of and right-<J~ ~mk_s:
but al so by a does-not -marry link, which says that the 1wo objects ~lo ~ot ,~~n:v._ "f\vo obj:c_t~ rrwrry itr\t:~/
have faces that touch and they have a common edge. The many relation 1s cnttl':al m the de-tu,mon of an
It is the difference between the first arch strucl"ure uncl the near miss arch structure sh0wn in Fig. 17-~· . .
. .ipproach that Wmston
. ' s program Look to Lhe problem oI' concept t·01rnat1on . ht:. descnbel
. . c,111 1as
The basic ·

follows:
. . . . . . . ..... d ~ . . .- tion the
J. Begm with a structural description ol· one known instance of the w ncept. Ctll that csl:IIP
concept definition .
l ea rn/11g 35 7

t\ 1-.·,.1<9-1'
X\ -0 0
/ '7
B supported-by

Wedge Brick
(a)

isa isa

(b) (c)

Fig. 17.3 Structural Descriptions

2. Examine desctjptions of other known instances of the concept. Generalize the definition to include them.
3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these.
Steps 2 and 3 of this procedure can be foterleaved .
Steps 2 and 3 of thi s procedure rely heavily on a comparison process by which si milarities and differences
between structures can be detected. This process must function in much the same way as does any other
matching process, such as one to determine whether a given production rule can be applied to a particular
problem state. Because differences as well as similarities must be found , the procedure must perfoCTn not
just literal but also approximate matching. The
output of the compari son procedure is a skeleton
structure describing the commona lities between
the two input structures. It is annotated with a
set of compari son note,s that describe spec ifi e
similarities and di fferences between the inpu ts .
To ~ee how thi s approach work s, we lrnce it
th rough the process of learnin g what an ur~h is,
Suppose that the arch descripti on of Fig. 17.3(b)
is presented first. It then becomes the tkfiniti on
of the concept Arch. Then suppose that the ~,n.: h
descripti on of Fi g. I 7.3 (c ) is prese nt ed: ~he
compari son routine will return a structu re simil ar
, ---- righ t-of
to the two input structures except that it will note
·
that th e objects nodes lnbdcd
represented by tI1e . · . . --------------
does-not-m arry
Brick
C are not identical. Thi s structure 15 showi~ as n of Two Arches
fig. 17.4 The Compariso
Fig 17 4 Th e c-note 1·m k f1·om node C describes
· • .
358 Artificia l Intelligence

the difference found by the compari so n routine. It notes th al the


difference occurre d in the isa link. and Lhat in the first structur e
the Lm link pointed to Brick, and in the second it pointed to Wedge.
It also notes that if we were to fo llow isa links from Brick and
Wedge, these links would eventuall y merge. At this point, a new
description of the concept Arch can be generated. This description
could say simply that node C must be either a Brick or a Wedge.
But since this particular disj unction has no previou sly known .
significance. it is probabl y better to trace up the isa hierarch ies
of Brick and Wedge until they merge. Assumi ng that that happen s does-not-marry
at the node Object, the Arch definition shown in Fig. 17.5 can be Fig. 17.5 The Arch Description after
built. Two Examples
Next, suppose that the near miss arch shown in Fig. 17 .2 is
present ed. This time, the compar ison routine will note that the
only difference betwee n the current definition and the near miss
is in the does-no t-many link betwee n nodes Band D. But since
this is a near miss, we do not want to broaden the definiti on to
include it. Instead, we want to restrict the definition so that it is
specifically excluded. To do this, we modify the link does-not-
marry. which may simply be r~cordi ng someth ing that has
happen ed by chance to be true of the small number of exampl es isa
f
that have been present ed It must now say must-not-marry. The ~-- must-not-marry I
Brick I
Arch description at this point is shown in Fig. 17 .6. Actuall y, Fig. 17.6 The Arch Description after a
musr-not-marry should not be a comple tely new link. There must
Near Miss
be some structure among link types to reflect the relation ship betwee n marry,
does-not-marry, and must-not-
marry.
Notice how the problem -solving and knowle dge represe ntation techniq ues
we covered in earlier chapters
are brough t to bear on the problem of learning. Semant ic network s were used
to describ e block structures, and
an isa hierarchy was used to de-scrib e relation ships among already known
objects . A matchin g process was
used to detect similarities and differences betwee n structur~s, and hill climbin
g allowed the program to evolve
a more and more accurate concep t definition.
This approac h to structural concep t learning is not withou t its problem s.
One major problem is that a
teacher must guide the learning program through a careful ly chosen sequen
ce of exampl es. In the next section.
we explore a learning technique that is insensitive to the order in which exampl
es are present ed.
17 .5.2 Version Spaces
Mitchell f 1977 ; I 978] describes another approac h to concep t learning called
version spaces. The goal is the
same: to produce a description that is consiste nt with all positive exampl es
but no negativ e examples in the
training set. But while Winsto n's system did thi s by evolvin g a single concep
t description, version spaces
work by maintaining a set of possible descriptions and evolvin g that set as
new examples n.nd near misses are
present ed . As in the previou s section , we need some sort of represe ntation
languag e for examples so that we
can describ e exactly what the system sees in an exampl e. For now we assume
a simple frame-b ased language:
although version spaces can be constru cted for more general represe ntation
languag es. Consid er Fig. 17 .7, a
frame represe nting an individual car.
359
lean,/119

Car023
ongin · Japan
manufacturer : Honda
color : Blue
decade : 1970
type : Economy
Fig. 17.7 An Example of the Concept Car
17 .8. The choice of fearu_res
in Fig.
\ ( , 1. suppose that eac~i slot. may cont ~ only the discre te vaJues shown
1
dded in a particuJar program and ~y asrng
·ralu~ is cal led tl~e bras of tbe l ~ g system . By being embe
se it learns some things more easily than
..,..i __Jar representations. every leanung system is biased , becau
have to do wi th car
~~ lo our example. the bias is fairl y simple - e.g., we can learn conceptsus.that A clear statement of the
:farturerS: but notcar_ owne ~. In more co~plex systems, the bias is less obvio
~ oi a tearmng system 1s very important to its
evaluation.
origin E {Japan, USA, Britain, Germany, lta/yJ
manufacturer E {Honda, Toyota, Ford, Chrysler, Jaguar, BMw, Ra~
color E {Blue, Green, Red, White}

decade E {1950, 1960, 1970, 1980, 1990, 2000}


type E { Economy, Luxury, Sports}
Fig. 17.8 Representation Language for Cars

Concept descriptions , as well as training examples, can be


stated in terms of these slots and values. For
as in Fig. 17.9. The names x 1, x2 , and x3 are
rumple.. the concept "Japanese economy car" can be represented
variables. The presence of Xi, for example, indicates that the
color of a car is not relevant to whether the car is
representation language such as in Fig. I 7.8.
aJapanese economy car. Now the learning problem is: Given a
!lid giYen positive and negative training examples such as
those in Fig. 17.7, how can we produce a concept
training examples?
desmption such as that in Fig. 17 .9 that is consistent with all the
origin : Japan
manufacturer : x1
color : x2
decade : x3
type : Economy
Fig, 17.9 The Concept 7apanese economy car•
d make some observatio ns about the
Before we proceed to the version space algorithm, we shouJ
. For example, the description in Fig. 17.9 is
representation. Some descriptions are more generaJ than others
llk,reg ~aJ than the one in Fig. 17.7. In fact, the repres
entation language defi nes a partial ortlering of
17. l 0.
~on .s. A portion of that partial ordering is shown in Fig.
c~n be depicted as in Fig. 17. 11 . At the- cop of
0a:The entire panial ordering is called the con~e~r space, and
les, and at the bottom are all the possible
. ~ pt space is the null description , consisting only of vanab
e any lraining examples, we know that the
llainiog instances, which contain no variables. Before we receiv
if ~~1ery ~si_ble descriplion is an instance
;:: ~ncept lies some where in the concepr _spa~e. For example,
~ ~ten~d conce pt , Lhen the null descn puon ~s the conce
pt defi111t1on s,~ce 1t murr hes every thing . On the
of tht} descriptions at the bonom of
!It hand, if the t.arget concept includ es onl ~ ~ single exnm pk, then one
pts, of roursc\ lie somewhere in between
~lle ept ~pace i!> the de~ired conce pl definlll on. Mo!>l larget conce
1 0
"· extremes. . 1· h
A ~ we process trainin g example&, we want to re Ii1ne our nouon o w ere the target conce pt mioh .
o t 11e. Our
CJ,~ f h .
-,..,en 1h
1ir,.,_ • Y Pothesis can be repre&ented .as .a subse t o· . l e •conce pt space called the. \"ersion space · Th e versio n
· · h . ples seen so far.
-~ 1 th largest collec tion of descnptions that 1s consistent w11 all the training exam
s e
360 Artificial In telligence
- ~ * •• Ul"l',O~--r.llllWWW-
~~1'¼"""1CQt...:!""...

origin : X1
mfr : x2
color : X3
decade · X4
type : X5

origin : Japan origin : X1


mfr : X2 mfr : X2
color : X3 color : X3
decade : X4 decade : X4
type : X5 type : Economy

origin : Japan origin : USA


mfr : X2 mfr: X2
color : X3 color : X3
decade : X4 decade : X4
type : Economy type: Economy

origin: Japan origin: USA


mfr: Honda mfr: Chrysler
color: White color: Green
decade: 1980 decade : 1970
type: Economy type: Economy

Fig. 17.10 Partial Ordering of Concepts Specified by the Representation Language

How can we represent the version space? The


version space is simply a set of descriptions, so an initial Null Hypothesis

idea is to keep an explicit list of those descriptions.


Unfortunately, the number of descriptions in the
concept space is exponential in the number of features
and values. So enumerating them is prohibitive. Version Space j Concept Space

However, it tu.ms out that the version space has a


concise representation. It consists of two subsets of
the concept space. One subset, called G contains the ----
most general descriptions consistent with the training Training Examples
examples seen so far; the other subset, called S, contains Fig. 17.11 Co ncept and Version Spaces
the most specific description s consistent with the
training examples. The version space is the set of all descriptions thnt lie between some element of G and
some element of S in the partial order of the concept space. ..
This representation of the version space is not only efficient for storage, but also for modification. lntu,~v~ly.
each time we receive a positive training example, we want to make the S set more ge nera l. Negative munu~~
examples serve to make the G set more spec ific. If the Sand G sels converge, our range of hypotheses wi
narrow to a single concept description. The algorithm for narrowing the version space is ca lled the candidate
elimination algorithm.
361
Lem 1111w
ISSi.. L$-'I'

,norilhm: Candida te Elimination


A~ .
· ·vc nnd 11cgat1vc
ntlll •,, .,.,c'l o1 pos1t1 · Lh at language..
• cx umplC's ex pre%e d in
n1,•cn.. A n-1xr~c111 ntmn lmll1t-- u·1gL'
.' .
(c1111p11tl' : A cnnl'<'Pt dcsrn ption tha1 is consisten t with all the rosi1ivr cx nmplrc; and none of the neganve
c,:11nplc:-
?
l. lnitial!1e to corH~i n one element: the null descriplion (all fea tu res are vari ables).
·
, lnitiali 1e S t0 ~ 01 ~1~111 one element: the first pos ili ve cxnmplc.
J. Accep1 a new ti-ammg example.
If it is a positil'e example. fi rst remove fro m G any descri ptions that do not cover the ex ample. Then.
update the S :--et to contain the most spec ific set or cicscript1ons in the vers ion space that cover the
example and the cmTent elements of the S set.
TI1at is, generalize the elements of S as little as poss ible so that they cover the new training example.
lf 1t is a 11egati1 •e example. first remove from S any descriptions that cover the example. Then, update
the G set to contai n the most genera] set of descriptions in the version space that do not cover the
example. That is, speciali ze the elements of G as little as possi bl e so tJ,at the negative example is no
longer covered by any of the elements of G.
4. If Sand Gare both singleton sets, then if they are identical, output their value and haJt. If they are both
singleton sets but they are djfferent, then the training cases were inconsistent. Output this result and
hal t Otherwjse, go to step 3.
Let us trace the operation of the candidate elimination aJgorithm. Suppose we want to learn the concept of
"Japanese economy car'' from tJ1e examples in Fig. 17. 12. G and S both start out as singleton sets. G contains
thenull description (see Fig. 17.1 l ), and S contains the first positive trajning example. The version space □ ow
2
contains al l descriptions that are consistent with this first example:
origin: Japan origin : Japan origin: Japan
mfr: Honda mfr: Toyota mfr: Toyota
color: Blue color: Green color · Blue
decade: 1980 decade: 1970 decade: 1990
type: Economy type: Sports type: Economy

(+) (- ) (+ )

origin : USA ongin- Japan


mfr: Chrysler mfr. Honda
color: Red color: White
decade: 1980 decade: 1980
type: Economy type· Ec-ononn

(- ) (I )

Fig. 17 .12 Po\i/i ve and Ne,lJOtive /;\:urn pie~ of' I Ill' CCl11 cep t "/c1pu11cs,• <'<WI Lll/{\' , ar•

G= l (x 1, x 2, ,1 1, ,1 1• x'\ ) I
S = I U flpan , I !fllulrt, /J/ 11 e. I '>HO, 1~·,011 0 1111•) I

Now we are ready to pro<.:l:~~ til l: :-,i·cond L· ,·w 11 1pk . Thi' c; s1•1 11111 -.1 be ~11••L' t.1 1l1t•d ins\"-''~ a way that the
negaLi ve example i~ no longt;:r in th1: v1. is io 11 ~ p i1\ ·1..·. 111 u u 1 11•p1~M' ll l,1l11u1 la 11µu agl'. s~1~~·1al1 zation involves
replaci ng variable~ with l.'on~tant :-.. (Null' : The (,' ~L· t 11111~1 lw :-.1H·t·1,il11t•d t~1t ly_ It) tk':-.c11pt10ns that are ii·ithin
the current version :-.pace, not out~idl.' of it.) I k 11· il ll' tl w ,1va dabk :-.pct·1td11.1ll ons·

~ ---e t-h-is-c-·x_a_ ·onu,l', we ~kip :,lot iia n1 L'~


m_p_lc_(_ 111 1lie t k·,..: 11pt11111 -. \\'l• iu_~111, t ~1111 va lues in the nrcler in which the
slots have been shown in the preceding fi gure \
362 Arr ificfal ln telllge11ce

G = { (.r,. Ho11dn . .rl. X4. r ~). (.r,. x~. m11e. r4. \ ~ ).


(x 1• '" i· ·' l• 1980, , ~). (\ 1, , 2• x ,, x 4 • l!'m11n111y))

The S set is unaffecred by the negative exampl e. Now we co me to the third ex ampl e, a pos iti ve one. The
first order of business is to remove from the G set any descripti ons that are inconsistent with the positi ve
example. Our new G set is:

We must now generalize the S set to include the new exa mple. This involves rep lacing constants with
Yariables. Here is the new S set:

S = { (Japan. x.:!. Bl ue. x 4 • Economy) )

At this poinL the S and G sets specify a version space (a space of candidate descriptions) that can be
translated roughly into English as: "The target concept may be as specific as 'Japanese, blue economy car,' or
as general as either 'blue car' or 'economy car."'
NexL we get another negati ve example, a car whose origin is USA. The S set is unaffected, but the G set
must be specialized to avoid covering the new example. The new G set is:

G = { (Japan. x 2 , Blue, x 4 , .x;), (Japan , ½, x 3 , x 4 , Economy) )

We now know that the car must be Japanese, because all of the descriptions in the version space contain
Japan as origin. 3 Our final example is a positive one. We first remove from the G set any descriptions that are
inconsistent with it, leaving:

G = { (Japan , x2, x3 , x4 , Economy )}

We then generalize the S set to include the new example:

5 = {(Japan , ½ , x3, x 4 • Economy ))

Sand G are both singletons, so the algorithm has converged on the target concept. No more examples are
needed.
There are several things to note about the candidate elimination algorithm. First, it is a least-commitment
algorithm. The version space is pruned as little as possible al each step. Thus, even if all the positive training
examples are Japanese cars, the algorithm will not reject the possibility that the target concept may include
cars of other origin-untiJ it receives a negative exampl e that forc es the rejection . This means thut if the
training data are sparse, the Sand G sets may never converge to a single description; the system may It:nrn
only partially specified concepts. Second, the algorithm invol ves exhuust ive, brenclth-first search through the
st
version spa~e. We can see th.i s in the algorithm for updating the C set. Contrnst this with the depth-fir
behavior of Winston 's learning program. Third , in our simple represent ation lang uage, the S set always contains
exactly one element, because any two positive exampl es always have exact ly one ge nernli zntion . Other
representation languages may not share thi s property.

3
It could be the case that our target concept is ·'not Chrysler," but we wi ll ignore this possibility because our representation
language is not powerful enough to express negation and disj'unction.
363
l,enrnfng

f hc version ~pan' ;ipprnach can he applied to a wide variety of lcc1 rning tasks and representation languages.
f he 11lg0ri1hm nbow can h~ cx lct1rl cd to hnndlc cominuously va lued features and hierarchical ~nowledge/::
0
gxcrcis~s). H~1 ~ 1cver.. ~'.er~:o,~ , spaces .have several deficiencies. One i.ci the large cipace requirement~ can
~ 11 9us11vc. b, ~ad tb-f,~ st .srn.1<.: h men~10ned above. Another is that inconsistent data. also called no,se . . (n
,e the cand,d::ite ellmrn at,on algonthm to prune the target concept from the version space prematurely.
'f .. . . pt
cn11.

,hr l·ar cxainpl~ above, ~ '..1.1e tlmd Lrammg instance had been mislabeled(- ) instead of (+ the target conce G
)f "Japancse economy ca, would never be reached. Al so, given enough erroneous negative examples. th e
~el can he specialized _s~ far that the version space becomes empty. In that case, the algorithm concludes th at
•)ncept fi ts the trnmmg examples.
110 l l . . . . .
one solution to tl11S problem lM1tchell, 1978] is to maintain several G and S sets. One G set ,s con ststertt
with all the training instances, another is consistent with all but one, another with aU but two, etc. (and the
sanie for the S set). When an inconsistency arises, the algorithm switches to G and S sets that are co-nsi stent
with most. but not all. of the training examples. Maintaining multiple version spaces can be costly. however,
and the Sand G sets are typically very large. If we assume bounded inconsistency, i.e., that instances close to
the target concept boundary are the most likely to be misclassified, then more efficient solutions are possible.
Hirsh [J 990) presents an algorithm that runs as follows . For each instance, we form a version space consistent
with that instance plus other nearby instances (for some suitable definition of nearby). This version space is
then intersected with the one created for all previous instances. We keep accepting instances until the version
space is reduced to a small set of candidate concept descriptions. (Because of inconsistency, it is unlikely that
the version spaec will converge to a singleton.) We then match each of the concept descriptions against the
entire data set, and choose the one that classifies the instances most accurately.
Another problem with the candidate elimination algorithm is the learning of disjunctive concepts. Suppose
we wanted to learn the concept of "European car," which, in our representation, means either a German.
British, or Italian car. Given positive examples of each, the candidate elimination algorithm will generalize to
cars of any origin. Given such a generalization, a negative instance (say, a Japanese car) will only cause an
inconsistency of the type mentioned above.
Of course, we could simply extend the representation language to include disjunctions. Thus, the concept
space would hold descriptions such as "Blue car of German or British origin" and "Italian sports car or
German luxury car." This approach has two drawbacks. First, the concept space becomes much larger and
specialization becomes intractable. Second, generalization can easily degenerate to the point where the S set
contains simply one large disjunction of all positive instances. We must somehow force generalization while
allowing for the introduction of di sjunctive descriptions. Mitchell [ 1978) gives an iterative approach that
involves several passes through the training data. On each pass, the algorithm huilds a concept that CO\'ers the
largest number of posiiive training instances without covering any negative training i,~stances. At the end of
the pru.&, the posi tive training in stances covered by the new concept are removed from the training St."t , and the
ne~ concept then becomes one di sjunct in the eventual disj unctive concept description. When all positive
!raining instanceb have been removed, we are left with II disjunctivt:) concept that covers nll of them without
covering any negative in stances.
There are a number of other complexities, iucluding the wuy in whkh fen turrs internet with one anoth er.
F .
or example, if the ori~in of a car is Jopon , then the 1111111t(/<l('ft11·1.'r <.:annot t'l t· C/11:vs/n: The version s
al . . I . • , • pace
gonthm as described above makci; no u.sc.: ol ~uc l 111tnrnrnt1~)1\ , Also tn our example, it would be more
natural to repl ace the decade slot wi th a con111111011sly vu lucd year l1dd. We would haw to change our d
fo . . . k' 1 • . 1 proce urns
r updating the sand C sets to account lor this ' 11H ot numcncu data.
36'"f'1 I, 1, '111,1I /11/r· /11,1 1111, t'

17 .S.3 Dt'-tision 1't·N's


A third ,1pp1\,m:h h'I 1,.\)lh' t'l't knn11n1' ,, 1lw 1nthH·111111
\,f. /c·d .,hlll /rr ,·.,. ,,, l'\,·mplt 11,·d h, ·, Ill' II)' p,nt1 I ,\Il l (If"''',?
'"r Q111nl.m I l Q~t1 1. In.~ 11:-l•:s a 1;\._ ,,·p1,•:-..,·111.111on I I
t~)t' 1.'('00~rt~. ~lid) n:- th,' \)m' :-..h1 l\\ II Ill f iii . I , I ' us,, (•C'IIII, Ill) 't;"'
11 ltoly
( )
Jopor,
I
1\, cl:\:-$if.~ a pnt111..· ul:11 input. \\l' :--11111 ,II tlw 1,,p nl ( ) ( )
type?
th~ tn..'l' nnd un:-" ~--'• \Jllt'sliL,n:-.. 11nt1I " c t\'.ldl n k:il .
\\ ht't't.' tht· cl n :-'1 11 1. ,ltH)ll , , q ,,, ,,d l t~ 17 I~ I I - 7
Spoi l~ Economy Luxury
~p.n'$tn~ the fonulmr ,.''l.)tll'\'PI ".lap:lllL'S1..' l'1..' 011ll l\l)' H (+) H
1.' ar:· 10_-; i:-. n progrnm that t,mld:s dl·,·tsH)n trL't':s
Fig. 17.13 I\ Dl'Cl!, lon 1'ree
:ml0mat11..·all~ . gi, en {)l.':-111 , ,. and 111..•gm " c 1m,tm11.:cs
of :1 1."lm1..-ept..1
IDJ use~ ;u1 itctall\ '-' t111..·tlh>d h> build up d1..'\:isio11 1,ces. preli·rring si111 plc tr~es over wrn plcx ones, on the
the<'!) tkn simple tree~" un..· mnn..· :iccmatc dassif1<.'rs or ru1ure i11pu1 s. 1l begi ns by choosing a random subset
of the tntinmg e:..ampl,~:--. This subset 1:-- l.·alkd tlw 1ri//dm1•. The ;.tl gori tltn1 bui ld~ a decision lree Lhat correctly
dassifie~ all exampk~ in the" mdu\, . l'ht' tree is 11lc11 tested 0 11 the 1rai 11 i11g ex ..11nples oulside the window. ff
all the exmupJe:- are du:-s1tied L'L)m:•C'll y. the algorithm halt s. Otherwise. it adds a number of Lraining examples
ro rhe wmdow and the prol·ess repeats. Empinc;al eviucncc i11dica1es that Lhe iLcrntivc strategy is more efficient
th3ll considering the '-''hole Lrnining set at once.
So ho,, daes ID3 actually construct dec ision trees? Building a node means choosing some attribute to Lest.
Al a given point in the u-ee. som~ atllibute~ will yidd more information than others. For example, testing the
artribute color is useles~ 1f the color of a car docs 1101 help us tu class ify it correctly. Ideally, an attribute will
separ..ite training instances into subsets whose members share ;1 common label (e.g .. positive or negative). In
that case. branching is tenninated. and the leaf nodes are labeled.
There 3re man~ ,·aiiaaon, on thi~ basic algorithm. For example, when we acid a test thaL has more lhan two
branches. i1 is possjb)e that one branch has no cotTesponcling Lraining instances. In that case, we can either leave
the node unlabeled. or we can attempt to guess a label based on sLatistical properties of the set of instances being
rested a1 that poim in die tree. Noisy input is another issue. One way of handling noisy input is to avoid building
new branches if the information gained is ve1y slight. JJ1 other words. we do not want to overcomplicate the tree
to account for i~olated noisy inst.ances. Another soui-ce of uncertainty is tl1at attribute values may be unknown.
For ex.ample a patient ·~ medical record may be incomplete. One solution is to guess the correct branch to take;
another soluuon 11, Lo bwld l>pecial "unknown" branches al each node during learn.ing.
When the concept l>pace is very lai·ge, dec ision tree learning algorithms run more quickly than ther version
space cousins. Abo. dil>junuio11 i!) more l)lfaiglnforward. Fur example, we can easily modify Fig. 17. lJ to
represem the di sj uncti ve concept "American car or Japanese economy car," simply by changi ng one of tht'
negative (-) leaf labeb LO po!)i ti,·e (+) . One drawback to the JDJ approach is that hu ge, complux decisi011
trees can be difficult for humanl'I to unden, Land , :md so a dern, io11 lree sys tem may have 1.1 hard 1i111c e.xpluining
the reasons for its clru.~ificatiom,.

17.6 EXPLANATION-BASED LEARNING


The previou s section illustrated how we L'a n induce concepl d(•1..c ri ptin11s rrn111 positi ve and negn ti ve t·xnmpks.
Leaming complex concepts Ul'l ing 1hese proccdurel'I lypi cu ll y req11irt·s 11 ~11hsta11tiul 1111111hn of trnining insrnnn·s.

4 Actually, the decision tree rc pre-.en tation ,s more J!l' n1 •1:il · I ,cave~ l' il ll dc1w t1• 11 11y tll' 11111111,hcr ol cla-.scs. 1101 just positive
and negative .

365

-------
Learning

sut people secm to be nble to le:im quite a bit from sil1glc examples. Consider V •
1ies~ player who. as 8 lack. has reached the position shown in Fig. 17. 14. I1 ~ .l .l
nc ·5 11 d "&,ork" 1Jecause the white kni ght attacks both the black
1
'fhc Position ca e ::i
. !! and the black queen. Black must move the king. thereby leavi ng the
kl"~ Pt I. ·
ueen open to capture . ·om t,,s smgle experience, Black is able to learn
\ire a bit about the fork lTap:. the idea is that if any pi ece x attacks both the
q nnnent's king and another piece Y, then piece y will be lost. We don 't need
op;~
to
dozens of positive and negative examples of fork positions in order to
1 · 'r. .
i!. i!. i!.
dfaW these cone usions. rrom JUSt one experience, we can learn to avoid this (,!)

traP in the future and pe_rhaps to use it to our own advantage. A ncork Position
Fig. 17.14
What mak·es sue I1 smg l e-example learning poss ible? The answer, not in Chess
-surprisingly. is knowledge. The chess player has plenty of domain-specific
knowledge that can be brought to b~ar, including the rules of chess and any previously acquired strategies.
That knowledge can be used to identify the critical aspects of the training example. In the case of the fork, we
tnow that the double simultaneous attack is important while the precise position and type of the attacking
<piece is not.
Much of the recent work in machine learning has moved away from the empirical, data-intensive approach
described in the last section toward this more analytical , knowledge-intensive approach. A number of
independent studies led to the characte1ization of this approach as explanation-based learning. An EBL
system attempts to learn from a single example x by explaining why xis an example of the target concept. The
explanation is then generalized, and the system's performance is improved through the availability of this
knowledge.
Mitchell et al. [ 1986] and Delong and Mooney [ 1986] both describe general frameworks for EBL programs
and give general learning algorithms. We can think of EBL programs as accepting ·the following as input:
• A Training Example-What the learning progran1 "sees" in the world, e.g., the car of Fig. 17 .7
• A Goal Concept-A high-level description of what the program is supposed to learn
• An Operationally Criterion-A description of which concepts are usable
• A Domain Theory-A set of rules that describe relationships between objects and actions in a domain
From this, EBL computes a generalization of the training example that is sufficient to describe the goal
concept, and also satisfies the operationality criterion.
Let's look more closely at this specification. The training example is a familiar input-it is the same thing
as the example in the version space algorithm. The goal concept is also familiar, but in previous sections, we
have viewed the goal concept as an output of the program, not an input. The assumption here is that the goal
concept is not operational , just like the high-level card-playing advice described in Section 17.3. An EBL
program seeks to operationalize the goal concept by expressing it in terms that a problem-solving program
ca_n under~tand . These terms are given by the openuioriality criterion. In the chess example, the goal concept
llllght be :-.omething like "bad position for Black," nnd, the operntionalized rnncept would be a generalized
de~cri ption of Situation s similar to the training ex ample, given in terms of pitlces and their relative positions.
~e ~a:-.t input to an EBL program is a domain theo_ry, !n our_casll, t~e rules of chess. Without such knowledge,
11 ~ unpossible to come up with a correct generaltwt1on_ol th~ tn~1ning exa~1ple.

Expfanation -hased ~eneraliw tion (EBG) is a_ n ulgnn!hm lor l~BL des1,;.nhed in Mitchell et al. [ 1986). It
has two steps: ( I ) ex plain and (2) gencrali1.e. During the ltrst step, the domnm theory is used to prune awa all
l~e unimportant aspects of the trnining example wlll1 resp~c t to th e gt'. 1_11 concept._W~at is left is an explarz:iion
0
~hy the training example is an instance of the goa l conce_pl. This explanation 1s expressed in te , h
satisfYthe operationality criterion . The nex t step is 10 genera lize th e explanation as far as possible w~: :ti~
366 Arti(icinl lnt ell{!]cnre

describing tl1e goal concept. Following our d1css example. the first liBL step chooses 10 ignore.! White's
pawns, king. and rook. and constructs an explanation consisting of' White's knight. Bluck'.'I king. und Black's
queen, each in tl1eir specific positions. Opernt.ionnlity is ensured: all chcss-plnying prognuns understoncl the
basic concepts of piece and position. Next. the explanation is genernli1.cd. Usi ng dornnin knowledge, we find
that moving the pieces to a different pat1 of the board is sti ll bod for Black. We can also determine thnt other
pieces besides knights and queens can participate in fork attacks.
In realit,,. current EBL methods nm into difficulties in domains us complex as chess, so we will not pursue
this example further. Instead. let's look at a simpler case. Consider the problem of learning the concept C11p
[Mitchell et al.. 1986]. Unlike tl1e arch-learning program of Section 17.5. 1, we want to be able to generalize
from a single example of a cup. Suppose the example is:
• Training Example:
owner'(_Object23, Ralph) I\ has-part(Object23, Collcavity l2) I\
is(Object23. Light) I\ color(,Object23, Browll) I\ .. .
Clearly, some of tl1e features of Object23 are more relevant to its being a cup than others. So far in Lhis
chapter, we have seen several metl1ods for isolating relevant features. These methods all require many positive
and negative examples. In EBL we instead rely on domain knowledge, such as:
• Domain Knowledge:
is(x, Light) I\ has-part(x, y) I\ isa(y. Handle) ➔ liftable(x)
has-pan(x, y) I\ isa(y, Bottom) I\ is(y, Flat) ➔ stable(x)
has-part(x, y) I\ isa(y, Concavity) I\ is(y, Upward-Pointing) ➔ open-vessel(x)

We also need a goal concept to operationalize:


• Goal Concept: Cup
xis a Cup if xis liftable, stable, and open-vessel.
• Operationality Criterion: Concept definition must be expressed in purely structural terms (e.g., Light.
Flat, etc.).
Given a training example and a functional description, we want to build a general structural desc1iption of
a cup. The first step is to explain why Object23 is a cup. We do tl1is by constmcting a proof, as shown in
Fig. 17.15. Standard theorem-proving techniques can be used to find such a proof. Notice that the proof

Cup(Object23)

liftable(Object23) 1 open-vessel(Object23)
stable(Object23)

I
is(Object23, Light)
I
has-part( Object23, Concav1Yy2)
has-part( Object23, Handle 16) Isa( Concavity12, Concavity)
isa(Handle16, handle) Isa( Concavity12, Upward-Pointing)

has-part( Object23, Bottom 19)


lsa(Bottom19, Bottom)
ls(Bottom 19, Flat)

Ffg. 17.15 An Explanat/011


t earntng 367

e relevant features of the tra· ·


·solates ti1 ' nnng example; nowhere in the proof do the predicates owner and color
· ·ror u
va lid gcnera li u 11ion. ff we gather up all the assumpti·ons a~ nd
1
3P
pear. Tl,e proof also
. ·serves
. c'ls. 'n basis
constant s wllh va.nables, we get tll , 1'O II • . .
rcplnce e owmg descnpllun of a cup:
/,as-partlx, y) 1\ .i.w(v. Concavity) /\ is(y, Up, va rd-1-'oimlnH) /\
/,as-pnrt(x. :) /\ 1sa( ::.. /30110111) /\ is( z, Pim) /\
/,os-pnrt(x. w) /\ isa(w, Handle) /\ is(x, Light)

This definitioi~ sati sfies th e operationality criterion and could be used by a robot to classify objects.
Simply replacmg constants by variables worked in thi s example, but in some cases it is necessary to retain
certai n constants. ~o c~tch th ese c~ses, we must reprove the goal. Thi s process, which we saw earli er in our
discussion of leammg m STRIPS, 1s called goal regression.
As we have seen, EBL depends strongly on a domain theory. Given such a theory, why are examples
needed at all? We could have operationalized the goal concept Cup without reference to an example, since the
domain theory contains all of the requisite information. The answer is that examples help to focu s the learning
on relevant operationalizations. Without an example cup, EBL is faced with the task of characterizing the
entire range of objects that satisfy the goal concept. Most of these objects will never be encountered in the real
world. and so the result will be overly general.
Providing a tractable domain theory is a difficult task. There is evidence that humans do not learn with very
primitive relations. Instead, they create incomplete and inconsistent domain theories. For example, returning
to chess, such a theory might include concepts like "weak pawn structure." Getting EBL to work in ill-
structured domain theories is an active area of research (see, e.g., Tadepalli [1989]).
EBL shares many features of all the learning methods described in earlier sections. Like concept learning,
EBL begins with a positive example of some concept. As in learning by advice taking, the goal is to
operationalize some piece of knowledge. And EBL techniques, like the techniques of chunking and macro-
operators, are often used to improve the perfonnance of problem-solving engines. The major difference between
EBL and other learning methods is that EBL programs are built to take advantage of domain knowledge.
Since learning is just another kind of problem solving, it should come as no surprise that there is leverage to
be found in knowledge.

17.7 DISCOVERY
Leaming is the process by which one entity acquires knowledge. Usually that knowledge is already possessed
by some number of other entities who may serve as teachers. Discovery is a restricted form of learning in
which one entity acquires knowledge without the help of a teacher. 5 In this section, we look at three types of
automated discovery systems.

l 7.7.1 AM: Theory-Driven Discovery


Discovery is certainly learning. But it is also, perhaps more clearly than other kinds of learning, problem-
SOlving. Suppose that we want to build a program to di scover things, for example, in mathematics. We expect
lhat such a program would have to rely heavily on the problem-solving techniques we hnvc discussed. In fact.
one such program was written by Lenat r1977; 19821. It wus culled AM , ancl it w0rkt;cl from a few basic
concepts of set theory to discover a good deal of standard number theory.

~ e in Lhe world who has 1he knowledge we bcu k. ln thut l:!IMl, 1h1;1 1-..incl uf Hl:linn Wt' must take is
Called scientific discovery.
368 Artificial Intelligence

AM exploite d a variety ot·. genernl -pw-pose A1 tee 11111ques.


· • IL us ed a frame system to represent mathematical
. .
· · · · · M · • • t
concept s. One of the maJor act1v1t1es ot- A 1s to ct ea e new concept s and fill m their s lots . An example of an
~ , ·~ . . .
AM l:Ol1cept · F' 17 16 AM also uses hemistic search , guided by a set of 250 he unstJc
ts s11own Ill 100 . rules
• • .. . •
represen ting hints abou t ac ti vities that are likely to lead to mterestm g ,, discove
• •• E I Of th kind
, ies. xamp es e . of
heuristi rs .~ M used are shown in Fig. 17. 17 . Generate-and-te st is used to form h~pothe
se~ o n the basis of a
small number of ~x amples and then to test the hypothe ses on a larger ~et_to see 1f
they stil l _a ~pear to hold.
Finally. an age nda contrnls the entire discove ry process. When the heunstlcs _sugge~t
a _cask, it 1s placed on a
cenn-al agenda. along with the reason that it was suggested and the strength with wh1ch
1L was ~ug~ested . AM
operates in cycles. each time choosing the most promisi ng task from the agenda and
perfonn mg it.
name : Prime-Numbers
definitions :
origin : Number-of-divisors-of(x) = 2
predicate-calculus : Pn"me(x) ~ (\t z) (z I x ⇒ (z = 1 ® z = x))

,J';,
iterative : (for x > 1): For i from 2 to i Xx
examples : 2, 3, 5, 7, 11 , 13, 17
boundary : 2, 3
boundary-failures : 0, 1
failures : 12
generalizations : Number, numbers with an even number of divisors
specializations : Odd primes, prime pairs, prime uniquely addables
conjecs : Unique factorization, Goldbach 's conjecture, extremes of number-of-diviso
rs-of
intus : A metaphor to the effect that primes are the building blocks of all numbers
analogies :
Maximally divisible numbers are converse extremes of number-of-divisors-of
Factor a nonsimple group into simple groups
interest : Conjectures tying primes to times, to divisors of, to related operations
worth : 800
Fig. 17.16 An AM Concept: Prime Number
• If f is a function from A to Band Bis ordered, then consider the elements
of A that are mapped into extremal
elements of 8 . Create a new concept representing this subset of A.
• If some (but not most) examples of some concept X are also example s of
another concept Y. create a new
concept representing the intersection of X and Y.
• If very few examples of a concept X are found , then add to the agenda the
task of finding a generalization of X.
Fig. 17.17 Some AM Heuristics
In one run , AM di scovere d Lhe concept of prime numbers. How did it do that? H,aving stumble
d onto the
natural numbers, AM explored operations such as addition , mullipli cation, and their
inverses . It created the
concept of di visibi lity and noticed th at some number s had very few divisors . AM has
a built-in hemistic that
tell s it Lo explore extreme cases. It attempte d to list all number s with zero divisors (fi
nding no ne), one divisor
(finding one: I), and two divi sors. AM was instructe d to ca ll the last concept ''primes
." Before pursuing this
concept , AM went on to li sl numbers with three divi sors, such ns 49. AM tried to
rdate th.is property with
other properties of 49, such as its being odd ~nd u perfec t squure. AM generate d other
odd numbers and other
perfect squi::l[es to tes t its hypothe sei;. A side efft!ctor determi ning the equivnfonce of perfect squares with
numbers with Lhree divi sors was to boost the "interes tingness " rnting of the di visor
concept. T his led; AM to
inves1ig ate ways in which a number co uld be broken down into factors. AM then noticed
that there was only
one way to break a number down into prime factors (known as the Unique Factoriz
ation Theorem ).
Since breaking down numbers into multipli cative components turned out to be interesti
ng, AM decided,
by analogy, to pursue additive compon ents-as well . It made several unintere sting
conjectu res, such as that
369

th
d nd more interesting pheno mena, such as at
every nulllh~ :_coul~, he c~!~r~ssc as a sum of I 's. It also fou
1

By listin g cases, AM determ foed tha_t a ll ev~n


n,anY 1111111~)-~rs, ~vc,_c cx~re ssrbl c as the su~ or two primes.
,
g_ l:atet th •111 2 seemed to have thr s property. Thi s conje cture, known as Goldb ach 's ConJe cture, is
0 u111ber~
in mathe matic s.
widdY belu.:, ~-d to be true. but a proof o r it has yet to be fo und
it used in thi s exam p le . Often
AM contain ~ a gr~,ll . many genera l-,Purpose heuri stics such as Lhe ones
111 th e same place. For examp le, while AM di scove red prime num bers using a he uri s lic
dil'lt:rc nt hcun st1 c~ pomt
prime numb ers is to use the foJlow ing two rul es :
that involved loukmg at extrem e cases, anoth er way to derive
conje cture about A th at does not hold for all
• If there is a _strnng ~nalogy betwe en A and B but there is a
nts of B for which it does hold .
eleme nts ot B. defin e a new conce pt that includes the eleme
then create a new conce pt representing the
• If there is a set whos e comp lemen t is much rarer than itself,
compl ement.
n of natural numb ers. But that analo gy break s
There is a strong analo gy betwe en addition and multiplicatio
I can be expre ssed as the sum of rwo small er
down when we obser ve that alJ natura l numb ers greate r than
multiplication. So the first heuris tic descr ibed
natural numbers (exclu ding the identity). This is not true for
the set of comp osite numb ers. Then the secon d
above suggests the creati on of a new conce pt repres enting
t of that, namel y the set of prime numb ers.
heuristic sugge sts creati ng a conce pt repres enting the comp lemen
on was: "Why was AM ever turned off?"
Two major quest ions came out of the work on AM. One questi
facts about numb ers, possibly facts unkno wn
That is, why didn't AM simp] y keep discov ering new interesting
mance was limite d by the static nature of its
to human mathe matic s? Lenat [1983 b] conte nds that AM 's perfor
it was worki ng evolv ed away from the initial
heuristics. As the progr am progr essed , the conce pts with which
conce pts stayed the same. To remed y this
ones, while the heuris tics that were availa ble to work on those
conce pts that could be create d and modif ied
problem, it was sugge sted that heuris tics be treated as full-fledged
n, and analo gy) as are conce pts in the task
by the same sorts of proce sses (such as generalization, specializatio
the doma in of "Heur etics, " the study of heuris tics
domain . In other words , AM would run in discov ery mode in
sion of AM called EUR1SKO (Lena t. 1983a ]
themsel ves, a<; well as in the doma in of numb er theory. An exten
was designed with this goal in mind.
One sourc e of powe r for AM was its huge
The other quest ion was: "Why did AM work as well as it did?"
. But AM had anoth er less obvio us sourc e of
collection of heuris tics about what consti tute interesting things
tical concepts and their comp act repres entati ons
power, name Iy, the natura l relatio nship betwe en numb er theore
mutat ing old conce pt defin itions - stored
in AM [Lenat and Brow n, 1983] . AM worke d by syntactically
g new, intere sting conce pts. It turns out that a
essentially as short LISP progr ams- in the hopes of findin
er well-formed. meani ngful LISP progr am. This
mutation in a small LISP progr am very likely results in anoth
But while huma ns interp ret AM as explo ring
accoun ts for AM 's ability Lo gener ate so many novel concepts.
LISP progr ams. AM suc-ce'C'ded in large part
number theory, it was actual ly explo ring the space of small
and LISP progra ms . When AM and EU R1SKO
because of thi s intima te relatio nship between numb er theory
tics thems elves. proble ms arose. Co ncepts in
Were applied to other doma in s, including the study of heuris
conce pts, and the sy1~t1Lx of the repres entati on
thei.e domains were larger and more complex tha_n nu1~b er theor~
m . As u result , syntuc ttc- mutat ion of a conce pt
language no longe r closely mirro red the semantics o f the doma
s conce pt. seVt're ly hamp eri ng the disco very
definition almrn,t alway s res ulted in ai1 ill -l'orm ed or useles
Procedure.
. 'W_e nn_1s~ be care ful how we interp ret what
Perhap~ Lhe moral of AM is th at leurni ng is u trk ky business
'.111 1111~ltc 1t bh1s towar d teami ng conce pts in
our AJ program ~ are doing I RiLchi e und Hu_n~a , I9H41. ~ M had
1t possible to understand why AM perfo rmed
number Lheory . OnJy afler that bius was exp l1 <.: 1tly re<.:og n1 zud was
Well in one doma in and poorly in uno ther.
r
370 l\rlljlclal /11 Lelll,qencc

17.7.2 BACON: Data-Driven Discovery


AM showed how discovery mi ght occur in a thcorctit:u l setting. Empirit:u l sc icnli sts sec thing.11 somewhat
differently. They are confronted with datn from the world uncl mu st mukc "sense of ii. They make hypotheses,
and in order to va lidate them, they design uncJ execute experiments. Scientific di scovery hus in spired a number
of computer models. Langley er al. I 198 I aJ present u model of uuta-drivcn !-.cientific di scovery that has been
implemenl'ed as a prograrn caJled BACON , named after Si r Francis Bucon, nn curly philosopher of .'1c icnce.
BACON begins with a set of variables for u prnblem. For exa mpl e, in the study of the behavior of gases,
some variables are p, the pressure on the gas, V, the volume of the gus, n, the amount of gas in mol es, and r,
the temperature of the gas. Physicists have long known a law, ca ll ed the ideal gas Law, that relates these
variables. BACON is able to derive this law on its own. First, BACON holds the variables n and T constant,
perfom1ing experiments at different pressures p 1, p 2, and p 3. BACON noti ces that as the pressu re increases,
the volume V decreases. Therefore, it creates a theoretical term p V. This term is constant. BACON systematically
moves on to vary the other variables. It tries an experiment with different values of T, and finds that pV
changes. The two terms are linearly related with an intercept of 0, so BACON creates a new term pV/T.
Finally, BACON varies the tenn n and finds another linear relation between n and pV/T. For all values of n, p,
V, and T, p V/nT = 8.32. This is, in fact, the ideal gas law. Fig. 17 .18 shows BACON's reasoning in a tabular
fom1at.

n T p V pV pVIT pV/nT
1 300 100 24.96
1 300 200 12.48
1 300 300 8.32 2496
1 310 2579.2
1 320 2662.4 8.32
2 320 16.64
3 320 24.96 8.32
Fig. 17.18 BACON Discovering the Ideal Gas law

BACON has been used to discover a wide variety of scientific laws, such as Kepler's third law, Ohm's law.
the conservation of momentum, and Joule's law. The heuristics BACON uses to discover the ideal gas law
include noting constancies, finding linear relations, and defining theoretical terms. Other heuristics allow
BACON to postulate intrinsic properties of objects and to reason by analogi For example, if BACON finds
a regularity in one set of parameters, it will attempt to generate the same regularity in a similar set of parameters.
Since BACON's discovery procedure is state-space search, these heuristics allow it to reach solutions while
visiting only a small portion of the search space. In the gas example, BACON comes up with the ideal gas law
using a minimal number of experiments.
A better understanding of the science of scientific discovery may lead one day to programs that display
true creativity. Much more work mu st be done in areas of science that BACON does not model. such as
determining what data to gather, choosing (or creating) instruments to measure the data, and using analogies
to previously understood phenomena. For a thorough discussion of scientific discovery programs. see Langley
et al. [1987].

17.7.3 Clustering
A third type of di scovery, called clustering, is very similar to induction, as ,,ve described it in Section 17.5. In
inductive learning, a program learns to classify objects bused on the lnbelings provided by a teacher. In
clustering, no class labelings are provided. The program mu st discover for itself the natural classes that exi st
for the objects, in addition to a method for classifyi ng instances .
/\rl ifidal /11 tel/lga11ce
374
, 1't1t'11 \ 1 the lcurning 1ncchunisms of ani ma ls, they might build
111
and pun ishment. Researchers hoped th111 bys ' ' ~ . tJi·ovcd ct11 sivc . However, the fi eld or neural net"·o k
. I ' ) 'WIS 1
lea.ming machines fm m very sunµ e I' · · uc1l iopcs .,. r
·tl y ,,s 11 result or the discovery of powerful new lcarn· i
· . c , 11
learning hns seen n resurgence 111 ic l: J • • • 1 \,cars pni , • · • ng
• •
. - ~ d'" , ··t ., , th"sc nlgonthms 111 dctn1 1.
a.lgonthms. Chapter 18 t:sl:lt )es l:,b d ., co,,, )titutional "brain metaphor," a number of other lcarn·in
k dels arc ·1se on • 1 · g
While neural nctwor_' mo · b, ·. d on evolution. In this work, learning occurs through a selection
h . ke u ' C o1 n metaphor nse . . . .
tee mques nm . s . , ·o o ulation of random programs. Learnmg algorithms inspired by evolution
p.rocess ti~~• beg1'.1s with _n lm ~efHP !pl· d 1975· de Jong, 1988; Goldberg, 1989]. GAs have been dealt with in
are called ge11er1c nlgonr/1111., 0 ,\11 ' '
m-eater
0
detail in Chapter 2.1

---
SUMMA RY

• t th'no to conclude from our study of automated learning is that learning itself is a problem-
Th e most 1mpo11 an 1 o . . .
·
so I,rmg process. we can cast various leammg st:rateg1es m terms of the methods of Chapters 2 and 3.
• Leaming by talcing advice
- Initial state: high-level advice
- Final state: an operational rule
- Operators: unfolding definitions, case analysis, matching, etc.
• Leaming from examples
- Initial state: collection of positive and negative examples
- Final state: concept description
- Search algorithms: candidate elimination, induction of decision trees
• Leaming in problem solving
Initial state: solution traces to exan1ple problems
- Final state: new heuristics for solving new problems efficiently
- Heuristics for search: generalization, explanation-based learning, utility consideration_s
• Discovery
- Initial state: some environment
- Final state: unknown
- Heuristics for search: interestingness, analogy, etc.
A learning machine is the dream system of AI. As we have seen in previous chapters, the key to intelligent
behavior is havi ng a lot of knowledge. Getting all of that knowledge into a computer is a staggering task. One
hope of sidestepping the task is to let computers acquire knowledge independently. as people do. We do not
yet have programs that can extend themselves indefinitely. But we have discovered some of the reasons for
our failure to create such systems. If we look at actual learning programs, we find that the more know ledge a
program starts with , the more it can learn. This finding is satisfying, in the sense that it corroborates our ot~er
discoveries about the power of knowledge. But it is also unpleasant, because it seems that fully self-extending
systems are, for the present , still out of reach.
Research in machine learning has gone through several cycles of popularity. Timing is always an important
consideration. A learning program needs to acquire new knowledge and new problem-solving abilitie s, but
knowledge and problem-solving are topics still under intensive study. [f we do not understand the nature 0 ~
the thing we want to learn , learning is difficult. Not surprisingly, the most successful learning programs
operate in fairly well-understood areas (like planning), and not in less well-understood areas (like natural
language understanding).

You might also like