Professional Documents
Culture Documents
10 Learning
10 Learning
10 Learning
17
LEARNING
Viat men do not !earn very mucfi_.from tlie lessons ojlii.storg is the most important efal[ the lessons ef
!tistory.
-Aldou s Huxley
(1894-1963), American Writer and Author
The Analytical Engine has no pretensions whatever to originate anything. It can do whatever we know how 10 order
it to perfonn. [Lovelace, 1961]
This remark has been interprete d by several Al critics as saying that computers cannot learn. In fact, it does
that
~ot say that at all. Nothing prevents us from telling a computer how to interpret its inputs in such a way
its performance gradually improves. ·
Rather than asking in advance w~ether it is possible for comput;~rs to ~'le:~rn," it is much mo~ enlightening
to try to describe exactly what activities we mean when we say learnmg and w~at mechanism
s could be
Used to enable us to perform those activities. Simon [ 1983] has proposed that learnmg denotes
. . th adaptive in Lhe sense that they enable the system to do the same task or tasks drawn
., .
···Changes m the system. at are fficiently and more effecuvely .
f th the next time.
rom e same populat10n more e
h ·1
eavi Y on knowledge a,;; their source l)f power. Knowledge 1•s genera II Y acquired through ex perience · and
such acqui sition is the focu s of thi~ chapter. f . t d· f
Knowledge acquisition it,;;elf includes many different act ivities. Simple st0 ring O compu e m ormation.
·
0 r mte I emw11g. ·
1s the most basic lenmi ng act1·v1·ty. M any com puter programs . · · e ·g ., database systems
. · can 1.._
ut:
·ct "1 .. •
sa.i to earn 111 thi s sense. although most peup 1e wou no Id t call such simple storage learnin
. g. Howeve
. r.
.
many AI programs are able to improve their• pe,fonnance
· , t·ta II Y through rote- learn mg technique,; a d
substan · n
we will look at one example in depth. the checker-playi ng program of Samuel.[ 1~63_J. .
Another way we learn is through taking advice from others. Advice takin~ is sim_ilar to rote le~ing. but
high-level advice may not be in a fmm simple enough for a program to use directly m problem-sol ving. The
advice may need to be first operationalized, a process explored in Section 17 .3..
People also learn through their own problem-sol ving experience. After solvmg ~ complex problem. we
remember the structure of the problem and the methods we used to solve it. The next time we see the problem,
we can solve it more efficiently. Moreover, we can generalize from our experience to solve related problems
more easily. In contrast to advice taking, learning from problem-solving experience does not _usually involve
gathering new knowledge that was previously unavailable to the learning program. That is, the program
remembers its experiences and generalizes from them, but does not add to the transitive closure 1 of its
knowledge, in the sense that an advice-taking program would, i.e., by receiving stimuli from the outside
world. In large problem spaces, however, efficiency gains are critical. Practically speaking, learning can mean
the difference between solving a problem rapidly and not solving it at all. In addition, programs that learn
through problem-sol ving experience may be able to-come up with qualitatively better solutions in the future.
Another form of learning that does involve stimuli from the outside is learning from examples. We often
learn to classify things in the world without being given explicit rules. For example, adults can differentiate
between cats and dogs, but small children often cannot. Somewhere along the line, we induce a method for
telling cats from dogs based on seeing numerous examples of each. Learning from ex~p)es usually involves
a :teacher who helps us classify things by correcting us when we are wrong. Sometimes, however. a program
can discover things without the aid of a teacher.
A1 researchers have proposed many me'ehanisms for doing the kinds of learning described above. In this
chapter, we discuss several of them. But keep in mind throughout this discussion that learning is itself a
problem-solving process. ln fact, it is very 'difficult to formulate a precise definition of learning that distinguishes
it from other problem-solving tasks. Thus it should come as no surprise that, throughout this chapter. we will
make extensive use of both the problem-solving mechanisms and the knowledge representation techniques
that were presented in Parts I and TJ.
'The transitive closure of a progrnm's J..nowledge is that knowledge plus whatever the program can logicaJly deduce front it.
349
_ _ _ __ __
_ _ing _...
es..__ _ _ _ _ _. __ l earn
10
Stored Scores A : 10 (b)
(a)
Fig.17.1 Storing Backed-Up Values
hist icat ed pro blem -sol vin g
It doe s not app ear to invo lve any sop
Rote learning of this sort is very sim ple. that will bec ome incr easi ngly imp orta
nt in mor e
sho ws the nee d for som e cap abil jties
capabilities. But eve n it
ities include:
complex learning syst ems . The se cap abil it wou ld be to
age of lnfo nna tion -In orde r for it to be faster to use a stor ed valu e than
• Organized Stor In Sam uel' s pr o~ this
mpu te it, ther e mus t be a way to acce ss the appropriate stor ed valu e quickly.
reco pieces. But
boa rd pos ition s by a few imp orta nt characteristics, suc h as the num ber of
was done by inde xing necessary.
plex ity of the stor ed info rma tion incr eases, mor e sophisticated tech niqu es are
as the com stor ed can be very larg e. To
inct obje cts that mig ht pote ntia lly be
• Gen era liza Jion -Th e num ber of dist ion is necessary.
ber of stor ed obje cts dow n to a man age able level, som e kind of gen eral izat
keep the num be stor ed was equal to the
num ber of dist inct obje cts that cou ld
In Sam uel' s pro gram, for exa mpl e, the eralizat ion
t boa rd pos itio ns that can ruis e in a grune. Onl y a few sim ple forms of gen
number of diff eren ed as thou gh \Vh ite is to
in Sam uel' s pro gram to cut dow n that number. All pos itio ns are stor
were used
. Wh en pos sibl e, rota tion s along the
diagonal are
num ber of stor ed pos itio ns in half
move. Th.is cuts the s, so too doe s the
bine d. Aga in, thou gh, as the com plex ity of the lear ning proc ess incr ease
also com
need for generaJization . pro blem solv ing.
hav e beg un to see one way in whi ch learning is sim ilar to othe r kinds of
At this point, we
its knowlectge huse .
al stru cture for
lts success dep end s on a goo d orga niza tion
FOO considers the advice to apply to the player called ·"me." Next, FOO UNFOLDS the definition of trick:
In other words, the player should avoid taking points during Uie scenario consisting of ( 1) players pla}ing
cards and (2) one player taking the trick. ROO then uses case analysis to <let.e rmine which steps could cau..-e
one to take points. It rules out step 1 on µie basis tha~ it knows of po intersection of the concepts take-poinlS
and. play-card. But step 2 could affect taking points, so FOO UNFOLDs the definition of take-points:
This ad vice says ,that the player should avoid taking point card d . th , · - th ttick-winne!
. . . · k Th • . · - s pnng 1,11,.e process o1 e ,
taking tue tnc . e questi on for FOO now 1s: Under wt-,at condi·u· ct ( ak ) durino (tok(
· , J)? B • . ~• •-~. ons o.es t~ e m,e c2 occur • , o ·f
(~ ·ck-wmner! .I c . y using a technique called partial match, FOO b th . th . ill be taken1.
me = trick-winner and c2 = cl. Jt transform s the advice into: ,ypo es1zes . ~t pp.mts w.
This means "Do not win a trick that has points " W i ''avoid
taking points," but it is important to note ~lat the cu~re te e
lavb not traveled very far ~onceptually frolll \150f
, . a- ctually playing the game of hearts. Throygh a numb n ;ocah ulcuy is one that FOO can understand in tel11'
. . er o ot ~r trans(onnations, FOO eventually settles o1 .
--=1!11:!1:l-.i~ll'l:a-- !::,_arning 351
tha t are fa irly reasonnhl c measures of the correct score e ven if' they arc not us accurate as we hope to get lhcrn
The n thi s evalu ati on fun ction ca n he used to provide l'ccdbnck to itse lf. Mo ve seq~ie nces th ut lead l'o positio :
al
with ~1i~hcr values cnn he considered good (n nd the 1cr111 s in the cvu lu utl on fun c tion th sugges ted them c:~
be rcmtorced) .
Because or the li,nitatio ns o f th i~ approac h, however. Su111uc l ·s prograr~ di~ two 0 th cr things: ? nc of Which
provided an add itionnl test that progress wn~ bei ng 1111H.lc t1 11tl the other ol wluc h ge nerated add 1t1onal nudge~
to keep the process ou t of n rut :
• Whe n Lhc pmg rnm was in learni ng mode, it played agains t anotl~e r co?y of itse lf. On ly one of Lhe
copies alt e red its scoring fu nction during the game ; the other re ma ined fixed. At the end of the game·
if U1e copy w ith the mod ified func tion wo n, the n the modified function was accepted . Otherwi se, th~
old one wa.-- re tained . If. however, thi s happened very many times, then some drastic change was made
t.o the function in an attempt to get the process going in a more profitable direction .
• Periodicall y. one term in the scori ng function' was eliminated and replaced by another. Thi s was possible
because. although the program used only sixteen features at any one time, it actually knew about thi rty.
eight. This replacement differed from the rest of the learning procedure since it created a sndden
change in the scori ng function rather than a gradual shift in its weights.
This process of learning b y successive modifications to the weights of terms in a scoring function has
many linutations, mostly ari sing out of its lack of exploitation of any knowledge about the structure of the
problem with which it is deal ing and the logical relationships among the problem's components. In addition,
because the learning procedure is a variety of hill climbing, it suffers from the same difficulties as do other
hill-climbing programs. Parameter adjustment is certainly not a solution to the overall learning problem. But
it is often a useful tech nique, e ither in situations where very little additional knowledge is available or in
programs in which it is combined with more knowledge-intensive methods . We have more to say about this
type of learning in Chapter J 8.
. , just ,nenti()lll' d· In future planning. STRIPS is free to use lhi s comp· Jex macro-operator just as it would
· 1.,P · , l1t I,er op e1
,1 ·a1·01·•
use :in)
· 13ut rarrlY."' . 1·11 STRIPS I bl
.._ sec t 1e exact same problem twice. New prbblems will differ from previous pro ~ms.
we would st ill hke the ?'_"oblem solver to make efficient use of the knowledge it gained from its P:evwus
rx erienccs. By_ge,'.em~,_z.rng MACROPs before storing them , STRIPS is able to accompli sh this. The s,mp~est
id~l for generahzati~n is ~o replace a_ll of the constants in the macro-operator by variables. Instead of stonng
ihe MACROP descnbed m th e previous paragraph, STRIPS can generalize the plan to consist of th~ steps
LINSTACK(x1, J".:), ~UTDOW~ (~1), PICKUP(x3). STACK(x3 x2) , where x 1, x 2, and x3 ate variables. Thi s plan
·Rn tJicn be stored with preconditions ON(x 1, x2), ON(x3, Table) and postconditionsON(x 1, Table), ON(.,ti, X3),
~uch a MACR_OP _can now apply in a variety of situations ..
Generali zati on is not so easy, however. Sometimes constants must retain their specific values. Suppose our
dolllain included an operator called STACK-ON-B(x), with preconditions that both x and B be clear, and with
po,tcondition ON(x, B). Consider the same problem as above:
L
start: ON(C, B)
•
goal: ON(A, B)
STRIPS might come up with the plan UNSTACK(C, B), PUTDOWN(C), STACK-ON-B (A). Let's
generalize this plan and store it as a MACROP. The precondition becomes ON(x3, x 2), the postcondition
becomes ON(x 1, x2), and the plan itself becomes UNSTACK(x3, x2) , PUTDO\VN(x3), STACK-ON-B(x 1).
Now, suppose we encounter a slightly different problem:
The generalized MACROP we just stored seems well-suited to solving this problem if we let x 1 = A. x-i =
C, and x3 = E. Its preconditions are satisfied, so we construct the plan UNSTACK(E, C), PUTDOWN(E),
STACK-ON-B(A). But this plan does not work. The problem is that the postcondition of the MACROP is
overgeneralized. This operation is only useful for stacking blocks onto B, which is not what we need in this
new example. In this case, this difficulty will be discovered when the last step is attempted. Although we
cleared C, which is where we wanted to put A, we failed to clear B, which is were the MACROP is going to
try to put it. Since B is not clear, STACK-ON-B cannot be executed. If B had happened to be clear, the
MACROP would have executed to completion, but it would not have accomplished the stated goal.
~n reality, STRIPS uses a more complex g_enerallzation procedure. First, all constants are replaced by
van.ables. Then, for each operator in the parameterized plan, STRIPS revaluates its preconditions. In our
~xampJe, the preconditions of steps 1 and 2 are satisfied, but the only way to ensure that B is clear for step 3
18
Lo assume that block x whi ch was cleared by the UNSTACK operutor, is actually block B. Through "re-
2
Prov·ing" that the generalized
' plan works, STRIPS Jooates· .
constramts ot· tl·11s ' mct .
. k'
~ore recent work on macro-operators appears in Korf r1985b ]. ~t turns out that the set of problems for
~:s
Whichmacro-operators are critical are exactly those problems with rumserinliw ble subgoals. Nonserializability
that working on one subgoal wi~I necessai:ily in_terl'ere ~ith _the pr~vious s~lution to another subgoal.
can bll th~t we _di scussed such problems 111 connect1_on ~1th 1101:hnem planning (Section 13 .5~. Macro-operators
th ou ehuseful in such cases , since one macro-ope1ato1 can p10duce a .small global change 111 the world , e ven
g the individual operators that make it up produce many undesirable local changes.
Art 1/i<'ial lnt<•lli,tJt'II('('
354
' l' n ,wo11rnt11 hns ct11'1't·l'l ly placed the lirst four tiles. ii ,is diflicult t11
For exmr1pIe. comn'der t11c o'"-puzz 1c. 0 11 l c-- · •
. 'ti i· t
ms m'1 •
,mg~ tli 1.:' 1-11.st rnm lkL'tHISl' dist11rh111µ previously solved suhguuls 1s dctcci.." ,,
"1us
1 l1 t1 1e w1 1m1t
plnee ti,e fift • · • •
heuristic scrnin.g functions. it is stmngly t'\:siS tlld. :or '.'u 111 ~Y'.'nblc,ns. lt1cluc.hn_
g th c 8-p11·1.1.lc unt1
n bud thing by
Rubik's cube. wenk methods bm:ed on heuristic scori ng urc thcrclmc msultH.: tcnt. I lcnc.: _c. we c.: ~thcr need tlomuin.
specific knowledge. or else n new wcnk method. FtH'ttmutdy. we en~, tea m the domu111-spcc.:1~c know ledge we
need in the foim of mucro-opemtors. Thus. mncro-opcrntors cun be viewed os o wcu~ mcthocl l't>t· lcnrning, In the
S-puu.le. for example. we might hnvc n muc,u--:n rompkx. prcstored s,x1_ucnce ol opcrnto1:s- for plocing the
fifth tile without disturbing nny of the first four _11lcs cxt~rnttlly ([tllhough 111 ~m.:l they ure d1stur~e(~within lhe
macro itselD. Kotf [ l 985bl gives an algorithm tor lenrnmg n complete set_ ol muc.:ro-~~pcrnto ,~s. fh1 s appronch
contrasts with STRIPS. ,:,,,hich teamed its MACROPs grndunlly, from expencncc. Korf s ulgonthm runs in time
proportional to the time it takes to solve a single problem without macro-operators.
For e_xample, if the task is weather prediction, the parameters can _be such measurements as rainfal
locauon of cold fronts . Different functions can be written to comb me these parameters to predict and
I
cloudy, rainy, or snowy weather. sunny,
• Isolate a set of features that are relevant to the task domain. Defi ne each class as a structure com
of those features. Posed
For example, if the task is to identify animals, the body of each type of animal can be stored
structure, with various features representing such things as color, length of neck, and feathers. as a
There are advantages and disadvantages to each of these general approaches. The statistical appr
taken by the first scheme presented here is often more efficient than the structural approach taken b oach
th
second. But the second is more flexible and more extensible. y e
Regardless of the way that classes are to be described, it is often difficult to construct, by hand. good cl
definitions. This is particularly trne in domains that are not well understood or that change rapidly. Thus~s
idea of producing a classification program that can evolve its own class definitions is appealing. This task ;
0
constructing class definitions is called concept learning, or induction. The techniques used for this task mu
. . . s~
of course, depend on the way that classes (concepts) are descnbed. If classes are described by scoring functions,
then concept learning can be done using the technique of coefficient adjustment described in Section 17.4.1.
If, however, we want to define classes structurally, some other technique for learning class definitions is
necessary. In this section, we present three such techniques.
and Arch shown in Fig. 17 .2. The figure also shows an Tent
example of a near miss for each concept. A near miss is I
an object that is not an instance of the concept in question
but that is very similar to such instances.
The program started with a line drawing of a blocks
world structure. It used procedures such as the one
Arch
ff~ ill
Fig. 17 _2 Some Blocks World Concepts
described in Section 14.3 to analyze the drawing and
construct a semantic net representation of the strnctural description of the object(s) .This strnctural description
was then provided as input to the learning program. An example of such a structural description for thr H011st'
of Fig. 17.2 is shown in Fig. 17.3(a). Node A represents the entire structure, which is composed (if two parts:
nodeB , a Wedge, andnodeC,aBrick. Figures 17.3(b)and 17.3(c)showdescriptions ofthe twoArchstructu~s
of Fig. 17 .2. These descriptions are identical except for the types of the objects on the top: one is a Brick_"~•htle
the other is a Wedge. Noti ce that the two supporting objects are related not only by l<1t-of and right-<J~ ~mk_s:
but al so by a does-not -marry link, which says that the 1wo objects ~lo ~ot ,~~n:v._ "f\vo obj:c_t~ rrwrry itr\t:~/
have faces that touch and they have a common edge. The many relation 1s cnttl':al m the de-tu,mon of an
It is the difference between the first arch strucl"ure uncl the near miss arch structure sh0wn in Fig. 17-~· . .
. .ipproach that Wmston
. ' s program Look to Lhe problem oI' concept t·01rnat1on . ht:. descnbel
. . c,111 1as
The basic ·
follows:
. . . . . . . ..... d ~ . . .- tion the
J. Begm with a structural description ol· one known instance of the w ncept. Ctll that csl:IIP
concept definition .
l ea rn/11g 35 7
t\ 1-.·,.1<9-1'
X\ -0 0
/ '7
B supported-by
Wedge Brick
(a)
isa isa
(b) (c)
2. Examine desctjptions of other known instances of the concept. Generalize the definition to include them.
3. Examine descriptions of near misses of the concept. Restrict the definition to exclude these.
Steps 2 and 3 of this procedure can be foterleaved .
Steps 2 and 3 of thi s procedure rely heavily on a comparison process by which si milarities and differences
between structures can be detected. This process must function in much the same way as does any other
matching process, such as one to determine whether a given production rule can be applied to a particular
problem state. Because differences as well as similarities must be found , the procedure must perfoCTn not
just literal but also approximate matching. The
output of the compari son procedure is a skeleton
structure describing the commona lities between
the two input structures. It is annotated with a
set of compari son note,s that describe spec ifi e
similarities and di fferences between the inpu ts .
To ~ee how thi s approach work s, we lrnce it
th rough the process of learnin g what an ur~h is,
Suppose that the arch descripti on of Fig. 17.3(b)
is presented first. It then becomes the tkfiniti on
of the concept Arch. Then suppose that the ~,n.: h
descripti on of Fi g. I 7.3 (c ) is prese nt ed: ~he
compari son routine will return a structu re simil ar
, ---- righ t-of
to the two input structures except that it will note
·
that th e objects nodes lnbdcd
represented by tI1e . · . . --------------
does-not-m arry
Brick
C are not identical. Thi s structure 15 showi~ as n of Two Arches
fig. 17.4 The Compariso
Fig 17 4 Th e c-note 1·m k f1·om node C describes
· • .
358 Artificia l Intelligence
Car023
ongin · Japan
manufacturer : Honda
color : Blue
decade : 1970
type : Economy
Fig. 17.7 An Example of the Concept Car
17 .8. The choice of fearu_res
in Fig.
\ ( , 1. suppose that eac~i slot. may cont ~ only the discre te vaJues shown
1
dded in a particuJar program and ~y asrng
·ralu~ is cal led tl~e bras of tbe l ~ g system . By being embe
se it learns some things more easily than
..,..i __Jar representations. every leanung system is biased , becau
have to do wi th car
~~ lo our example. the bias is fairl y simple - e.g., we can learn conceptsus.that A clear statement of the
:farturerS: but notcar_ owne ~. In more co~plex systems, the bias is less obvio
~ oi a tearmng system 1s very important to its
evaluation.
origin E {Japan, USA, Britain, Germany, lta/yJ
manufacturer E {Honda, Toyota, Ford, Chrysler, Jaguar, BMw, Ra~
color E {Blue, Green, Red, White}
origin : X1
mfr : x2
color : X3
decade · X4
type : X5
(+) (- ) (+ )
(- ) (I )
Fig. 17 .12 Po\i/i ve and Ne,lJOtive /;\:urn pie~ of' I Ill' CCl11 cep t "/c1pu11cs,• <'<WI Lll/{\' , ar•
G= l (x 1, x 2, ,1 1, ,1 1• x'\ ) I
S = I U flpan , I !fllulrt, /J/ 11 e. I '>HO, 1~·,011 0 1111•) I
Now we are ready to pro<.:l:~~ til l: :-,i·cond L· ,·w 11 1pk . Thi' c; s1•1 11111 -.1 be ~11••L' t.1 1l1t•d ins\"-''~ a way that the
negaLi ve example i~ no longt;:r in th1: v1. is io 11 ~ p i1\ ·1..·. 111 u u 1 11•p1~M' ll l,1l11u1 la 11µu agl'. s~1~~·1al1 zation involves
replaci ng variable~ with l.'on~tant :-.. (Null' : The (,' ~L· t 11111~1 lw :-.1H·t·1,il11t•d t~1t ly_ It) tk':-.c11pt10ns that are ii·ithin
the current version :-.pace, not out~idl.' of it.) I k 11· il ll' tl w ,1va dabk :-.pct·1td11.1ll ons·
The S set is unaffecred by the negative exampl e. Now we co me to the third ex ampl e, a pos iti ve one. The
first order of business is to remove from the G set any descripti ons that are inconsistent with the positi ve
example. Our new G set is:
We must now generalize the S set to include the new exa mple. This involves rep lacing constants with
Yariables. Here is the new S set:
At this poinL the S and G sets specify a version space (a space of candidate descriptions) that can be
translated roughly into English as: "The target concept may be as specific as 'Japanese, blue economy car,' or
as general as either 'blue car' or 'economy car."'
NexL we get another negati ve example, a car whose origin is USA. The S set is unaffected, but the G set
must be specialized to avoid covering the new example. The new G set is:
We now know that the car must be Japanese, because all of the descriptions in the version space contain
Japan as origin. 3 Our final example is a positive one. We first remove from the G set any descriptions that are
inconsistent with it, leaving:
Sand G are both singletons, so the algorithm has converged on the target concept. No more examples are
needed.
There are several things to note about the candidate elimination algorithm. First, it is a least-commitment
algorithm. The version space is pruned as little as possible al each step. Thus, even if all the positive training
examples are Japanese cars, the algorithm will not reject the possibility that the target concept may include
cars of other origin-untiJ it receives a negative exampl e that forc es the rejection . This means thut if the
training data are sparse, the Sand G sets may never converge to a single description; the system may It:nrn
only partially specified concepts. Second, the algorithm invol ves exhuust ive, brenclth-first search through the
st
version spa~e. We can see th.i s in the algorithm for updating the C set. Contrnst this with the depth-fir
behavior of Winston 's learning program. Third , in our simple represent ation lang uage, the S set always contains
exactly one element, because any two positive exampl es always have exact ly one ge nernli zntion . Other
representation languages may not share thi s property.
3
It could be the case that our target concept is ·'not Chrysler," but we wi ll ignore this possibility because our representation
language is not powerful enough to express negation and disj'unction.
363
l,enrnfng
f hc version ~pan' ;ipprnach can he applied to a wide variety of lcc1 rning tasks and representation languages.
f he 11lg0ri1hm nbow can h~ cx lct1rl cd to hnndlc cominuously va lued features and hierarchical ~nowledge/::
0
gxcrcis~s). H~1 ~ 1cver.. ~'.er~:o,~ , spaces .have several deficiencies. One i.ci the large cipace requirement~ can
~ 11 9us11vc. b, ~ad tb-f,~ st .srn.1<.: h men~10ned above. Another is that inconsistent data. also called no,se . . (n
,e the cand,d::ite ellmrn at,on algonthm to prune the target concept from the version space prematurely.
'f .. . . pt
cn11.
!·
,hr l·ar cxainpl~ above, ~ '..1.1e tlmd Lrammg instance had been mislabeled(- ) instead of (+ the target conce G
)f "Japancse economy ca, would never be reached. Al so, given enough erroneous negative examples. th e
~el can he specialized _s~ far that the version space becomes empty. In that case, the algorithm concludes th at
•)ncept fi ts the trnmmg examples.
110 l l . . . . .
one solution to tl11S problem lM1tchell, 1978] is to maintain several G and S sets. One G set ,s con ststertt
with all the training instances, another is consistent with all but one, another with aU but two, etc. (and the
sanie for the S set). When an inconsistency arises, the algorithm switches to G and S sets that are co-nsi stent
with most. but not all. of the training examples. Maintaining multiple version spaces can be costly. however,
and the Sand G sets are typically very large. If we assume bounded inconsistency, i.e., that instances close to
the target concept boundary are the most likely to be misclassified, then more efficient solutions are possible.
Hirsh [J 990) presents an algorithm that runs as follows . For each instance, we form a version space consistent
with that instance plus other nearby instances (for some suitable definition of nearby). This version space is
then intersected with the one created for all previous instances. We keep accepting instances until the version
space is reduced to a small set of candidate concept descriptions. (Because of inconsistency, it is unlikely that
the version spaec will converge to a singleton.) We then match each of the concept descriptions against the
entire data set, and choose the one that classifies the instances most accurately.
Another problem with the candidate elimination algorithm is the learning of disjunctive concepts. Suppose
we wanted to learn the concept of "European car," which, in our representation, means either a German.
British, or Italian car. Given positive examples of each, the candidate elimination algorithm will generalize to
cars of any origin. Given such a generalization, a negative instance (say, a Japanese car) will only cause an
inconsistency of the type mentioned above.
Of course, we could simply extend the representation language to include disjunctions. Thus, the concept
space would hold descriptions such as "Blue car of German or British origin" and "Italian sports car or
German luxury car." This approach has two drawbacks. First, the concept space becomes much larger and
specialization becomes intractable. Second, generalization can easily degenerate to the point where the S set
contains simply one large disjunction of all positive instances. We must somehow force generalization while
allowing for the introduction of di sjunctive descriptions. Mitchell [ 1978) gives an iterative approach that
involves several passes through the training data. On each pass, the algorithm huilds a concept that CO\'ers the
largest number of posiiive training instances without covering any negative training i,~stances. At the end of
the pru.&, the posi tive training in stances covered by the new concept are removed from the training St."t , and the
ne~ concept then becomes one di sjunct in the eventual disj unctive concept description. When all positive
!raining instanceb have been removed, we are left with II disjunctivt:) concept that covers nll of them without
covering any negative in stances.
There are a number of other complexities, iucluding the wuy in whkh fen turrs internet with one anoth er.
F .
or example, if the ori~in of a car is Jopon , then the 1111111t(/<l('ft11·1.'r <.:annot t'l t· C/11:vs/n: The version s
al . . I . • , • pace
gonthm as described above makci; no u.sc.: ol ~uc l 111tnrnrnt1~)1\ , Also tn our example, it would be more
natural to repl ace the decade slot wi th a con111111011sly vu lucd year l1dd. We would haw to change our d
fo . . . k' 1 • . 1 proce urns
r updating the sand C sets to account lor this ' 11H ot numcncu data.
36'"f'1 I, 1, '111,1I /11/r· /11,1 1111, t'
4 Actually, the decision tree rc pre-.en tation ,s more J!l' n1 •1:il · I ,cave~ l' il ll dc1w t1• 11 11y tll' 11111111,hcr ol cla-.scs. 1101 just positive
and negative .
►
365
-------
Learning
sut people secm to be nble to le:im quite a bit from sil1glc examples. Consider V •
1ies~ player who. as 8 lack. has reached the position shown in Fig. 17. 14. I1 ~ .l .l
nc ·5 11 d "&,ork" 1Jecause the white kni ght attacks both the black
1
'fhc Position ca e ::i
. !! and the black queen. Black must move the king. thereby leavi ng the
kl"~ Pt I. ·
ueen open to capture . ·om t,,s smgle experience, Black is able to learn
\ire a bit about the fork lTap:. the idea is that if any pi ece x attacks both the
q nnnent's king and another piece Y, then piece y will be lost. We don 't need
op;~
to
dozens of positive and negative examples of fork positions in order to
1 · 'r. .
i!. i!. i!.
dfaW these cone usions. rrom JUSt one experience, we can learn to avoid this (,!)
traP in the future and pe_rhaps to use it to our own advantage. A ncork Position
Fig. 17.14
What mak·es sue I1 smg l e-example learning poss ible? The answer, not in Chess
-surprisingly. is knowledge. The chess player has plenty of domain-specific
knowledge that can be brought to b~ar, including the rules of chess and any previously acquired strategies.
That knowledge can be used to identify the critical aspects of the training example. In the case of the fork, we
tnow that the double simultaneous attack is important while the precise position and type of the attacking
<piece is not.
Much of the recent work in machine learning has moved away from the empirical, data-intensive approach
described in the last section toward this more analytical , knowledge-intensive approach. A number of
independent studies led to the characte1ization of this approach as explanation-based learning. An EBL
system attempts to learn from a single example x by explaining why xis an example of the target concept. The
explanation is then generalized, and the system's performance is improved through the availability of this
knowledge.
Mitchell et al. [ 1986] and Delong and Mooney [ 1986] both describe general frameworks for EBL programs
and give general learning algorithms. We can think of EBL programs as accepting ·the following as input:
• A Training Example-What the learning progran1 "sees" in the world, e.g., the car of Fig. 17 .7
• A Goal Concept-A high-level description of what the program is supposed to learn
• An Operationally Criterion-A description of which concepts are usable
• A Domain Theory-A set of rules that describe relationships between objects and actions in a domain
From this, EBL computes a generalization of the training example that is sufficient to describe the goal
concept, and also satisfies the operationality criterion.
Let's look more closely at this specification. The training example is a familiar input-it is the same thing
as the example in the version space algorithm. The goal concept is also familiar, but in previous sections, we
have viewed the goal concept as an output of the program, not an input. The assumption here is that the goal
concept is not operational , just like the high-level card-playing advice described in Section 17.3. An EBL
program seeks to operationalize the goal concept by expressing it in terms that a problem-solving program
ca_n under~tand . These terms are given by the openuioriality criterion. In the chess example, the goal concept
llllght be :-.omething like "bad position for Black," nnd, the operntionalized rnncept would be a generalized
de~cri ption of Situation s similar to the training ex ample, given in terms of pitlces and their relative positions.
~e ~a:-.t input to an EBL program is a domain theo_ry, !n our_casll, t~e rules of chess. Without such knowledge,
11 ~ unpossible to come up with a correct generaltwt1on_ol th~ tn~1ning exa~1ple.
Expfanation -hased ~eneraliw tion (EBG) is a_ n ulgnn!hm lor l~BL des1,;.nhed in Mitchell et al. [ 1986). It
has two steps: ( I ) ex plain and (2) gencrali1.e. During the ltrst step, the domnm theory is used to prune awa all
l~e unimportant aspects of the trnining example wlll1 resp~c t to th e gt'. 1_11 concept._W~at is left is an explarz:iion
0
~hy the training example is an instance of the goa l conce_pl. This explanation 1s expressed in te , h
satisfYthe operationality criterion . The nex t step is 10 genera lize th e explanation as far as possible w~: :ti~
366 Arti(icinl lnt ell{!]cnre
describing tl1e goal concept. Following our d1css example. the first liBL step chooses 10 ignore.! White's
pawns, king. and rook. and constructs an explanation consisting of' White's knight. Bluck'.'I king. und Black's
queen, each in tl1eir specific positions. Opernt.ionnlity is ensured: all chcss-plnying prognuns understoncl the
basic concepts of piece and position. Next. the explanation is genernli1.cd. Usi ng dornnin knowledge, we find
that moving the pieces to a different pat1 of the board is sti ll bod for Black. We can also determine thnt other
pieces besides knights and queens can participate in fork attacks.
In realit,,. current EBL methods nm into difficulties in domains us complex as chess, so we will not pursue
this example further. Instead. let's look at a simpler case. Consider the problem of learning the concept C11p
[Mitchell et al.. 1986]. Unlike tl1e arch-learning program of Section 17.5. 1, we want to be able to generalize
from a single example of a cup. Suppose the example is:
• Training Example:
owner'(_Object23, Ralph) I\ has-part(Object23, Collcavity l2) I\
is(Object23. Light) I\ color(,Object23, Browll) I\ .. .
Clearly, some of tl1e features of Object23 are more relevant to its being a cup than others. So far in Lhis
chapter, we have seen several metl1ods for isolating relevant features. These methods all require many positive
and negative examples. In EBL we instead rely on domain knowledge, such as:
• Domain Knowledge:
is(x, Light) I\ has-part(x, y) I\ isa(y. Handle) ➔ liftable(x)
has-pan(x, y) I\ isa(y, Bottom) I\ is(y, Flat) ➔ stable(x)
has-part(x, y) I\ isa(y, Concavity) I\ is(y, Upward-Pointing) ➔ open-vessel(x)
Cup(Object23)
liftable(Object23) 1 open-vessel(Object23)
stable(Object23)
I
is(Object23, Light)
I
has-part( Object23, Concav1Yy2)
has-part( Object23, Handle 16) Isa( Concavity12, Concavity)
isa(Handle16, handle) Isa( Concavity12, Upward-Pointing)
This definitioi~ sati sfies th e operationality criterion and could be used by a robot to classify objects.
Simply replacmg constants by variables worked in thi s example, but in some cases it is necessary to retain
certai n constants. ~o c~tch th ese c~ses, we must reprove the goal. Thi s process, which we saw earli er in our
discussion of leammg m STRIPS, 1s called goal regression.
As we have seen, EBL depends strongly on a domain theory. Given such a theory, why are examples
needed at all? We could have operationalized the goal concept Cup without reference to an example, since the
domain theory contains all of the requisite information. The answer is that examples help to focu s the learning
on relevant operationalizations. Without an example cup, EBL is faced with the task of characterizing the
entire range of objects that satisfy the goal concept. Most of these objects will never be encountered in the real
world. and so the result will be overly general.
Providing a tractable domain theory is a difficult task. There is evidence that humans do not learn with very
primitive relations. Instead, they create incomplete and inconsistent domain theories. For example, returning
to chess, such a theory might include concepts like "weak pawn structure." Getting EBL to work in ill-
structured domain theories is an active area of research (see, e.g., Tadepalli [1989]).
EBL shares many features of all the learning methods described in earlier sections. Like concept learning,
EBL begins with a positive example of some concept. As in learning by advice taking, the goal is to
operationalize some piece of knowledge. And EBL techniques, like the techniques of chunking and macro-
operators, are often used to improve the perfonnance of problem-solving engines. The major difference between
EBL and other learning methods is that EBL programs are built to take advantage of domain knowledge.
Since learning is just another kind of problem solving, it should come as no surprise that there is leverage to
be found in knowledge.
17.7 DISCOVERY
Leaming is the process by which one entity acquires knowledge. Usually that knowledge is already possessed
by some number of other entities who may serve as teachers. Discovery is a restricted form of learning in
which one entity acquires knowledge without the help of a teacher. 5 In this section, we look at three types of
automated discovery systems.
~ e in Lhe world who has 1he knowledge we bcu k. ln thut l:!IMl, 1h1;1 1-..incl uf Hl:linn Wt' must take is
Called scientific discovery.
368 Artificial Intelligence
,J';,
iterative : (for x > 1): For i from 2 to i Xx
examples : 2, 3, 5, 7, 11 , 13, 17
boundary : 2, 3
boundary-failures : 0, 1
failures : 12
generalizations : Number, numbers with an even number of divisors
specializations : Odd primes, prime pairs, prime uniquely addables
conjecs : Unique factorization, Goldbach 's conjecture, extremes of number-of-diviso
rs-of
intus : A metaphor to the effect that primes are the building blocks of all numbers
analogies :
Maximally divisible numbers are converse extremes of number-of-divisors-of
Factor a nonsimple group into simple groups
interest : Conjectures tying primes to times, to divisors of, to related operations
worth : 800
Fig. 17.16 An AM Concept: Prime Number
• If f is a function from A to Band Bis ordered, then consider the elements
of A that are mapped into extremal
elements of 8 . Create a new concept representing this subset of A.
• If some (but not most) examples of some concept X are also example s of
another concept Y. create a new
concept representing the intersection of X and Y.
• If very few examples of a concept X are found , then add to the agenda the
task of finding a generalization of X.
Fig. 17.17 Some AM Heuristics
In one run , AM di scovere d Lhe concept of prime numbers. How did it do that? H,aving stumble
d onto the
natural numbers, AM explored operations such as addition , mullipli cation, and their
inverses . It created the
concept of di visibi lity and noticed th at some number s had very few divisors . AM has
a built-in hemistic that
tell s it Lo explore extreme cases. It attempte d to list all number s with zero divisors (fi
nding no ne), one divisor
(finding one: I), and two divi sors. AM was instructe d to ca ll the last concept ''primes
." Before pursuing this
concept , AM went on to li sl numbers with three divi sors, such ns 49. AM tried to
rdate th.is property with
other properties of 49, such as its being odd ~nd u perfec t squure. AM generate d other
odd numbers and other
perfect squi::l[es to tes t its hypothe sei;. A side efft!ctor determi ning the equivnfonce of perfect squares with
numbers with Lhree divi sors was to boost the "interes tingness " rnting of the di visor
concept. T his led; AM to
inves1ig ate ways in which a number co uld be broken down into factors. AM then noticed
that there was only
one way to break a number down into prime factors (known as the Unique Factoriz
ation Theorem ).
Since breaking down numbers into multipli cative components turned out to be interesti
ng, AM decided,
by analogy, to pursue additive compon ents-as well . It made several unintere sting
conjectu res, such as that
369
th
d nd more interesting pheno mena, such as at
every nulllh~ :_coul~, he c~!~r~ssc as a sum of I 's. It also fou
1
n T p V pV pVIT pV/nT
1 300 100 24.96
1 300 200 12.48
1 300 300 8.32 2496
1 310 2579.2
1 320 2662.4 8.32
2 320 16.64
3 320 24.96 8.32
Fig. 17.18 BACON Discovering the Ideal Gas law
BACON has been used to discover a wide variety of scientific laws, such as Kepler's third law, Ohm's law.
the conservation of momentum, and Joule's law. The heuristics BACON uses to discover the ideal gas law
include noting constancies, finding linear relations, and defining theoretical terms. Other heuristics allow
BACON to postulate intrinsic properties of objects and to reason by analogi For example, if BACON finds
a regularity in one set of parameters, it will attempt to generate the same regularity in a similar set of parameters.
Since BACON's discovery procedure is state-space search, these heuristics allow it to reach solutions while
visiting only a small portion of the search space. In the gas example, BACON comes up with the ideal gas law
using a minimal number of experiments.
A better understanding of the science of scientific discovery may lead one day to programs that display
true creativity. Much more work mu st be done in areas of science that BACON does not model. such as
determining what data to gather, choosing (or creating) instruments to measure the data, and using analogies
to previously understood phenomena. For a thorough discussion of scientific discovery programs. see Langley
et al. [1987].
17.7.3 Clustering
A third type of di scovery, called clustering, is very similar to induction, as ,,ve described it in Section 17.5. In
inductive learning, a program learns to classify objects bused on the lnbelings provided by a teacher. In
clustering, no class labelings are provided. The program mu st discover for itself the natural classes that exi st
for the objects, in addition to a method for classifyi ng instances .
/\rl ifidal /11 tel/lga11ce
374
, 1't1t'11 \ 1 the lcurning 1ncchunisms of ani ma ls, they might build
111
and pun ishment. Researchers hoped th111 bys ' ' ~ . tJi·ovcd ct11 sivc . However, the fi eld or neural net"·o k
. I ' ) 'WIS 1
lea.ming machines fm m very sunµ e I' · · uc1l iopcs .,. r
·tl y ,,s 11 result or the discovery of powerful new lcarn· i
· . c , 11
learning hns seen n resurgence 111 ic l: J • • • 1 \,cars pni , • · • ng
• •
. - ~ d'" , ··t ., , th"sc nlgonthms 111 dctn1 1.
a.lgonthms. Chapter 18 t:sl:lt )es l:,b d ., co,,, )titutional "brain metaphor," a number of other lcarn·in
k dels arc ·1se on • 1 · g
While neural nctwor_' mo · b, ·. d on evolution. In this work, learning occurs through a selection
h . ke u ' C o1 n metaphor nse . . . .
tee mques nm . s . , ·o o ulation of random programs. Learnmg algorithms inspired by evolution
p.rocess ti~~• beg1'.1s with _n lm ~efHP !pl· d 1975· de Jong, 1988; Goldberg, 1989]. GAs have been dealt with in
are called ge11er1c nlgonr/1111., 0 ,\11 ' '
m-eater
0
detail in Chapter 2.1
---
SUMMA RY
• t th'no to conclude from our study of automated learning is that learning itself is a problem-
Th e most 1mpo11 an 1 o . . .
·
so I,rmg process. we can cast various leammg st:rateg1es m terms of the methods of Chapters 2 and 3.
• Leaming by talcing advice
- Initial state: high-level advice
- Final state: an operational rule
- Operators: unfolding definitions, case analysis, matching, etc.
• Leaming from examples
- Initial state: collection of positive and negative examples
- Final state: concept description
- Search algorithms: candidate elimination, induction of decision trees
• Leaming in problem solving
Initial state: solution traces to exan1ple problems
- Final state: new heuristics for solving new problems efficiently
- Heuristics for search: generalization, explanation-based learning, utility consideration_s
• Discovery
- Initial state: some environment
- Final state: unknown
- Heuristics for search: interestingness, analogy, etc.
A learning machine is the dream system of AI. As we have seen in previous chapters, the key to intelligent
behavior is havi ng a lot of knowledge. Getting all of that knowledge into a computer is a staggering task. One
hope of sidestepping the task is to let computers acquire knowledge independently. as people do. We do not
yet have programs that can extend themselves indefinitely. But we have discovered some of the reasons for
our failure to create such systems. If we look at actual learning programs, we find that the more know ledge a
program starts with , the more it can learn. This finding is satisfying, in the sense that it corroborates our ot~er
discoveries about the power of knowledge. But it is also unpleasant, because it seems that fully self-extending
systems are, for the present , still out of reach.
Research in machine learning has gone through several cycles of popularity. Timing is always an important
consideration. A learning program needs to acquire new knowledge and new problem-solving abilitie s, but
knowledge and problem-solving are topics still under intensive study. [f we do not understand the nature 0 ~
the thing we want to learn , learning is difficult. Not surprisingly, the most successful learning programs
operate in fairly well-understood areas (like planning), and not in less well-understood areas (like natural
language understanding).