Download as pdf or txt
Download as pdf or txt
You are on page 1of 59

Cognition, 34 (1990) 137-195

mputational

el for metrical phonology* 6. ELAN ORESHER


University of Toronto JONATHAN B. KAYE University of London

Received July 27, 1987, final revision accepted June 1, 1989

Abstract
Dresher, B.E., and Kaye, J.D., 1990. A computational learning model for metrical phonology. Cognition, 34: 137-195

One of the major challenges to linguistic theory is the solution of what has been termed the projection problem . Simply put, linguistics must account for the fact that starting from a data base that is both unsystematic and relatively small, a human child is capable of constructing a grammar that mirrors, for all intents and purposes, the adult system. In this article we shall address ourselves to the question of the learnability of a postulated subsystem of phonological structure: the stress system. We shall describe a computer program which is designed to acquire this subpart of linguistic structure. Our approach follows the principles and parameters model of Chomsky (1981 a, b). This model is particularly interesting from both a computational point of view and with respect to the development of leurning theories. We encode the relevant aspects of universal grammar (UG) - those aspects of linguistic structure that are presumed innate

*This researchwas funded in part by a grant from the Social Science and Humanities Research Council of Canada (no. 410-84-0319). This project was first proposed by Jean-Roger Vergnaud, and we thank !lim for his encouragement and many stimulating discussions. We are indebted to Andrew Mackie, whc. wrote a program in PLII for generating metrical trees as a first step toward a learning program for metrical pnonology. His discussion of this program (Mackie, 1981) is extremely clear and insightful, and we have greatly benefited from the many interesting suggestions contained in his work. We are grateful to Peter Roosen-Runge for his help and advice on programming, and for his comments and ideas at various stages of this research. We wouid also like to thank the anonymous Cognition referees for their many constructive suggestions. Portions of this research were presented at the annual meetings of NELS (1985), LSA (1985), and GLOW (1986), as well as at various colloquia and workshops. We wish to thank the following people for their valuable suggestions and comments: Manfre 1Bierwisch, Jack Chambers, Russell Greiner, Michael Hammond, Norbert Hornstein, and Kenneth Wexler. We alone are responsible for any errors occurring in the text. Requests for reprints may be sent to B. Elan Dresher, Department of Linguistics, University of Toronto, Toronto, Ontario, Canada MSS lA1, or to Jonathan Kaye. Department of Linguistics, SOAS, University of London, London WClE 7HP, UK.

0010~U77/90/$18.20 0 1990, Elsevier Science Publishers B.V.

138

B.E. Dresherand J. D. Kaye

and thus present in every linguistic system. The learning process consists of fixing a number of parameters which have been shown to underlie stress systems and which should, in principle, lead the learner to the postulation of the system from which the primary linguistic data (i.e., the input to the learner) is drawn. We go on to explore certain formal and substantiveproperties of this learning system. Questions such as cross-parameter dependencies, determinism, subsets, and incremental versus all-at-once learning are raised and discussed in the article. The issues raised by this study provide another perspective on the formal structure of stress systems and the learnability of parameter systems in general.

A central goal of modern linguistic theory is to explain how, on the basis of limited data, a person is able to attain the grammar of his or her language. There is reason to suppose that, with regard to fundamental properties and organizational principles, human languages dze not infinitely and arbitrarily varied; rather, particula: grammars are constructed in accordance with a set of universal principles which themselves may be analyzed into a series of parameters (see Chomsky, 1981 a, b). Acquisition of grammar, on this approach, becomes a matter of correctly fixing the parameters for the grammar one is acquiring. We will discuss several issues that arise from an attempt to construct an explicit learning theory for stress systems along these lines. This learning theory is instantiated in a computer program capable of taking approlziate data from any language as input, and which then attempts to learn the grammar of stress which generates the d.ata it has been exposed to. By learn, we mean that, given input data and a model of universal grammar (UG) which includes a set of open parameters, the program contains a procedure which can correctly fix the parameters, and can then apply the system so as to generate well-formed strings. Why stress? The choice of stress systems is prompted by several cocsiderations. First, there currently exists a well-developed theory of stress within a parametric framework, namely the metrical theory, Second, accentual systems are fairly weii documented, both -withrespect to the basic data, and with respect to detailed analyses of the data (see Halle anti Vergnaud, 1987b; Hayes, 1981; Hyman, 1977, for surveys). Third, accentual systems can be studied in relative independence of other aspects of grammar. Finally, the parameters of metrical theory exhibit intricate interactions which surpass in complexity the syntactic parameters studied so far. The issues rsised by this study thus provide another perspective on the learnability of par&meter sys-

A computational learning model

139

terns, with consequences extending beyond the domain of stress, Why computers? The computer is a useful tool in the st;:dy of interacting parameters which combine to create systems of some complexity. The leaming program can serve as a laboratory for exploring families of learning theories with a variety of different properties. It should be stressed that our project is closely tied to a certain view of the nature of the language faculty, and in particular of the nature of stress systems. Thus, we adopt as our model of UG a particular parameterized version of a tree-based metrical theory. This theory has characteristics that make it a possible candidate for a theory of UG, and it is therefore interesting to see to what extent it really can serve as the basis of an explicit learning theory. Finally, parameterized theories lend themselves quite well to computer modelling. Parameterized learning theories are only beginning to receive attention and appea; to be a promising area of research. To the extent that we want to write a computer program that can emulate aspects of cognitive abilities, our concerns overlap with those of researchers in artificial intelligence (AI), though our approach is quite different. Rather than engage in extended epistemological discussion, let it suffice to say that we adopt a Chomskian view of human cognitive processes. We are of course aware that this view has not received universal acceptance in the AI literature. It is our hope that the results of our research will provide further justification for it.* The nature of the problem we wish to address can be represented by the conventional diagram illustrated in (1): (1) The projection problem

Given D, the data of language L, and c. grdrnmar, G, which is the grammar of L, how can one learn G from D? It is a basic assumption of the generative program that G is acquired via UG, a set of innate and universal cognitive

The mere existence of a working program does not in itself demonstrate that our proposals to the learning theory - or anything else - are correct. Rather, it shows that they meet fairly criteria of adequacy. Nor would we iusist that this criterion is necessarily applicable to all areas research - one might discover and study insightful principles of grammar even if these are not particular model of learning, or parsing, or production. For further discussion, see Dresher and Hornstein (1976, 1977) and Winograd (1977).

with regard rudimentary of linguistic related to a

140

B. E. Dresher and J. I). ,Kaye

principles.3 However, as we shall show, if UG is construed as it usually is, as a set of rather abstract universal principles, such as the binding theory, or constraints on metrical structures, then even if we supply UG we will still not be able to account for how G was derived from D. In addition to UG, there must be a learning theory, LT, which relates D to UG. Suppose, for example, that UG dictates that a language may either have pr-operty P, or else it must have property Q, and that the possession of P or Q entails a number of other consequences. Now; it might well be true that this theory of UG imposes powerful constraints on possible grammars; nevertheless, it must still be supplemented by principles which enable learners to determine whether the language they are learning is a P language or a Q language. These are the principles we are calling the learning theory, and the need for them will become clear in the course of this discussion. What we shall do then, is construct models of D, G, and UG, and then see what has to be added in order to achieve a theory with explanatory adequacy, that is, an account of how G could be learned from D. What must be added will be the principles of LT.
2. The data

We will begin with the data, D, which goes through several stages of preprocessing before it becomes the input to the metrical component. We assume the prior operation of rules that convert the signal into words and segments. We assume also that the various acoustic cues which indicate phonological stress are mapped into one of three degrees of stress. After this preliminary processing, the data relevant to learning word stress consists of words with vowels marked as bearing main stress (21, secondary stress (l), or no stress (0). Tine reader should bear in mind that these representations are drastically oversimplified for expository reasons. Some sample forms are given in (2):4 (2) Sample input data valnco2uveOr :: a21geObraO (VBncduver) (ilgebra)

3ft may, of course turn out that the principles of UG are derived from other, perhaps nore basic, innate principles; or we mayfind that I&Y are particular linguistic manifestations of general cognrtive principles that apply to other domains as well. Jntil any of these are shown, however, it is prudent to assume that they are innate and language-particular; $e Piattelli-Palmarini (1980) for debate of these issues. 4For expository convenience i le keep the ordinary English orthography in these representations; to be more precise, entries should be in\ F.,honemic form, e.g. /v a! 11 k u 2 u v a 0 r/. The use of orthography here n implies no claims about the rote of &thography in the dCqUiSi"~3n of stress. -

A comptafional

learning model

141

Forms like those in (2) serve as the initial input into our model. However, they do not yet represent the input to the stress component. One principle that is shared by most theories of stress is that stress is sensitive to representations built on projections from syllable structure. Moleover, in many languages, stress is sensitive to syllable weight, or quantity. So the first step in the analysis of the input data in (2) involves parsing the words into syllables. The parser analyzes the syllable as containing an onset and a rime. The onset contains all the material before the syllable peak (usually the vowel), while the rime is divided into the nucleus, which contains the peak of the syllable, and a coda, which contains material which follows the nucleus. The words in (2) will be assigned the syllable structures shown in (3): (3) Syllable structure of sample words
S

(a>
On
I

S /\
Ri
I&o I I

/\
On
I

(b)

7
Ri On

,S,
Ri On

/\

Ri
rlru /\

vancou 1

On Ri /\ Nu Co I I v e r 0

Ri

It is generally agreed that onsets are not relevant LOstress rules.5 We will take this to be a principle of UG. Therefore, the input relevant to stress consists of rime structures, as shown in (4): (4) Rime projections (a) Ri N:\co a n 1
I I

Ri
LJU

Ri
N: \co

(b) I r I a 2

Ri
Iv: \co

Ri
Au

Ri
rlru

/\

0 u 2

I e 0

I 1

I E :

In our model, the initial input data shown in (2) is transformed into the representations in (4) by the syllabic parser, which constructs the syllable structures in (3) and then strips off the onsets. The structures in (4), then, constitute the input relevant to the stress component.

It has been argued that there are a few cases where onsets do play a role in stress. though the mechanism by which they do so has been debated: see Davis (1988) for a review of the issue.

142

B. E. Dresher and J.D. Kaye

Let USturn now to the metrical theory which serves as our representation of UG. There currently exist a number of variants of metrical theory, differing from each other to various degrees with respect to the appropriate formalism for representing stress. Thus, some versions posit that stress is represented only by metrical trees (Giegerich, 1985; Hammond, 1984; Hayes, 1981); some posit only metrical grids (Prince, 1983; Selkirk, 1984); and some variants posit both trees and grids (Halle & Vergnaud, 1987a,b; Hayes, 1984, 1985; Liberman, 1975; Liberman & Prince, 1977). For purposes of this project, we are assuming a model with trees only, and no grid. While changes in the details of UG have consequences for learnability, the differences between the various versions of metrical theory do not, in our view, greatly affect the results of our discussion. All variants share certain basic assumptions, and it is these that are crucial to oui s:rJdy, and not the differences. In metrical theory, stress patterns, and hence the stress levels observed in the data, are controlled by metrical structures which are built on the rime structures in (4).(j These metrical structures take the form of labelled trees, where, in any group of sister nodes, one node is labelled Strong and the others are labelled Weak. The various possibilities of metrical structure construction and labelling are expressed in terms of a series of binary parameters. 3.1. The metrical parameters Our model incorporates ahe eleven parameters listed in (5): (5) Parameters of metrical theory Pl: The word-tree is strong on the [Left/Right] P2: Feet are [Binary/Unbounded] P3: Feet are built from the [Left/Right] P4: Feet are strong on the [Left/Right] P5: Feet are quantity sensitive (QS) [Yes/No] P6: Feet are QS to the [Rime/Nucleus] P7: A strong branch of a foot must itself branch [No/Yes] P8A: There is an extrametrical syllable [No/Yes]
61n some languages (eg ., c:r:in pitch accent languages, though not all of them), stress behaves in ways more characteristic of tonal systems, and may be subject to principles of autosegmental association and spreading rather than to metrical principles - see van der Hulst and Smith (1982) for a review of this issue. Which subtheory is in play cannot necessarily be determined simply from the phonetic manifestation of stress. Conversely, metrical principles may apply to phenomena otlrer than stress: for example, alternating vowel lengthening in Ttlbatulabal (Prince 1983, p. 66), and Chituwi:ni vowel shortening (Hayes 1981, p. 67).

A computational learning model

143

P8: It is extrametrical on the [Left/Right] P9: A weak foot is defooted in clash [No/Yes] PlO: Feet are noniterative [No/Yes] We will briefly illustrate the effects of these parameters, starting with Pl. Main stress in a word is controlled by an unbounded word tree, in which either the leftmost or the rightmost node is labelled Strong. For example, the word tree in (6a) has been constructed with Pl [Left]; this gives initial stress, as in languages like Latvian, or Hungarian. Setting Pl [Right] gives fixed final stress, as in French and Farsi. (6) (a) Word-tree with Pl [Left] constructed on rime projections M

Kin stress is not necessarily confined to a peripheral syllable since Pl can interact with other parameters to produce different results. For example, a peripheral syllable may be designated as extrametrical by P8A and P8, meaning it does not participate in the construction of the word-tree. Extrametricality can result in main stress falling on the second or penultimate syllable; in (6b), it falls on the second syllable, as in Lakota and Araucanian: (6) (b) Pl [Left], leftmost syllable extrametrical (P8A [Yes], P8 [Left])

(Ri) Ri

I I II

Ri Ri Ri

In languages such as these, only one syllable in each word is marked Strong, while all the rest are Weak. In many languages, however, syllables are first grouped together into feet, and then the word-tree is constructed on the feet.

144

B. E. Dresher

and J. D. Kaye

Every foot receives a stress; hence, languages with feet also exhibit secondary f a language has feet, a number of other parameters come into play. ~2 allows feet to be at most binary, or else unbounded. Suppose we choose binary feet, which give rise to an alternating pattern of weak and strong syllables. We must then choose P3, the dire&on of construction, which may be either from left to right or from right to left. In addition, we must select P4, which allows each foot to be left-dominant or right-dominant. We present two illustrative examples - Maranungku, in (7), has P3 [Left] and P4 [Left] - that is, left-dominant feet constructed from the left; and Warao, in (8), has left-dominant feet constructed from the right-that is, P3 [Right], P4 [Left]: (7) P3 [Left], P4 [Left] (Maranungku) (a) ,,P,
SWI I

F
I I

(b)
s I

F
\ w I s/w I

Ri
I

Ri Ri
/\

Ri

Ri

Ri
I

Ri
I

m erepet 20 (8)

y LrJgarmata 2 0

10

P3 [Right], P4 [Left] (Warao) (a) F


s w I I

F
s w I I I I

F
s/ w I I I

F
s w I I

(b)

F
I * I I

F
w I

s I s w I I
Ri
I

Ri
I

Ri Ri
I

Ri Ri

Ri
1

Ri
I

Ri
I

Ri

Ri Ri
I I

Ri
I

yapurukitanehase 1 0101020

yiwarana 01020

Word-trees have been omitted from the examples; & $wever, they would be constructed on the feet, with main stress devolving upon the strongest vowel in the strong foot. Depending on how various parameters are set, the strong syllable of the strong foot may be fairly far from the periphery; in this way, main stress can be quite variable, despite the limited choices in word-tree construction. Note one additional fact about the Warao word in (8bj: the first syllable, being alone in a foot, ought to receive a secondary stress. Its stresslessness is due to the setting of the defooting parameter, P9, to [Yes]. Warao apparently does not tolerate stress clashes, wherein two adjacent syllExcept where otherwise noted, all languages discussed in this article can be found in Hayes (i981) Halle and Vergnaud 11987b), which should be referred to for further details and sources. and

A computational learning model

145

ables are stressed; hence, the nonbranching foot is defooted, and the first syllable does not receive a stress. The feet in (7) and (8) are not affected by the internal structure of the rimes on which they are const: Icted; in such languages, foot construction, and hence stress, is said to be insensitive to quantity (QI) - select P5 [No]. However, many languages have quantity-sensitive (US) stress systems. Quantity sensitivity, in the theory being adopted here, means that a branching rime (or a branching nucleus, depending on the setting of P6) may not occupy a weak position in a foot; hence, the configurations in (9) are ill-formed: (9) Quantity sensitivity (a) P6 [Rime]: *F \ W
I

(b) P6 [Nucleus]:

*F \ W
. .

Ri /\ A branching Rime must be Strong

Nu /\ A branching Nucleus must be Strong

Syllables containing branching rimes or nuclei are called heavy; those that do not are called light. It follows from (9) that in quantity-sensitive stress systems all heavy syllables are stressed. The presence of heavy syllables can considerably disrupt the smooth alternation of stresses we have observed up to here. Suppose, for example, that Maranungku feet were quantity-sensitive to a branching rime; the feet would appear as in (10): (10) Left-dominant feet built from the left, QS to the rime: F
s I w I I i \

F
I

F
I ; s I Ri

F
w I

F
I i

Ri
I

Ri Ri 1

m erepet 20

Ri Ri Ri /\ y a n gar m 4 t i 2 1 01

The second syllable of yangurmata branches; hence, it may not occupy the weak branch 02 a foot. So the first syllable must form a unary foot by itself, and a new foot begins with the second syllable. All the examples we have considered to here have involved binary feet, P2 [B]. In languages with binary feet, main stress can never fall ari2re than a certain number of syllables from the edge of a word. Barring other complica-

146

B. E. Dresher and J. D. Kaye

tions, that number is three, if we alllow an extrametrical syllable; for stress must fall on either the first or the last foot, and a binary foot has at most two syllables. Also, in languages with binary feet, syllables are stressed by virtue of their position; in QS languages, syllable quantity also plays a role, but no heavy syllable may receive main stress if it occurs too many syllables from the edge of the word. Languages with unbounded feet, by contrast, display a different pattern. In these languages, quantity is the most important factor in stress assignment, while position is secondary. The typical formulation of main stress in such languages is: stress the [rightmost/leftmost] heavy syllable. This type of pattern can be accounted for by positing quantity-sensitive unboundeh fee< an example is Eastern Cheremik8 (11) Pl [Right], P2 [II], F [Yes], P6 [Nucleus], P4 [Left] (E . Cheremis) (a) lvl (b)

I
Ri I te 2 Ri I le 0 Ri /\ ze 0 n shla 1 Ri /\ ap Ri /\ aazhem 2

I
Ri
I

Notice that in words with no heavy syllables, as in (lla), a single foot extends over the whole word, and main stress is dictated by the branching of the foot-tree, not the word-tree. The value of P7 in these languages is No, which means that the dominant (i.e., S) branch of a foot does not itself have to branch (i.e., contain a heavy syllable). If we set the value of P7 to Yes, then only heavy syllables can anchor a foot; light syllables that are not part of a foot are joined as stray syllables to the word-tree. Words not having a heavy syllable will receive main stress directly from the word-tree. A language with this pattern is Aguacatec Mayan, in which stress falls on the last syllable
We assumesecondarystresses E. Cheremis, though we have no information as to whether they appear in on the surface.

.A computation@1 learning model

147

containing a long vowel, otherwise on the last syllable of a word:


(12) Pl [Right], P2 [U], P5 [Yes], P6 [Nucleus], P7 [Yes] (Aguacatec Mayan)

(a)

M /\ ws

(b)

M
I

F /\ s
I

w
I

Ri Ri
r I

Ri
/\

Ri
I

alu? 3 c 0 Finally, PlO is for languages in which words have only main stress, but no secondary stresses. PlO [No] means that only the strongest foot in a word tree will count, while other feet are suppressed. 3.2. Possible stress systems Let us consider how many possible stress systems this set of parameters gives. We have 11 parameters, each of which can have two values. However, the parameters are not all independent, so we do not in fact have 211 (2048) systems. Here are the built-in dependencies: P6 is suspended unless P5 is QS, so P5 and P6 together yield 3 values. (b) P8 is suspended unless P8A is Yes, so these together also have 3 values. (c? If P5 is QI, unbounded feet become vacuous, so P2 must be Binary; so P2 with P5-6 allows 5 rather than 6 possibilities. We assume that P7 requires P5 QS. Also, if P7 is Yes, then there (d) is no observable difference between binary and unbounded feet. We will stipulate that P2 B is not available with P7 Yes; so P7 together with P2-P5-P6 has 7 values. (e) P3 is suspended when P2 is U; P4 is also suspended when P7 is Yes (i.e., it is impossible to determine constituency in such systems in this model); so parameters P2, P3, P4, P5, P6, P7 together yield 18 independent values. (a)

kasa? 8 2

ha

148

B.E. Dresher and J. D. Kaye

Multiplying 18 by the remaining values of parameters Pl (2), PSA-PS (3), P9 (2), and PlO (2) yields 432 different stress systems. However, P9 (Defooting) is not really a binary parameter; as we shall see below, there are a number of ways of defooting that languages may choose, and these increase the number of possible stress systems. Anticipating the results of our discussion in Section 5.4, it is better to let the learning system abstract away from destressing aitogether. Removing P9 from consideration, the solution space consists of 216 distinct stress systems. 3.3. Why paramefers? Why use parameters at all to describe stress systems? A rich and highly structured theory of ?JG is otiose if the same results can be achieved by simpler means. It is therefore worth considering here some plausible alternatives which posit far less specific cognitive machinery. We can begin with the observation that stress patterns are sensitive to sequences of syllables and syllable weight. We have been assuming that there are three types of syllable that must be distinguished for purposes of stress: light syllables (L) and two kinds of heavy syllables (PI): those that branch at the level of the Rime (R), and those that branch at the level of the Nucleus (N). Let us assume that any theory of stress ought to encode at least that much information, so we would not have to consider the infinite number of possible but nonexistent rules which relate stress to other features of the phonetic string. Imagine now a minimalist theory of stress which simply maps strings of weighted syllables (weight strings) into sequences of stresses (stress strings).g Consider Maranungku: as we noted above, it has main stress on the initial syllable of a word, and secondary stresses on every odd syllable thereafter. All syllables have equal weight in Maranungku; to simplify the following discussion, let us assume that all syllables in this language are light. Now the correspondence between weight strings and stress strings in Maranungku can be summarized in the following table (note that some arbitrary upper limit to the length of the string must be adopted; we will return to this point below):

?he discussion in this section was suggested by an anonymous Cognition referee, to whom we are indebted.

A computational learning model

149

Correspondence ranungku)
Weight string L LL LLL LLLL LLLLL LLLLLL . . . . . .. .

between

weight strings and stress strings (Ma-

Stress string 2 20 201 2010 20101 201010 . . . . . .. . .

In quantity-sensitive languages, such as Latin, we would have to distinguish syllable weights; again to simplify matters, let us assume only a distinction between light (L) and heavy (H) syllables. The following are tables for words up to three syllables long (Latin has no stressed light monosyllables, so the sequence L has been omitted from the tables): (14) Correspondence between weight strings and stress strings (Latin) (a) One- and two-syllable words
Weight string H LL LH HL HH Stress string 2 20 20 20 20

(b) Three-syllable words


Weight string LLL LLH LHL LHH HLL HLH HHL HHH Stress string 200 200 020 020 200 200 020 020

Since the patterns of word stress can be listed in tables of this kind, one might be tempted to wish to reduce the whole theory of stress to the tables themselves. For any language, we would simply record the stress string corresponding to each distinct weight string; the table generated by this procedure would be the grammar of stress for that language. A compelling argument against such a grammar has to do with the distribution of observed stress systems, as opposed to nonexistent ones. The metrical theory outlined above allows for a limited number of basic stress patterns. A grammar of tables is unable to express patterns at all. Even if supplemented with some kind of pattern seeker, we would expect to find various kinds of crazy stress systems:

150

B.E. Dresher and J.D. Kaye

(a) Crazy nonexistent patterns: e.g. main stress on the middle syllable in odd-syllabled words, and on the right of the middle syllable in even-syllabled words, There is a general pattern, but it is unnatural and has never been observed. (b) Systems where stress placement for individual words is not crazy, but which add up to an unn&ural overall pattern: e.g. stress the first syllable in words that are 2, 4, 5, and 8 syllables long, and stress the last syllable in words that are 3, 6, 7, and 9 syllables long. (c) Finally, we would allow for systems with no discernable pattern at all, such as the one in (15):

(19 Correspondence between weight strings and stress strings (nonexistent)


(a) One- and two-syllable words --_
Weight string _______-__.__~~~~_ L H LL LH HL HH Stress string ..-._2 2 02 20 20 02 ~~ _ _- .___. LLL LLH LHL LHH HLL HLH HHL HHH

(b) Three-syllable words


Weight string Stress string _ 200 020 020 200 002 200 002 200

The extent to which metrical theory cuts down on the number of possible stress systems can be seen by compa.ring the number of possible patterns allowed by unrestricted weight-string to stress-string mappings to those allowed by metrical theory. Weight strings range over the symbols L, R, and N; as before, however, we will assume that no language distinguishes all three types, so we will restrict weight strings to L and H. With this restriction there are 2 weight strings for a word of II syllables. Stress strings range over 2,1, and 0, but their occurrence is not completely free, as there must be one and only one 2 in a string. Therefore, in a string of n syllables, a 2 has II possible locations; the remaining n - 1 positions may be occupied freely by either 0 or 1, giving a total of n x 2 - stress strings for words of length n syllables. For any n, the number of possible mappings from the set of weight strings W to the set of stress strings S is S, that is, n x 2 - raised to 2. The number of distinct mappings is this figure summed over all ES. The figures for n less than 6 are given in (16):

A computational learning model

151

(16)Number of possible mappings from weight strings to stress strings


--__
Syllables n 1 2 3 4 5 Weight strings w=r 2 4 8 16 32 Stress strings S=fIX2_ _ 1 4 12 32 80 ~- .-__ Mappings S -___ 1 256 -4.3 x 1oB -1.2 x 1024 -7.9 x 1060 ___ _.~~

______

The number of systems allowed by the metrical theory was calculated above to be 432 with a binary destressing parameter (P9). If we suppose that this parameter is multi-valued, the number of systems allowed remains of the order of 103, which contrasts dramatically with the values shown above. Even if we posit the existence of many more complete!y independent binary parameters, we will not approach the vast number of systems allowed by an unrestricted theory. Abstracting away from destressing makes for a more complicated calculation. However, we can underestimate the number of possibilities in an unrestricted mapping by eliminating secondary stress altogether, limiting the vocabulary of stress strings to 2 and 0. The solution space of the unrestricted theory comes down bv many orders of magnitude, but still remains vast: 416 = - 4.3 X 109 possibilities for strings of four syllables, for example. Another major shortcoming of the mapping-by-tables approach is that it requires us to set an arbitrary upper bound on word length. This is undesirable even for languages which happen to have relatively short words, if the upper bound is an arbitrary one. In many languages, however, the domain of word stress mere closely approximates what we would call in English phrases, and relatively long words (nine, ten syllables and more) are not a rare occurrence. It is clear, therefore, that the metrical theory outlined above makes nontrivial claims about the number and nature of possible stress systems.

We have now presented samples of input data, as well as the type of UG we are assuming. A particular grammar G must be derived from UG by fixing the values of each parameter. How might this be done? The procedure that we have built into our model incorporates a set of cues which are closely tied

152

B. E. Dresher and J. D. Kaye

to the parameters. Each parameter (or in some cases, a group of parameters) is associated with a specific cue in the data which the learner looks for; the presence (and/or absence, if the set of input data is assumed to be closed) of some cue in the data set serves as a trigger for setting its associated parameter. A learning system of this kind, which we present beginning in Section 5, has interesting properties. One might think, though, that the learning of word stress ought to be a fairly straightforward affair. It is thus worthwhile for us first to consider some simpler, even pedestrian, approaches to the learning problem in this domain. 4.1. A learner with minimal UG The shortcomings of unconstrained mappings of weight strings into stress strings carry over to learning systems based on such mappings. Thus, a learner could mimic the acquisition of any stress pattern by simply keeping a table of weight strings and entering the corresponding stress strings as the relevant data comes in. Though the number of possible stress systems resulting from unconstrained mappings is astronomical, the size of the table for any particular system will not be very large, being equal to the number of different weight strings encountered. Therefore, learning of stress can be modelled with recourse to only the most minimal version of UG, and without metrical parameters. Nevertheless, this type of model is no more satisfactory than the unconstrained UG associated with it. Why should stress systems be confined to such a small part of the solution space if learning is based on unrestricted weight-string to stress-string mappings? And again, such a learner only ap pears to learn stress patterns; what it learns is not the pattern itself, but only a part of its extension. It would be unable to project its grammar to assign stress to weight strings not yet encountered. We conclude that such learning theories are empirically inadequate. Let us, then, turn to learners which are confined to the space of possible grammars demarcated by the parameter set. We have shown above that limiting stress systems to those allowed by the metrical parameters results in a relatively small solution space; and this space will remain small by computational standards even if many more binary parameters are added. Therefore, the parameters demonstrate their worth over the null hypothesis as a theory of UG. But this very success might lead us to question the need for a sophisttcated learning theory. Since the number of possible stress systems is small, there are any number of ways of traversing the solution space and guaranteeing that a system is found that fits the data. Below we look at learners which illustrate two extremes that we will ultimately reject: a brute-force learner that traverses the solution space according to a fixed schedule regardless of

A computational learning model

153

input; and an apparently prescient learner which extracts as much information as possible from each piece of data, because it has been primed in advance with precompiled lists. 4.2. A brute-force learner Consider first a learning model equipped with a counting procedure, a matching procedure, and a metrical tree constructor. Since every parameter is binary, let each parameter have either the value 0 or 1; if the parameters are arrayed in fixed positions in a counter, then any parameter setting will correspond to a unique binary number. Let us continue to assume (counterfactually) that accentual systems are transparent: that is, words with identical syllable structure are accentuated in the same way. The learning model would proceed as follows: (17) A brute-force learning procedure (a) Assume a group of input forms D = {Dl, D2,..* D,}. (b) Initialize the parameter setting counter P to 0. Initialize j to 1. (c) Select word Di = current word. Let Di_ sbe Diwithout its stress markings. (d) Increment P by 1. FAIL if P > L, where L = the number of possible stress systems. (e) Apply this parameter setting to Dj_s. (f) Match the derived form against Dj. (g) If the match fails, reapply steps (d), (e), and (f) until successful. (h) Assign Dj to parameter setting &. (i) If there is a form Dk assigned to Dkar where R < P, set current word = Lpk,and go to (e). (j) Increment j by 1 (if j < n; else RETURN P and END). (k) Select word Dj. (1) Go to (e). This algorithm will find the parameter setting P (if one exists) which chamterizes the accentual system for a set of forms D. The system is driven by a matching procedure. No cues for individual parameter values are requiredOne knows only that a given form is compatible or not with a particular counter of parameter values. It is also worth noting that the order in which the counters are arranged is irrelevant to the ultimate success or failure of the model, Any ordering will give the same outcome, though some small gains in efficiency could be achieved by setting up the parameter counter p in such a way that the most common stress systems are characterized by the lowest values of P, and so will be tried first. This kind of learner simply cranks through every possible setting of the

154

B.E. Dresher and J.D. Kaye

parameters according to a preset schedule until it finds the right one. On such an approach, we must assume that evolution has precompiled the possible stress systems into a list, without, however, supplying the learner with any insight into the relation between the data and the parameters. 4.3. A learner with inside information The search procedure can be made very efficient if we suppose that evolution has precompiled the set of possible stress systems in a much more elaborate fashion. Imagine a table with three columns: weight strings, stress strings, and a list of stress systems that fit each mapping: (18) Precompiled weight-string to stress-string mappings
Weight string LLL LLL NLL LLN RLL LLR LLL Stress string 200 002 200 002 200 201 020 Languages that fit the mapping Latin, Koya, Garawa, Latvian, . . . Komi, French, A. Mayan, Greenlandic, . . . Latin, Koya, Komi, Latvian, A. Mayan, . . . Komi, French, A. Mayan, E. Cheremis, . . . Latin, Koya, Garawa, Latvian, . . . Maranungku, Hungarian, Koya, . . . Lakota. Warao, Polish, S. Paiute, . . .

The table in (18) illustrates a small portion of the complete table of mappings. In etch row are listed the languages which are compatible with the given g. The language names are a shorthand for the parameter settings they exemplify; for example, Latin stands for the parameter set [Pl R, P2 B, P3 R, P4 L, P5 QS, P6 R, P7 No, P8A Yes, P8 Right, P9 Yes/No, PlO Yes]. A learning procedure can now be constructed whereby exposure to a word that instantiates a particular row of the table selects the languages represented on that row. Eventually, the system will converge on a unique setting. If the available data does not determine a unique solution, the most highly ranked system, in terms of some measure of markedness, is se1ected.l Various more or less sophisticated versions of this type of learning system can be elaborated. In contradistinction to the brute-force learner, which can take no short-cuts through the solution space, this type of learner is endowed with extraordinary information about the relation of individual words to the it tn_ ______ parameter set. Exposure LJ any word --*kl-c WIIUCI.WW mzclude large num_&rs of Ic _
Again we thank the anonymous Cbgnirion reviewer for some interesting suggestions which have influenced this section.

A computational learning model

155

possible parameter sets. However, this type of learner is similar to the bruteforce learner in that it also has no insight into how individual parameters are related to data. It succeeds because it has been primed in advance. Because the search space is small, and we know of no empirical evidence that bears on how people go about setting the values of stress parameters, we can muster no convincing empirical arguments against these types of learning systems. Conceptually, there are a number of reasons to question their plausibility. Both the brute-force and the primed learner are completely inflexible. They gain minimal information from failure. The learning path of the brute-force learner continues inexorably towards termination, quite impervious to the nature of the fai!ure of the previous trial. The only information retained is that a parameter counter is or is not compatible with the current input. Similarly, the primed learner cannot distinguish between having ten arameters set correctly and one wroag as against 11 parameters set wrong. I?A primed learner is incompatible with the idea, elegantly expressed in UG, that the diversity of stress patterns is not directly represented in the mind, but is the result of the interaction of a small number of independent parameters. At this stage, our most compelling argument against these learning theories is that they are rather uninteresting in comparison with the type of cue-based learner incorporated in our model. It turns out that cues for given parameter values are in most cases identifiable, a fact that must be treated as serendipitous by the other theories. In like manner, there would be no reason to expect any particular relationship among the various parameters. Cue-based learners, by contrast, raise interesting questions about how the various parameters interact, their relation to cues, and numerous other issues, to which we turn in the following sections.
5.

The first problem we confront in constructing a cue-based learning system is that the data is relatively undifferentiated when compared with the input data relevant to the setting of syntactic parameters. Syntactic categories are projecNote that it is difficult to develop a measure of closeness of fit for a parameter counter based on the correspondence of an input saress saing to a derived stress string. A counter that is correct in every parameter except extrametricality might displace every stress by one syllable. resulting in no matches at all. Conversely, a parameter set that is off in a number of parameters might derive the correct stress strings in cer;ain classes of words. In general, a procedure that zeros in on the correct settings by incrementing and decrementing derived stresses until they match an input is quite foreign to a parametric system of this kind, where small changes can have large effects and large changes can have small effects.

156

B. E. Dresher and J. D. Kaye

tions of lexical categories that are usually assumed to be already known to the learner; thus, many studies suppose that learners can identify complements and heads, as well as anaphors and pronominals and so on, independently of the principles and parameters of ?JG which mention these terms (see Baker, 1979; Lightfoot, 1982; Wexler & Hamburger, 1973). Hence, it is usually fairly evident, if that is in fact the case (but see Wexler N Manzini, 1987), which data is relevant to setting each parameter. As far as metrical theory goes, however, the first task of the learning theory is to determine what aspects of the data are relevant to what parameter.
5.1. Appropriateness

Consider, by way of illustration, a language which has already been determined by the learner to have quantity-sensitive binary feet. Let us consider how a learner might now arrive at the correct value of parameters P3 and P4: that is, from which side of the word are the feet constructed, and how are they labelled? The most uninteresting approach to this problem would be by means of a brute-force learner which simply tries out every combination of P3 and P4 in a fixed order until it arrives at the correct settings. We have in fact implzmented such a learner -the reasons for rejecting it were discussed above. On the other and, a maximally perspicacious learner is also undesirable, as it leads to highly implausible results. For example, consider (19). We have seen that in quantity-sensitive systems all heavy syllables, represented by H, are stressed, while light syllables may be stressed (indicated by a capital L) or unstressed (indicated by a lower-case l), depending on their position. Now, it is possible to prove that a four-syllable word having the stress pattern H 1 L H shown in (19) can be generated only by P3 [Left], whatever the value of P4; the various possibilities and resulting foot structures are worked out in (19a-d). Hence, one might propose that this pattern serve as a cue for P3: (19) An (a) (b) (c) (d) unprincipled diagnostic for P3 [Left]: H 1 L H P3 Left/P4 Left: [S W] feet built from left to right P3 Left/P4 Right: [W S] feet built from left to right P3 Right/P4 Left: [S W] feet built from right to left P3 Right/P4 Right: [W S] feet built from right to left [H 1][L] [H] [H] [1 L] [H] [NJ [L I] [H] [H] [L] [l H]

Note that this pattern serves as a cue onl.y to a learner endowed with prior knowledge of the combined effects of these interacting parameters on this particular form. However, the learning theory cannot plausibly have access to such information. We do not want the learner to have to engage in ex-

A computational learning model

157

tended computations, for that would be putting back into the LT what we have been trying to banish from the UG. We might suppose, alternatively, that the results of the computations are innately supplied; but that brings us back to the problematic primed learner discussed above. Therefore, we propose to constrain the learning theory by requiring a principled relation between its cues and the parameters of UG. Put informally, we require that cues must be appropriate to the parameters they are related to: (20) Appropriateness Cues must be appropriate to their parameters with regard to their scope and operation. While difficult to formalize, it is easy to give some examples illustrating this notion. The diagnostic in (19) violates the Appropriateness Condion because the string it mentions is never a constituent of metrical theory. Rather, it represents two distinct structures, shown in (19a) and (19b), that, fortuitously, have the same value of P3; by a further happenstance, no structures with a different value of P3 can be constructed. Consider now the cues for P3 and P4 listed in (21), which do conform to the Appropriateness Condition (questions of extrametricality will not be taken into account for this example): (21) Appropriate cues for P3 and P4 in QS systems (a) P3 Left, P4 Left: Scanning from the left, a light syllable following anything must be 1: [X l] + (b) P3 Left, P4 Right: Scanning from the left, a light syllable preceding anything must be 1: [l X] + (c) P3 Right, P4 Left: Scanning from the right, a light syllable following anything must be 1: c- [X l] (d) P3 Right, P4 Right: Scanning from the right, a light syllable preceding anything must be 1: c [I X] Consider (21a). If left-dominant feet are constructed from left to right, then a light syllable must become the weak right sister of an immediately preceding syllable. Therefore, the presence of a stressed light syllable in that position would supply a learner with positive evidence that this is not the correct parameter setting. These cues conform to the Appropriateness Condition because they mirror their parameters. They each mention an actual constitcsnt of metrical theory, in this case, a potential foot. Like P3 and P4, they are restricted to looking at a window that has two elements in it. The terms they mention - light, heavy, immediately before or after - as well as the directional scan - from the le!t or from the right - are terms that the parameters must also be sensitive to. Adoption of the Appropriateness Condition

158

B.E. Dresher and J. D. Kaye

ensures a desirable result, namely: no deduction or computation is permitted to the LT. 5.2. Robustness
Let US turn now to another property that a learning theory ought to have,

and that is Robustness, given in (22): (22) Robustness Core parameters must be learnable despite disturbances to the basic patterns caused by opacity, language-particular rules, exceptions, etc. By way of illustration, let us say we are trying to learn the pattern of Warao in (8). Suppose that the learner has already determined that it has quantity-insensitive binary feet, and let us consider how it might now arrive at the correct value of parameters P3 and P4: that is, how could it determine that Warao has left-dominant feet built from the right edge of the word? There are four configurations generated by P3 and P4: feet can be built from the left or from the right, and the feet can be left-dominant or right-dominant. Each one of these four configurations corresponds, in its ideal form, to a characteristic pattern of alternating stresses (we can ignore the difference between 1 and 2 stresses here). Suppose, then, that we simply try to fit the stress patterns of the input vlvords each of these patterns in turn until we to get a consistent fit. Thus, the word in (8a) is consistent also with left-dominant feet built from the left; but (8b) is not. Forms like (8b) lead the learner to reject this configuration, and try another one. The problem with this test is that the correct answer - left-dominant feet built from the right - does not match the pattern of (8b), either. The expected pattern is given in (23a): in words with an odd number of syllables, we expect the first syllable to be stressed - in (8b), it is unstressed. Now, we know why that is; it is because the initial syllable is defooted, causing the loss of its stress. This change in the pattern, though relatively minor, is enough to derail a learner looking for the pattern of (23a): (23) Left-dominant feet built from (a) Basic pattern: (b) Defooting 1: (Warao) (c) Defooting2: (Garawa) the right (V = stressed, v = unstressed) even: VvVvVv odd:VVvVv even: V v V v V v odd: v V v V v even: VvVvVv odd: VvvVv

To remedy this situation, we might attempt to incorporate the effects of Defooting into the test for P3 and P4. The learner, observing that there are no stress clashes in the data, could be made to assume that Defooting might be at work; thus alerted, it would attempt to undo any possible effects of

A computational learning model

159

Defooting before trying the tests for P3 and P4. This procedure would first convert (23b) to (23a), and Warao would pass the test for left-dominant feet built from the right. Such a procedure would work for a language like Warao; we have in fact implemented it. * The problem with this approach, though, is that it is extremely fragile; the learner must know in advance every factor that might cause a deviation from the ideal patterns, and must be able to correct for all of them perfectly. This is the opposite of robust: in a robust system, if you know a little, then you know a lot; but this is a situation where if you dont know something, you dont know anything. With this in mind, consider now Garawa, shown in (24): (24) P3 [Right], P4 [Left] (Garawa)

(a) , F,
s : +

(b)

7
;

F
y--

F
yw i

F
w I py

Ri Ri Ri Ri /\ /\ I I wat j impagu 2 0 10

Ri Ri Ri
I I /\

Ri Ri
I /\

Ri Ri
I I

Ri
I

Ri
I

na~i~inmukunjinami~a 200 10

Garawa can be analyzed as having the same basic pattern as Warao; however, in (24b), the expected stress clash is resolved by defooting the second foot, not the first one.13 If the learner is unaware of this possibility, it will not succeed in learning P3 and P4.14 We might therefore propose to expand the Defooting parameter to include the possibility represented by Garawa. Rather than one Defooting parameter, then, we may have several: it may be more correct to think of Defooting as a module containing a set of parameters. However, if Destressing is indeed the province of a whole other component of the grammar, the fragile learner must in effect embed its learning of this module inside its test for direction and labelling of trees. To the extent that Destressing contains elements not entirely foreseen by UG (i.e., language-particular variations that must be learned), the learner will be unable to learn P3 and P4. But even if Destres-

Depending on how the test for extrametricality is ordered, the learner could also analyze Warao as having right-dominant feet built from right to left, with a final extrametrical syllable. Hayes (1981, pp. 54-55) proposes a different anaiysis. *Perhapsthe position of main stress plays a role here; however, the Garawa pattern of Defooting occurs also in Italian, when two secondary stresses clash.

160

B. E. Dresher and

J. D. Kaye

sing is entirely controlled by prespecified parameters, the learning strategy required is inelegant, and leads to a very complex computation, even though it may result in a solution. Consider how this computation would go: the learner learning, say, Garawa, having determined that it has bounded quantity-insensitive feet, now wishes to figure out the values of P3 and P4. Since it has to start somewhere, say it checks first for P3L and P4L: left-dominant feet built from the left. Mence, it checks for the pattern S W S W . . . . and noticing an absence of stress clashes, it first sends the forms to the Destressing module, which attempts to reconstruct any stresses that may have been removed. Notice there is of necessity a certain amount of guesswork involved at this stage, for there may be several ways in which to destress. Nevertheiess, Destressing tries something: say it does nothing to (24a), and changes the first 0 in (24b) to a 1. Now this is in fact correct. Destressing sends these forms back to the test for P3 and P4, which now attempts to match them against the pattern S W S W =.. It succeeds with (24a); however, for (24b) it finds a mismatch. So something is wrong, but the learner does not know what; in the meantime, we have generated a search tree shown in (25): (25) Embedded search paths P3-P4 test R-L R-R

L-L L-R

/\/\ ...................... /4

\ 2 . . . . . . . . . . . . . . . . . . . . . . . . n-l

II

The paths under P9 represent the various possibilities allowed to Destressing. So far, we have followed the path that ends at 1; when this path does not yield the correct result, the next reasonable path to try is the one that ends in 2. This follows from the search strategy which says: from the point of failure (here l), backtrack to the lowest decision point, and then try the next possible path. This basic search logic would lead us to exhaust all possibilities of destressing for P3L-P4L first; only when these all fail do we return to the top of the tree and try another setting of P3-P4. In our case, Defooting was

A computational learning model

161

right on the first try - P3 was wrong. However, the learner has no way to know that, so in this case it would retreat from the correct destressing before ultimately getting everything right. In this domain, then, it acts like the maximally dull brute-force learner, which never knows in what respect it is wrong, and so advances toward and retreats from the correct solution by chance. 5.3. Determinism This sort of learning strategy has been called nondeterministic in the sense of the term as it is usually used in the parsing literature. We may take the following to be criteria of strict determinism (Marcus, 1980): (26) Deterministic parsing 1. No backtracking: the deterministic parser cannot back up when a parse does not work out; the nondeterministic one can return to its status quo ante and continue the parse. 2. No undoing of created substructures: once the parser has created a given structure, it may not subsequently be undone if the parse should fail. 3. Data-driven: a deterministic parser is data-driven ?o the extent that changes in its internal state are triggered, at least partially, by data from the input stream; a nondeterministic parser changes state according to a fixed agenda which is sensitive to data only to the extent that it indicates success or failure. These considerations have been brought to bear on parsing, first by Marcus (198Q), and then by Bsrwick (1985) and Berwick and Weinberg (1984). ATN parsers, of the sort developed by Woods (i970), have a preset schedule of preferences when faced with a choice: they thus are nondeterministic by (26), for they pursue a parse until failure, building structures and then backing up and dismantling them again. By contrast, Marcus aimed to develop a parser which did mot back up, by providing it with a limited lookahead capability. While there are obvious computational advantanges to deterministic parsing, Berwick (1985) argues that it also aids acquisition. Thus, suppose the parser encounters a sentepce containing a. new structure unknown to it; the parse -willfail. The parser will then attempt to modify its grammar so as to accommodatethe new data; but to do that successfully, it is necessary to know where the failure occurred. A nondeterministic parser, which routinely can make use of unlimited backtracking, will characteristically fail backwards through the whole sentence, undoing correct as well as incorrect substruc-

162

B.E. Dresher and J. D. Kaye

tures. The same is not true of a deterministic parser; as Berwick and Weinberg point out (1984, p. 231), from the standpoint of learning, the effect of determinism and the restriction that rules refer only to bounded context is to pinpoint errors to a local radius about the point at which the error is detected, The ability to keep problems local aids in the learnability of grammars. Determinism in the stress learner is implemented in the following way: the initial state of the system contains a counter of parameters set to their unmarked (0) values. An initial (0) parameter setting may be changed to its marked (1) value, but a change from marked (1) to unmarked (0) is interdicted. In other words, the decision to set a parameter value to its marked setting commits the system to this decision: it may not be changed. This procedure corresponds to the non-structure-destroying non-backtracking properties of deterministic parsers. The learning theory can thus be viewed as a monotonic function returning a parameter counter for a given array of input forms. Implementation of this type of learning theory requires a set of cues associated with individual parameter settings. Each parameter may be viewed as a switch which may only be turned on. To illustrate, consider a parameter counter (Pi Pj Pk) consisting of three parameters, i, j, and k, where the first counter position (from the left) represents the value of i, the second the value of j, and the third the value of k. Suppose that at some point i is set to its marked value; the current counter will now be (10 0), or, more simply, 100. Possible learning paths whose ultimate destination is 111 (all parameters at marked values) would be as follows: (27) Learning paths (a) Nondeterministic 100-101-110-111 100-110-101-111 (b) Deterministic loo-IOl-111 loo- llO- 111

In the kind of deterministic system implemented in our model, the learning paths ir (27a) are excluded, because the sequences . . . 101 - 110 . . . and . . . 110 - 1Gl . . . are illegal. Once a parameter is set to its marked value, the system is committed to that value. Possible paths are thus those shown in (27b). The constraint against retreating from a marked value must be suspended in one type of case, namely, where the default valbe of a parameter depends on the setting of another parameter. Consider, for example, P2, which determines whether feet are binary or unbounded. In our model, the default setting of P2 is [Unbounded] given a QS system, because a positive cue exists for detecting the presence of binary feet, but not unbounded feet. For reasons

A computational learning model

163

discussed above, QI systems allow only binary feet; that means that when P5 is [QI], P2 must be set to [Binary]. This is in fact the initial configuration of the parameter set. When P5 changes to the marked setting [QS], both binary and unbounded feet become available. and P2 must revert to its default setting of [Unbounded]. Thus, P2 may follow a path from [Binary] to [Unbounded] and back to [Binary]. The violation of determinism in such cases is only apparent, however. The initial sett::g of [Binary] is required by UG; there is no other possible setting at that stage, and so it is incorrect to consider it as marked: it is as unmarked as it is allowed to be. The change back to [Unbounded] when that possibility becomes open keeps the parameter in its unmarked setting. By contrast, the subsequent change to [Binary], provoked by something in the data, represents a change to a true marked setting, and cannot be altered thereafter.15 There are a number of reasons for considering deterministic learning systems to be interesting models of phonological acquisition. First, the constraints imposed by the requirements of deterministic learning may at least partially account for certain properties exhibited by stress systems and may help to explain why they are organized the way they are. As we have seen, nondeterministic learners are completely unrevealing in this regard, for they simply search through a list of possible stress systems. Stress systems require i1o particular organization to be learned in this fashion. Second, the existence of cues meeting conditions of Appropriateness and Robustness is expected in the context of a deterministic learner, which requires such cues in order to operate. From the point of view of a nondeterministic learner, however, the existence of such cues must be treated as an accident, since they could have no possible basis in the learning theory. The only relevant features for such a learning theory are the finiteness and coherence of accentual systems. Third, Berwick and Weinbergs argument that deterministic parsers aid acquisition holds in our case as well. Thus far, we have been speaking of stress systems as if they were indeed finite, and capable of being exhaustively listed. While this may be true at a certain level of abstraction, most languages nevertheless have certain idiosyncrasies, not all of which may be foreseen in the parameter list. In such cases, a nondeterministic learner would simply fail; by contrast, a deterministic Iearncr would still perhaps be able to fix

Borer and Wexler (1987) appear to have some such situation in mind in their account of the acquisition of causatives in English and Hebrew. At an early point, on their account, A-chain formation is uot available to children, who must consequently set up a marked causative rule. When A-chain formation becomes available, the unmarked causative rule becomes the default setting. and the earlier marked rule is revoked

164

B. E. Dresker and J. D. Kayt*

some, perhaps most, parameters. Failure would be local; information derived even from failure would aid the learner in further efforts to acquire the system.16 Let us return, then, to a consideration of our problems with destressing. Recall that the interaction of destressing with the other parameters of the accentual system is what first led us in the direction of a nondeterministic learning strategy. It is interesting, therefore, that a deterministic solution exists in this case. 5.4. Determinism and destressing We have seen that the attempt to learn destressing in the course of learning P3 and P4, the parameters for the direction of construction and labelling of foot trees, leads to a nonrobust and nondeterministic learning path. Fortunately, another approach is available. For it turns out that we can abstract away from destressing completely. Suppose the test for direction and labelling overlooks unstressed syllables in strong positions; then, only the presence of a stressed syllable in what should be a weak position will count as a violation. This cue is restrictive enough to rule out incorrect settings of P3 and P4; but at the same time, it is robust enough to see through the defootings of both Warao and Garawa. The cue has other advantages: it works equaiiy for noniterated feet (PlO Yes), and it would work when faced with a language thai has idiosy ,ic;atic destressing rules. In fact, this is just the cue (21) we have already proposed for quantity-sensitive feet. l7 This conclusion raises an empirical and a theoretical prospect. Empirically, this cue would be confounded when faced with a language-particular footing rule that is analogous to the sort of defooting rules that appear to be quite common. If we have indeed found the correct cue for P3 and P4, we now have grounds for excluding rules which simply stress any unstressed syllable in a certain weak position. We do find languages where certain morphemes always receive a stress; but such phenomena can be easily detected by the learner. Also we find rhythm rules; but these do not simpiy stress an unstressed syllable; moreover, they tend to involve alternations, and these

hAlthough we have been contrasting deterministic and nondeterministic learners, it would be more accurate to talk in terms of a continuum of determinism: learners may be more or less deterministic in several different ways. It is an empirical question to what extent deterministic learners are in fact the correct model for acquisition, in the domain of stresssystemsas well as in other domains. The same princtnle appears in the metrical correspondence rules proposed by Halle and Keyser (1971, p. 169f.). whereby a iine isdeemed unmetrical if a stressed syllable occurs in a weak position, but not vice versa. We are grateful to Way Jackendoff for calling this to our attention.

A computational learning model

165

would help the learner. To the extent that this empirical prediction holds up, it lends support to our proposed cue. Separating out destressing, and ordering it after stressing, also accords well with the way destressing has usually been thought of, as a series of operations that apply to the output of the stress rules. It is interesting, therefore, that the parameters of stress are learnable in the absence of knowledge about the details of destressing. On a theoretical level, the pursuit of robust cues can iead, if successful, to a substantive notion of core grammar. It is useful to think of the grammar of a language as having a core, which is tightly constrained by UG, and a periphery, which departs in various ways from the principles of the core, and which consists of more or less idiosyncratic rules and exceptions. In any given case, however, there are few criteria for assigning any particular process to he core or to the periphery. The distinction can be bolstered by learnability considerations: the core of a grammar ought to be learnable by robust preprogrammed cues, cues which will not be misled by peripheral processes. The periphery would have to be learned by less principled means; the task, however, is much simplified once the core system is in place. This line of reasoning may offer the beginning of an answer to a puzzle posed by Howard Lasnik. Lasnik wondered why linguists usually assume that core grammar should be considered regular, systematic and in general well behaved, while the periphery is assumed to contain phenomena that are formally more bizarre or complex. If core grammar is largely determined by UG, his reasoning went, then it is largely hard-wired into the brain and no particular constraints on its form would be required. Since peripheral processes lie outside of UG, it is there that one should seek the more constrained aspects of grammatical structure. If, as we have suggested, core grammar is learned in a deterministic fashion while other procedures may be employed for peripheral phenomena, this could account for the highly structured nature of the former in contrast to the latter. 6. Subset theory ar,d negative evidence In a deterministic theory, the choice of the initial unmarked parameter values is crucial . I8 The learner must never mistakenly set a parameter value to its
This is true when the data is being pr eser.tcd in izcreirrentai fahiun, and the iearner is unabie to draw conclusions from the absence of a cue. If, however, the learner has access to all the relevant data simultoneously, the absence of a cue (1.369 become positive evidence, nnd the choice of the initial parameter value _ becomes less significant. See Section 7.3 below for further discussion.

166

B.E. Dresher and J.D. Kaye

marked value. One hopes for a situation involving the existence of positive cues that unfailingly indicate the presence of a given parameter setting for a system. To the extent that positive evidence exists for both values of a given parameter, the question of markedness loses its relevance, at least as far as the learning theory is concerned. In many cases, however, positive evidence is available for only one value. In such cases, one assumes as the initial setting the value for which there is no positive evidence. The learner is driven to the marked value by encountering just such positive evidence. In syntax, it has been argued by Berwick (1985) and by Wexler and Manzini (1987) that any learning theory constrains markedness theory to select subsets as the unmarked parameter value. Consider, for example, anaphora. Languages may vary according to what may constitute a proper antecedent far an anaphor. Some languages, such as Japanese, only allow antecedents in subject position. Other languages are less strict in terms of the position that may be occupied by the antecedent. In English, for example, an antecedent may be anything within the governing category of the anaphor. As the former set of possible antecedents is a subset of the latter, it must be chosen as the unmarked case. Positive evidence for changing to the English setting of this parameter would then be available in the form of an anaphor bound by a nonsubject in its governing category. The contrary choice (setting English as unmarked for this feature) would furnish no possible positive evidence that one is learning a language where anaphoric reference islimited to subject position. Indeed, Wexler (personal communic:dtion) suggests that children may deduce subset relations from UC. In other words, the subset principle, which holds that the initial hypothesis always involves the more constrained system, is not a methodological imperative directed at linguists as to what the initial parameter values should be, but rather a learning strategy employed by children. In phonology, the situation appears to be much more complex. In the case of accentual systems, the types of subset relations discussed in the literature on syntactic learning are simply absent. For any two accentual systems A and B, there exists some accentual pattern in A that may not occur in B, and vice versa. If subsets are to be defined on the output forms produced by the grammatical system under study, then it is clear that no such subsets will be found involving the parameters of accentual systems. It is far from obvious, however, that subsets must be based on output forms construed in this manner. The difficulty arises from a failure to consider the relationship between (I) a parameter setting, VP, (2) the forms, {Di}, resulting from its presence (always in conjunction with the presence of the other parameters in the system), and (3) the cue(s), Cr, which reveal its presence. Work on syntactic learning models has tended to obscure these different factors.

A computatiunal learning model

167

Consider again anaphora. The parameter in question is characterized as the position of possible antecedents in the language. The forms generated by such languages wili/will not involve antecedents in nonsubject positions. The cue utilized to detect the marked (superset) system is an antecedent in nonsubject position. In such cases, it is easy to neglect the differences between Vr, (Di}, and Cp. Phonological parameters and the construction of learning models based on them highlight these differences, however. To take but one example, consider again the placement of main (word) stress, controlled by Pl. This parameter may be set to [Left] (word stress left) or to [Right] (word stress right). As we noted above, the fact that this parameter has on!y two settings does not mean that main stress occurs invariably on the first or-last syllable of a word. The direct and consistent result of this parameter setting (as well as most others in stress systems) is seen only in the production end: the construction of metrical structures. Its effects on the output forms are varied and depend on the settings of the other parameters of the system. For example, setting Pl [Right] (main stress on the right) results in invariable final stress in a language like Weri, which has binary, quantity-insensitive, right-dominant feet constructed from the right. Sample forms are illustrated below: (28) Pl [Right], P2 [B], P5 [No], P3 [Right], P4 [Right] (Weri) (a) M (b) M

A
I I

S F
I . I I ;9 Y

F
/\

F
/\

Ri
I

Ri
I

Ri
I

Ri
I

Ri Ri
II

Ri Ri
II

ku 1

11 pu 0 2

a 1

u ne 01

t epal 02

QC The same setting of PI [Right], by contrast, h,, very different results in Eastern Cheremis (see (11) above), a language with unbounded, quantity sensitive, left-dominant feet. Here, main stress appears on the last heavy

168

B.E. Dresher (2ndJ. D. Kaye

syllable of words which have at least one heavy syllable (llb), and on the initial syllable of words with no heavy syllables (lla). It is thus a challenge to the constructor of learning systems to find a cue to some value of the word stress parameter, Pl, that is constant in all parameter settings. In the case of Pl, there are a few possible cues. Given the set of parameters assumed in our current model, the appearance of main stress to the left (resp. right) off a secondary stress is a positive cue for the [Left] (resp. [Right]) value of Fl. This cue will not be present in the case of languages with no secondary stresses. Another cue, more in keeping with the Appropriateness Condition, would scan a foot-sized window at the left (right) edge. t*f a word, looking for the presence of main stress; this window would vary with the values of P2 (binary vs. unbounded feet) and PS (QI vs. QS), to which the cue for Pl would now be sensitive. Such examples make clear the potentially remote relationship between a parameter setting and the data which signals its presence in a grammar. Let us return now to the question of subsets. It follows from the above discussion that subsets can be calculated only on the basis of the set of output forms as characterized by the cues involved in the identification of the parameter setting in question. The relevant set [cannot simply be identified with the output words or sentences in the language. Concretely, let us consider (final) extrametricality. Languages in which final syllables are not extrametrical are characterized by final syllables which may or may not be stressed. In languages with final extrametricality, h.owever, final syllables can never be stressed. Considering now only final syllables in both types of languages, we can. see that the extrametrical cases (only unstressed final syllables) are a subset of the nonextrametricalcases (stressed or unstressed final syllables). By this reasoning, it should be supposed that the child assumes it speaks an extrametrical accentual system, and revises this hypothesis in the face of contrary positive evidence, to wit the presence of final stressed syllables in the data. The extrametrical case, however, is by no means typical of all stress parameters. Furthermore, the suggested cue is not totally reliable in both directions. While it is true that a final stressed syllable is sure-fire proof that there is no right extrametricality, it will not always be the ca3e that languages with no final extrametricality will invariably contain final stressed syllables. In such cases, the learner may be unable to determine the setting for extrametricality until it tests for P3 and P4.19
There we a few c..__. where our Learner is unable to determine whether or not there is extrametrisality, aces __ and incorrectly leaves it at the default value, which we take to be [No]. The failure will become evident when the program attempts to apply the parameter setting passedon by thle Learner to words the language; the in stresspatterns derived will not match those of the input. For such situations we have a brute-force nondeter-

A computational learning model

169

The reader will have noted that a good number of differences among stress systems involve contrasts in chirality: word stress, foot dominance, and construction direction do not appear to involve subsets, no matter what the cues for them are. In other cases, whether or not a subset relationship exists is clearly dependent on decisions involving the nature of the cues. Consider, for example, the relation between QS and QI systems. Is this a subset relation, or one of a partitioned set? The answer depends on the cues chosen. Suppose that LT, is a learning theory in which the cue for fixing this parameter involves a window that is exactly one syllable in length. In QI systems, there is no fundamental distinction between heavy and light syllables: all syllables can be either stressed or unstressed, depending on their position in the word. Similarly, light syllables in QS systems are stressed or not depending on position; heavy syllables, however, may not usually be unstressed (abstracting away now from the effects of destressing and extrametricality): (29) Syllable types in QI and QS systems
QI systems Stressed Unstressed Heavy, Light Heavy, Light QS systems Heavy, Light Light

From the point of view of syilable types found, QS systems are clearly subsets of QI ones. The latter allow unstressed heavy syllables whereas the former do not. This implies, then, that the unmarked value of this parameter in LT, should assume QS. LT,, however, is not the only learning theory available for dealing with these cases. Consider LT,, where the test windov+ is extended to include the whole word. Now, a different cue becomes available: two words having the same number of syllables, but different stress patterns, can be taken as a positive cue for QS (abstracting away now from exceptional words). This is because, in QI systems, all syllables have the same status with respect to metrical theory; hence, words with the same number of syllables are all alike from the point of view of the metrical parameters. In QS systems, by contrast, metrical structures are sensitive to the distinction between heavy and light syllables. We thus have the following equivalence classes of word types in QI and QS systems:
ministic back-up procedure which then tries out other parameter settings that have not been ruled out by the Learner. It remains an empirical question whether we can do away with the brute-force back-up altogether. Extrametricality is a particularly difficult problem in this regard.

170

B. E. Dresher and J. D. Kaye

(30) Word classes in QI and QS systems .___... .~~~_~~_. __. _~_~~~~


QI: syllable = S ~_. ___~~~__~_ _...-.. 2 syllable words 3 syllable words 4 syllable words {SS) (SSS) { SSSS} QS: syllable = H or L _ _.__.._-._.__.____~-_________ {LL) {LLLJ {LLH) {LLLL) {HLl {HLL) {HLH} (HLLL) {LH) {LHL} {LHH} {LHLL} ___-_ {HH) {HHL} (HHH} .. .

If we calculate subsets with regard to the number of equivalence classes established for words with n syllables, we see that the QI classes are a subset of the QS classes. Indeed, in LT,, we must suppose that QI is the unmarked case, since positive evidence - he existence of words with the same representations but different stress patterns - exists only for QS. A learner which starts by assuming QS will not receive positive evidence contradicting it. Rather, it would have to notice that all equivalence classes consisting of words of n syllables have the same stress pattern. Thus, LT, requires a different subset relation, and a different markedness theory, from LT,. In general, there is a close relationship between an LT and markedness theory: the choice of an LT will dictate the choices of a markedness theory and vice versa. It should be noted also that the markedness theory derived from LT may be rather different from one based on more traditional kinds of linguistic evidence.

An important question that must be addressed in the construction of learning models is that of the nature of the information that is available to the learner. In this section we will discuss the sorts of information that are provided to our learning model, and consider some alternative possibilities. 7.1. Syllable structure As mentioned at the outset, we assume that the model may access the rime structures of the input data. Certain cues are sensitive to these structures, which are required to distinguish light from heavy syllables. In addition to the parameter for QS/QI, this distinction also plays a role in choosing between binary versus unbounded feet, as well as other aspects of foot construction and labelling.

A computational learning model

171

7.2. The current parameter counter

If parameter cues were independent of each other, in the sense that they did not vary according to other parameter values, then the learner would not have to have knowledge of the current parameter counter. To the extent, however, that parameter cues depend on the values of other, already determined parameters, the learner must have access to these earlier values. For example, various cues presuppose a value for P5 and P6 - the parameters for QS/QI. When applying these cues, then, the learner must be able to access the current value of P5 and P6. The current counter is also crucial in the detection of nontransparent systems, such as those involving exceptions, morphological sensitivity, or coexisting stress patterns. These cases will be discussed in more detail below. 7.3. Incremental versus all-at-once learning theories
7.3.1. Incremental mode

Two proposals may- be advanced concerning the manner in which input data is made available to the LT. An approach which appears to mirror the situation of a child is the Incremental Mode: words (in this case, but the same would hold for the data of any other relevant domain) are processed individually, with the learner having access to the rime projection and stress pattern of the current word, as well as the current parameter settings. An Incremental Mode learner cannot assume at any point that it has already seen all the relevant data, so it may not use negative evidence. In the most constrained version of an Incremental LT, the learner wo-ild also not be able to directly access previous words. The model, then, is of a stream of words passing through the learner, with only one word being processed at a time. Parameter cues must be limited to information that can be gleaned from a single word, analyzed with reference to current parameter settings. 7.3.2. Batch mode If the learner is also allowed to refer to previous words it has encountered, it begins to approach a second type of LT, one which functions in All-at-once, or Batch Mode: input forms are accepted fcr some fixed interval, in which time all forms are processed at once. This LT has access to all the words it has been exposed to, and is therefore allowed to make cross-word comparisons. It has the capability of noticing, for example, that two words have the same number of syllables, but different stress patterns; recall the equivalence class test for QS/QI of LT, demonstrated in (30) above. Such a test would be unavailable to the Incremental Mode LT, which would have to use some-

172

B. E. Dresher and J. D. Kaye

thing like the syllable test of Ll; in (29). The Batch Mode learner can also make use of negative evidence; thus, it can notice that there are no words in the data with some property Q, and draw conclusions from that. This possibility is also not open to a pure IncretrPntal LT. The Batch Mode LT incorporates the idealization to instantaneous acquisition assumed by most work in phonological theory (see Chomsky & Walle, 1968). In terms of actual acquisition, this model assumes a latency period chi~g which a learner stores input, without attempting an analysis. Another possibility leading to the same result is that the early stages of acquisition play no decisive role with regard to the grammar ultimately learned. In either case, acquisition can be viewed as if it were instantaneous, with the learner having access to all the relevant data. As the example with LT, and LT,,,shows, an Incremental Mode LT is more constrained and less powerful than a Batch Mode LT: any cue available to the former is available to the latter, but not vice versa. It is therefore interesting to see whether the extra power of the Batch Mode LT is indeed necessary in learning the parameters of metrical theory. At the present time, it appears that it is, for reasons that will now be made clear. 7.4. Cross-word comparisons To illustrate the sort of problem that can arise for a strictly deterministic and purely Incremental Mode learner, consider again the test for quantity sensitivity. We have seen that a cross-word comparison test, as in (30) above, is not available to such a learner; another possible test which is available is the syllable-based test of (29). This test uses the absence of a stress on a heavy syllable as a positive cue for quantity insensitivity (QI). In our discussion of destressing, however, we observed that the lack of a stress can be a treacherous cue. While the lack of a stress might indicate that a syllable is in a metrically weak position, there are three other sources of stresslessness in the system of parameters assumed in our model: a syllable ,might be extrametrical (P8); it might be destressed (P9); or feet might not be iterated (PlO), in which case all secondary stresses are reduced. The cue for QI would have to be able to distinguish these cases from a true case of a metrically weak heavy syllable. At the present time, it it not clear whether it can do this consistently. The cross-word test in (30) can be made available to an Incremental Mode learner if we weaken slightly the constraint on what information it has access to by allowing it to keep a table of weight-string to stress-string correspondences . Thus, when the learner encounters a new word, it records its weight string and stress string, which are all that is needed to apply the cross-word test. It is not necessary for the iearner to recall the actual words it has heard,

A computational learning model

173

but only their weight-string to stress-string mappings. This type of Incremental Mode learner must still use only positive cues, for it can never assume that what it has not yet seen will never occur. That the LT could have access to cross-word comparisons of this type is supported by developmental studies which suggest that cnildren in early stages of acquisition organize words in terms of preferred or canonical segmental and prosodic patterns (see Ingram, 1974; Macken, 1979; Menn, 1978). This kind of organization is compatible with the idea that children keep track of weight-string to stress-string mappings. Potentially relevant here is the distm;;tion made by Bowerman (1987) between on-line and off-line models of acquisition. On-line theorists assume that learning takes place when the child is actually using language: processing forms, dealing with new constructions, etc. Off-line theorists (e.g., Bowerman, 1982, 1985; Karmiloff-Smith, 1986j also suppose that children (unconsciously) compare forms and extract regularities even when they are not confronted with new data. The on-line/off-line contrast does not line up with the Incremental Mode/Batch Mode one, for our discussion has been limited to on-line learning, where both types of LT can be entertained. Nor can we necessarily derive constraints on the LT from either of these models. However, a large role for off-line analysis would at least indirectly support the idea that the LT could have access to a variety of cross-word comparisons.*O 7.5. Cross-parameter dependencies Another type of problem confronts an Incremental Mode learner, however, even if modified as above. If there are cross-parameter dependencies - i.e. if the cue for one parameter depends on having established a result for the setting of another parameter - then a certain relation must hold for the cues of these parameters if we are to maintain strict determinism. This relation can be best illustrated with an example. Suppose that the cue for a parameter, Pi, is a one-sided test, so that Pi retains the default value 0 until the learner encounters Ci in the data, at which point the value of Pi is set to 1. Suppose also that the cue for another parameter, Pi, varies with the value of Pi as follows: if Pi is set at 0, then the presence of Cj in the data is a positive cue to change the value of Pj to 1. It follows that the value of Pi must be established first. But notice that the learner is liable to make a serious mistake in setting Pi. As long as Pi remains set at the default value,
For discussion,see Bowerman (1987) and the references cited there, as well as some of the othe: papers in MacWhinney (1987).

174

B. E. Dresher and J. il. Kaye

the learner will be treating the presence of ci as a positive cue for setting Pj to 1. Suppose now that the data contains both Ci and Cj, and that the correct setting of parameters is Pi = 1 and Pj = 0. If the learner encounters Cj before Ci, though, it will incorrectly set Pj to 1 - an error from which it cannot recover. This example is not hypothetical; it arises several times in our parameter set. One example concerns Pl, the main stress parameter, P2, the parameter determining whether feet are binary or unbounded, and P5-P6, parameters determining QYQS. To determine whether main stress is on the left or the right, the learner samples a foot-sized window at the left and right edge of a word: main stress should always appear in one of these windows. Clearly, this cue must know how big to make the window, and so it depends on information received from P2. The default setting of P2 is itseif dependent on PS; if the stress system is QI, feet can only be binary, in which case P3, must be set to [Binary]. But if the system is QS, the default must be [Unbounded], since the cue for P2 is: assume that feet are unbounded unless you see a stressed light syllable that is not at an edge (note in addition that the notion of edge is itself relative to the current value of extrametricaiity, controlled by P8A and P8, and what counts as a light syllable depends on the values of P5 and P6). In our current model, the default value of P5 is [QI], in accordance with LT, above: assume stress is QI until you see evidence that it is QS. With this background, suppose now that we are faced with a language whose true parameter settings are P5 [QS] and P2 [Unbounded]. Suppose also that P5 has not yet seen the evidence it needs to change its default from 1 to QS. The stage is now set for a fatal error to occur downstream from P5. For P2 will of necessity be set to [Binary] as long as P5 is [QI]; and the cue for Pl will be looking for main stress in the wrong window. Depending on what data it is scanning, Pl is liable to move to an incorrect value it cannot retreat from. The cues for P3 and P4 can also go spectacularly wrong if they apply an incorrect parameter setting of P5 or P6 to misconstrue what syllables count as heavy; in such circumstances, it is impossible to determine the correct direction and labelling of feet. What this type of example shows is that we are actually dealing with three pa.rameter values, not two. For we must distinguish not just between a default, hence changeable, 0 and a marked, frozen 1, hut also between a default 0 and a frozen 0. A frozen 0 is an unmarked parameter setting for which the learner has positive evidence. Given cross-parameter dependencies, a parameter Pj which depends on Pi can be evaluated with confidence only if the value of Pi is frozen.

A computational iearnhg model

175

Mow can we prevent an Incremental Mode learner from acting incorrectly on the basis of default values? One way would be if the default values of the various parameters lined up, so that no parameter had a cue that led to a marked value which was sensitive to an unmarked value of another parameter V such a)wnstraint held, the hypothetical situation sketched above would never arise? If this constraint does not hold, then the learner would have to wait before it acted on a default value until it was confident that the default value was correct. But in this case, it would start to approach the Batch Mode learner, in that it would have access to negative evidence, namely: if after a certain period there has been no occurrence of Ci, assume that there is no Ci. Another way cf maintaining an Incremental Mode learner would be to relax the constraint on retreating from marked parameter values in the following way: any time a parameter value changes, all parameters which depend on it revert to their default values. * The rationale for this is that any information obtained under the influence of an incorrect parameter setting is potentially flawed, in ways that cannot be recovered. It is an empirical question which of these LTs is correct. Our current model implements a fairly powerful Batch Mode learner. It remains to be seen to what extent a more constrained Incremental Mode learner can be constructed. Given the above considerations, it appears that a learner that is both purely incremental and strictly deterministic is unworkable, given the current set of parameters and cues.
Scope of the model

8,

We will discuss here how well our program, called YOUPIE. (for Universal Phonology), performs with respect to the UG given it, and sketch some ways in which the model needs to be extended in order to provide a more comprehensive theory of stress acquisition. 8.1. Learning within the solution space delimited by UG Within the limits of the UG provided it, YOUPIE does quite well in setting its parameters, as long as it is supplied with da.ta that is not lacking crucial cues. For example, the word-based cue for QI/QS (LT,, discussed above in Section 6 and in the Appendix) operates by looking for two words with the same weight strings but with different stress strings. While determining that

Compare the notion of linking of phonological features proposed by Chomsky and Halle (1968).

176

B.E. Dresher and J. D. Kaye

a system is QS is usually unproblematic, the Learner does less well than a linguist in making the further discrimination between QS [Rime] and QS [Nucleus]. For example, given only the three words tdezen, shlaapdazkem, and shiinchdam, we can deduce that Eastern Cheremis is QS [Nucleus]. We. and the Learner, know it is QS because of Mezen (2 0 0) versus shlaapciazhem (0 2 0). Figuring out that P6 is [Nucleus] requires some deduction. The contrast between the two trisyllabic words is not itself decisive, for if we have P6 [Rime], the weight strings are Mezen (L L H) versus shlaapdazhem (H H PI); as the weight strings are different, their different stress strings do not lead us to conclude that P6 [Rime] is incorrect. Rather, we notice the following: (a) (b) (c) (d) (e) 0 . Stress falls on th,, -cond heavy syllable in shiinchdam. Therefore, main stress falls on the last heavy syllable. Tit;: final syliable in a word is not extrametrical: shiinchdam. The final syllable of shlaapdazhem is unstressed. Therefore, it cannot be heavy.

These are valid deductions, but our Learner does not operate this way. As we have argued throughout, we think there are advantages to a cue-based system over one that allows this kind of deductive reasoning (assuming now that such a component can be formalized). Interestingly, the cti;its of excluding deduction are rather minimal. For YOUPIE to arrive at P6, it needs to be exposed to a word eiik uf the form (L L Xj to contrast with i&Zen (L L R), or (N N N) or (N R R) to contrast with shkaapdazhem (N N R), or (N R) to contrast with shiinchdam (N N). The chances of encountering one of these contrasts in even a small number of common words are very high. In general, the lower bound on the data that YOUPIE needs to learn a grammar is so low as to be irrelevant to the situation of real acquisition. YOUPIE needs only a handful of representative words to correctly set its parameters. If words were supplied at random, it might need a few handfuls. We do not know whether children have recourse to the sort of deduction demonstrated above to learn stress, since they learn grammars on the basis of much more data than is under discussion here. Relevant empirical evidence bearing on this question would be a !anguage in which contrasts crucial to our Learner were systematically sup;,ressed, but where the stress system is learned nevertheless; however, we do not know of such cases.

A computational learhg

model

177

Aside from some problems with extrametricality, discussed in the Appendix, the Learner performs quite well within the bounds of its theory. 8.2. Extending the parameter set
How narrow are these bounds in comparison with the distribution of stress systems in the world? The current set of parameters covers many of the basic cases treated by Hayes (1981) and Halle and Vergnaud (1987b), though these works also treat many languages whose stress patterns could not be accounted for in terms of these parameters alone. To bring the model to the point where it would equal their empirical coverage, a number of other parameters would have to be added. In the context of a learning model, the addition of a new parameter requires us also to include its associated cue; and the addition of any new cue could have consequences for the cues already established. If our current set of cues is correct, we would hope that the cues associated with new parameters would fit smoothly into the system without disrupting or invalidating any other cue. For some of the new parameters reviewed below, it is fairly easy to see what the associated cue is, and how it might be integrated into the model. Other cases pose more of a challenge.

8.2, I. Other kinds of destressing Although we abstract away from destressing in general, the current model nevertheless obeys some restrictions on what may be destressed. In particular, we assume that a foot labelled Strong in the word-tree (i.e., a foot that would receive main stress) cannot be destressed: an expected stress pattern [2] [1 0] can appear as [2] [0 01, but not as [0] [2 0] in this theory (square brackets indicate foot boundaries).** There are, however, languages like Creek, which have the following pattern: (a) build binary, right-dominant QS feet from left to right; (5) main stress is assigned to the rightmost foot. It the rightmost foot dominates oniy a single syllable, however, stress devolves upon the penultimate:

Such a cinzaint

is presupposed by

the Trigger Prominence Principle proposed by Hammond (1984).

;.78

B. E. Dresher and J. D. Kaye

(31) PI [Right], P3 [Left], P4 [Right], QS [Rime] (Creek)

)A
W
I

I
F

I
F

I
Ri ?\ ho 0 k

Ri Ri

I I

I I

4 s i I I

Ri Ri

taki 0

i fo 0 2

Ri c i

Creek is the mirror-image of Warho (8) with respect to foot construction. Main stress is on the right, as is shown by (31a); the accent in Creek is tonal, so there are no secondary stresses (PlO Yes). Creek differs from Warao, however, in words like (?lb): if main stress is on the right, the final foot should be labelled S by the word tree, and the initial foot should be defooted, deriving the incorrect *ifoci. Iiayes (1981, p. 61) derives main stress in Creek using a word tree labelled according to the rule: label the. right node Strong iff it branches. This labelling convention entails a number of difficulties discussed by Prince (1983), and is not available in our parameter set. Halle and Vergnaud (1987b, pp. 59-60) order destressing in Creek before the assignment of main stress, and some version of this option could be added to the parameter set. It is interesting and encouraging that the Learner can nevertheless set all the relevant parameters of Creek correctly. A problem arises only when the model applies the parameters to generate stress patterns: the derived form ifoci (0 0 2) is not considered by the model as a match for the input ifbci (0 2 0). Thinking it has failed, YOUPIE then embs. ks on a fruitless search for other solutions. The success of the Learner under these conditions suggests that the current cues are on the right track, coming through robustly despite the surface disturbance of the pattern.
8.2.2. Qther kinds of feet

A number of stress systems suggest that a larger inventory of foot types is ne.rcnccary.Given the current parameters, heavy syllables occupy only one ___..#Y position in a QS foot, though they must be in the strong position. Possible

A computational learning model

179

ant binary feet are then [l L] arad [l H]. In some languages, feet appear to be constrained by weight requirements in which a light syllable counts as one unit (mora) and a heavy syl!ab!e counts as ~0. An illustration of seme of the possibilities is given in (32):
(32) Foot construction: 33 [Leftj, P4 [Left J 3 P5 [QS]
Foot structure (a) (b) (4 (d) [HI] [L . . . IHIlL Stresspattern 101 110 Language Old English Cairene Arabic

ILllIL...

II...

101
100

CZitXiC AAX
Old English

[L-l 11. . .

0 Both Old English and Caire.., Arabic can be analyzed as having QS left-dominant binary feet constructed from the left, though each adds its own twist to this parameter setting. In (32a), Old English represents the expected analysis of a word of the form H L L . . . The same sequence in Cairene is analyzed as in (32b). where the heavy syllable occupies both positions in the foot; alternatively, we can say that feet are limited to two moras.23 By contrast, the tre&tment of a sequence of light syllables in Cairene (32~) is as expected; here Old English (32d) departs from our miniature UC. Old English requires the strong branch of a foot to have at least two moras; a light syllable in strong position must be resolved, that is, grouped with a following syllable to make up the necessary weight.2 Thus, the foot [L-I l] in (32d) is not really ternary, but is a bihary foot equivalent to [H 11. It is not at present clear to us how these systems should be incorporated into the model. Ternary feet have been reported in a number of languages, such as Cayuvava. However the relevant parameter is formulated,2 it appears that its cue has to follow the current cue for P2, which is in fact already a cue for boundedness, not binariness; the latter is a consequence of the fact that there are no other kinds of bounded feet. But if there are, then once P2 is set at [Bounded], further discriminations will have to be made.

*For various views on Cairtue and related issue;-s. Prince (1983. p, 57ff.). Hayes (1987). and Halle and see Vergnaud (1987b. pp. 60-M). This analysis of Old English is based on Dresher and Lahiri (1988): see also Lahiri and van der Huh (!??M).
*%ec Halle and Vergnaud (1987b. pp. 25-28) for discussion.

180

B.

E. Dresker and J. D. Kaye

0.2.3. Other kinds of extrametricality The current model allows only for extrametricality of syllables. There are some suggestions that extrametricality should also apply at other levels, to feet and to segments. Cairene Arabic exemplifies the latter case: although medial closed syllables sf the form CVC count as heavy, in final position only superheavy syllables (CVCC) count as heavy. This follows if a final consonant is extrametrical. Brownlow (1988) shows how the model can be extended to recognize extrametrical segments. 8.2.4. Con.icting parameters for main and secondary stress In our model, main stresses and secondary stresses are constructed in the same way, with main stress being simply the most prominent stress. In terms of parameters, only PI refers specifically to main stress. It has been proposed, however, that there are languages where the main stress is constructed differently from the other stresses. In such cases, main stress is typically constructed first, then secondary stresses are filled in. Both situations are represented by Lenakei. Lenakel nouns are stressed as in Warao (8): main stress is usually penultimate, and secondary stresses fall on every even-numbered syllable before the main stress (see 8a); in words with an odd number of syllables, the initial syllable is destressed in clash (see
,, 8b).

Secondary stress in Lenakel verb:, however, follows a different pattern, falling on odd-numbered syllables preceding the main stress., except for the sy!!able immediately preceding main stress: n8dyag~r??~dwcidamnirpzOn, tinag&nyasMvIn. The stress pattern of words with en odd number of syllables cannot be generated uniformly from right to ieft. Hammond (1984) proposes to assign a left-dominant foot from the right for main stress; then secondary stresses are assigned by building left-dominant feet from left to right. A stress on the syllable immediately before the main stress is destressed. Malle and Vergnaud (1987b) propose a number of analyses which involve more than one direction of foot construction. Cases have also been adduced where main stress is quantity-sensitive while secondary stress is not (e.g., English; see Hayes, 1981). If these analyses are correct, it appears that we have to allow for languages where main stress and secondary stress have different parameter values for some parameters. Incorporating this idea into the model entails no difficulties with regard to the parameters themselves, for we a!ready have them; it would simply be a matter of creating separate counters for main and secondary stress. The difficult part is in diagnosing in a principled fashion that such a situation exists.

A computational learning model

181

8.3. Effecss of phonology We noted at the outset that accentual systems are well suited to a learnability study of this sort because they are relatively independent of other aspects of grammar. Eventually, however, we must take into account interactions between stress and other parts of the grammar. Thus, various phonological processes can serve to obscure the underlying stress patterns of a language. A common example of this is a language with final stress which has a vowel epenthesis rule that causes stress to appear on the penultimate (e.g., Hebrew mdek from /malk/). Or unstressed vowels may be deleted, obscuring the alternating character of stress, as in Odawa. In these cases stress applies to underlying forms that must somehow be recovered by the Learner. In such cases, the stress Learner would presumably interact with other components of the grammar. In the Hebrew case, for example, the morpheme mtflek, king, also appears as ma/k, for example, malk-I,my king. Also, there would be information from the syllable structure component that fk is not a well-formed syllable coda. These facts. added to the surface exceptionality of penultimate stress, could lead the Learner to conclude that the stressable form of the morpheme king is malk, not melebc. By not including the effects of phonology in our model, we no doubt ease the Learners task, in comparison to a lhuman learner. On the other hand, it should be recalled that the phonology often supplies additional evidence as to the nature of stress in a language. There are many cases in the literatutre where it is shown that processes like vowel harmony and vowel reduction, as well as consonant strengthenings or weakenings of various kinds, are sensitive to metrical structure. Stress shifts occasioned by the deletion of a stressed vowel are sometimes the main or the only evidence for establishing metrical constituency in a language. 26Therefore, there are cases where the job of the Learner is made more difficult by not having access to phonological evidence bearing on the values of the metrical parameters. 8.4. Morphology and stress Our model abs tracts away from the influence of morphology on stress, although any comprehensive account of stress and stress acquisition will have to take morphology into account. Two important sources of morphological influence involve the cycle and morphoiogically conditioned parameter settings; also relevant here are languages with lexical accent.

*See HUle and Vergnaud (1987~1)and (1987b,

p. 28f.)

for n number of such cases.

182

B. E. Dresher and J. D. &ye

8.4.1. Stress and the cycle Our model implicitly assumes that the domain of word stress is the entire word considered as a weight string with no further internal structure. This must be true of monomorphemic words, and in many languages it is also true of morphologically complex words, which act like monomorphemes with respect to stress assignment. It is very often the case, however, that the morphological structure of a word influences its stress pattern. The basic main stress rule of English, for example, is the same as that of Latin: stress a penultimate syllable if it is heavy, otherwise stress the antepenultimate. The word sol&nnity is stressed on the antepenult in accordance with this pattern, but sblemnness violates the rule, since it ha.s stress on the antepcnult though its penult is a heavy syllable - compare agbnda. The location of main stress in sdlemnness mwt rather be explained with reference to the stress of sdlemn, from which it is derived. In English, a class of suffixes including +zess, -less, -fur, -km, etc. are stress neutral; that is, they neither receive stress nor influence its position. Stress neutrality can be accounted for by applying stress to a word dormain which does not laclude these affixes. Stress applies to inner word domains also in the case of affixes which are not stress neutral. For example, the residual stress on the second syllable of Usttkity is due not lto the surface syllable pattern. (cf. skendrpity), but to the fact that it bears main stress in elkstic. We mulst therefore suppose cyclic application of stress, first to elastic, and then to elasticity.A number of principles regulak the interaction of metrical structures assigned at different cyclic levels. 27 A complete treatment of cyclic stress assignment in the learning program would require some interaction with a morphological component, and probably also with other parts of the phonology. As for the stress learner itself, the presence of morphological influence would be signalied quite early by the existence of identical weight strings with differing stress strings. Cues for identifying the various possible modes of cyclic application remain to be formulated. 8.4.2. Morphologically conditioned parameter se,ttings In many languages, certain morphological classes have their own parame_..~__lr. ter settings for stress. For C;AauIpb, nouns muYhave a different stress pattern 0.1 from verbs, or certain affixes may behave in a special way. The nontranspa%ee Kaisse and Shaw (1985) and the other articles in Phonology and the Lexicon in Phonology Yearbook 2 for LI review of some proposals, mainly due to Kiparsky, and Halle and Vergnaud (IY89a, b) for a different view.

A computational learning model

183

rency of such cases would be easily detected, but the learner would have to distinguish them from cases of cyclic application on the one side, and simple exceptions on the other. 8.4.3. Lexical accent In all the cases discussed above, stress is assigned to words by rule. Some languages, such as Vedic and Lithuanian, distinguish between accented and unaccented morphemes. An accented morpheme is assigned an accent in the lexicon; thus. in such languages, the placement of stress is not determined only by rule. but is also influenced by whether or not a morpheme bears a lexical acce:lt. Accented syllables in languages with lexical accent play much the same role as heavy syllables in QS languages. Thus, the basic accentuation principle (BAP) for the languages named above is: stress the leftmost accented vowel or, in the absence of accented vowels, the leftmost vowel (Halle & Vergnaud, 1987b, p. 84ff.); compare Aguacatec Mayan in (12) above. In QS languages, heavy syllables are systematically distinguished from light syllables by their structure; once the level of QS has been determined, any new word can be assigned its weight string upon inspection. Languages with lexical accent present an additional challenge, however, in that accented syllables do not necessarily have any distinctive structure. A learner of such a language essentially has the task of learning a Q&type system without knowing which syllables are light and which are heavy. Learniug lexical accent shares some features with learning exceptions. 8.5. Exceptions Exceptions could pose serious problems to a cue-based learner, and learners must have ways of detecting them and distinguishing them from the regular cases. In Section 9, we turn to the problems raised by exceptions, and make some suggestions for extending the learning r.iodel to handle them.
8.6. Dewlopmental evidence

Ideally, a model of stress acquisition should not only reach the correct final state of the grammar being learned, but should also simulate the developn~ntal progression of real children. However, we have not been able to find in the stress acquisition literature evidence that can be directly related to parameter setting, Interpreting the data of stress acquisition is complicated, 4 especially at the early stages, by difficultie., in articulation as well as the patterns of child phonology. For example, certain segments or sequences cause distortions of stress, as does the frequent reduction of syllables and

184

B.E. Wresher and J.W. Kaye

segments. 28Moreover, the effects of morphology and lexical exceptions can swamp the delicate observations required to establish shifts in parameter settings. Nevertheless, patterns of first and also second language acquisition remain a potentially rich source of data bearing on the learning model. if acquisition of stress is indeed guided by the cues and parameters of the model being proposed here, we would expect to find patterns of errors (in new words, not necessarily in known words whose stress pattern could simply be stored) that result from incorrectly set parameters. We might also expect stress acquisition to be characterized by relatively well-defined stages, of the sort that have been reported for aspects of syntactic and morphological acquisition. By virtue of its default settings, our current learning model (see Appendix A.2 for details) starts out assuming quantity-insensitive binary feet in all cases; in learning a language like English, it would then move to quantity-sensitive unbounded feet before arriving at the correct setting of quantitysensitive binary feet. It is an open question whether similar fluctuations can be observed in the course of acquisition by children or nonnative speakers learning English. Can we detect a point in the acquisition of English3 for example, when children become aware that stress is sensitive to quantity, or that feet are binary, or that stress is computed from the end of the word? Since in our model the setting of destressing parameters follows that of stress assignment, we expect that the acq,uisition of destressing would be relatively late; is that in fact the case? The answers to these and similar questions regarding the acquisition of stress in English and other languages should be, in principle, readily obtamable, though in practice they may require fairly subtle experiments; they would, moreover, shed much light on the acquisition of grammar in general. 9. In the !deal case, the observed stress patterns of a language are precisely those that follow from the parameter counter of its accentual system. Deviations from this type of transparency may occur for a number of reasons: (1) language-particular rules (if such exist): (2) exceptions; (3) coexisting systems; and (4) morphological sensitivity. NOW are such phenomena to be accommodated within a learning system that strives for strict determinism? The danger is that certain conflicting items in the form of exceptions, language-specific rules (necessarily subcases of parametrically derived results),

28See Klein (1984) for discussion and references.

A computational iearning model

185

or coexisting systems (Latinate vs. native vocabulary in English), may drive the learner into setting parameters to their marked values when these decisions turn out to be (at least partially) wrong. The learner should be robust enough to weather these conditions without crashing, and to succeed in fixing its parameters. In terms of the learning system under discussion here, it is necessary to prevent the iearner from receiving conflicting information. We shall consider two among the many possible ways of protecting the learner from these sorts of conflicts. 9.1. Etching Certain types of exceptions, concurrent systems, and noise can be detected a&or filtered out of the input stream. Let us suppose the following model: each parameter has an accompanying counter associated with it. When the learner encounters a cue, no permanent change of state occurs within the system; rather, the counter associated with the parameter in question is incremented. In the case of an increment learner, a temporary counter is aL created, and it is this counter that is applied to the input form sans stress markings. The derived fo r.m is then compared with the original input form. A new form is then read in with the counter in its initial state. As cues are encountered, the corresponding counters are incremented until a saturation point is reached. At this point, the counter dischdrges, and rhe parameter to .~lUr it is associated is set. It should be emphasized that until this discharge the only lasting effects of input forms are on the counters. No permanent parameter settings take place until these counters are saturated. One can speculate as to an appropriate saturation level for these counters. The intervals separating the appearance of cues in the input stream may likewise play a role. Future experiments should enable us to determine the form of this procedure with greater precision. Etching serves as a kind of buffer which shields the learner from various sorts of noise and relatively infrequent (i.e., in the input stream) exceptions. The probability of an individually exceptional word creating a lasting impact on the learner approaches zero, particularly with a judicious setting of the saturation point. One should recall that the counter is advanced by encountering any cue for its presence. Such cues may be expected to be found in any word of reasonable length. Note further that short words (monosyllabic or bisyllabic) are not fertile loci for cues. All systems converge at monosyllabic forms, and any bisyllabic stress pattern is derivable from an impressive number of different counters. Such-words will have no effect on the counters, since, given a choice, the system will always assume the least marked counter compatible with the form in question.

186

B.E. Drcsher and J. D. Kaye

9.2. A theory of exceptions As noted above, exceptions provide a challenge for a strictly deterministic

learning system. One wishes to avoid situations in which the learner is fooled into setting a parameter to its marked value by an exceptional form. A first question that can be posed is whether exceptions are absolute or relative. An absolute exception is one involving a stress pattern which is not derivable from any parameter counter: it is thus exceptional with respect to all parameter counters. A relative exception is a stress pattern which is exceptional with respect to a given counter: it may be perfectly regular with respect to some other counter. If exceptions were always absolute, the detectability problem would be solved, for no amount of parameter setting would succeed in generating a form tbzt would match the input form. In fact, the exceptions we have studied are relative rather than absolute. For example, stress in Polish usually falls on the penultimate syllable: (33) Polish penultimate stress Mwa zdpa kotlina pokolenie pomid6ry

There is also a group of words, typicallgrof Greek or Latin origin and usually ending in -ika or -yka, that are stressed on the antepenultimate syllab!:: (34) Polish antepenultimste stress fonCtyka matemhtyka ldgika dpera minimum

The exceptional forms in (34) are quite regular in a system like that of Latin. A left-dominant, QS, binary foot % constructed on the right edge of the word, with the final syllable being extrametrical. Thus, penultimate syllables are stressed if heavy, otherwise the antepenultimate syllable is stressed. A stress pattern such as that of &era is exceptional in Polish, but perfectly regular in Latin. Exceptions, then, are relative and not absolute, at least in the cases that we have considered up to now. We have seen that exceptional forms can appear in the data stream that are indistinguishable from regular forms in other systems. What is desired, then, is a system able to encounter exceptions at any point in the learning process, and which will still arrive at the correct parameter values, It can be shown that a deterministic, incremental learner requires that exceptions go invariably in the unmarked direction: that i3, the normal words must drive the learner to a marked value for some parameter, while the exceptions must be unmarked. There are two possible scenarios:2g
24uoie that these scenarios are constructed withou! taking into account other posdSle strategies, such as etching, described above.

A computational learning mudel

157

(a) The learner encounters the unexceptional forms first, and sets the relevant parameter to its marked value. Subsequently, it finds an exceptional form. This form is immediately detectable as incompatible with the current counter. It does not drive the system to the marked value of the parameter for which it is exceptional; indeed, the system has already been set to this marked value. The exceptionality will be detected when the applier fails to obtain a match with the (exceptional) input form using the current counter. Such a form is flagged. The counter Is reinitialized, and the leaner finds the appropriate values for the form. This second counter is bound to flagged forms. It is not applied tc unflagged words or morphemes. Notice that this situation is quite different from one causing a counter change in a marked direction (the normal learning path). In such a case, a parameter is set to its marked value based on positive evidence encountered in the data stream. The exceptional case is one where a given datum requires an unmarked parameter value where previous data have already driven the learner to its marked setting. The correct counter can never be arrived at without violating the monotonicity (strict determinism) requirement. Thus, such exceptional forms are detectable. (b) Suppose the exceptional form is encountered first. This involves no parameter setting (at least in so far as the exceptional feature is concerned). The learner does not commit incorrectly to a marked value (which would be fatal). Eventually, normal forms are processed which will drive the learner to a marked state. Subsequent exceptional forms (including the one encountered initially) will be correctly flagged and assigned the exceptional counter. This method can be extended in a natural way to handle language-specific rules (if such exist) and coexisting systeas. On the other hand, an incremental deterministic learner will have trouble if an exceptional form ever requires the marked value of a parameter, while the unexceptional forms require the unmarked value. For then the excep tional form will induce the learner to incorrectly commit to the marked value of the parameter; as such decisions cannot be revoked, the learner will then proceed to flag the unexceptional forms as exceptional. If the correct learning theory is indeed incremental and deterministic, then, it follows that exceptional forms should rather adhere to the scenario sketched above. Conversely, if we find that exceptions are generally unmarked relative to the regular forms, that would lend support to a deterministic, incremental learning model. For now, the; theory of unmarked exceptions remains an intriguing hypothesis, which, like a number of other hypotheses discussed here, can only be explored in the context rrf the further development of the learning model, an enterprise w?$ch is still in its first stages.

188

B. E. Dresher and J. D. Kaye

A.1. Description of the model

Here we present a brief description of the organization of YOUPIE, the stress learning program. The model is implemented in a version of MicroPROLOG.
The Syllable Parser

Words are input as described in Section 3, as strings of the form (word v a 1 n c o 2 u v e 0 r). The words are then parsed into syllables. From this syllable parse, two representations are derived for each word. The first is a rime projection with stress levels indicated, as in (4) above, which serves as the input to the lrarning system. A second representation is derived by removing the stress level indicators from the rime projection. These representations are used later by the Applier in building metrical structures.
The Classifier

From tk Syllable Parser, the forms are sent to a Classifier, whose main frmction is to test that the system is transparent, that is, that there do not exist obvious conflicts whereby two words with identical syllable structure have different stress patterns. Such conflicts can arise from various sources: exceptions, morphological differences, the operation of phonological rules which conflate different underlying sources, etc. If there arc such conflicts in the data, the program is alerted, for the learning model will be unable to arrive at a successful setting of parameters. It is at this point, presumably, that other linguistic modules can be consulted in an effort to resolve the contradictions. Such modules can include an exception analyzer along the lines proposed in Section 9, a morphological component, other parts of the phonology, and so on. We have not implemented these other components in our model.
The Learner

If the Classifier does not find an 0bviou.s contradiction in the data, the program calls on the Learner to set the parameters3 The Learner is equipped with a battery of cues, geared to the parameters. The cues are described in Section A.2 below. The Learner outputs a set of parameter values which are its hypothesis as to the grammar of the language in question.
qassing the Classifier does not guarantee that there actually is a solution in terms of the metrical parameters, or, if there is, that the Learner will be able to find it on the basis of the available data.

A computational iearning mode!

189

The Applier

The Applier uses the parameter settings received from the Learner to build metrical structures and assign stress. The Applier functions as both a checker and a generator. As a checker, it builds metrical trees on the stressless rime projections set aside by the Syllable Parser. The trees are then used to assign stress. The derived stress pattern is then checked against the input data. If the Learner has arrived at.the correct parameter settings, the derived forms and the original forms ought to match. Since the Learner abstracts away from destressing, the Applier will sometimes generate stresses which do not appear in the input, and must allow for this in the matching procedure. In practical terms, this means that if the input stress string is (0 2 0), and the Applier derives (1 2 0), this will count as a match, though the model flags the abstract extra stresses. Such flagged representations can serve as input to a Destressing module. The second function of the Applier is to generate the stress patterns of new words in the language which are not part of the input set. If it has been provided with the correct parameter settings, the Applier will be able to correctly predict the stress pattern of any new word, even if it has not previously encountered its particular weight string.
The Cranker

Sometimes, the Learner does not succeed in setting all the parameters correctly. A typical case involves extrametricality. The only positive cue is one which excludes extrametricality; where it cannot be definitely ruled out, :he Learner will sometimes be unable to arrive at a sure setting, and will assume no extrametricality as a conservative strategy. Where this turns out to be wrong, the Applier will eventually fail to match its derived forms to the input forms. In that case, the parameter set is passed to the Crasker, which is the brute-force dull learner described in Section 4.2. The Cranker looks for another legal setting of the parameter counter, and returns this value to the Applier.. This new setting becomes the models hypothesized grammar. As the Learner has increased in its ability to fix and exclude parameter values, the number of legal settings remaining open to the Cranker have decreased, to the point where they mostly involve extrametricality options. An improvement in the Learners ability to deal with extrametrica!ity cou!d make the Cranker obsolete. As things stand, most failures on the part of the Learner are due to the fact that the language it is working on does not fall within the solution space of the model, and such failures cannot be remedied by the Cranker, since no setting of parameters will succeed.

190

B. E. Dresher and J.D. Kayc

A-2. The structure of the Learner

The Learner operates in Batch Mode, in the sense discussed above; that is, it can assume that it has access to all the relevant data at once, and so can draw conclusions from the absence of some cue, as well as its presence. Cross-parameter dependencies reqliire the cues to be searched in a particuiar order. The order used by the Lr suer is given in (35): (35) Order in which Parameter> are set in the Learne? 5: Feet arc quantity sensitive ( P6: Feet are QS to the [Rime/Nucleus] P10: Feet are noniterative [No/Yes] P8A-PSL: There is an extrametrical syllable [No/Yes] on the Left PSA-P8R: There is an extrametrical syllable [No/Yes] on the Right P2: Feet are [Binary/Unbounded] P1: The word-tree is strong on the [Left/Right] P7: A strong branch of a foot must itself branch [No/Yes] P3-P4: Feet built from rlhe [LefuRght] and strong on the [Left/Right] We will now review the cues used by the Learner to set each parameter.
P5 and I%: Quantity sensitivity

The first parameter s:t is P5, which deals with quantity sensitivity (QS). Evidence for US is the existence of at least two words with the same number of syllables, but with different stress patterns. If no such pairs are found (i.e. the same number of syllables have the same stress patterns), conclude the system is quantity insensitive (QI). If feet have been determined to be QS, then we have to test P6, the level of QS. This test works on the same idea a.5the test for QS, but at a further level of detail. In the theory we are working with, there are two possibilities: if we have QS [Rime] j then closed syllables (R) and long nuclei (N) should act the same; if QS [Nucleus], then Rs should act the same as light open sylllabies (L). These possibilities are tried in order. First words are converted into representations where Rs are treated as equiva!ent to &s; if all words with the same syllable representations have the szme stress patterns, we ctinclude P6 [Rime]. If not, WC try P6 [Nucleus]: words are converted into representations where Rs are treated as equivalent to Ls; if we have uniform stress patterns, the test succeeds. For example, consider the Latin ivords represented in (36):
-

Recall that P9 Defooting is not set by the Learner, which abstracts .-way from destressing.

A computational learning model

191

Various.settings of P5 and P6 for words of Latin


Words ho2miOneOm aOmi2icuOs diOce%eduOm Stress 200 020 020 Rimes LLR LNR LRR 01 sss SSS sss QS[Rim] LLH LHH LHH QS[Nuc] LLL LHL LLL

--._..__ ._~______._._____.~__.

These words have rime structures as indicated. If Latin were QI, then all syllables would count for equal weight, which can be indicated by representing every syllable by S. All the sample words now have the same representations, as far as the Learner is concerned. owever, their stress patterns are not all the same; therefore, the Learner concludes that this level of representation is not adequate to discriminate words with regard to stress patterns, and sets PS to QS. If Latin is QS [Rime], then closed syllables and syllables with branching nuclei ought to be equivalent; we can thus designate every light syliable by L, and every heavy syllable, including closed syllables, by H. Now wsrds with the same structures have the same stress patterns, and the representation succeetis in giving a consistent sci of mappings from weight strings to stress strings. For purposes of illustration, suppose we were to go on to test for QS [Nucleus]; now, only branching nuclei count as heavy, while all other syllables are light. This representa.tisn does not succeed, as b;Jehave two words with threz light syllables but different stress patterns, indicating that the word classes are not correctly represeiited. P!O: Iterated feet We assume that all languages h;;lve feet: however, in many languages only one foot, the one bearing main stress, is visible. Since, in our model, feet generally bear stress, the absence of any sesondary stresses in the data is a diagnostic that feet are not iterated.
HIA, B8: Extrametricality

The next set of tests deals with extrametricality. These tests are asymmetrical: they can rule out extrametricality at an edge, but they cannot conclud:: that extrametricality definitely is present. The presence of a stress at the left or right edge of a word is enough, in this sv;tem of parameters, to rule out extrametricality at that edge. By contrast, the lack of any stresses at an edge may indicate extrametrical., ity, but not necessarily. Consider in this connection Garawa (cf. (24) above).

192

B. E. Dresher

and J. D. Kaye

We observed above that its stress pattern can be derived by building left-dominant feet from right to left, destressing a non-peripheral foot in clash. Some schematic forms are given in (37): (37) Garawa with and without extrametricality
Words
s SS sss ssss Lx3 Construction (2j [2 01 [2j [I 01 Extrametrical on Right 0. 2 2 (0) 2fl(fl)

sssss
ssssss

sssssss

01 M[~Q11101 Pol~l~lPol PI PO1 01 01 [I P


I2 01 P

201(O)
2001(O)

20101(O) 200101(0)

In (37), S represents a syllable. In the second column we show the foot groupings according to our analysis; stresses in bold are later destressed, so :2 1 0) becomes (2 0 0), and so on. Although all final syllables are stressless - the result of constructing a left-dtiminant foot at the right edge of the word - they may not be considered extramctrical. If they are, then the reader will br able to verify that no coherent solution, in terms of the parameter set assumed here, can be given to the resulting metrical pattern, which is shown in the third column.32
R?: Binary versus unbounded feet

If a language has unbounded feet, only one peripheral light syllable in a word is eligible to receive stress. Therefore, if the Learner finds a nonperipheral stressed light syllable, or stress on both the rightmost and leftmost light syllab!e (not necessarily both in the same word), it can conclude that feet arc botinG&, that is, binary. The actual operation of this cut is complicated when extrametricality has not been ruled out, because extrametricality can affect what the peripher&i syllable is. Thus, if extrametricality is in force, the second syllable from the edge may be, stressed &cause it is peripheral, not because feet are bounded. In that case, the cue first assumes that ex-

32Qutside thp parameter set assumed here, there are of course possible solutions. For example, if we . dissociate main stress from secondary stress. we can propose to construct a binary left-dominant foot at the !eft edge of !hr word: then, se?ting the final syllable exkrunrtrica!, build righ!.dominant feet from the right edge of the word, observing the condition that the foot on the left edge not ue disturbed (this solution IS rcminiscenr of that of Hayes (1981, c; M-55), who, however, did not suppose extrametricality). While languages have betrr reported in which nl&a stress and secondary stArtishave differe:~t parameter settings, it is desirable 90 let that be the marked option.

A computational learning mode!

193

trametricality ia in force, and will not conclude there are binary feet unless its evidence is unambiguous.
PI: Main SJI=~

The cue for main stress utl!izes +h j?don of a foot-sized w-lndow at the .Ilc edge of a word. Main stress is confined to a peripheral foot; therefore, it ishould consistently appear in one of the windows which correspond to a peripheral foo&.Such a test presupposes knowledge of P5 ( QS or QI) as well as PZ (Binary or Unbounded feet), for these parameters tell us the dimensions of the window. Ideally, extrametricality should be known, since it plays a role here also; but where it has not been ruled out, we must again loop through all the possible values of extrametricality in applying this cue. -Where feet are unbounded, the cue is not triggered by words with only light syllables, because it is not yet known whether such v ords have feet or not (see P7 below). There exist a number of other cues which could serve to indicate which side main stress is on. One easy but opportunistic test is tn nnte ahe r&ativc s__ __ SY P-v-v positions of main stress and any secondary stresses: if a secondary stress fails to the left, then main stress must be on the right (and vice versa). This is true in our parameter system, because main stress must always be assigned to a peripheral foot; thus, no foot bearing secondary stress can intervene between main stress and the edge to which it is assigned I Although this cue can often give a quicker result thsn the one we have implemented, it is not in the spirit of Appropriateness, as it involves noting the relation between main and secondary stress, and this sort of relation is nzt a part rif the parameter for J?i., As with most opportunistic tests, this one can be made invalid by a siight change to the parameter set. Suppose, for example, that peripheral feet can be extrametrical at the word le,vel, and yet preserve, as Selkirk (1984) suggests, some stress assigned at the foot level; in that case, there will be a secondary stress intervening between the main stress and the edge of the word, and this cue could jump to the wrong conclusion about which side main stress is on. Also, this test does not extend to languages where feet are not iterated, and so would in any case not be sufficient on its own. then light syllables can never anchor a foot. Hence, a light syllable can rsi:eiv~ &ES;: Gii;y rr,>isthe nsor&tree (by Pl) in words *vhere there are no he:?vy syllables. To determine the value of rs thisparameter, then, the %earner loo h. ts SW if there are any stressed light syllables whose stress is not due to Pl. If there are, then set P7 to [No]; otherwise, it is [Yes].
P7: Obligatory branching (OB) If tlris parameter is set to [Yes],

194

B.E. Dresher and J. D. Kaye

It appears that these parameters must be set together. The cue is the one discussed at length in Section 5.1 and illustrated in (219:a moving window corresponding to each of the four possible settings of P3 and P4 sweeps across the word. Any setting in which a stressed syllable appears in what should be an unstressed position is ruled out.

Baker, C.L. (1979). Syntactic theory and the projection problem. hguistic Inquiry. JO, 533-581. Berwick. R.C. (1985). The ncquisirion of synracric knowledge. Cambridge, MA: MIT Press. Benvick. R.C., & Weinberg, A.S. (1984). ?!re grammatical basis of linguistic performance. Cambridge, MA: MIT Press. Borer, H., & Wexler, K. (1987). The maturation of syntax. In T. Rtiepcr 8r E. Williams (Eds. j, Parameters wtd finguistic rheory. Dordrecht: Reidel. Bowerman, M. (1982). Reorganizational processes in lexical and syntactic development. In E. Wanner & L.R. Gleitman (Eds.), Leirgzogc acquisition: The state of the art Cambridge: Cambridge University Press. Bowerman, M. (1985). Beyonfl communicative adequacy: From piecemeai knowledge to an integrated system in the childs acquisit&i of language. In K.E. Nelso.1 (Ed.), Chi!&?!i. Language, Vo:. 5. Hillsdale, NJ : Erlbaum. Bowerman, M. (1987). Commentary: Mechanisms of language acqui@rion. In B. MacWhinney (Ed.), Mechanisms of hnguage acquisition. Hillsdale, NJ: Erlbaum. Brownlow, N.D. (1988). A computer-based learning theory for final segn*ent extrametricality. Unpdblished MS, Department of Linguistics, University of Toronto. Chomsky, N. (1981a). Lectures on government and &tiding. Dordrecht: Foris. Chomsky, N. (1981b). Principles and parameters in sy;itactic theory. In N. Hornstein & D. Lightfoot (Eds.), Explunlation in linguistics (pp. 32-75). London: .Longman. Chomsky, N., h Ialle, M. (1968). The snund partern o;English. New York: Harper & Row. Davis, S. (1988). Syllable onsets as a factor in stress rules. Phonology, 4, l-..9. Dresher, B.E,, & Hornstein, N. (1976). On some supposed ccntributions of artificial intelligence to the scientific study of language. Cognition, 4, 321-398. Dresher, BE., & Hornatein, N. (1977) Reply to Winograd. Cognition, 5, 357-392. Dresher, BE., & Lahiri, A. (19%). Metrical coherence in Germanic. Unpublished MS, Departmeiit of Linguistics, University of Toronto, and the Maw-Ptanck-Institut fur Psycholinguistik, Nijmegen. Giegerich, H.3. (1985). Metrical phonology and phonological structure: German ,md English. Cambridge: Cambridge University Press. Halle, M.. & Keyser, S.J. (1971). English stress: Its form, iis growth, and its role in verse New York: Harper & P,...~ jalle, M., & Vergnaud, J.-R. (1987a). St.ress and the cycle. Linguistic Inquiry, /P, 45-W. Halle, M., & Vergnaud, J.-R. (1987b). An essc~1or! sfrcsr. ~Carnh~id~~ hfi*a. l&CT ?rpsr_. e-9 ..a_ * Hammond. M. (19M). Constraining metrical theory: A modular theory of rhythm urtd dwrcssirrg. UCLA Doctoral dissertation, distributed by IULC, Bloomington, IN. Hajjes, BP. (1981). A metrical theory of stress rules. MIl Doctoral dissertation, dirtrEbuted by IULC, Bloomington, IN. Hayes, BP. (i984). The phonology of rhythm in English. Linguistic Inquiry, i4, 33-74.

A computational leawing model

195

Hayes, B.P. (1985). Iambic and trochaic rhythm in stress rules. In P9wceedirrgs of the berkeley Linguistics

Society II, Berkeley, CA. Hayes. B.P. (1987). A revised parametric metrical theory. In J. McDonough & B. Plunkett (Eds.). Proceedings of NELS 17, Vol. 1 (pp. 274-289). Amherst, MA: GLSA. University of Massachusetts. H&t. H. van der % Smith N. (1982). An overview of autosegmental and metrical phonology. In H. van der Hulsr & N. Smith (Eds.), Th4 strwure ofphonologicul representations, Part 1. Dordrecht: Foris. Hyman, L.M. (1977). Srudies in stress and accent. Southern California Occasional Papers in Linguistics 4. Los Angeles: University of Southern California. Ingram, D. (1974). Phonological rules in young children. Jourr~al of Child Language, I, 49-64. Kaisse, E.M., Br Shaw, P.A. (1985). On the theory of lexical phonology. Phonology Yearbook, 2, l-31. Karmiloff-Smith, A. (1986). From meta-processes to conscious access: Evidence from childrens metalinguislic and repair data. Cognition, 23, 95-145. Klein, H.B. (1984). Learning to stress: A case study. Journul of Child Lunguuge. il, 375-390. Lahiri. A.. 8~ Hulst. H. van der (1988). On foot typology. In J. Bievins & J. Carter (Eds.). Proceedings of NELS 18, Vol. 2 (pp. 286-299). Amherst, MA: GLSA. IJniversity of Msssachusetts. LI>erman. M. (1975). The intonutionulsystem of English. Doctoral dissertation. M!T. Cambridge, MA. Lsberman, M.. & Prince, A... (1977). On stress and linguistic rhythm. Linguistic iuywiy, 3. 249-336. Lightfoot. D. (1982). The language Iottcry: Toward a biology of grammars. Cambiidgc, MA: MIT Prcjs. Macken, M.A. (1979). Developmental reorganization of phonology: A hierarchy cf basic units of acquisition. Lingua, 49, 11-49. Mackie. A. (1981). A computer simtdution of metricui stress theory. M&rs dissertation, Department of Linguistics, Brown University, Providence. MacWhinney, B. (Ed.) (1987). Mechanisms of lunguugfi arquisition. Hillsdaie. NJ: Erlhaum. Marcus, M.P. (i981)). A theory af syntacticrecognition for natural language. Cambvidge, MA: MIT Press; Mpnn -. :<_< ..,,. _ . . ..- . . . . rnnrrnl.) nnA rnntrnct .in. y*a . . . . . . . . b r ..--... * *. rnvn ._., .in. .thn ..%... ,,,~,L~,, J W-WA 1 IQ7921 Pnrtmn ., . . . . wwmh A . ...% rimA nf ...., . ..- Ao~dnnm~~tr .. __.... . . h~oinnhro .-__...., form and word function. Doctoral thesis, University of Illinois. distributed by the Indiana University Linguistics Club, B!oomington. IN. Piattelli-Palmarini. M. (Ed.) (1980). Lunguuge and leurning: The debute between Jeun Piuget and Nourn Chomsky. Cambridge, MA: Harvard University Press. Prince, A.S. (1983). Relating to the grid. Lingui.W ittquiry. 14, 19-100. Sslkirk, E .O. (1984). Phonology and syntax: The !eiation between sound und structure. Cambridge. MA: MIT Press. Wexler, K., & Hamburger, H. {19733. On the insufficiency of surface data for the learning of transformational languages. In K.J. Hintikka, J.M.E. Moravcsik. & P. Suppes (Eds.). Approuches to naturuf Iunguuge. Dordrecht: ReidTl. Wexler, ii., B I&~&i. M.R. (1987). Parameters and le;lmab&iy in tilrciin: thcarv. In T. Roeprr & E. Wi!iiams (Eds. 5, Flrameters and linguistictheory. Dqrdrecht: Reidel. Winograd, T. (1977). On some contested suppositions of generative linguistic abttul the scientific etudy of language. Cog&ion, 5, 151-1?9. of the Woods, W. (1970). Transition network grammars for natur al [engage :jnalysis. Conrir~rcrraicutiorts Aw.wiation for Computing Machinery, 13, (10).

You might also like