Lexical Encoding of Verbs in English and Bulgarian Rositsa Dekova

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 8

LEXICAL ENCODING OF VERBS IN ENGLISH AND BULGARIAN Rositsa Dekova

Department of Modern Languages, NTNU Trondheim, Norway, NO-4791 rositsa.dekova@hf.ntnu.no Abstract This paper focuses on the information that can be encoded in verbs as lexical entries, and its formal representation both in English and Bulgarian. For this purpose, an already existing, but not very widespread framework is used - the Sign Model (Dimitrova-Vulchanova, 1996/99; Hellan and Dimitrova-Vulchanova 2000) that describes words as meaningful cells, including morpho-syntactic information. Based on corpora data and results from continuation tests, my research is an attempt to find a unified format for representing lexical entries not only within a single language, but also across languages. 1 Introduction complement and modifier are used respectively for their syntactic correlates. 2 Theoretical Background

The knowledge that native speakers demonstrate suggests that something beyond word-specific idiosyncratic properties needs to be accounted for in the lexical representation of words. Information about the syntactic environment in which a particular word can appear should also be part of the lexical encoding, a finding that is particularly relevant for verbs. The assumption that only some participant information is encoded lexically is widespread across a number of different linguistic theories. Traditionally, participants in a situation, denoted by a particular verb, are divided into two main groups arguments and adjuncts (or complements and modifiers). A verb selects a set of arguments. The adjuncts, in contrast, are neither required, nor dependent on the particular verb. They can co-occur with many other verbs. Furthermore, syntactic realization does not always overlap with semantic obligatoriness. Therefore, I will refer to the set of entities included in a situation as semantic participants (following the terms in Koenig et al. 2002), where lexically encoded semantic participants are called arguments, and non-lexically encoded semantic participants adjuncts. The terms

Although it is widely accepted that the syntactic structure of many sentences is determined mostly or entirely by the participant information included in the lexical entries of verbs, there are no reliable syntactic criteria that can be used to delimit the set of items that can express lexically encoded participant information. In other words, there is no established set of necessary and sufficient criteria that can serve as a clear-cut basis for the distinction between information that is lexically encoded and information that is not (that is the distinction between arguments and adjuncts). One very good solution, however, has been suggested in a paper by Koenig et al. Class Specificity and the Lexical Encoding of Participant Information (Koenig et al. 2002). They propose two criteria, semantic obligatoriness and verb class specificity, which jointly determine the argument status of participant information. The authors define lexically encoded information as that information which is accessed immediately upon recognition of a word. This is because only lexically encoded participant information is expected to play a role in the immediate representation that readers form for sentences. This information is said to be obligatory, that is, it is entailed to hold of the class of situations denoted by a word (ibid, p.226), and it is also relatively specific to the corresponding verbs. Those two properties can be directly observed by language users, and therefore, according to Koenig et al., they can serve as criteria providing a basis for learning the distinction between arguments and adjuncts.

The contrast between the verbs in sentences (1) and (2) illustrates this approach: 1. He cut the paper with the scissors. 2. She drank the cocktail with a straw. While cut always describes a situation where an instrument is included, drink only allows an instrument to be included in some types of situations denoted by it. Thus, the instrument phrase with the scissors is both obligatory and specific for the verb cut, but one does not have to use an instrument to drink. Therefore an instrument should be included in the lexical representation of cut, and need not be present in the encoding of drink. Relying heavily on these two criteria, namely semantic obligatoriness and verb class specificity, I have been able to isolate the participant information that should be encoded in the lexical representation of a selected set of verbs. 3 My research

preferred syntactic environment of the relevant verbs, as well as an analysis of the most common semantic participants in relation with their syntactic distribution. For the purpose of this paper, I will only show a few illustrative examples for English (LOB corpora only) and Bulgarian. And also, as the Bulgarian corpora is very large, but does not allow for specific searches yet, only the first 100 of the occurrences have been analysed in detail for verbs occurring more often. 3.2 Results and examples

I have selected basic verb types in English and Bulgarian and examined their semantic properties with an account of their syntactic distribution. Special attention was paid to approximately 20 verbs, subgroups of what are called Verbs of Contact by Impact (as defined in Levin, 1993) along with verbs that include motion (in Levins classification, those fall in the group of Throw Verbs). In order to determine the possible morphosyntactic environment of the verbs selected, I have partially analyzed the type of syntactic behaviour they exhibit in the available corpora. I have also tested the results of the analyses against native speaker judgments in two similar continuation tests for both languages. 3.1 Corpora used in the research

The results from the corpora research are summarised in Table 1 and Table 2, for English and Bulgarian, respectively. As expected, the verbs examined showed a great tendency to occur in a syntactic environment that consisted of elements that are overt expressions of the semantic participants linked to the particular verb. Thus I could observe the relations between a semantic participant of a verb and the possible syntactic positions it can occupy with this verb being the main verb in the sentence. Special attention is paid to some of the information obtained from the corpora analyses, and its significance for the lexical representation of the verbs. The focus of the paper is on the results for Bulgarian, because it is not so well studied, and therefore perhaps more interesting to discuss. One peculiarity can be clearly seen in three of the Bulgarian verbs analyzed: proboda/stab, pljasna/slap, and potupam/tap. Aside from the appearance of the traditionally accepted complement (the direct object), which here is referred to as Limit (the participant that is affected or changed), we can see that there are also many phrases that are identified as BodyPart/Possessor (Levins term, ibid: 71), as illustrated by the following examples: (3) i go probode v sartseto. and stabbed him in his heart. (4) Tja pljasna Gabi po koljanoto. She slapped Gabi on her knee. (5) i go potupah po ramoto. and slapped him on his shoulder

I have used Brown and LOB corpora for English; for Bulgarian I have used a corpus that is still under construction in the Laboratory of Computer Modelling of Bulgarian Language, at the Bulgarian Academy of Sciences, where I did part of my field research. The corpora research was aimed at revealing the possible and/or

The high degree of occurrences of those phrases with particular verbs suggests that the information conveyed by them is an important part of the lexical representation of the verbs. This, however, does not directly mean that we should include them as separate participants in the situation described. On the contrary, the Possessor Raising phrase specifies the Limit, or more precisely, the place of its contact with what is called the Launch-part, but it does not constitute a separate participant. Therefore, we should distinguish between different types of information all of which is important for the lexical encoding of particular verb, but which should not be treated in the same way. Another interesting example would be the presence of path in the verb pljasna/slap: (6) ...opashkata mu, , pljasna vav vodata. his tail, , slapped in the water. (7) ...tja pljasna dolu... she slapped down (She fell down) (8) ...and pljasna dolu varhu trevata. and slapped down onto the grass. (She fell down on the grass) Whereas in sentence (6) the prepositional phrase in the water can not be initially identified as path, the prepositional phrases in sentences (7) and (8) show that in the water should also be regarded as an overt expression of the end of path information that is lexically encoded in the verb. A similar behaviour is observed for other verbs of motion (see for example DimitrovaVulahanova, 2004). 4 The continuation tests

4.1

Methodology of the tests

The tests were organized as follows: there were 50 to 60 sentences, containing as many as 20 target verbs, together with approximately the same amount of sentences containing distracter verbs, equally distributed among the target sentences. The first 30-40 sentences consisted only of a subject and a verb, while the last sentences also contained a direct object. All the participants in the tests were asked to complete the sentences without spending too much time on any of the items. I encouraged the participants to write down each continuation fast; so that it would be the first thing that came into their mind (additional literature on the methodology of similar type of tests can be found in Koenig et al. 2002, 2003). The main idea behind the continuation tests was to confirm the hypothesis that, if implicit participant information is lexically encoded, then it will play an important role in the immediate representation that the readers form for sentences, and is therefore more likely to be used to continue a sentence. Thus I expected to receive a significantly higher percentage of continuations related to semantic participant information, than the percentage of the responses that do not include lexically encoded participant information. 4.2 Results and analysis

To determine what kind of participant information should be included in the lexical encoding of this particular set of verbs, I have used not only data from English and Bulgarian corpora, as described earlier, but also the results from two similar continuation tests conducted for both English and Bulgarian. These tests were developed to test native speakers intuition about the most prominent participants in a situation denoted by the target verbs.

Some of the results for Bulgarian (in per cent) can be seen in Table 3, in the Appendix. The tests for English have not yet been fully completed, but the analyses so far are consistent with the results from the Bulgarian tests. There are higher percentages of continuations related to information about semantic participants: approximately 90% of the continuations for tap, stab, and cut can be defined as Limit. And since the answers differ widely from each other (e.g. continuations for cut included: the bread, John, her finger, her hair, her knee) it can not just be assumed that the results are merely due to the existence of a certain stereotype or a phraseological unit containing the target verb. The continuations provided for the sentences also confirmed the corpora analyses

as can be seen in the tables in the Appendix. In addition, the continuations for the second half of the sentences in the tests, which were virtually complete (as described earlier, the sentences had a subject, verb, and a direct object), contained a high degree of fillers that were consistent with the information assumed to be semantically encoded. Instrument/Body extension constituted 90% of the fillers for stab, 27% - for cut, and 37% - for tap. 5 The formal description

d. Protracted vs. Non-Protracted a contrast that is close to the traditional distinction durational vs. nondurational The dimensional part consists of a number of dimensions. Each of them reflects a different aspect of the involvement of one and the same participant in the situation denoted by the verb. It is a new decompositional approach to the traditional Theta-roles (Dowty, 1991). The importance has been shifted to the number of the participants, as well as to the differentiation of the sub-events constituting the main event. Each of those sub-events should be separately described in detail. All of them describe the situation as a whole. The dimensions may consist of one or more values. The dimension of Force, then, may incorporate the values of Source (the participant performing the action), Launch-Part (the part of the participant, if any, performing the action), and Limit (the item upon which the force has been performed). The Control dimension (for action that is under the control of a participant) will incorporate the values of the Controller, the Means, and the Target. The dimension of Monodevelopment (short for monotonic development) includes the value of a Monodeveloper (the one performing the monotonic development) together with information about the possible respects in which the development can take place Integrity, Location (mainly regarding path), and Quality, being one of the main cases. For many verbs a further dimension of Conditioning is possible in close relation with Monodevelopment. Conditioning applies when, in a given context, a given event or actor, called the Conditioner, is sufficient to release a certain event, called Conditioned. For example, in John broke the window, John is the Conditioner for the event of breaking the window. In contrast, in The window broke, no Conditioner is identified and no Conditioning obtains in this usage of break. A participant in a situation is thus defined by the set of values characterising co-indexed elements in the different dimensions. Furthermore, the meaning of a verb (the Cell) is identified with the conditions that have to be

So far we have seen that the kind of participant information that can be lexically encoded, is semantically obligatory, and is restricted to a verb, or a verb class in terms of selection. Based on the data from the corpora research, together with the results from the continuation tests, I have tried to find a suitable formalized lexical representation for the set of verbs selected, that is, a representation that will account for the considerable complexity, and subtlety of their meaning. Following a proposal made by Hellan and Vulchanova (Hellan & Dimitrova-Vulchanova 2000) I assume that there is a set of lexical semantic factors that serves as the basis for predictions about the possible morpho-syntactic environment of a verb. One of the potential members of that set is called criteriality. In order to define criteriality first I must briefly describe the structural unit constituting the meaning of a verb, called a cell (DimitrovaVulchanova 1996/99). A cell consists of two parts an aspectual part and a dimensional part. The aspectual part specifies the following factors: a. Situational vs. Non-Situational reflecting whether what is expressed by verb is situated in time or not. b. Dynamic vs. Stative relevant only for Situational verbs and reflecting whether some kind of change or Force emission is involved or not. c. Monodevelopmental vs. NonMonodevelopmental depending on whether the dimensional part includes Monodevelopment or not.

met by the participants in a situation so that it can count as being expressed by this particular verb. The notion of Criteriality, then, applies to the items of a cell that have properties by which the situation is easily identified as belonging to a certain type. The following items (in Hellan and Vulchanova 2000) are defined as criterial: 1. An item with the value Monodeveloper 2. A Source whose Launch-part (a) behaves monotonically, or (b) is specified for inherent properties 3. A Limit with sustained contact 4. An item characterized for Posture 5. A Source for an iterative activity with a cumulative Target With this theoretical approach as the basis of my research, I have attempted to find a unified format for representing lexical entries not only within a single language, but also across languages. Thus a more formal (and probably more accurate) comparison of verbs (their meaning, as well as and their syntactic behaviour) can be achieved, as it is easily possible to compare verbs that also encode more/less information than their correlates. A more in depth analysis of the basic cell of the verb tap/potupam illustrates this approach:

Conditioned2 Cell: Aspectual specification: +2-point Element specification: Monodevelopmenta Element: 2 Phasing: +2-point Medium: Location Line of Trajectory End: Contact with 3 Limit3 Thus sentences (9) and (10) can be evaluated as described bellow: (9) John tapped his fingers on the desk.

(10) His fingers tapped on the desk. In the context of (9) John tapped his fingers on the desk, John will be defined as the set of values (Conditioner1, Source1), his fingers (Fingers2, Launch-part2, Mover2), and the desk - (Absorber3, Limit3). However, in the case of (10) His fingers tapped on the desk, the dimension of Conditioning will not be present. This representational format makes this not only possible but also easily predictable. The basic cell contains two items that can be counted as criterial one of them is John, according to 2(a) (a Source whose Launch-part behaves monotonically), and the other one is his fingers, according to 1 (an item with the value Monodeveloper). The alternation, described in Levin (1993) as Causative/Inchoative, was incorrectly predicted by Levin as impossible with the verb tap. According to Levins criteria, verbs undergoing Causative/Inchoative Alternation can be characterized as verbs of Change of State or Change of Location. The verb tap is a member of the Hit Verbs, a sub-group of Verbs of Contact by Impact, and of two other verb

Cell of tap/potupam
Global specification: +Protracted Constituency of Development: Recursion based Mode of recursion: iterative Recursive unit: Celln Aspectual Specification: +2-point Element specification:
Conditioning|Constituency| Force| Monodevelopment

Conditioner1 Fingers2

Source1 Launch-part2 Monodeveloper2 Limit3

groups, Throw verbs and Investigate Verbs, and was not predicted to be able to allow this alternation. Levin may have come to this incorrect conclusion by overlooking the individual participants and the single sub-events in the situation, instead regarding the situation as a whole. As we have already seen, there is, in fact, change of location, but with regard to the Launch-part only.

Acknowledgments

I would like to thank my colleagues and friends at the Department of Modern Languages, NTNU, Trondheim, who supported me my research. As well as my colleagues at the Laboratory of Computer Modelling of Bulgarian Language, at the Bulgarian Academy of Sciences, Sofia, where I collected my data for Bulgarian and who accepted me as part of their research team. References

Conclusion

I have tried to show that breaking up information for encoding into relevant semantic features, and using a suitable formal representation are crucial in finding a unified format of representing lexical entries, not only within a single language, but also across languages. Thus a more formal (and probably more accurate) comparison of verbs (their meaning and their syntactic behaviour) can be achieved, because this approach makes it possible to compare verbs that encode more/less information than their correlates. It would also be very interesting to investigate whether the representational format presented in this paper can be integrated within some of the wellknown lexical theories, such as Pustejovskys Generative Lexicon (Pustejovsky, 1995). Thus an investigation of the possible optimal solutions will be pursued to describe those verbs that do not have semantic equivalents in another language. This will lead my research to a new stage: the creation of a VerbNet, as a Distributed Lexical Database, containing a network of verb classes with their semantic features.

M. Dimitrova-Vulchanova, 1996/99. Verb Semantics, Diathesis and Aspect. Doctoral dissertation, NTNU (University of Trondheim)/LINCOM, Newcastle/ Munchen M. Dimitrova-Vulchanova, 2004. Paths in Verbs of Motion. Presented at the Argument Structure CASTLE Conference, November 4-6, 2004, Troms University D. Dowty, 1991. Thematic proto-roles and argument selection. Language, 67(3): 547619. L. Hellan, and M. Dimitrova-Vulchanova 2000. Criteriality and Grammatical Realization. Lexical Specification and insertion. CLIT series, John Benjamins. J. P. Koenig, G. Mauner, B. Bienvenue, (2002). Class Specificity and the Lexical Encoding of Participant Information. Brain and Language, 81, 224-235. J. P. Koenig, G. Mauner, and B. Bienvenue, (2003). Arguments for Adjuncts. Cognition, 89, 67-103. Levin, B. 1993. English Verb Classes and Alternations. Chicago and London: University of Chicago Press. J. Pustejovsky, 1995. The Generative Lexicon. The MIT Press.

Appendix:
Table 1: The English corpus data
USAGE VERB cut stab slap Trans 69 4 8 Human/ Intrans Lit Fig part 23 43 source 86 43 37 pass 4 limit 1 2 source 2 2 1 pass 1 limit 3 1 pass 2 12 8 source 1 limit 15 source SUBJECT MetaInstr/ or phor body extens 4 limit 7 OBJECT Other Argument Adjunct 12 manner 1 manner 4 manner 3 manner 1 quantity

tap

14

15

6 source 58 limit 29 limit 5 instrument 2 limit 1 1 BPP 9 limit 2 source 3 part. loc. 2 BPP 16 limit 1 4 instrument 1 BPP

Table 2: The Bulgarian corpus data


USAGE VERB rezha (cut) proboda (stab) Trans 71 1 refl. 83 4 refl. Intr 21 7 sepassive Lit 90 Fig 12 SUBJECT Human/ Meta Instr /or part phor body ext. 39source 3 2/ 2 wings OBJECT Other Argument Adjunct 20 mann 8 loc 3 quant 9 mann 4 quant 1 time 1 loc

71 limit 5 source 12 instr 7 limit 2 BPP 84 limit 4 source 24 BPP 23 instr 3 (object) 4 (bird) -

72

15

34source

14

12

pljasna (slap) potupam (tap)

10

31

41

21source 1 (face)

3 (feet/tail)

93 5 refl.

100

76source

19 limit 3 mann 15 instr 3 quant 13 BPP 2 loc 6 path end 95 limit 13 mann 72 BPP 2 loc 7 instr/b.e. 1 time

Table 3: Results from the continuation test for Bulgarian


Sentence 3. Bob potupa Bob tapped 8. Lucy pljasna Lucy slapped 9. Margaret otrjaza Margaret cut 11. Billy probode Billy stabbed 26. Knigata pljasna The book slapped 28. Nozhat rezhe The knife cuts 32. Valnite pljaskaha The waves slapped 34. Lilly otrjaza hljaba Lilly cut the bread 36. Ann probode mesoto Ann stabbed the meat 38. Iva potupvashe po masata Iva tapped on the table 27 7 +2 Instr/ Body ext 10 37 Limit 90 +3 26 +10 93 90 +1 10 43 57 3 70 origin 3 end 10 47 27 47 Path Body-Part/ Possessor +16 +3 13 Loc Temp Manner +3 7 17 shamar slap 7 (s.o.) 3(-) 3 3(-) 7 3(-) 3 10(-) 26 Other

90

10 3 7(-)

37

53

Legend: Argument: lexically encoded semantic participant Adjunct: non-lexically encoded semantic participant Instr/or body ext. (b.e.): instrument or body extension (hand, finger, leg, foot) used as an instrument Human/part: human or part of a human (face, head, eyes, hair) Trans: transitive usage of the verb Intrans: intransitive usage of the verb Lit: literal usage of the verb Fig: figurative usage of the verb Source: the participant performing the action

Limit: the participant that is affected or changed BPP: body-part/possessor Loc: (event) location Part. loc: participant location Pass: passive Se-pass: se-passive (a certain type of passive in Bulgarian) Refl: reflexive Quant: quantity Mann: manner Temp: temporal (time) (-): no continuation was provided (s.o.): someone +: refers to continuations provided in addition to the first one

You might also like