Professional Documents
Culture Documents
Learning Rules and Categorization Networks For Language Standardization
Learning Rules and Categorization Networks For Language Standardization
Learning Rules and Categorization Networks For Language Standardization
Standardization
Figure 1: Number of words, rules and (initial) trans- Figure 2: Cross-validated rule accuracy for the per-
formation classess for the various person-x data sets son-x and adjective-x data sets.
5 Experimental Setup and Results 1 the number of words and number of unique
transformation classes are displayed for each per-
son-x data set.
5.1 Data In order to verify the accuracy of our extracted
The dataset that we use is the list of foreign place rules, we use 10-fold cross-validation to obtain a
names and their corresponding personal names mean accuracy per data set, as depicted in Figure 2
from the AWS (Taalkommissie, 2009). For pur- (labeled “person”). We also generate a rule set
poses of brevity, we only report on suffixation and from the training and test data combined: this larg-
back-formation, and exclude cases with variant er set is used to extract categorization networks.
morphemes, zero derivation and clipping, as well When the rule set is structured as a graph (called
as all cases of paraphrasing. 732 instances are re- a rule network), the data can be interpreted as fol-
tained (from the original dataset of 1,034 instanc- lows: the root node indicates the default transfor-
es). mation, which applies unless any child node is
A supplementary dataset consisting of adjectival matched by the base-form, which again only ap-
derivations of place names was also taken from the plies unless a child of the child node matches the
AWS and treated in the same manner as the per- base-form (and so forth), which indicates that a
sonal names; this dataset is used in Section 6.3 to more refined rule should be applied. A small part
verify certain of the findings. This set contains 786 of a rule network is displayed in Figure 3, with
instances. each node listing the end-of-word string of the
base-form that will trigger the rule, the transfor-
5.2 Development of Predictor mation rule that will be applied, and the number of
instances of the rule in the training data. The com-
The full dataset is highly irregular, containing plete rule network is very large: 266 nodes for the
many transformation classes that occur only once. person-1 data set, as indicated in Figure 1.
We are interested in these irregularities (in order to As was expected, a large number of exceptional
identify words that may need further review), as rules are generated, indicating much inconsistency
well as in more systematic phenomena that occur in how derivations are formed. For the person-1
in the data. We therefore create different data sets; data set, 217 exceptions are identified. For each of
in each set (referred to as person-x) we only retain these exceptions, alternatives are suggested in or-
those instances that occur x or more times in the der of prototypicality by tracing the rule network,
transformations. (The person-1 set therefore con- as illustrated for the base-form Smirna in Table 2.
tains all training data, including all exceptions, Automatically generated tables like these provide a
while the person-6 set only contains transfor- practical tool for language standardization.
6 Discussion
6.1 Predictor
Acknowledgments
Van Huyssteen is jointly affiliated with North-
West University. Support by NWU is hereby
acknowledged.
Part of this research was made possible through
a research grant by the South African National Re-
search Foundation (FA207041600015).
We would like to extend our gratitude to Handré
Groenewald and Martin Puttkammer for their help.
References
Combrink, J.G.H. 1990. Afrikaanse morfologie [Afri-
kaans morphology]. Pretoria: Academica.
Davel, M. & Barnard, E. 2004. A default-and-
refinement approach to pronunciation prediction.
Proceedings of the 15th Annual Symposium of the
Pattern Recognition Association of South Africa.
Grabouw, November 2004. pp 119-123.
Davel, M. & Barnard, E. 2008. Pronunciation Prediction
with Default & Refine. Computer Speech and Lan-
guage. 22: 374-393.
Donaldson, B.C. 1993. A Grammar of Afrikaans. Berlin:
Mouton de Gruyter.
Goldsmith, J. 2001. Unsupervised Learning of the Mor-
phology of a Natural Language. Computational Lin-
guistics 27, pp. 153-198.
Kempen, W. 1969. Samestelling, afleiding en woord-
soortelike meerfunksionaliteit in Afrikaans [Com-
pounding, derivation and change of part-of-speech
category in Afrikaans]. Kaapstad: Nasou.
Langacker, R.W. 2008. Cognitive Grammar: A Basic
Introduction. Oxford: Oxford University Press.
Shalonova, K. & Flach, P. 2007. Morphology learning
using tree of aligned suffix rules. ICML Workshop:
Challenges and Applications of Grammar Induction.
Taalkommissie. (comp.). 2009. Afrikaanse Woordelys
en Spelreëls [Afrikaans Wordlist and Spelling Rules].
Tenth edition. Kaapstad: Pharos Dictionaries.
Tuggy, D. 2005. Cognitive Approach to Word-
Formation. In: Štekauer, P. & Lieber, R. (eds.).
Handbook of Word-Formation. Dordrecht: Springer.
pp. 233-265.
Van Huyssteen, GB. 2010. (Re)defining Component
Structures in Morphological Constructions: A Cogni- Figure 5: Categorization network for the person-4
tive Grammar Perspective. In: Michel, S & Onysko, data set
A (eds.). Cognitive Approaches to Word-Formation.
Berlin: Mouton de Gruyter. pp. 97-126.