Professional Documents
Culture Documents
Mikros (Ed) - Macutek (Editor) - Sequences in Language and Text (2015)
Mikros (Ed) - Macutek (Editor) - Sequences in Language and Text (2015)
)
Sequences in Language and Text
Quantitative Linguistics
Editors
Reinhard Köhler
Gabriel Altmann
Peter Grzybek
Advisory Editor
Relja Vulanović
Volume 69
Sequences in
Language and
Text
Edited by
George K. Mikros
Ján Mačutek
DE GRUYTER
MOUTON
ISBN 978-3-11-036273-2
e-ISBN (PDF) 978-3-11-036287-9
e-ISBN (EPUB) 978-3-11-039477-1
ISSN 0179-3616
www.degruyter.com
Foreword
Gustav Herdan says in his book Language as Choice and Chance that “we may
deal with language in the mass, or with language in the line”. Within the frame-
work of quantitative linguistics, the language-in-the-mass approach dominates;
however, the other direction of research has enjoyed an increased attention in
the last few years.
This book documents some recent results in this area. It contains an intro-
duction and a collection of thirteen papers of altogether twenty-two authors. The
contributions, ordered alphabetically according to the first author surname, cover
quite a broad spectrum of topics. They can be, at least roughly, divided into two
groups (admittedly, the distinction between the groups is a bit fuzzy).
The first of them consists of theoretically oriented papers. One could distin-
guish three subgroups here – development of text characteristics, either within
a text or in time (Andreev and Borisov, Čech, Köhler and Tuzzi, Rovenchak, Zör-
nig), linguistic motifs (Köhler, Mačutek and Mikros, Milička) and, in the last
contribution from this group, Benešová and Čech demonstrate that, with respect
to the Menzerath-Altmann law, a randomized text behaves differently from the
“normal” one, and thus the role of a sequential structure of a text is confirmed.
Four papers belonging to the other group focus (more or less) on appli-
cations, or at least they are inspired by real-world problems. Bavaud et al. apply
the apparatus of autocorrelation to different text properties. Pawłowski and Eder
investigate a development of rhythmic patterns in a particular Old Czech text. A
prediction of sales trends based on the analysis of short texts from Twitter is pre-
sented by Rentoumi et al. Finally, Rama and Borin discuss different measures of
string similarities and their appropriateness for automatic text classification.
We hope that this book will provide an impetus for a further development of
linguistic research both in language-in-the-mass and language-in-the-line direc-
tions. It goes without saying that the two approaches complement and not ex-
clude each other.
We would like to express our thanks to Gabriel Altmann, who gave an impulse
which was at the beginning of the book, and to Reinhard Köhler for his continu-
ous help and encouragement during the process of editing. Ján Mačutek would
also like to acknowledge VEGA grant 2/0038/12 which supported him during the
preparation of this volume.
Gabriel Altmann
Introduction 1
Radek Čech
Text length and the lambda frequency structure of a text 71
Reinhard Köhler
Linguistic Motifs 89
Jiří Milička
Is the Distribution of L-Motifs Inherited from the Word Length
Distribution? 133
Andrij Rovenchak
Where Alice Meets Little Prince:
Another approach to study language relationships 217
Peter Zörnig
A Probabilistic Model for the Arc Length in Quantitative Linguistics 231
Sequences occur in texts, written or spoken, and not in language which is con-
sidered system. However, the historical development of some phenomenon in
language can be considered sequence, too. In the first case we have a linear
sequence of units represented as classes or as properties represented by num-
bers resulting from measurements on the ordinal, interval or ratio scales. In the
second case we have a linear sequence of years or centuries and some change of
states. A textual sequence differs from a mathematical sequence; the latter can
usually be captured by a recurrence function. A textual sequence is rather a
repetition pattern displaying manifold regularities such as distances, lumping,
strengthening or weakening towards the end of sentence or chapter or text,
oscillations, cohesion, etc. This holds both for units and their conceptually
constructed and measured properties, as well as for combinations of these
properties which are abstractions of higher degree.
The sequential study of texts may begin with the scrutinizing of repetitions.
Altmann (1988: 4f.) showed several types of repetitions some of which can be
studied sequentially: (a) Runs representing uninterrupted sequences of ident-
ical elements, especially formal elements such as word-length, metric patterns,
sentence types. (b) Aggregative repetitions arising from Skinner’s formal rein-
forcement which can be expressed by the distribution of distances or by a de-
creasing function of the number of identities. (c) The relaxing of strict identity
may yield aggregative repetitions of similar entities. (d) Cyclic repetitions occur-
ring especially in irregular rhythm, in prose rhythm, in sentence structure, etc.
Different other forms of sequences are known from text analyses. They will
be studied thoroughly in the next chapters.
If we study the sequences of properties of text entities, we have an enor-
mous field of research in our view. An entity has as many properties as we are
able to discriminate at the given state of science. The number of properties in-
creases with time, as can be seen in 150 years of linguistic research. The proper-
ties are not “possessed” by the entities, they do not belong to the fictive essence
of entities; they are results of definitions and measurements ascribed to the
entities by us. In practice, we can freely devise new properties, but not all of
them need to turn out to be useful. In practise, a property is useful (reasonable)
2 Gabriel Altmann
state the mean of the property and compare the texts setting up text classes, (c)
to find tendencies depending (usually) on other properties, (d) to search for
idiosyncrasies; (e) to set up an empirical control cycle joining several proper-
ties, and (f) to use the sequences in practice, e.g. in psychology and psychiatry.
Though according to Bunge’s dictum “no laws, no theory” and “no theory,
no explanation”, we may try to find at least some “causes” or “supporting cir-
cumstances” or “motives” or “mechanisms” leading to the rise of sequences.
They are: (a) Restrictions of the inventories of units. The greater the restrictions
(= the smaller the inventory), the more frequently a unit must be repeated. A
certain sound has a stronger repetitiveness than a certain sentence because the
inventory of sentences is infinite. (b) The grammar of the given language not
allowing many repetitions of the same word or word class in immediate neigh-
bourhood. (c) Thematic restriction. This restriction forces the repetition of some
words, terms etc., but hinders the use of some constructions. (d) Stylistic and
aesthetic grounds which may evoke some regularities like rhyme, rhythm, but
avoiding repetition of words. (e) Perseveration reinforcing repetitions, support-
ing self-stimulation, etc. This phenomenon is well known in psychology and
psychiatry. (f) The flow of information in didactic works is ideal if words are
repeated in order to concentrate the text around the main theme. In press texts
the information flow is more rapid.
It is not always possible to find the ground for the rise of the given se-
quence. Speaking is a free activity, writing is full of pauses for thinking, correc-
tions, addition or omissions, it is not always a spontaneous creation. Hence the
study of texts is a very complex task and the discovery of sequences is merely
one of the steps towards theory formation.
Sequences are secondary units and their establishment is a step in concept
formation. As soon as they are defined, one strives for the next definitions, viz.
the quantification of their properties and behaviour. By means of analogy, ab-
duction, intuition, or inductive generalization one achieves a state in which it is
possible to set up hypotheses. They may concern any aspect of sequences in-
cluding form, repetition, distances, perseveration, dependences, history, role,
etc. Of course, hypotheses are a necessary but not a sufficient component of a
theory. They may (and should) be set up at every stage of investigation but the
work attains its goal if we are able to set up a theory from which the given hy-
potheses may be derived deductively Usually, this theory has a mathematical
form and one must use the means of mathematics. The longest way is that of
testing the given hypotheses. One sole text does not do, and one sole language
is merely a good beginning. Often, a corroboration in an Indo-European lan-
guage seduces us to consider the hypothesis as sufficiently corroborated - but
this is merely the first step. Our recommendation is: do not begin with English.
4 Gabriel Altmann
Ask a linguist knowing a different language (usually, linguists do) for data and
test your hypothesis at the same time on a different text sort. If the hypothesis
can be sufficiently corroborated, try to find some links to other properties. Set
up a control cycle and expand it stepwise. The best example is Köhler’s linguis-
tic synergetics containing units, properties, dependencies, forces and hypothe-
ses. The image of such a system gets complex with every added entity, but this
is the only way to construct theories in linguistics. They furnish us laws which
are derived and corroborated hypotheses and thereby at the same time explana-
tions because phenomena can be subsumed under a set of laws.
There are no overall theories of language, there are only restricted ones.
They expand in the course of time, just as in physics, but none of them embrac-
es the language as a whole, just as in physics.
In the sequel we shall list some types of sequences in text.
(A) Symbolic sequences. If one orders the text entities in nominal classes -
which can also be dichotomous - one obtains a sequence of symbols. Such clas-
ses are e.g. parts-of-speech, reduction to noun (N) and rest (R), in order to study
the nominal style; or to adjectives (A), verbs (V) and rest (R), in order to study
the ornamentality vs. activity of the text; or to accentuated and non-accentuated
syllable, in order to study rhythm, etc. Symbolic sequences can be studied as
runs, as a sequence of distances between equal symbols, as a devil’s staircase,
as thematic chains, etc.
(B) Numerical sequences may have different forms: there can be oscillation
in the values, there are distances between neighbours, one can compute the arc
length of the whole sequence, one can consider maxima and minima in the
parts of the sequence and compute Hurst’s coefficient; one can consider the
sequence a fractal and compute its different properties; numerical sequences
can be subdivided in monotonous sub-sequences which have their own proper-
ties (e.g. length), probability distributions, etc. Numerical sequences represent a
time series which may display autocorrelation, “seasonal oscillation”, they may
represent a Fourier series, damped oscillation, climax, etc.
(C) Musical sequences are characteristic for styles and development stages
as has been shown already by Fucks (1968). Besides motifs defined formally and
studied by Boroda (1973, 1982, 1988) there is the possibility of studying sequen-
tially the complete full score obtaining multidimensional sequences. In music,
we have, as a matter of fact, symbolic sequences, but the symbols have ordinal
values, thus different types of computations can be performed.
Spoken text is a multidimensional sequence: there are accents, intonation,
tones, which are not marked in the written texts. Even a street can be consid-
ered simple sequence if we restrict our attention to one sole property. Observa-
tions are always simplified in order to make our conceptual operations possible
Introduction 5
and lucid. Nothing can be captured in its wholeness because we do not even
know what it is.
We conjecture that textual sequences abide by laws but there is a great
number of boundary conditions which will hinder theory construction. Most
probably we shall never discover all boundary conditions but we can help by in-
troducing un-interpreted parameters which may but need not change with the
change of the independent variable. In many laws derivable from the “unified
theory” (cf. Wimmer, Altmann 2005) there is usually a constant representing the
state of the language.
At the beginning of research, we set up elementary hypotheses and im-
prove, specify them step by step. Sometimes, we simply conjecture that there
may be some trend or there may be some relation between two variables and try
to find an empirical formula. In order to remove the “may be”-formulation we
stay at the beginning of a long way beginning with quantification and continu-
ing with measurement, conjecturing, testing, systematization, explanation.
Even a unique hypothesis may keep occupied many teams of researchers be-
cause a well corroborated hypothesis means “corroborated in all languages”.
And since there is no linguist having at his disposal all languages, every linguis-
tic hypothesis is a challenge for many generations of researchers. Consider for
example the famous Zipf’s law represented by a power function used in many
scientific disciplines (http://www.nslij-genetics.org/wli/zipf). Originally there
was only an optic conjecture made by Zipf without any theoretical foundation.
Later on one began to search for modifications of the function, found a number
of exceptions, fundamental deviations from the empirical data depending on
the prevailing “type” realized in the language, changed the theoretical founda-
tions, etc. Hence, no hypothesis and none of its mathematical expressions
(models) are final. Nevertheless, the given state of the art helps us to get orien-
tation, and testing the hypothesis on new data helps us to accept or to modify it.
Sometimes a deviation means merely the taking of boundary conditions into
account. As an example, consider the distribution of word length in languages:
it has been tested in about 50 languages using ca 3000 texts (cf.
http://www.gwdg.de/~kbest/litlist.htm) but in Slavic languages we obtain dif-
ferent results if we consider the non-syllabic prepositions as prefixes/clitics or
as separate words. The same holds for Japanese postpositions, etc. Thus already
the first qualitative interpretation of facts may lead to different data and conse-
quently to different models.
In spite of the labile and ever changing (historically, geographically, idio-
lectally, sociolectally, etc.) linguistic facts from which we construct our data, we
cannot stay for ever at the level of rule descriptions but try to establish laws and
systems of laws, that is, we try to establish theories. But even theories may con-
6 Gabriel Altmann
cern different aspects of the reality. There is no overall theory. While Köhler
(1986, 2005) in his synergetic linguistics concentrated on the links among punc-
tual properties, here we look at the running text, try to capture its properties
and find links among these sequential properties. The task is enormous and
cannot be got through by an individual researcher. Even a team of researchers
will need years in order to come at the boundary of the first stage of a theory.
Nevertheless, one must begin.
References
Altmann, Gabriel. 1988. Wiederholungen in Texten. Bochum: Brockmeyer.
Carroll, John B. 1968. Vectors of prose style. In Thomas A. Sebeok (ed.), Style in language,
283–292. Cambridge, Mass.: MIT Press.
Boroda, Mojsej G. 1973. K voprosu o metroritmičeski elementarnoj edinice v muzyke. Bulletin
of the Academy of Sciences of the Georgian SSR. 71(3). 745–748.
Boroda, Mojsej G. 1982. Die melodische Elementareinheit. In Jurij K.Orlov, Mojsej G.Boroda &
Isabella Š. Nadarejšvili (eds), Sprache, Text, Kunst. Quantitative Analysen, 205–222. Bo-
chum: Brockmeyer.
Boroda, Mojsej G. 1988. Towards a problem of basic structural units of a musical texts.
Musikometrika. 1. 11–69.
Fucks, Wilhelm. 1968. Nach allen Regeln der Kunst. Stuttgart: Dt. Verlags-Anstalt.
Köhler, Reinhard. 1986. Zur linguistischen Synergetik. Struktur und Dynamik der Lexik. Bo-
chum: Brockmeyer.
Köhler, Reinhard. 2005. Synergetic linguistics. In Reinhard Köhler, Gabriel Altmann &
Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An International Handbook, 760–
774. Berlin-New York: de Gruyter.
Wimmer, Gejza, Gabriel Altmann. 2005. Unified derivation of some linguistic laws. In Reinhard
Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An In-
ternational Handbook, 791–807. Berlin-New York: de Gruyter.
Sergey N. Andreev and Vadim V. Borisov
Linguistic Analysis Based on Fuzzy
Similarity Models
1 Introduction
Modern state of research in linguistics is characterized by development and
active use of various mathematical methods and models as well as their combi-
nations to solve such complex problems as text attribution, automatic gender
classification, classification of texts or individual styles, etc. (Juola 2006,
Hoover 2007, Mikros 2009, Rudman 2002).
Among important and defining features of linguistic research is the neces-
sity to take into account uncertainty, heuristic representation and subjectivity of
estimation of the analyzed information, heterogeneity and different measure-
ment scales of characteristics.
All this leads to the necessity to use data mining methods, primarily, meth-
ods of fuzzy analysis and modeling.
In empirical sciences the following possibilities of building fuzzy models
exist.
The first approach consists in the adaptation of fuzzy models which were
built to solve problems in other spheres (technical diagnostics, decision making
support models, complex systems analysis). The advantages of this approach
include the use of the obtained experience and scientific potential. At present in
empirical sciences this approach prevails.
The second approach presupposes introduction of fuzziness into already
existing models to take into account various types of uncertainty. Despite the
fact that this approach is rather effective, difficulties arise in the introduction of
fuzziness into the models.
The third approach implies creation of fuzzy models, aimed at solution of
the problems in the given branch of empirical science under conditions of un-
certainty. In our opinion this approach is preferable, though at the same time it
is the most difficult due to the necessity to consider both specific aspects of the
problems and characteristics of the developed fuzzy models (Borisov 2014).
This chapter deals with some issues of building fuzzy models within the
third approach, using as example fuzzy models for the estimation of similarity
of linguistic objects. Such models usage makes the basis for solving several
problems of linguistic analysis including those viewed in this chapter.
8 Sergey N. Andreev and Vadim V. Borisov
For variant (i), in addition to axioms А1) – А4), the following axiom must be
fulfilled:
А5) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, ℎ �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� ≤ 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 ��, i.e. the value of
the generalized characteristic must not exceed the minimal value of any of the
partial characteristics.
For variant (ii), in addition to axioms А1) – А4), the following axiom must
be fulfilled:
А6) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≥ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, i.e. the value of
the generalized characteristic must not be less than the maximal value of any of
the partial characteristics.
For compromise strategies variant (iii), in addition to axioms А1) – А4), the
following axiom must be fulfilled:
A7) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �,
𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� < ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� < 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, i.e.
the value of the generalized characteristic is between the values of the partial
characteristics.
For hybrid strategy (variant (iv)) the value of the generalized characteristic
may be obtained by aggregation of the values of the partial characteristics on
the basis of symmetric sum such as median 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝2 �𝑎𝑎𝑗𝑗 �; 𝑎𝑎𝑎, 𝑎𝑎 𝑎 [0, 1].
Another variant of hybrid strategy consists in using associative symmetric
sums (excluding median):
1
if 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� < , then
2
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≤ 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�
1
if 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� > , then
2
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≥ 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�;
This approach makes it possible under condition of equality of partial
characteristics to identify adequate operations for their aggregation.
Linguistic analysis based on fuzzy similarity models 11
ℎ(𝑛𝑛) �𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� = ℎ�ℎ(𝑛𝑛𝑛𝑛) �𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, 𝑝𝑝𝑛𝑛𝑛𝑛 (𝑎𝑎𝑗𝑗 )�.
In case variant (i) is applied, borderline values for partial characteristics are set,
the fulfillment of which determines the acceptability of the generalized
estimation of the object under study.
If variant (ii) is chosen, every partial characteristic is assigned weight with
consequent aggregation of weighed partial characteristics using the operations
of aggregation. The most wide spread is weighed sum of partial characteristics.
𝑛𝑛 𝑛𝑛
We shall show the solution of these problems on the data-base of the poem by
Samuel Coleridge The Rime of the Ancient Mariner and its two translations by
Nikolay Gumilev and Wilhelm Levik.
The main demands to similarity models for solving tasks of linguistic
analysis include the following:
– possibility to formulate a generalized characteristic on the basis of
changing sets of partial characteristics;
– possibility to aggregate heterogeneous characteristics (both quantitative
and qualitative) which differ in measurement scales, range of variation;
– taking into consideration different significance of partial characteristics in
the generalized characteristic;
– taking into consideration consistency of partial characteristics;
– possibility to adjust (adapt) the model in a flexible way to the changes in
the number of characteristics (addition, exclusion) and changes in the
characteristics (in consistency and significance).
1 The approach, suggested in this chapter, makes it possible to build similarity models both
with and without a priori grouping of the characteristics. In both cases all the provisions of the
method and similarity models remain valid.
Linguistic analysis based on fuzzy similarity models 15
(ii) subgroup of parts of speech, localized at the end of the line and not rhymed:
– p6 – number of unrhymed nouns;
– p7 – number of unrhymed verbs;
– p8 – number of unrhymed adjectives;
– p9 – number unrhymed adverbs;
– p10 – number of unrhymed pronouns;
(iii) subgroup of parts of speech which are not localized at some position in the
line:
– p11 – number of nouns;
– p12 – number of verbs;
– p13 – number of adjectives;
– p14 – number of adverbs.
Table 2: Consistency matrix of part of speech localized rhymed characteristics in the poem The
Rime of the Ancient Mariner by S.T. Coleridge.
p1 p2 p3 p4 p5
p1 – HC LC MC LC
p2 HC – LC MC NC
p3 LC LC – HC MC
p4 MC MC HC – MC
p5 LC NC MC MC –
In the same way consistency matrices for localized unrhymed (Table 3) and
unlocalized (Table 4) part of speech characteristics are formed.
18 Sergey N. Andreev and Vadim V. Borisov
Table 3: Consistency matrix of part of speech localized unrhymed characteristics in the poem
The Rime of the Ancient Mariner by S.T. Coleridge.
p6 p7 p8 p9 p10
p6 – MC LC LC HC
p7 MC – LC LC MC
p8 LC LC – MC MC
p9 LC LC MC – LC
p10 HC MC MC LC –
Table 4: Consistency matrix of part of speech unlocalized characteristics in the poem The Rime
of the Ancient Mariner by S.T. Coleridge.
symmetric matrix (Table 5) is formed, in which element vi/vj denotes the degree
of significance of the i-th characteristic as compared to the j-th characteristic.
Table 5: Matrix of pairwise comparisons of the significance of the characteristics from a group.
Characteristic number
1 2 … i … n
1 1 v1/v2 … v1/vi … v1/vn
2 v2/v1 1 … v2/vi … v2/vn
… … … … … … …
i vi/v1 vi/v2 1 … vi/vn
Number
… … … … … … …
n vn/v1 vn/v2 … vn/vi … 1
For the analyzed example the following weights of characteristics were obtained
using the above mentioned method:
– for the subgroup of part of speech localized rhymed features: {w1 = 0.12, w2 =
0.29, w3 = 0.25, w4 = 0.15, w5 = 0.19};
– for the subgroup of part of speech localized unrhymed features: {w6 = 0.11,
w7 = 0.30, w8 = 0.23, w9 = 0.14, w10 = 0.22};
– for the subgroup of part of speech unlocalized features: {w11 = 0.25, w12 = 0.3,
w13 = 0.3, w14 = 0.15};
20 Sergey N. Andreev and Vadim V. Borisov
Fig. 2: Fuzzy undirected graph which corresponds to the results of the estimation of
consistency and significance in the subgroup of part of speech localized rhymed
characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , } in The Rime of the Ancient Mariner by Coleridge.
In the same way fuzzy graphs corresponding to the results of the estimation of
consistency and significance in other groups (subgroup) can be made.
Consider the method of building an similarity model on the data-source of
the above-mentioned subgroup of localized rhymed part of speech
characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , }.
The suggested method is based on alternate search in fuzzy graph of
complete subgraphs 𝐺𝐺 �′ = �𝑃𝑃�′ , 𝑅𝑅� ′�, whose arcs correspond to one of the criterion
levels of consistency 𝑐𝑐𝑐𝑖𝑖 ∈ 𝐶𝐶̃ = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹}. The search order of such
Linguistic analysis based on fuzzy similarity models 21
The order of proceeding from the lowest to higher degrees of consistency allows
«not to lose» «good» estimations of badly consistent characteristics, because
the operation of aggregation of the characteristics is usually aimed at such
estimation when the value of generalized characteristic is defined by the worst
value of the partial characteristics.
The order of proceeding from the most consistent to less consistent
characteristics makes it possible to take into account «bad» estimations of
highly consistent characteristics whose aggregation was based on such type of
estimation when the value of the generalized characteristic is defined by the
highest value of partial characteristics (Zernov 2007).
Consider the alternate search of subgraphs 𝐺𝐺 �′ = �𝑃𝑃�′ , 𝑅𝑅� ′� following the
procedure from the most consistent to less consistent characteristics in
subgroup {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , }.
The procedure starts with finding the highest level of consistency for the
characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , } (Fig. 2).
Pairs of characteristics p1 and p2, and also p3 and p4 have HC «High level of
consistency», the largest in fuzzy graph in Figure 2. These pairs of nodes form
two complete subgraphs of maximum size.
In case there are several such subgraphs, these complete subgraphs of
maximum size are sorted in the order of decreasing the total weight of their
nodes (another method of analysis of subgraphs can also be applied). In Fig. 3a
only those edges of fuzzy graph were left which correspond to the given level of
consistency HC.
Considering in succession these subgraphs, we merge all the nodes of these
subgraphs into one on the basis of the operation of aggregation corresponding
to the given level of consistency.
If the operation of aggregation is not multi-place ℎ�𝑝𝑝1 , … , 𝑝𝑝𝑞𝑞 �, 𝑞𝑞 𝑞 {1, … , 𝑛𝑛},
but a two-place nonassociative operation ℎ(𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 ), 𝑘𝑘, 𝑙𝑙 𝑙 {1, … , 𝑛𝑛}, 𝑘𝑘 𝑘 1, then
the order of enumeration of characteristics pk and pi may be set, e.g. according
to decreasing weights.
22 Sergey N. Andreev and Vadim V. Borisov
In our case of bisymmetric operation it is also possible to start with the nodes of
the largest weight. Then, supposing that 𝑤𝑤𝑖𝑖 ≤ 𝑤𝑤𝑖𝑖𝑖𝑖 , 𝑖𝑖 𝑖 {1, … , 𝑞𝑞 𝑞 1}, we get:
ℎ∗ �𝑝𝑝1 , … , 𝑝𝑝𝑞𝑞 � = ℎ�ℎ�… (ℎ(𝑝𝑝1 , 𝑝𝑝2 ), … ), 𝑝𝑝𝑞𝑞𝑞𝑞 �, 𝑝𝑝𝑞𝑞 �
For aggregating the characteristics p1 and p2, and also p3 and p4 the operations
𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75) and 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75) from Table 1, corresponding to the
specified level of consistency HC are chosen.
Figure 3b shows the fuzzy graph, made by merging nodes p1 and p2, and p3
and p4 using the above-mentioned operations. The weights of the new nodes of
the fuzzy graph are determined by summing weights of the merging nodes.
Fig. 3: Subgraphs of a fuzzy graph, which correspond to the HC “High Consistency” level – a);
fuzzy graph with merged nodes (𝑝𝑝1 , 𝑝𝑝2 ) and (𝑝𝑝3 , 𝑝𝑝4 ), received by using the operations,
correspondingly, 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75), and 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75) – b).
Then from the set of edges of the fuzzy graph the edges which are connected
with the merging nodes are removed.
Afterwards levels of consistency of edges, adjacent to the merging nodes.
These levels of consistency are defined according to the strategy of estimation
which was chosen (see the classification of approaches to building similarity
models for linguistic analysis). In the given example we choose the type of
estimation in which the value of the generalized characteristic is determined by
Linguistic analysis based on fuzzy similarity models 23
Thus the similarity model of the subgroup of part of speech localized rhymed
characteristics for the poem The Rime of the Ancient Mariner can be expressed as
follows:
𝑃𝑃𝐿𝐿𝐿𝐿 = min ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75)�, �𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75)�, 𝑝𝑝5 ; 0.5���
Similarity models for two other subgroups of characteristics are built in the
same way.
Figure 5 shows fuzzy undirected graph which corresponds to the results of
estimation of consistency and significance in the subgroup of part of speech
localized unrhymed characteristics {𝑝𝑝6 , 𝑝𝑝7 , 𝑝𝑝8 , 𝑝𝑝9 , 𝑝𝑝10 , } for the poem The Rime of
the Ancient Mariner.
24 Sergey N. Andreev and Vadim V. Borisov
Fig. 5: Fuzzy undirected graph which corresponds to the results of estimation of consistency
and significance in the subgroup of part of speech localized unrhymed characteristics
{𝑝𝑝6 , 𝑝𝑝7 , 𝑝𝑝8 , 𝑝𝑝9 , 𝑝𝑝10 } for the poem The Rime of the Ancient Mariner.
Having repeated all the steps of this method we get similarity model for the
subgroup of part of speech localized unrhymed characteristics as follows:
𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝6 , 𝑝𝑝10 ; 0.75)�, 𝑝𝑝7 ; 0.5�� , �𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝8 , 𝑝𝑝9 ; 0.5)�; 0.25�.
Fig. 6: Fuzzy undirected graph which corresponds to the results of estimation of consistency
and significance in the subgroup of part of speech unlocalized characteristics
{𝑝𝑝11 , 𝑝𝑝12 , 𝑝𝑝13 , 𝑝𝑝14 , } for the poem The Rime of the Ancient Mariner.
Linguistic analysis based on fuzzy similarity models 25
𝑃𝑃𝑢𝑢𝑢𝑢 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝11 , 𝑝𝑝12 ; 0.5)�, 𝑝𝑝13 ; 0.5�� , 𝑝𝑝14 ; 0.5�.
Where 𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � is the estimation of linguistic object 𝑎𝑎𝑗𝑗 using the subgroup of
localized part of speech rhymed characteristics and obtained with the help of
similarity model 𝑃𝑃𝐿𝐿𝐿𝐿 ; 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � is the estimation of object aj using the subgroup of
part of speech localized unrhymed characteristics with the help of similarity
26 Sergey N. Andreev and Vadim V. Borisov
model 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 ; 𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 � is the estimation of the object aj using the subgroup of part
of speech unlocalized characteristics, obtained with the help of the model 𝑃𝑃𝑢𝑢𝑢𝑢 .2
On the other side, the estimation results got for the above mentioned
subgroups of characteristics may be aggregated into a generalized estimation
for the whole group of part of speech characteristics.
In order to do this according to the suggested method it is necessary to
build a similarity model in which 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 , in their turn, function as
“partial” characteristics, and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is a generalized characteristic which is
formed taking into account the consistency and significance of these “partial”
characteristics:
Consider briefly some approaches to the analysis of linguistic objects using the
developed models.
2 Significance of characteristics is taken into account directly during the calculation of a
particular linguistic object using the estimation models.
Linguistic analysis based on fuzzy similarity models 27
poem The Rime of the Ancient Mariner by Coleridge and of its two translations
made by Gumilev and Levik.
In this case the model 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 , which was built for the original text, is used.
Specific numerical values of characteristics are set:
– for the original text by Coleridge – {p1(a1), …, pn(a1)};
– for the translation by Gumilev – {p1(a2), …, pn(a2)};
– for the translation by Levik – {p1(a3), …, pn(a3)}.
Then the values of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎1 ), 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎2 ) and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎3 ) are calculated using the
similarity model built for the group of part of speech characteristics in the
original text.
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎𝑖𝑖 ) = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑃𝑃𝐿𝐿𝐿𝐿 (𝑎𝑎𝑖𝑖 ), 𝑃𝑃𝑢𝑢𝑢𝑢 (𝑎𝑎𝑖𝑖 ); 0.75)�, 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 (𝑎𝑎𝑖𝑖 ); 075�.
Then using the values of translations 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎2 ) and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎3 ) their similarity with
the original, i.e. proximity to the value of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎1 ) is established.
The obtained results of the estimation allow not only to rank the compared
linguistic objects, but also to find out the degree of their similarity.
Table 6 contains the results of the estimation of 7 parts (chapters) of the
texts of the original poem by Coleridge The Rime of the Ancient Mariner and its
two translations by Gumilev and Levik, which were obtained using the
similarity model for part of speech characteristics of the original.
28 Sergey N. Andreev and Vadim V. Borisov
Table 6: Results of the estimation of 7 parts of the texts of the original poem by Coleridge The
Rime of the Ancient Mariner and its two translations by Gumilev and Levik, obtained using the
similarity model for part of speech characteristics of the original.
These results show that on the whole the translation by Gumilev is closer to the
original in Parts 1–5 in which the idea of crime and punishment prevails. On the
other hand, Part 6 and to some extent Part 7 in which the themes of atonement
and relief of the Mariner (and the writer himself as it is implied in the text) by
storytelling are expressed, are translated by Levik with a higher degree of
exactness, thus revealing difference of approaches demonstrated by the two
translators.
Gumilev is much closer in his translation in unrhymed characteristics and
also in unlocalized part of speech characteristics whereas Levic pays more
attention to the rhymed words. Rhyme is one of important means of introducing
vertical links into verse. Unrhymed words attract less attention, but together
with unlocalized characteristics they allow Gumilev to render the distribution of
parts of speech in the line. In other words, Levik creates better similarity in
vertical organization, Gumilev – in horizontal aspect.
The estimation was carried out with the following assumptions:
– For the given example the evaluation model is unchanged for all parts
(chapters) of the original poem. Obviously, it is possible to build different
similarity models for different chapters of the text, and then estimation can
be made for each chapter, using its “own” similarity model.
– The comparison of these linguistic objects in the given example was made
after normalizing the characteristics against the number of lines in the
corresponding chapter. It is reasonable also to examine the possibilities of
using this model with normalization of the values of characteristics against
their maximum values and taking into account their weights.
– While building the model in the given example we followed the strategy of
examining the characteristics in the order “from the most consistent to the
Linguistic analysis based on fuzzy similarity models 29
least consistent». Using the other strategy of building the model one may
widen the possibilities of the classification.
Fig. 7: Dynamics of the characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 over the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.
characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is mostly influenced by 𝑃𝑃𝑢𝑢𝑢𝑢 . This can explain higher values of
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in both translations as compared with the values of this characteristic in
the original.
Secondly, for the original 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 remains stable practically for the whole of
the text. Some deviation – tendency to decreasing – takes place only in Part 6,
reaching the minimum value of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in Part 7. The same tendency is observed in
both translations.
Thirdly, there is a closer correlation of the changes of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in the text of the
poem between the original and the translation by Gumilev than between the
original and the translation of Levik.
The same type of comparisons were made for the dynamics over the text of,
𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 , 𝑃𝑃𝑢𝑢𝑢𝑢 , for the original and the translations (Fig. 8–10).
Variation of the values of two characteristics 𝑃𝑃𝑢𝑢𝑢𝑢 and 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 in all the three
texts is rather moderate. Some deviation from this tendency is observed in
Levik’s translation (Fig. 9) in Part 3 where the theme of punishment is
introduced. On the other hand, the localized rhymed characteristics 𝑃𝑃𝐿𝐿𝐿𝐿 (Fig. 8)
show a wide range of variation of their values in each text.
Fig. 8: Dynamics of the characteristic 𝑃𝑃𝐿𝐿𝐿𝐿 for the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.
Linguistic analysis based on fuzzy similarity models 31
Fig. 9: Dynamics of the characteristic 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 for the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.
Fig. 10: Dynamics of the characteristic 𝑃𝑃𝑢𝑢𝑢𝑢 for the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.
4 Conclusion
In this chapter the relevance of building fuzzy similarity models for linguistic
analysis has been discussed. The models are aimed at solving a wide range of
tasks under conditions of uncertainty:
32 Sergey N. Andreev and Vadim V. Borisov
– estimation of the degree of similarity of the original text (poem) and its
translations using different groups of characteristics;
– estimation of the similarity of parts of the compared poems (chapters);
– classification of texts and their features according to certain rules.
Similarity models for complicated linguistic object have been analyzed, their
classification depending on the type of characteristics aggregation has been
suggested. The relevance of building similarity models based on the fuzzy
approach and aimed at solving various linguistic problems is outlined.
The suggested fuzzy similarity models for linguistic analysis satisfy the
following conditions:
– generalized characteristic is formed on the basis of changing sets of partial
characteristics;
– possibility to aggregate heterogeneous characteristics (both quantitative
and qualitative) which differ in measurement scales, range of variation and
weight;
– possibility to build a similarity model both with and without a priori
grouping of the characteristics;
– possibility to create a complex hierarchical structure of the similarity
model, in which partial characteristics are divided into groups and
subgroups;
– using criterion levels of characteristics consistency for the identification of
aggregation operations;
– possibility to adjust (adapt) the model in a flexible way to the changes of
the number of characteristics (addition, exclusion) and changes of
characteristics (in consistency and weight).
Acknowledgment
Research is supported by the State Task of the Ministry of Education and
Science of the Russian Federation (the basic part, task # 2014/123, project #
2493) and Russian Foundation for Humanities (project # 14-04-00266).
References
Andreev, Sergey. 2012. Literal vs. liberal translation – formal estimation. Glottometrics 23. 62–
69.
Linguistic analysis based on fuzzy similarity models 33
1 Introduction
Much work in the field of quantitative text processing and analysis has adopted
one of two distinct symbolic representations: in the so-called bag-of-words
model, text is conceived as an urn from which units (typically words) are inde-
pendently drawn according to a specified probability distribution; alternatively,
the sequential model describes text as a categorical time series, where each unit
type occurs with a probability that depends, to some extent, on the context of
occurrence.
The present contribution pursues two related goals. First, it aims to general-
ise the standard sequential model of text by decoupling the order in which units
occur from the order in which they are read. The latter can be represented by a
Markov transition matrix between positions in the text, which makes it possible
to account for a variety of ways of navigating the text, including in particular non-
linear and non-deterministic ones. Markov transitions thus define textual neigh-
bourhoods as well as positional weights – the stationary distribution or promi-
nence of textual positions.
Building on the notion of textual neighbourhood, the second goal of this con-
tribution is to introduce a unified framework for textual autocorrelation, namely
the tendency for neighbouring positions to be more (or less) similar than ran-
domly chosen positions with respect to some observable property – for instance
whether the same unit types tend to occur, or units of similar length, or consisting
of similar sub-units, and so on. Inspired from spatial analysis (see e.g. Cressie
1991; Anselin 1995; Bavaud 2013), this approach relates the above mentioned
transition matrix (specifying neighbourhoodness) with a second matrix specifying
the dissimilarity between textual positions.
The remainder of this contribution is organised as follows. Section 2 intro-
duces the foundations of the proposed formalism and illustrates them with toy
examples. Section 3 presents several case studies intended to show how the for-
malism and some of its extensions apply to more realistic research problems in-
volving, among others, morphosyntactic and semantic dissimilarities computed
in literary or hypertextual documents. Conclusion briefly summarises the key
ideas introduced in this contribution.
36 François Bavaud, Christelle Cocco, Aris Xanthos
2 Formalism
Fig. 1: Fixations and ocular saccades during reading, illustrating the skimming navigation
model. Source: Wikimedia Commons, from a study of speed reading made by Humanistlabora-
toriet, Lund University (2005).
1 Boundary effects are, if necessary, dealt with the usual techniques (reflecting boundary, pe-
riodic conditions, addition of a "rest state", etc.), supporting the formal fiction of an ever-reading
agent embodied in stationary navigation.
Textual navigation and autocorrelation 37
Fig. 2: Word cloud (generated by http://www.wordle.net), aimed at illustrating the free or bag-
of-words navigation model with differing positional weights f.
2 Here and in the sequel, 1(𝐴𝐴𝐴𝐴) denotes the indicator function of event 𝐴𝐴𝐴𝐴, with value 1(𝐴𝐴𝐴𝐴) = 1 if
𝐴𝐴𝐴𝐴 is true, and 1(𝐴𝐴𝐴𝐴) = 0 otherwise.
3 Here and in the sequel, the notation "•" denotes summation over the replaced index, as in
𝑎𝑎𝑎𝑎•𝑖𝑖𝑖𝑖 : = ∑𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 , 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖 : = ∑𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 and 𝑎𝑎𝑎𝑎•• : = ∑𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 .
38 François Bavaud, Christelle Cocco, Aris Xanthos
1
– 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1(|𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖| ≤ 𝑟𝑟𝑟𝑟) 1(𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖) for skimming with (undirected) jumps of
2𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
maximum length 𝑟𝑟𝑟𝑟
– 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 for the free or bag-of-words exchange matrix.
1 1 1 1 1/4 1 1 1 1
1 1 1 1 1 1/4 1 1 1 1 1
𝐵𝐵𝐵𝐵
𝑇𝑇𝑇𝑇 = � 𝐵𝐵𝐵𝐵
� 𝑓𝑓𝑓𝑓 = � 𝐵𝐵𝐵𝐵
� 𝐸𝐸𝐸𝐸 = � � (3)
4 1 1 1 1 1/4 16 1 1 1 1
1 1 1 1 1/4 1 1 1 1
Note that (2) can also be conceived as an example of (periodic) skimming with
jumps of maximum length 𝑟𝑟𝑟𝑟 = 1, which is indeed equivalent to (periodic) linear
reading. Similarly, free navigation is equivalent to skimming with jumps of max-
imum size 𝑛𝑛𝑛𝑛, with the single difference that the latter forbids jumping towards
the currently occupied position.
Let us now consider a matrix of dissimilarities 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 between pairs of positions (𝑖𝑖𝑖𝑖, 𝑗𝑗𝑗𝑗).
Here and in the sequel, we restrict ourselves to squared Euclidean dissimilarities,
i.e. of the form 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 =∥ 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 − 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 ∥2 , where ∥. ∥ denotes the Euclidean norm and 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 , 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖
are 𝑝𝑝𝑝𝑝-dimensional vectors, for some 𝑝𝑝𝑝𝑝 𝑝 1.
The average dissimilarity between a pair of randomly chosen positions de-
fines the (global) inertia Δ, while the average dissimilarity between a pair of
neighbouring positions defines the local inertia Δloc :
1 1
Δ ∶= � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 Δloc ∶= � 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 (4)
2 2
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
A local inertia much smaller than the global inertia reflects the presence of textual
autocorrelation: closer average similarity between neighbouring positions than
Textual navigation and autocorrelation 39
(e.g. Cliff and Ord 1981; Bavaud 2013), thus making the autocorrelation index sig-
nificant at level 𝛼𝛼𝛼𝛼 if
𝛿𝛿𝛿𝛿 − 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿)
� � ≥ 𝑢𝑢𝑢𝑢1−𝛼𝛼𝛼𝛼 , (6)
�Var0 (𝛿𝛿𝛿𝛿) 2
where 𝑢𝑢𝑢𝑢𝛼𝛼𝛼𝛼 denotes the 𝛼𝛼𝛼𝛼-th quantile of the standard normal distribution.
Toy example 1, continued: suppose that the types occurring at the four po-
sitions of example 1 are the following trigrams: 𝑜𝑜𝑜𝑜(1) = 𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼, 𝑜𝑜𝑜𝑜(2) = 𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿𝛼𝛼𝛼𝛼, 𝑜𝑜𝑜𝑜(3) =
𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖, and 𝑜𝑜𝑜𝑜(4) = 𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼. Define the (squared Euclidean) dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 as the num-
ber of characters by which trigrams 𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) and 𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) differ:
0 2 3 0
2 0 2 2
𝐷𝐷𝐷𝐷 = � �
3 2 0 3
0 2 3 0
4 Up to the bias associated to the contribution of self-comparisons: see (6).
40 François Bavaud, Christelle Cocco, Aris Xanthos
Under linear periodic navigation (2), one gets the values Δ𝐴𝐴𝐴𝐴 = 3/4, Δ𝐴𝐴𝐴𝐴loc = 7/8
and 𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 = −1/6, higher than 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 ) = −1/3: the dissimilarity between immedi-
ate neighbours under linear periodic navigation is on average (after subtracting
the bias 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 )) smaller than the dissimilarity between randomly chosen pairs –
although not significantly by the above normal test.
By contrast, free navigation (3) yields Δ𝐵𝐵𝐵𝐵 = Δ𝐵𝐵𝐵𝐵loc = 3/4 and 𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 = 0, since local
and global inertia here coincide by construction. Note that 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 ) = 0 and
Var0 (𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 ) = 0 in case of free navigation, regardless of the values of 𝐷𝐷𝐷𝐷.
In most cases, the dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 between positions 𝑖𝑖𝑖𝑖 and 𝑗𝑗𝑗𝑗 depends only on the
types 𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎 and 𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) = 𝑏𝑏𝑏𝑏 found at these positions. Thus, the calculation of the
autocorrelation index can be based on the 𝑣𝑣𝑣𝑣 × 𝑣𝑣𝑣𝑣 type dissimilarity matrix 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ra-
ther than on the 𝑛𝑛𝑛𝑛 × 𝑛𝑛𝑛𝑛 position dissimilarity matrix – which makes it possible to
simplify both computation (since 𝑣𝑣𝑣𝑣 < 𝑛𝑛𝑛𝑛 in general) and notation (cf. sections 3.2
and 3.3).
Here are some examples of squared Euclidean type dissimilarities, i.e. of the
form 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ∥ 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∥2 where 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∈ ℝ𝑝𝑝𝑝𝑝 are the 𝑝𝑝𝑝𝑝-dimensional coordinates of type
𝑎𝑎𝑎𝑎, recoverable by multidimensional scaling (see section 3.4):
– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = (𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 )2 where 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 characterises type 𝑎𝑎𝑎𝑎, e.g. 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 = "length of 𝑎𝑎𝑎𝑎" or 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 =
1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎) (presence-absence dissimilarity with respect to property 𝐴𝐴𝐴𝐴)
– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎), the discrete metric
1 1
– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = � + � 1(a≠b), the weighted discrete metric, where 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 > 0 is the rel-
πa πb
ative proportion of type 𝑎𝑎𝑎𝑎, with ∑𝑎𝑎𝑎𝑎 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 1 (Le Roux and Rouanet 2004; Ba-
vaud and Xanthos 2005)
𝑝𝑝𝑝𝑝 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟•• 𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟•• 2
– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ∑𝑘𝑘𝑘𝑘𝑘𝑘 ( − ) , the chi-square dissimilarity, used for compo-
𝑟𝑟𝑟𝑟•• 𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑎 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑏 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘
site types made of distinguishable features, where 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑘𝑘𝑘𝑘 is the type-feature ma-
trix counting the occurrences of feature 𝑘𝑘𝑘𝑘 in type 𝑎𝑎𝑎𝑎.
whose margins specify the relative distribution of types: 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎 = 𝛼𝛼𝛼𝛼•𝑎𝑎𝑎𝑎 =
∑𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 1(𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎). Global and local inertias (4) can then be calculated as
1 1
Δ = � 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 Δloc = � 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 (7)
2 2
𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎
The Markov transition probability from term 𝑎𝑎𝑎𝑎 to term 𝑏𝑏𝑏𝑏 is now 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 /𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 . Following
(5), the autocorrelation index 𝛿𝛿𝛿𝛿 = 1 − Δloc /Δ has to be compared with its ex-
pected value under independence
𝜖𝜖𝜖𝜖𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
∑𝑣𝑣𝑣𝑣𝑎𝑎𝑎𝑎=1 −1
𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) =
𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 (8)
𝑣𝑣𝑣𝑣 − 1
which generally differs from (5). Indeed, the permutation-invariance implied by
the null hypothesis 𝐻𝐻𝐻𝐻0 of absence of textual autocorrelation relies on permuta-
tions of positions 𝑖𝑖𝑖𝑖 = 1, … , 𝑛𝑛𝑛𝑛 in (5), while it considers permutations of terms 𝑎𝑎𝑎𝑎 =
1, … , 𝑣𝑣𝑣𝑣 in (8) – two natural although not equivalent assumptions. In the follow-
ing, the position permutation test will be adopted by default, unless explicitly
stated otherwise.
Toy example 1, continued: recall that the set of types occurring in the text
of example 1 is {𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖}. The type dissimilarity 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 corresponding to the
position dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 previously used is defined as the number of characters
by which trigrams 𝑎𝑎𝑎𝑎 and 𝑏𝑏𝑏𝑏 differ:
0 2 3
𝐷𝐷𝐷𝐷 = �2 0 2�
3 2 0
Under linear periodic navigation (2), the type exchange matrix and type propor-
tions are
1 2 1 1 1 2
(𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ) = �1 0 1� 𝜋𝜋𝜋𝜋 = �1� ,
8 4
1 1 0 1
yielding inertias (7) Δ𝐴𝐴𝐴𝐴 = 3/4 and Δ𝐴𝐴𝐴𝐴loc = 7/8 with 𝛿𝛿𝛿𝛿 = −1/6, as already obtained.
3 Case studies
The next sections present several case studies involving in particular chi-squared
dissimilarities between composite types such as play lines, hypertext navigation,
and semantic dissimilarities (further illustrations, including Markov iterations
𝑊𝑊𝑊𝑊 𝑊 𝑊𝑊𝑊𝑊 𝑟𝑟𝑟𝑟 , may be found in Bavaud et al. 2012). Unless otherwise stated, we use
42 François Bavaud, Christelle Cocco, Aris Xanthos
the "skimming" navigation model defined in section 2.1 (slightly adapted to han-
dle border effects) and let the maximum length of jumps vary as 𝑟𝑟𝑟𝑟 = 1,2,3, …,
yielding autocorrelation indices 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] for neighbourhoods of size 𝑟𝑟𝑟𝑟, i.e. including
𝑟𝑟𝑟𝑟 neighbours to the left and 𝑟𝑟𝑟𝑟 neighbours to the right of each position. In partic-
ular, 𝛿𝛿𝛿𝛿 [1] constitutes a generalisation of the Durbin-Watson statistic.
The play Sganarelle ou le Cocu imaginaire by Molière (1660) contains 𝑛𝑛𝑛𝑛 = 207
lines declaimed by feminine or masculine characters (coded 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 = 1 or 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 = 0 re-
spectively). Each line 𝑖𝑖𝑖𝑖 is also characterised by the number of occurrences 𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖𝑘𝑘𝑘𝑘 of
each part-of-speech (POS) tag 𝑘𝑘𝑘𝑘 = 1, … , 𝑝𝑝𝑝𝑝 as assigned by TreeTagger (Schmid
1994); here 𝑝𝑝𝑝𝑝 = 28. The first few rows and columns of the data are represented on
Table 1.
The length autocorrelation index (Figure 3 left) reveals that the length of lines
tends to be strongly autocorrelated over neighborhoods of size up to 5: long
(short) lines tend to be surrounded at short range by long (short) lines. This might
reflect the play structure, which comprises long passages declaiming general
considerations on human condition, and more action-oriented passages, made of
shorter lines.
Textual navigation and autocorrelation 43
Fig. 3: Autocorrelation index 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] (circles), together with expected value (5) (solid line), and
95% confidence interval (6) (dashed lines). Left: length dissimilarity 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑟𝑟𝑟𝑟𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ . Right: gender dis-
similarity 𝐷𝐷𝐷𝐷 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑔𝑔𝑔𝑔𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 .
Fig. 4: Autocorrelation index 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] (same setting as figure 3) for the chi-square dissimilarity 𝐷𝐷𝐷𝐷 𝜒𝜒𝜒𝜒
(left) and the reduced chi-square dissimilarity 𝐷𝐷𝐷𝐷𝑅𝑅𝑅𝑅𝜒𝜒𝜒𝜒 (right).
than feminine lines (64.7% vs. 35.3%); (ii) the probability of being followed by a
line of differing gender is much lower for masculine lines than for feminine ones
(44.4% vs. 82.2%). These factors concur to dominate the short-range preference
for alternation.
The POS tag profile of lines tends to exhibit no autocorrelation, although the
alignments observed in Figure 4 (left) are intriguing. The proportion of verbs (Fig-
ure 4 right) tends to be positively (but not significantly, presumably due to the
small size 𝑛𝑛𝑛𝑛 = 207 of the sample) autocorrelated up to a range of 10, and nega-
tively autocorrelated for a range between 20 and 30 – an observation whose in-
terpretation requires further investigation.
Let the text be partitioned into documents 𝑔𝑔𝑔𝑔 = 1, … , 𝑚𝑚𝑚𝑚; 𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖 denotes the mem-
bership of position 𝑖𝑖𝑖𝑖 to document 𝑔𝑔𝑔𝑔 and 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 : = ∑𝑖𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 is the relative weight of doc-
ument 𝑔𝑔𝑔𝑔. Consider now the free textual navigation confined within each document
where 𝑔𝑔𝑔𝑔[𝑖𝑖𝑖𝑖] denotes the document to which position 𝑖𝑖𝑖𝑖 belongs. The associated ex-
𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙
change matrix 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 : = ∑𝑚𝑚𝑚𝑚
𝑙𝑙𝑙𝑙=1 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 is reducible, i.e. made out 𝑚𝑚𝑚𝑚 disconnected
submatrices. Note that 𝑓𝑓𝑓𝑓 obtains here as the margin of 𝐸𝐸𝐸𝐸, rather as the stationary
distribution of 𝑇𝑇𝑇𝑇, which is reducible and hence not regular. In this setup, the local
inertia is nothing but the within-groups inertia
1 𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙
Δloc = � 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 Δ𝑙𝑙𝑙𝑙 = : Δ𝑊𝑊𝑊𝑊 Δ𝑙𝑙𝑙𝑙 ∶= � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
2
𝑙𝑙𝑙𝑙 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
and hence 𝛿𝛿𝛿𝛿 = Δ𝐵𝐵𝐵𝐵 /Δ ≥ 0, where Δ𝐵𝐵𝐵𝐵 = Δ − Δ𝑊𝑊𝑊𝑊 = ∑𝑙𝑙𝑙𝑙 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙0 is the between-groups
inertia, and 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙0 is the dissimilarity between the centroid of the group 𝑔𝑔𝑔𝑔 and the
overall centroid 0. Here 𝛿𝛿𝛿𝛿, always non negative, behaves as a kind of generalised
𝐹𝐹𝐹𝐹-ratio.
In practical applications, textual positional weights are uniform, and the free
navigation within documents involves the familiar term-document matrix
with
Textual navigation and autocorrelation 45
The significance of 𝛿𝛿𝛿𝛿 = (Δ − Δloc )/Δ can be tested by (6), where trace(𝑊𝑊𝑊𝑊 2 ) =
trace(𝑊𝑊𝑊𝑊) = 𝑚𝑚𝑚𝑚 for the free within-documents navigation under the position per-
mutation test (5).
𝑟𝑟𝑟𝑟 𝑟𝑟𝑟𝑟
When 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ( •• + •• )1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎) is the weighted discrete metric (section 2.3
𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑎 𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑏
c), the autocorrelation index turns out to be
𝜒𝜒𝜒𝜒 2
𝛿𝛿𝛿𝛿 = (13)
𝑛𝑛𝑛𝑛•• (𝑣𝑣𝑣𝑣 − 1)
where 𝑣𝑣𝑣𝑣 is the number of types and 𝜒𝜒𝜒𝜒 2 the term-document chi-square.
Toy example 2: consider 𝑣𝑣𝑣𝑣 = 7 types represented by greek letters, whose 𝑛𝑛𝑛𝑛 =
𝑛𝑛𝑛𝑛•• = 20 occurrences possess the same weight 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 = 1/20, and are distributed
among 𝑚𝑚𝑚𝑚 = 4 documents as "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿", "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼", "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼" and "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖" (Figure 5).
The term-document matrix, term weights and document weights read
𝒈𝒈𝒈𝒈 = 𝟏𝟏𝟏𝟏 𝒈𝒈𝒈𝒈 = 𝟐𝟐𝟐𝟐 𝒈𝒈𝒈𝒈 = 𝟑𝟑𝟑𝟑 𝒈𝒈𝒈𝒈 = 𝟒𝟒𝟒𝟒
8
𝜶𝜶𝜶𝜶 0 2 2 4
⎛ ⎞ 4
⎜ 𝜷𝜷𝜷𝜷 2 0 2 0 ⎟ ⎛2⎞ 1
𝜸𝜸𝜸𝜸 1 1 0 0 ⎟ 1 ⎜ ⎟ 1 1
�𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 � = ⎜ 𝜋𝜋𝜋𝜋 = 1 ⎟ 𝜌𝜌𝜌𝜌 = � � (14)
⎜ 𝜹𝜹𝜹𝜹 1 0 0 0 ⎟ 20 ⎜
⎜2⎟ 5 1
⎜ 𝝐𝝐𝝐𝝐 0 1 0 1 ⎟
2
2
𝜻𝜻𝜻𝜻 0 0 0 2 ⎝1⎠
⎝ 𝜼𝜼𝜼𝜼 0 0 0 1 ⎠
0 1 1 1 1 1 1
1 0 1 1 1 1 1
⎛1 1 0 1 1 1 1⎞
⎜ ⎟
𝐷𝐷𝐷𝐷𝐵𝐵𝐵𝐵 = ⎜ 1 1 1 0 1 1 1⎟ ,
⎜1 1 1 1 0 1 1⎟
1 1 1 1 1 0 1
⎝1 1 1 1 1 1 0⎠
and the weighted discrete metric 𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 (section 2.3 c):
0 7.5 12.5 22.5 12.5 12.5 22.5
7.5 0 15 25 15 15 25
⎛ ⎞
12.5 15 0 30 20 20 30
⎜ ⎟
𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 = ⎜ 22.5 25 30 0 30 30 40 ⎟ .
⎜ 12.5 15 20 30 0 20 30 ⎟
12.5 15 20 30 20 0 30
⎝ 22.5 25 30 40 30 30 0 ⎠
The corresponding values of global inertias (7), local inertias (12) and textual au-
tocorrelation 𝛿𝛿𝛿𝛿 are given in Table 2 below.
Sganarelle, continued: consider the distribution of the 961 nouns and 1'204
verbs of the play Sganarelle among the 𝑚𝑚𝑚𝑚 = 24 scenes of the play, treated here as
documents (section 3.1). The autocorrelation index (13) for nouns associated to
the weighted discrete metric takes on the value 𝛿𝛿𝛿𝛿 nouns = 0.0238, lower than the
expected value (5) 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 nouns ) = 0.0240. For verbs, one gets 𝛿𝛿𝛿𝛿 verbs = 0.0198 >
𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 verbs ) = 0.0191. Although not statistically significant, the sign of the differ-
ences reveals a lexical content within scenes more homogeneous for verbs than
for nouns. Finer analysis can be obtained from Correspondence Analysis (see e.g.
Greenacre 2007), performing a spectral decomposition of the chi-square in (13).
Consider a set 𝐺𝐺𝐺𝐺 of electronic documents 𝑔𝑔𝑔𝑔 = 1, … , 𝑚𝑚𝑚𝑚 containing hyperlinks at-
tached to a set 𝐴𝐴𝐴𝐴 of active terms and specified by a function 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎] from 𝐴𝐴𝐴𝐴 to 𝐺𝐺𝐺𝐺,
associating each active term 𝑎𝑎𝑎𝑎 to a target document 𝑔𝑔𝑔𝑔 = 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎]. A simple model of
hypertext navigation consists in clicking at each position occupied by an active
term, thus jumping to the target document, while staying in the same document
when meeting an inactive term; in both cases, the next position 𝑖𝑖𝑖𝑖 is selected as
𝑙𝑙𝑙𝑙
𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 in (11). This dynamics generates a document to document transition matrix
Φ = (𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ ), involving the term-document matrix 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 (10), as
Textual navigation and autocorrelation 47
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙
𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ : = � 𝜏𝜏𝜏𝜏 (15)
𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 (𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ
𝑎𝑎𝑎𝑎
where 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ is the probability to jump from term 𝑎𝑎𝑎𝑎 in document 𝑔𝑔𝑔𝑔 to document ℎ
and obeys 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)• = 1. In the present setup, 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ = 1(ℎ = 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎]) for 𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎 and
𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ = 1(ℎ = 𝑔𝑔𝑔𝑔) for 𝑎𝑎𝑎𝑎 𝑎̸ 𝐴𝐴𝐴𝐴. Alternative specifications taking into account click-
ing probabilities, or contextual effects (of term 𝑎𝑎𝑎𝑎 relatively to its background 𝑔𝑔𝑔𝑔)
could also be cast within this formalism. The document-to-document transition
matrix obeys 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ ≥ 0 and 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙• = 1, and the broad Markovian family of hypertext
navigations (15) generalizes specific proposals such as the free within-documents
setup, or the Markov chain associated to the PageRank algorithm (Page 2001).
By standard Markovian theory (e.g. Grinstead and Snell 1998), each docu-
ment belongs to a single "communication-based" equivalence class, which is ei-
ther transient, i.e. consisting of documents eventually unattainable by lack of in-
coming hyperlinks, or recurrent, i.e. consisting of documents visited again and
again once the chain has entered into the class. The chain is regular iff it is ape-
riodic and consists of a single recurrent class, in which case its evolution con-
verges to the stationary distribution 𝑠𝑠𝑠𝑠 of Φ obeying ∑𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ = 𝑠𝑠𝑠𝑠ℎ , which differs
in general from the document weights 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 = 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 /𝑛𝑛𝑛𝑛•• .
In the regular case, textual autocorrelation for type dissimilarities (section
2.3) can be computed by means of (7), where (compare with (12))
1 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎ℎ
𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑎)(𝑎𝑎𝑎𝑎•) 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)(𝑎𝑎𝑎𝑎ℎ) : = [𝜏𝜏𝜏𝜏 𝑠𝑠𝑠𝑠 + 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎ℎ)𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠ℎ ]
2 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛•ℎ (𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ 𝑙𝑙𝑙𝑙
and
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎
𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑎)(••) = � 𝑠𝑠𝑠𝑠 ≠ (hypertextual navigation).
𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛••
𝑙𝑙𝑙𝑙
Toy example 2, continued: let the active terms be 𝐴𝐴𝐴𝐴 = {𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛿𝛿𝛿𝛿}, with hyper-
links 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 1, 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 2, 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 3 and 𝛼𝛼𝛼𝛼[𝛿𝛿𝛿𝛿] = 4 (Figure 5). The transition matrix
(15) turns out to be regular. From (14), the document-document transition proba-
bility, its stationary distribution and the document weights are
0 1/2 1/4 1/4 1/3 1/5
1/2 1/4 1/4 0 1/3 1/5
Φ=� � 𝑠𝑠𝑠𝑠 = � � 𝜌𝜌𝜌𝜌 = � � .
1/2 1/2 0 0 1/6 1/5
1/2 0 0 1/2 1/6 2/5
4 documents of toy example 2. Similary, the term magnification factor 𝑛𝑛𝑛𝑛•• 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 /𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎
is 1.04 for 𝛼𝛼𝛼𝛼, 0.83 for 𝛼𝛼𝛼𝛼, 1.67 for 𝛼𝛼𝛼𝛼, 1.67 for 𝛿𝛿𝛿𝛿, 1.04 for 𝛼𝛼𝛼𝛼, 0.42 for 𝜖𝜖𝜖𝜖 and 0.42 for 𝜖𝜖𝜖𝜖.
Fig. 5: Hypertextual navigation (toy example 2) between 𝑚𝑚𝑚𝑚 = 4 documents containing |𝐴𝐴𝐴𝐴| = 4
active terms 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, and 𝛿𝛿𝛿𝛿.
Table 2: (Toy example 2) terms autocorrelation is positive under free within document naviga-
tion, but negative under hypertextual navigation. Here 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) refers to the term permutation
test (8).
Table 2 summarizes the resulting textual autocorrelation for the three term dis-
similarities already investigated: systematically, hyperlink navigation strongly
increases the terms heterogeneity as measured by the local inertia, since each of
the active terms 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼 and 𝛿𝛿𝛿𝛿 point towards documents not containing them.
Textual autocorrelation in "WikiTractatus": strict application of the above
formalism to real data requires full knowledge of a finite, isolated network of 𝑚𝑚𝑚𝑚
hyperlinked documents.
Textual navigation and autocorrelation 49
Fig. 6: Entry page of "WikiTractatus", constituting the document "les mots" (left). Free-within
navigation weights 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 versus hypertextual weights 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 , logarithmic scale (right).
Semantic similarities have been systematically investigated in the last two dec-
ades, using in particular reference word taxonomies expressing "ontological" re-
lationships (e.g. Resnik 1999). In WordNet (Miller et al. 1990), words, and in par-
ticular nouns and verbs, are grouped into synsets, i.e. cognitive synonyms, and
each synset represents a different concept. Hyponymy expresses inclusion be-
tween concepts: the relation "concept 𝑐𝑐𝑐𝑐1 is an instance of concept 𝑐𝑐𝑐𝑐2 " is denoted
𝑐𝑐𝑐𝑐1 ≤ 𝑐𝑐𝑐𝑐2 , and 𝑐𝑐𝑐𝑐1 ∨ 𝑐𝑐𝑐𝑐2 represents the least general concept subsuming both 𝑐𝑐𝑐𝑐1 and
𝑐𝑐𝑐𝑐2 . For instance, in the toy ontology of Figure 7, cat ≤ animal and cat ∨ dog =
animal.
Fig. 7: Toy noun ontology made up of 7 concepts: numbers in bold are probabilities (16), num-
bers in italic are similarities (17), and the underlined number is the dissimilarity between bicy-
cle and car according to (18).
Based on a reference corpus (hereafter the Brown corpus, Kučera and Francis
1967), the probability 𝑝𝑝𝑝𝑝(𝑐𝑐𝑐𝑐) of concept 𝑐𝑐𝑐𝑐 can be estimated as the proportion of word
tokens whose sense 𝐶𝐶𝐶𝐶(𝑤𝑤𝑤𝑤) is an instance of concept 𝑐𝑐𝑐𝑐. Thus, representing the num-
ber of occurrences of word 𝑤𝑤𝑤𝑤 by 𝑛𝑛𝑛𝑛(𝑤𝑤𝑤𝑤),
∑𝑤𝑤𝑤𝑤 𝑛𝑛𝑛𝑛 (𝑤𝑤𝑤𝑤) 1(𝐶𝐶𝐶𝐶(𝑤𝑤𝑤𝑤) ≤ 𝑐𝑐𝑐𝑐)
𝑝𝑝𝑝𝑝(𝑐𝑐𝑐𝑐): = (16)
∑𝑤𝑤𝑤𝑤 𝑛𝑛𝑛𝑛 (𝑤𝑤𝑤𝑤)
Textual navigation and autocorrelation 51
According to TreeTagger (Schmid 1994), the short story The Masque of the
Red Death by Edgar Allan Poe (1842) contains 497 positions occupied by nouns
and 379 positions occupied by verbs. Similarities between nouns and between
verbs can be obtained using the WordNet::Similarity interface (Pedersen et et al.
2004) – systematically using, in this case study, the most frequent sense of am-
biguous concepts. Autocorrelation indices (for neighbourhoods of size 𝑟𝑟𝑟𝑟) calcu-
lated using the corresponding dissimilarities exhibit no noticeable pattern (Fig-
ure 8).
Fig. 8: Autocorrelation index δ[r] (same setting as Figure 3) in "The Masque of the Red Death"
for the semantic dissimilarity (18) for nouns (left) and for verbs (right).
This being said, the 𝑝𝑝𝑝𝑝-dimensional coordinates 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 entering in any squared Eu-
clidean distance 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 =∥ 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∥2 can be recovered by (weighted) multidimen-
sional scaling (MDS) (e.g. Torgeson 1958; Mardia et al. 1979), yielding orthogonal
52 François Bavaud, Christelle Cocco, Aris Xanthos
Fig. 9: Screeplot for the MDS on semantic dissimilarities for nouns (left). Factorial coordinates
xaα and proportion of explained inertia for α = 1,2 (right).
Fig. 10: Autocorrelation index δ[r] for nouns in the first semantic dimension (left) and in the sec-
ond semantic dimension (right).
The first semantic coordinate for nouns in The Masque of the Red Death (Figure 9)
clearly contrasts abstract entities such as horror, pestilence, disease, hour, mean,
night, vision, or precaution, on the left, with physical entities such as window, roof,
Textual navigation and autocorrelation 53
wall, body, victim, glass, or visage, on the right, respectively defined in WordNet
as "a general concept formed by extracting common features from specific exam-
ples" and "an entity that has physical existence". Figure 10 (left) shows this first
coordinate to be strongly autocorrelated, echoing long-range semantic persis-
tence, in contrast to the second coordinate (Figure 10 right), whose interpretation
is more difficult.
Fig. 11: Screeplot for the MDS on semantic dissimilarities for verbs (left). Factorial coordinates
xaα and proportion of explained inertia for α = 1,2 (right).
Fig. 12: Autocorrelation index δ[r] for verbs in the first semantic dimension (left) and in the sec-
ond semantic dimension (right)
54 François Bavaud, Christelle Cocco, Aris Xanthos
For verbs (Figure 11), the first semantic coordinate differentiates stative verbs,
such as be, seem, or sound, from all other verbs, while the second semantic coor-
dinate differentiates the verb have from all other verbs. Figure 12 reveals that the
first coordinate is strongly autocorrelated, while the second coordinate is nega-
tively autocorrelated for neighbourhood ranges up to 2. Although the latter result
is not significant for 𝛼𝛼𝛼𝛼 = 0.05 according to (6), it is likely due to the use of have
as an auxiliary verb in past perfect and other compound verb tenses.
4 Conclusions
In this contribution, we have introduced a unified formalism for textual autocor-
relation, i.e. the tendency for neighbouring textual positions to be more (or less)
similar than randomly chosen positions. This approach to sequence and text
analysis is based on two primitives: (i) neighbourhoodness between textual posi-
tions, as determined by a Markov model of navigation, and formally represented
by the exchange matrix 𝐸𝐸𝐸𝐸; and (ii) (dis-)similarity between positions, as encoded
in the (typically squared Euclidean) dissimilarity matrix 𝐷𝐷𝐷𝐷.
By varying 𝐸𝐸𝐸𝐸 and or 𝐷𝐷𝐷𝐷, the proposed formalism recovers and revisits well-
known statistical objects and concepts, such as the 𝐹𝐹𝐹𝐹-ratio, the chi-square and
Correspondence Analysis. It also gives a unified account of various representa-
tions commonly used for textual data analysis, in particular the sequential and
bag-of-words models, as well as the term-document matrix. It can also be ex-
tended to provide a model of hypertext navigation, where hyperlinks act as mag-
nifying (or reducing) glasses, modifying the relative weights of documents, and
altering (or not) textual autocorrelation.
This approach is applicable to any form of sequence and text analysis that
can be expressed in terms of dissimilarity between positions (or between types).
The presented case studies have aimed at illustrating this versatility by address-
ing lexical, morphosyntactic, and semantic properties of texts. As shown in the
latter case, squared Euclidean dissimilarities can be visualised and decomposed
into factorial components by multidimensional scaling; the textual autocorrela-
tion of each component can in turn be analysed and interpreted – yielding in par-
ticular a new means of dealing with semantically related problems.
Textual navigation and autocorrelation 55
References
Anselin, Luc. 1995. Local indicators of spatial association. Geographical Analysis 27(2). 93–
115.
Bavaud, François. 2013. Testing spatial autocorrelation in weighted networks: The modes per-
mutation test. Journal of Geographical Systems 15(3). 233–247.
Bavaud, François, Christelle Cocco & Aris Xanthos. 2012 Textual autocorrelation: Formalism
and illustrations. In Anne Dister, Dominique Longrée & Gérald Purnelle (eds.), 11èmes
journées internationales d'analyse statistique des données textuelles, 109–120. Liège :
Université de Liège.
Bavaud, François & Aris Xanthos. 2005. Markov associativities. Journal of Quantitative Linguis-
tics 12(2-3). 123–137.
Cliff, Andrew D. & John K. Ord. 1981. Spatial processes: Models & applications. London: Pion.
Cressie, Noel A.C. 1991. Statistics for spatial data. New York: Wiley.
Greenacre, Michael. 2007. Correspondence analysis in practice, 2nd edn. London: Chapman
and Hall/CRC Press.
Grinstead, Charles M. & J. Laurie Snell. 1998. Introduction to probability. American Mathemati-
cal Society.
Kučera, Henry & W. Nelson Francis. 1967. Computational analysis of present-day American
English. Providence: Brown University Press.
Lebart, Ludovic. 1969. Analyse statistique de la contigüité. Publication de l'Institut de Statis-
tiques de l'Université de Paris 18. 81–112.
Le Roux, Brigitte & Henry Rouanet. 2004. Geometric data analysis. Kluwer: Dordrecht.
Mardia, Kanti V., John T. Kent & John M. Bibby. 1979. Multivariate analysis. New York: Aca-
demic Press.
Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross & Katherine Miller.
1990. WordNet: An on-line lexical database. International Journal of Lexicography 3(4).
235–244.
Moran, Patrick & Alfred Pierce. 1950. Notes on continuous stochastic phenomena. Biometrika
37(1-2). 17–23.
Ourednik, André. 2010. Wikitractatus. http://wikitractatus.ourednik.info/ (accessed December
2012).
Page, Lawrence. 2001. Method for node ranking in a linked database. U.S. Patent No
6,285,999.
Pedersen, Ted, Siddharth Patwardhan & Jason Michelizzi. 2004. WordNet:Similarity – Measur-
ing the relatedness of concepts. In Susan Dumais, Daniel Marcu & Salim Roukos (eds.),
Proceedings of HLT-NAACL 2004: Demonstration Papers, 38–41. Boston: Association for
Computational Linguistics.
Resnik, Philip. 1999. Semantic similarity in a taxonomy: An information-based measure and its
application to problems of ambiguity in natural language. Journal of Artificial Intelligence
Research 11. 95–130.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. Proceedings
of international conference on new methods in language processing. Manchester, UK.
Torgeson, Warren S. 1958. Theory and methods of scaling. New York: Wiley.
Martina Benešová and Radek Čech
Menzerath-Altmann law versus random
model
1 Introduction
The Menzerath-Altmann law belongs among the in linguistics most used and the
best corroborated empirically elaborated linguistic laws. Firstly, it was enunci-
ated by Menzerath (1928), later transformed into its mathematical formula by Alt-
mann (1980), and recently it has been proved by the labour of Hřebíček (1995),
Andres (2010) and Andres et al. (2012) to be closely related with the fractal quality
which can be seen in texts. On the basis of already executed experiments when
we tried to examine the fractality of the text we observed (Andres – Benešová
2011, Benešová 2011) that the gained results differ significantly depending on the
manner in which the units of the sequence segmentation had been chosen and
set; for example, the word was grasped graphically as a sequence of graphemes
in between two blanks in a sentence, or it was regarded as a sound unit, semantic
unit and so on. Importantly, although all used manners of segmentation are sub-
stantiated linguistically, the differences of results (e.g., the value of b parameter
in formula (1)) are so striking that an analysis of the relationship between the
segmentation and the character of the model representing the Menzerath-Alt-
mann law is strongly needed, in our opinion.
As the first step, we decided to scrutinize this relationship by using a random
model of data building based on an original text sample. Surprisingly enough,
despite the fact that testing models representing real systems of any kind by ran-
dom models is quite usual in empirical science, the Menzerath-Altmann law has
been tested by the random model only once in linguistics (Hřebíček, 2007), to our
knowledge1. However, this test lacks appropriate amount of experiments to be
supported. In sum, we pursue two aims. Firstly, the validity (or non-validity) of
the Menzerath-Altmann law in the text segmented randomly will be tested; sec-
ondly, because three random models are used (cf. Section 3) the character of re-
sults of all models with regard to the real text will be analyzed.
1 In biology, Baixeries et al. (2012) tested the Menzerath-Altmann law by a random model.
58 Martina Benešová and Radek Čech
where x is the length of the construct measured in the number of its constituents
(in our experiments it is the length of sentences measured in the number of their
clauses; x∈N), y is the average length of the constituents measured in the units
on the immediately lower language level, i.e. in the constituent’s constituents (in
our experiment it is the average length of clauses measured in the number of
words they are constituted of; y∈Q), A,b are positive real parameters. Graph-
ically, parameter A determines how far from the x-axis the graph representing the
particular MAL realization is positioned. However, parameter b is responsible
firstly for the steepness of the curve, and secondly, more essentially, for the rising
or falling tendency of the curve; when b>0 or b<0, respectively. As the MAL ex-
presses the inversely proportional relation between the length of the construct
and the length of its constituents, parameter b is required to be positive real so
that it reflects the substance of MAL. According to Kelih (2010) parameter b in the
MAL can clearly depend on the text type.
3 Random Models
Despite the fact that the MAL is considered to be one of the best corroborated
linguistic laws, one can ask whether the relationships between units engaged in
the MAL are not a result of a random process. If so, the law would be put into
question and, obviously, the observed relationships between constructs and con-
stituents would be explained differently. The most striking example of using the
Menzerath-Altmann law versus random model 59
random model and its impact on the explanation of language properties is Man-
delbrot’s (1953) and Miller’s (1957) interpretations of Zipf’s law which triggered a
debate (still unending) about validity/non-validity of Zipf’s law, on one hand,
and about the character of the random model (e.g., Miller, Chomsky 1963; Li 1992;
Cohen et al. 1997; Ferrer i Cancho, Solé 2002; Mitzenmacher 2003; Ferrer i Can-
cho, Elvevåg 2010). Further, random models have played an important role also
in the theory of complex networks (c.f., Barabási, Albert 1999, Newman 2011)
which has been used for analysis of language, too (c.f., Ferrer i Cancho 2010; Bib-
liography on Linguistic and Cognitive Networks2).
The main problem with the use of random models lies in the determination
of randomness. As any property of reality, neither randomness is somewhat
“given”; it is our concept which is defined with regard to the goal of analysis un-
der consideration. Consequently, there is not the only “right” definition of ran-
domness; it is always the matter of critical discussion and, therefore, any random
model must be properly justified and described in detail. In our study, three ran-
dom models are used; their characteristics are presented in the next section.
2 http://www.lsi.upc.edu/~rferrericancho/linguistic_and_cognitive_networks.html
60 Martina Benešová and Radek Čech
clauses; the length of five clauses equals to 6, the length of three clauses equals
to 5, and the length of two clauses equals to 3. The relative frequency of clauses
with the length 6 is 0.5, with the length 5 it is 0.3 and with the length 3 it is 0.2.)
Furthermore, the length of each clause is computed randomly as follows: for each
clause the number expressing its length is generated randomly with the proba-
bility given by observed relative frequencies.
The second model (M2) is a slight modification of the M1. Although the prob-
abilities of individual lengths of clauses correspond with those in the real text in
the M1, the total number of words generated in the M1 does not, i.e., the sum of
lengths of clauses in the M1 is not equal to the total number of words in the real
text. Therefore, the algorithm of the M2 follows that of the M1 and it is extended
by a condition which requires that the sum of lengths of clauses in the M2 has to
correspond to the total number of words in the real text (technically, a loop is
added to the program which repeats the generation of random numbers until the
condition is fulfilled).
The third model (M3) is created by the following procedure3: firstly, to each
clause one word is assigned, i.e., all clauses have the length equal to one; fur-
thermore, in each step another word is added to a randomly chosen clause, i.e.,
its length is increased by one. This procedure ends when the total number of
words in the real text is used in the model. For the generation of random numbers
the free statistical software R (http://www.r-project.org/) was used.
As for the language material, Václav Havel’s essay Moc bezmocných (The
Power of the Powerless)4 was analyzed. The text was segmented into sentences
and clauses manually.
5 Results
As was mentioned above, the initial model to provide us at the onset of the exper-
iment with the input data which was processed later in a random way was Václav
Havel’s essay Moc bezmocných (The Power of the Powerless). Table 1 shows the
figures which are the result of quantifying the original sample text.
3 This model was proposed by Ján Mačutek; we appreciate that he allowed us to use it in our
study and we thank him for a fruitful discussion and help with programming.
4 Available at: http://vaclavhavel.cz/showtrans.php?cat=eseje&val=2_eseje.html&typ=HTML
Menzerath-Altmann law versus random model 61
Table 1: The original sample text (Václav Havel’s essay The Power of the Powerless): x the
length of sentences (in clauses) – z their frequency – y the mean length of clauses (in words).
1 408 12.6373
2 260 9.5481
3 176 8.7992
4 94 8.8005
5 57 9.0140
6 20 7.6583
7 14 7.9388
8 4 10.2813
9 7 6.6984
10 4 8.1250
11 2 7.0000
14 1 9.0000
The data shown in Table 15 was processed in mathematical and statistical ways,
as demonstrated e.g. in Andres et al. (2012). We gained the parameters of the
MAL; A=11.9311 and b=0.22356, for more information on MAL parameters cf. e.g.
Cramer (2005). The value of parameter b proves that the empirically gained data
of the particular sample behaves in accordance with the original assumption of
the MAL, i.e. the relation between the length of constructs (sentences) and the
length of their constituents is inversely proportional. The graph in Figure 1 is pre-
sented as evidence.
5 The data points with x≥8 were omitted in further data processing due to their low frequency
compared to the frequencies of other data points.
6 For estimating all the values of the MAL parameters A "and" b in this paper linear regression
and the method of least squares were used. The data was processed and the outcomes gained by
means of the free statistical software R.
7 All the confidence intervals in this paper were calculated by means of the free statistical soft-
ware R.
62 Martina Benešová and Radek Čech
Fig. 1: Plotted data points of the observations presented in Table 1 and the graphical display of
the goodness-of-fit of the MAL model (R2=0.8737).
As is obvious from the above mentioned, the input, empirically gained data fol-
lows the MAL in the outlined manner. Additionally, the data in Table 1 was ad-
justed randomly to get models M1, M2 and M3. To get a bigger variety of data, we
constructed five different random samples per each model, cf. Tables 4a, b, c, d,
e, 5a, b, c, d, e, 6a, b, c, d, e in Appendix. The results (parameters A, b and the
coefficients of determination R2) are presented in Tables 2a, 2b and 2c, respec-
tively.
Table 2a: Parameters A, b and the coefficients of determination R2 extracted from the random
samples constructed in terms of Model 1.
Table 2b: Parameters A, b and the coefficients of determination R2 extracted from the random
samples constructed in terms of Model 2.
Table 2c: Parameters A, b and the coefficients of determination R2 extracted from the random
samples constructed in terms of Model 3.
It is obvious that the MAL assumption is satisfied due to its positive parameter b
only in the following cases: M1 – random sample 3; M2 – random samples 2, 3, 4,
5; M3 – random sample 4. Their parameters b are all positive real and in addition
the tendency of the relation between the construct length and the length of its
constituents is inversely proportional. However, the respective coefficients of de-
termination are only in two random samples greater than fifty percent, they are
M2 – random samples 2 and 4; R2=0.5196 and R2=0.4771. It is, nevertheless, still
less than in the original real text sample. Figures 2a and 2b illustrate the two ran-
dom samples with the highest and lowest degrees of goodness-of-fit.
64 Martina Benešová and Radek Čech
We can also measure the confidence intervals for all of these random samples
and check this way if they really satisfy the MAL assumption with the given 95%
certainty. The confidence intervals for these cases are displayed in Table 3. Only
those cases whose parameters b are positive real numbers are listed. The confi-
dence intervals prove that the MAL assumption is not satisfied with 95% certainty
because the bottom line in all the random samples declined to negative real num-
bers.
Table 3: The random samples with b>0 supplied with the respective coefficients of determina-
tion and confidence intervals.
the coefficient of
Model random sample confidence interval
determination R2
1 3 0.4340 〈−0.0521; 0.0758〉
6 Conclusion
The results of the experiment reveal that the data under examination generated
by random models does not fulfil the MAL. Consequently, the results can be
viewed as another argument supporting the assumption considering that the
MAL expresses one of important mechanisms controlling human language be-
havior.
Secondly, we wanted to explore which method of random modelling can con-
struct the best data in terms of the MAL. We came to the finding that the biggest
number of mathematical models showing the closest properties to the original
text sample in terms of the MAL is designed by means of M2; this result is not
surprising so much because M2 shares more characteristics of the original text
than the other two random models.
66 Martina Benešová and Radek Čech
Acknowledgments
Martina Benešová’s and Radek Čech's contribution is supported by the project
CZ.1.07/2.3.00/30.0004 POSTUP and the project Linguistic and lexicostatistic
analysis in cooperation of linguistics, mathematics, biology and psychology,
grant no. CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund
and the National Budget of the Czech Republic, respectively.
References
Altmann, Gabriel. 1980. Prolegomena to Menzerath’s law. In Rüdiger Grotjahn (ed.), Glot-
tometrika 2, 1–10. Bochum: Brockmeyer.
Andres, Jan. 2010. On a conjecture about the fractal structure of language. Journal of Quantita-
tive Linguistics 17(2). 101–122.
Andres, Jan & Martina Benešová. 2011. Fractal analysis of Poe’s Raven. Glottometrics 21. 73–
100.
Andres, Jan, Martina Benešová, Lubomír Kubáček & Jana Vrbková, J. 2012. Methodological note
on the fractal analysis of texts. Journal of Quantitative Linguistics 19(1). 1–31.
Baixeries, Jaume, Antoni Hernández-Fernández & Ramon Ferrer-i-Cancho. 2012. Random mod-
els of Menzerath-Altmann lawin genomes. BioSystems 107(3). 167–173.
Barabási, Albert-László & Réka Albert. 1999. Emergence of Scaling in Random Networks. Sci-
ence 286(5439). 509-512.
Benešová, Martina. 2011. Kvantitativní analýza textu se zvláštním zřetelem k analýze fraktální
[Quantitative analysis of text with special respect to fractal analysis]. Olomouc: Palacký
University dissertation.
Cohen, Avner, Rosario N. Mantegna & Shlomo Havlin. 1997. Numerical analysis of word fre-
quencies in artificial and natural language texts. Fractals 5(1). 95–104.
Cramer, Irene M. 2005. The parameters of the Altmann-Menzerath law. Journal of Quantitative
Linguistics 12(1). 41–52.
Ferrer i Cancho, Ramon. 2010. Network theory. In Patrick Colm Hogan (ed.), The Cambridge en-
cyclopaedia of the language sciences, 555–557. Cambridge: Cambridge University Press.
Ferrer i Cancho, Ramon & Ricard V. Solé. 2002. Zipf’s law and random texts. Advances in Com-
plex Systems 5(1). 1–6.
Ferrer i Cancho, Ramon & Brita Elvevåg. 2010. Random texts do not exhibit the real Zipf’s law-
like rank distribution. PLoS ONE 5(3). e9411.
Hřebíček, Luděk. 1995. Text levels. Language constructs, constituents and the Menzerath-Alt-
mann law. Trier: WVT.
Hřebíček, Luděk. 2007. Text in semantics. The principle of compositeness. Praha: Oriental Insti-
tute of the Academy of Sciences of the Czech Republic.
Kelih, Emmerich. 2010. Parameter interpretation of the Menzerath law: Evidence from Serbian.
In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and Language. Structures,
functions, interrelations, quantitative perspectives, 71–79. Wien: Praesens.
Menzerath-Altmann law versus random model 67
Li, Wientian. 1992. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE
Transactions on Information Theory 38(6). 1842–1845.
Mandelbrot, Benoit. 1953. An informational theory of the statistical structure of language. In
Willis Jackson (ed.), Communication theory, 486–504. London: Butterworths.
Menzerath, Paul. 1928. Über einige phonetische Probleme. In Actes du premier congrés inter-
national de linguistes, 104–105. Leiden: Sijthoff.
Miller, George A. 1957. Some effects of intermittent silence. The American Journal of Psychol-
ogy 70(2). 311–314.
Miller, George A. & Noam Chomsky. 1963. Finitary models of language users. In R. Duncan
Luce, Robert R. Bush & Eugene Galanter (eds.), Handbook of mathematical psychology,
419–491. New York: Wiley.
Mitzenmacher, Michael. 2003. A brief history of generative models for power law and lognor-
mal distributions. Internet Mathematics 1(2). 226–251.
Appendix
Table 4: Model 1 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their fre-
quency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words).
Tab. 5: Model 2 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their fre-
quency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words).
Table 6: Model 3 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their fre-
quency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words).
where fr are the ordered absolute word frequencies (𝑟𝑟𝑟𝑟 = 1,2, . . . 𝑉𝑉𝑉𝑉 𝑉 1) and
V is the highest rank (= vocabulary size), i.e. L consists of the sum of Euclidean
distances between ranked frequencies. The frequency structure is measured by
the so-called lambda indicator
72 Radek Čech
𝐿𝐿𝐿𝐿(𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙10 𝑁𝑁𝑁𝑁)
𝛬𝛬𝛬𝛬 = (2)
𝑁𝑁𝑁𝑁
The authors state that this indicator successfully eliminates the impact of
text length, meaning that it can be used for a comparison of genre, authorship,
authorial development, language typology and so on; an analysis of 1185 texts
in 35 languages appears to prove the independence between lambda and text
length (cf. Popescu, Čech, Altmann 2011, p. 10 ff.).
However, if particular languages are analyzed separately, a dependence of
lambda on text length emerges, as presented in this study. Moreover, this study
reveals that the relationship between lambda and text length is not straightfor-
ward, unlike the relationship between text length and type-token ratio. Further,
the study presents a method for the empirical determination of the interval in
which lambda should be independent of text length; within this interval lambda
could be used for its original purpose, i.e. comparison of genre, authorship,
authorial development etc. (cf. Popescu et al. 2011).
Fig. 1: The relationship between lambda of individual chapters and their length in the novel The
Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek.
Table 1: The length and lambda of individual chapters in the novel The Fateful Adventures of the
Good Soldier Švejk During the World War written by Jaroslav Hašek
Fig. 2: The relationship between lambda of individual chapters and their length in the novel
Oliver Twist written by Charles Dickens.
Table 2: The length and lambda of individual chapters in the novel Oliver Twist written by
Charles Dickens.
Fig. 3: The relationship between lambda of individual chapters and cumulative length in the
novel The Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav
Hašek.
Table 3: The cumulative length and lambda of individual chapters in the novel The Fateful
Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek.
Fig. 4: The relationship between lambda of individual chapters and cumulative length in the
novel Oliver Twist written by Charles Dickens.
Table 4: The cumulative length and lambda of individual chapters in the novel Oliver Twist
written by Charles Dickens.
The longer the text, the more the writer loses his subconscious control over some propor-
tions and keeps only the conscious control over contents, grammar, his aim, etc. But as
soon as parts of control disappear, the text develops its own dynamics and begins to abide
by some laws which are not known to the writer but work steadily in the background. The
process is analogous to that in physics: if we walk, we consider our activity as something
normal; but if we stumble, i.e. lose the control, gravitation manifests its presence and we
fall. That means, gravitation does not work ad hoc in order to worry us maliciously, but it
is always present, even if we do not realize it consciously. In writing, laws are present,
too, and they work at a level which is only partially accessible. One can overcome their
working, but one cannot eliminate them. On the other hand, if the writer slowly loses his
control of frequency structuring, a new order begins to arise by self-organization or by
some not perceivable background mechanism.
ing, eating, traveling), one’s mental state is probably changed; even if the au-
thor repeatedly reads the text, this change influences the character of frequency
structuring. In short, I would like to emphasize the difference between homoge-
neous text types (e.g. a personal letter, e-mail, poem, or short story) which are
written relatively continuously, on the one hand, and long texts written under
different circumstances, on the other.
Fig. 5: The relationship between lambda and text length in Czech (610 texts).
Fig. 6: The relationship between lambda and text length in English (218 texts).
In sum, it can be assumed that in the case of texts that are too short or too long,
the author cannot control the frequency structure. So, the task is to find the
Text Length and the lambda frequency structure of a text 79
Table 5: The mean lambdas of particular intervals of N in Czech; 615 texts were used for the
analysis.
Table 6: The mean lambdas of particular intervals of N in English; 218 texts were used for the
analysis.
1.6950 − 1.5337
𝑢𝑢𝑢𝑢 = = 8.73
√0.000135 − 0.000206
Because multiple u-tests in subsequent intervals are performed (which in-
flates probability of Type I error), the Bonferroni correction is used for an ap-
propriate adjustment (cf. Miller 1981). Specifically, the critical value for a rejec-
tion of null hypothesis is determined as follows
𝛼𝛼𝛼𝛼/2
𝑝𝑝𝑝𝑝𝑖𝑖𝑖𝑖 = (4)
𝑛𝑛𝑛𝑛
where α is the significance level and n is a number of performed tests. For
the significance level α = 0.05 and n = 13 (we perform 13 tests, cf. Table 7 and 8)
we get ucorrected ≤ 2.89 as a critical value. Thus we can state that between first and
second interval (cf. Table 5) is a significant difference (at the significance level α
= 0.05). All results are presented in Table 7 and 8 and graphically in Figure 7
and 8, where subsequent intervals with non-significant differences are linked.
Table 7: The mean lambdas of particular intervals of N in Czech; 615 texts were used for the
analysis.
Table 8: The differences between subsequent intervals in English. The values of significant
differences (at the significance level α = 0.05, ucorrected ≤ 2.89) are boldfaced.
Fig. 7: The differences between subsequent intervals in Czech. Subsequent intervals with non-
significant differences are connected by the lines. The line does not mean that all the intervals
inside the line are not different; the line expresses non-significant differences only between
pairs of subsequent intervals.
Fig. 8: The differences between subsequent intervals in English. Subsequent intervals with
non-significant differences are connected by the lines. The line does not mean that all the
intervals inside the line are not different; the line expresses non-significant differences only
between pairs of subsequent intervals.
of the arc length L (cf. Section 1), it is first necessary to derive the theoretical
maximum of the arc length, which is given as
where V is the number of word types (i.e. maximum rank) and f(1) is the
maximum frequency. The theoretical maximum of lambda is
𝐿𝐿𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 (𝑁𝑁𝑁𝑁 − 1) 1
𝛬𝛬𝛬𝛬𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = log(𝑁𝑁𝑁𝑁) = � � log(𝑁𝑁𝑁𝑁) = �1 − � log(𝑁𝑁𝑁𝑁) (6)
𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁
practically N >> 1, hence
The comparison of the maximum theoretical value of lambda and the em-
pirical findings is presented in Figure 9 and 10. It is clear that shorter texts are
closer to the maximum than longer ones. Moreover, with increasing text length
the difference between Λmax and Λ grows. Obviously, this result reveals that with
increasing text length the language user's ability to control the frequency struc-
ture of the text decreases, and consequently it is governed by some self-
regulating mechanism (cf. Popescu et al. 2012).
Fig. 9: Comparison of the maximum theoretical value of lambda (line) and the empirical find-
ings (dots) in Czech.
Text Length and the lambda frequency structure of a text 85
Fig. 10: Comparison of the maximum theoretical value of lambda (line) and the empirical find-
ings (dots) in English.
6 Conclusion
The study revealed the evident dependence of the lambda indicator on text
length. Consequently, this finding undermines the main methodological ad-
vantage of the lambda measurement and casts doubt upon many of the results
presented by Popescu at al. (2011). However, the specific relationship between
lambda and the text length, which is expressed graphically by a concave bow
(cf. Section 3), allows us to determine the interval of the text length in which the
lambda measurement is meaningful and can be used for original purposes.
Moreover, this determination is connected to theoretical reasoning which may
be enhanced by psycholinguistic or cognitive explanations.
To summarize, the lambda indicator can on the one hand be ranked among
all previous attempts which have tried unsuccessfully to eliminate the impact of
text length; on the other hand, however, its specificity means that the method
offers a potential use in comparisons of texts. Of course, only further analyses
can reveal the potential meaningfulness or meaninglessness of this method.
86 Radek Čech
Acknowledgments
This study was supported by the project Linguistic and lexicostatistic analysis
in cooperation of linguistics, mathematics, biology and psychology, grant no.
CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund and the
National Budget of the Czech Republic.
References
Covington, Michael A. & Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving Average
Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94–100.
Hess, Carla W., Karen M. Sefton & Richard G. Landry. 1986. Sample size and type-token ratios
for oral language of preschool children. Journal of Speech and Hearing Research, 29. 129–
134.
Hess, Carla W., Karen M. Sefton & Richard G. Landry. 1989. The reliability of type-token ratios
for the oral language of school age children. Journal of Speech and Hearing Research,
32(3). 536–540
Martynenko, Gregory. 2010. Measuring lexical richness and its harmony. In Peter Grzybek,
Emmerich Kelih & Ján Mačutek (eds.), Text and language. Structures, Functions, In-
terrelations, 125-132. Wien: Praesens Verlag.
Miller, Rupert G. 1981. Simultaneous Statistical Inference. Berlin, Heidelberg: Springer.
Müller, Dieter. 2002. Computing the type token relation from the a priori distribution of types.
Journal of Quantitative Linguistics 9(3). 193-214.
Popescu, Ioan-Iovitz. 2007. Text ranking by the weight of highly frequent words. In Peter
Grzybek & Reinhard Köhler (eds.), Exact methods in the study of language and text, 555-
565. Berlin-New York: de Gruyter.
Popescu, Ioan-Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur Dayaloo Jayaram, Reinhard
Köhler, Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matumnal N. Vidya.
2009. Word frequency studies. Berlin-New York: Mouton de Gruyter.
Popescu, Ioan-Iovitz, Ján Mačutek & Gabriel Altmann. 2010. Word forms, style and typology.
Glottotheory, 3. 89-96.
Popescu, Ioan-Iovitz, Rakek Čech, & Gabriel Altmann. 2011. The lambda-structure of texts.
Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2012. Some geometric properties of
Slovak poetry. Journal of Quantitative Linguistics 19(2). 121-131.
Ratkowsky, David A., Maurice H. Halstead & Linda Hantrais. 1980. Measuring vocabulary rich-
ness in literary works. A new proposal and a re-assessment of some earlier measures.
Glottometrika 2. 125-147.
Richards, Brian. 1987. Type/token ratios: what do they really tell us? Journal of Child Lan-
guage 14(2). 201-209.
Tešitelová, Marie. 1972. On the so-called vocabulary richness. Prague Studies in Mathematical
Linguistics 3. 103-120.
Text Length and the lambda frequency structure of a text 87
Tuldava, Juhan. 1995. On the relation between text length and vocabulary size. In Juhan Tulda-
va (ed.), Methods in quantitative linguistics, 131-150. Trier: WVT.
Weizman, Michael. 1971. How useful is the logarithmic type-token ratio? Journal of Linguistics
7(2). 237-243.
Wimmer, Gejza. 2005.Type-token relation. In Reinhard Köhler, Gabriel Altmann & Rajmund G.
Piotrowski (eds.), Handbook of Quantitative Linguistics, 361-368. Berlin: de Gruyter.
Wimmer, Gejza & Gabriel Altmann.1999. On vocabulary richness. Journal of Quantitative Lin-
guistics 6(1). 1-9.
Reinhard Köhler
Linguistic Motifs
1 Introduction
Quantitative linguistics has been concerned with units, properties, and their re-
lations mostly in a way where syntagmatic resp. sequential behaviour of the ob-
jects under study was ignored. The employed mathematical means and models
reflect, in their majority, mass phenomena treated as samples taken from some
populations - even if texts and corpora do not possess the statistical properties
which are needed for many of the common methods. Nevertheless, with some
caution, good and valid results can be obtained using probability distributions,
functions, differential and difference equations, etc.
The present volume gives an overview of alternative methods which can be
applied if the sequential structure of linguistic expressions, in general texts, are
in the focus of an investigation. Here, a recently presented new unit will be intro-
duced in order to find a method which can give information about the sequential
organisation of a text with respect to any linguistic unit and to any of its proper-
ties – without relying on a specific linguistic approach or grammar. Moreover, the
method brings with it several advantages, which will be described below.
The construction of this unit, the motif (originally called segment or sequence,
cf. Köhler 2006, 2008a,b; Köhler/Naumann 2008, 2009, 2010; ) was inspired by
the so-called F-motiv for musical “texts” (Boroda 1982). Boroda was in search for
a unit which could replace the word as used in linguistics for frequency studies
in musical pieces. Units common in musicology were not usable for his purpose,
and so he defined the "F-Motiv" with respect to the duration of the notes of a mu-
sical piece.
Thus,
90 Reinhard Köhler
A L-motif is a continuous series of equal or increasing length values (e.g. of morphs, words
or sentences).
“Word length studies are almost exclusively devoted to the problem of distributions.”
1 It may be, e.g. appropriate to go from right to left when a language with syntactic left branch-
ing preference is analyzed.
Linguistic Motifs 91
holds for polytextuality motifs, only that a single text does not suffice, of
course, whereas polysemy must be looked up in a dictionary. Lengths motifs
can be determined automatically to the extent to which the writing system
reflects the units in which length is to be counted. Alphabetic scripts are good
conditions for character counts whereas syllabic scripts abet syllable counts.
Special circumstances such as with Chinese are also helpful if syllables are
to be counted. Syntactic complexity, depth of embedding and other more
complicated properties can also be used to form motifs but determining them
automatically presupposes previously annotated text material. Even if seg-
mentation into motifs cannot be done automatically, the important ad-
vantage remains that the result does not depend on any interpretation but is
objective and unambiguous.
2. Segmentation in motifs is always exhaustive, i.e. no rest will remain. The suc-
cessor of a numerical value in a sequence is always (1) larger than or equal to
the given value – or (2) smaller. In the first case, the successor belongs to the
current motif, in the second case, it starts a new motif. The last value in a
text does not cause any problem; for the first one, we have to add the only
additional rule: It starts the first motif.
3. Motifs have an appropriate granularity. They can always be operationalised
in a way that segmentation takes place in the same order of magnitude as the
phenomena under analysis.
4. Motifs are scalable with respect to granularity. One and the same definition
can be iteratively applied: It is possible to form motifs on the basis of length
or frequency values etc. of motifs. A closer look unveils that there are two
different modes in a scaled analysis. The first level of motifs is based on a
property of genuine linguistic units, e.g. word length, which is counted in
terms of e.g., syllables, phones, characters, morphs, or even given as dura-
tion in milliseconds. On the second level, on which the length of length mo-
tifs is determined, only numbers exist. Thus, the concept of length is a differ-
ent one than that on the first level. The length motifs of length motifs of word
length or higher levels, however, do not add new aspects. It goes without
saying that corresponding differences exist if other properties or units are
used to form motifs.
The scaling mechanism can be used to generate infinitely many new kinds of
motifs. Thus, frequency motifs of word lengths motifs can be formed as well
as length motifs of frequency motifs of morph polysemy motifs. This may
sound confusing and arbitrary; nevertheless, a number of such cross-cate-
gory motifs was used and proved to be useful for text classification purposes
(cf. Köhler, Naumann 2010).
92 Reinhard Köhler
3 Application
Let us consider a text, e.g. one of the end-of-year speeches of Italian presidents,
in which the words have been replaced by their lengths measured in syllables.
The beginning of the corresponding sequence looks as follows:
2 2 1 1 2 1 3 2 3 2 2 1 1 1 4 1 5 1 1 2 1 2 2 1 2 1 4 3 2 1 1 1 4 1 3 4 2 1 1 4 2 2 2 2 4 2 1 4 ...
Forming L-motifs according to the definition given above yields (we do not need
the parentheses):
2-2
1-1-2
1-3
2-3
2-2
1-1-1-4
1-5
1-1-2
1-2-2
1-2
1-4
3
2
1-1-1-4
1-3-4
2
1-1-4
...
Linguistic Motifs 93
The rank-frequency distribution of the motifs in the complete text is given in Ta-
ble 1.
Fitting the Zipf-Mandelbrot distribution to the data yields an excellent result: The
value of the Chi-square statistic is 30.7651 with 131 degrees of freedom; the prob-
ability cannot be distinguished from 1.0. The parameters are estimated as a =
1.6468, b = 3.1218. Figure 1 gives an impression of the fit.
So far, the empirical description of one of the aspects of the statistical struc-
ture of motif sequences in texts was shown. Such results can be used for various
purposes, because the differences of the parameters of two empirical distribu-
tions can be tested for significance, e.g. for the comparison of authors, texts, and
text sorts and for classification purposes (cf. Köhler, Naumann 2010). As to the
question, which theoretical probability distribution should be expected for mo-
tifs, a general hypotheses was set up in Köhler (2006). It states that the frequency
Linguistic Motifs 95
distributions of motifs are similar to the distributions of the basic units. This hy-
pothesis was successfully tested in the cited work, in the cited papers by Köhler
and Naumann, and in Mačutek (2009).
Fig. 1: Bi-logarithmic graph of the Zipf-Mandelbrot distribution as fitted to the data from Table
1.
The following is an example, which may illustrate that the underlying probability
distribution of the basic units resp. their properties and of the corresponding mo-
tifs can be theoretically derived and thus explained. In Köhler, Naumann (2009),
sentence length was studied. As opposed to most studies on sentence length, this
quantity was measured in terms of the number of clauses. There are two main
reasons to do so. First, the word, the commonly used unit for this purpose, is not
the immediate constituent of the sentence (in the sense of the Menzerath-Alt-
mann Law). Second, frequency counts of sentence length based on the number
of words, display ragged distribution shapes with notoriously under-populated
classes. The data are usually pooled into intervals of at least ten but do not dis-
play smooth distributions nevertheless. The length motifs formed on the basis of
clause counts yielded an inventory of motif types, which turned out to be distrib-
uted according to the Zipf-Mandelbrot distribution, as expected. The next step
was the attempt to theoretically find the probability distribution of the length of
these length motifs. The corresponding considerations were as follows:
[1] In a given text, the mean sentence length, the estimation of the math-
ematical expectation of sentence length, can be interpreted as the
sentence length intended by the text expedient (speaker/writer).
96 Reinhard Köhler
And 𝑏𝑏𝑏𝑏 (𝑥𝑥𝑥𝑥) = 𝑏𝑏𝑏𝑏(𝑏𝑏𝑏𝑏 + 1) … (𝑏𝑏𝑏𝑏 + 𝑥𝑥𝑥𝑥 𝑥 1). According to this derivation, the hyper-Pois-
son distribution, which plays a basic role with word length distributions (Best
1997), should therefore also be a good model of L-motif length on the sentence
level although motifs on the word level, regardless of the property considered
(length, polytextuality, frequency), follow the hyper-Pascal distribution (Köhler
2006); Köhler/Naumann 2008). Figure 2 shows one example of the fitting tests,
which support the specific hypothesis derived above as well as the general hy-
pothesis that motif distributions resemble the distributions of the basic units
resp. properties. Similarly, in Köhler (2006) a theoretical model of the distribution
of the length of L-motifs was derived, which yielded the hyper-Pascal distribu-
tion. Empirical tests confirmed this hypothesis.
Linguistic Motifs 97
Fig. 2: Fitting the hyper-Poisson distribution to the frequency distribution of the lengths of L-
motifs on the sentence level.
An example of the segmentation of a text fragment taken from one of the anno-
tated newspaper commentaries of the Potsdam corpus2 (represented as a se-
quence of argumentative relations) into R-motifs is the following:
The first R-motif consists of a single element because the following relation is a
repetition of the first; the second one ends also where one of its elements occurs
again etc.
On this basis, the lengths of these R-motifs in the Potsdam commentary cor-
pus were determined. The distribution of the motif lengths turned out to abide by
the hyper-Binomial distribution (cf. Figure 3):
Fig. 3: The distribution of the length of R-motifs in a Corpus, which was annotated for argumen-
tation relations (cf. Beliankou, Köhler, Naumann 2013).
2 Stede (2004).
Linguistic Motifs 99
Each motif begins at the root of the tree and follows one of the possible paths
down until a terminal element is reached. The length of motifs determined in this
way displays a behaviour that differs considerably from that of the R-motifs. A
linguistically interpretable theoretical probability distribution which can be fit-
ted to the empirical frequency distribution is the mixed negative binomial distri-
bution (cf. Figure 4).
Fig. 4: Fitting the mixed negative binomial distribution to the D-motif data
This distribution was justified in the paper by the assumption that it is the result
of a combination of two processes, viz. the combination of two diversifications,
which result both in the negative binomial distribution but with different param-
eters.
We will apply the R-motif method to the Italian text we studied above but this
time with respect to the sequences of part-of-speech tags3. Replacing the words
in the text by the symbols for their resp. part-of-speech yields the sequence (for
the tags cf. Table 2):
3 Again, many thanks to Arjuna Tuzzi.
100 Reinhard Köhler
Tag Part-of-speech
N noun
PREP preposition
V verb
A adjective
DET article
CONG conjunction
PRON pronoun
AVV adverb
NM proper name
NUM number
ESC interjection
We determine the R-motifs and fit the Zipf-Mandelbrot distribution to the fre-
quency distribution of these motifs. The motifs and their frequency are shown in
Table 3. The result of the fitting is excellent. The probability of the Chi-square
value is given as 1.0, the parameters are a = 0.9378, b = 0.9553, and n = 602. The
number of degrees of freedom is 473 (caused by pooling classes with low fre-
quency). The graph of the distribution and the data is given in Figure 5.
Fig. 5: Fitting the Zipf-Mandelbrot distribution to the data in Table 3. Both axes are logarithmic.
It goes without saying that also R-motifs, D-motifs, and maybe other variants of
motifs which are formed from categorical data can be used as the basis for form-
ing F-, L- and other kinds of motifs. The procedure can be recursively continued
until the point is reached where too few elements are left.
Thus, motifs provide a means to analyse text with respect to their sequential
structure with respect to all kinds of linguistic units and properties; even categor-
ical properties can be studied in this way. The granularity of an investigation can
be adjusted by iterative application of motif-formation, and proven statistical
methods can be used for the evaluation. The full potential of this approach has
not yet been explored.
108 Reinhard Köhler
References
Beliankou, Andrei, Reinhard Köhler & Sven Naumann. (2013). Quantitative properties of argu-
mentation motifs. In Ivan Obradović, Emmerich Kelih & Reinhard Köhler (eds.), Methods
and applications of quantitative linguistics, 33–43. Belgrade: Academic Mind.
Best, Karl-Heinz. 1997. Zum Stand der Untersuchungen zu Wort- und Satzlängen. In Third Inter-
national Conference on Quantitative Linguistics, 172–176. Helsinki.
Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef
Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Viktor Krupa, 145–
152. Bratislava: Slovak Academic Press.
Köhler, Reinhard. 2008a. Word length in text. A study in the syntagmatic dimension. In Sybila
Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: Veda.
Köhler, Reinhard. 2008b. Sequences of linguistic quantities. Report on a new unit of investiga-
tion. Glottotheory 1(1). 115–119.
Köhler, Reinhard & Gabriel Altmann. 1996. “Language forces” and synergetic modelling of lan-
guage phenomena. In Peter Schmidt (ed.), Glottometrika 15, 63–76. Trier: WVT.
Köhler, Reinhard & Sven Naumann. 2008. Quantitative text analysis using L-, F- and T-seg-
ments. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme & Reinhold Decker
(eds.), Data Analysis, Machine Learning and Applications, 635–646. Berlin & Heidelberg:
Springer.
Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sen-
tence level. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics, 34–57.
Lüdenscheid: RAM-Verlag.
Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classifica-
tion. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Em-
merich Kelih & Ján Mačutek (eds.), Text and Language. Structures, functions, interrela-
tions, quantitative perspectives, 81–89. Wien: Praesens.
Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics,
51–60. Lüdenscheid: RAM-Verlag.
Stede, Manfred. 2004. The Potsdam commentary corpus. In Bonnie Webber & Donna Byron
(eds.), Proceedings of the 2004 ACL workshop on discourse annotation, 96–102.
Wimmer, Gejza & Gabriel Altmann. 1999. Thesaurus of univariate discrete probability distribu-
tions. Essen: Stamm
Reinhard Köhler and Arjuna Tuzzi
Linguistic Modelling of Sequential
Phenomena
The role of laws
1 Introduction
A number of textual aspects can be represented in form of linear sequences of
linguistic units and/or their properties. Some well-known examples of simple
models of such phenomena are the (dynamic variant) of the type-token relation
(TTR) representing the gradual development of the 'lexical richness' of a text,
i.e. how the amount of types increases as the number of tokens increases, and
other properties of a text, the linguistic motifs, which can be applied with any
degree of granularity, and time-series models such as ANOVA (cf. the contribu-
tions about motifs and time series in this volume).
Before conclusions are drawn on the basis of the application of a mathe-
matical model, some important issues should be taken into account. One of the
most important ones is the validity of the model, i.e. the question whether a
model does really represent what we think it does. The simplest possible quanti-
tative model of a linguistic phenomenon is a single number which results from a
measurement. Combining two or more numbers by arithmetic operations yields
an index, e.g. a quotient. Another way to reflect more than just one property in
order to represent a complex property is forming a vector, which consists of as
many numbers as dimensions are relevant. The appropriate definition of the
measure of a property is fundamental to all the rest of the modelling procedure.
Every measure can be used to represent a whole text (such as the original TTR)
or to scrutinize the dynamic behaviour of the properties under study from text
position to text position, adding a temporal, or more generally, a sequential
dimension to the model. The validity of the result depends, of course, on the
validity of the basic measure(s) and on the validity of the mathematical function
which is set up as a model of its dynamic development. A function which, e.g.
increases boundlessly with text position cannot serve as a valid model of any
text property because there are no infinitely long texts and because there is not
any linguistic property which would yield infinite values if measured in an ap-
propriate way. The same is true of a measure such as the entropy, which gives a
reliable estimation of the 'true' value of a property only for an infinitely long
110 Reinhard Köhler and Arjuna Tuzzi
sequence of symbols. In general, the domain of the function must agree with the
range and the possible values of the linguistic property. The appropriateness of
ANOVA models should also be checked carefully; it is not obvious that this kind
of model correctly reflects linguistic reality. Sometimes, categorical data are
represented by numbers e.g., when stressed and unstressed syllables are
mapped to the numbers "1" and "0" without any explication why not the other
way round or to "4" and "-14,000.22". And why it is correct to calculate with
these numbers, which still represent the non-numerical categories "stressed'
and "unstressed".
Besides validity, every model should be checked also for some other im-
portant properties: reliability, interpretability, and simplicity (cf. Altmann 1978,
1988). Moreover, appropriateness of the scale level of the measure should be
considered, i.e. whether the property to be modelled is measured on a nominal,
an ordinal, or one of the metrical scales. The choice of the scale level decides on
the mathematical operations which are applicable on the measured data and
hence also which kind of mathematical function would agree with the numbers
obtained as results of the measurement.
Another aspect which must not be neglected when a combination of proper-
ties is used to form an index or a vector is the independence of the component
variables. An example may illustrate this issue: The most commonly used index
of reading ease (the 'Flesh formula') for English texts consists of a linear combi-
nation of sentence length (SL) and word length (WL)
This may look like a good idea until it turns out that word length is an indi-
rect function of sentence length as follows from the well-established Menzerath-
Altmann Law (Altmann 1980, 1983b; Cramer 2005) and was empirically con-
firmed over and over again. Thus, this index combines sentence length with
sentence length, which is certainly at least redundant.
It is not too difficult to set up a measure of a linguistic or textual property,
however, many simple measures and also more complex indexes suffer from
problems of various kinds. Some of these problems are well-known in the scien-
tific community such as the dependence of the type-token relation on text
length. Therefore, attempts are made (Wimmer & Altmann 1999, Köhler 1993,
Kubat & Milička 2013, Covington & McFall 2010) to find a method to circumvent
this problem because this measure is easy to apply and hoped to be useful to
differentiate text sorts or to serve as a technique for authorship attribution and
for other stylistic tasks. Another good reason not to drop this measure is the fact
that not many measures of dynamic, sequential text phenomena are known.
Linguistic modelling of sequential phenomena 111
1949 to 2011. Main aims of these studies were identifying a specific temporal
pattern, i.e. a curve, for each word and clustering curves portraying similar
temporal patterns. The authors proposed a flexible wavelet-based model for
curve clustering in the frame of functional data analysis approaches. However,
clear specific patterns could not be observed as each word possesses its own
irregular series of frequency values (cf. Fig. 1). In their conclusions the authors
highlighted: some critical points in representing the temporal pattern of words
as functional objects, the weaknesses of an explorative approach and the lack of
a linguistic theory to justify and interpret such a complex and extremely sophis-
ticated model.
Fig. 1: Temporal patterns of six words (taken from Trevisani &Tuzzi 2013)
In the present paper, we will emphasize the linguistic point of view: The
material under study is not just a corpus or a set of texts but a sequence of texts
representing an Italian political-institutional discourse. The temporal trajectory
of the frequency of a word can thus be considered as an indicator of the com-
municative relevance of a concept at a given point in time. We will not abandon
the original idea of characteristic patterns but instead of expecting a general
time-depending behaviour we will set up the specific hypothesis that the tem-
poral behaviour of the frequency of a word is discourse-specific. A ready-made
model of this kind of phenomenon is not available but there is an established
law of a related phenomenon: the logistic function, in linguistics called the
Piotrowski (or Piotrowski-Altmann) law. It does not make any statements about
Linguistic modelling of sequential phenomena 113
frequency but about the dispersal of units or properties over a community along
the time axis. We will argue in the following way: degree of dispersion and
frequency are, with respect to communication systems, two sides of a coin. The
more frequently a unit is used the higher the probability to make the unit more
familiar to other members of the community – and vice versa: the greater the
degree of dispersion the higher the probability of occurrence. We will therefore
adopt this law for our purposes and assume that it is a good model of the
dynamics of the frequencies of words within a discourse.
The basic form of the Piotrowski-Altmann law (Altmann 1983a) is given as
𝐶𝐶𝐶𝐶
𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 = (1)
1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡
Altmann (1983a) also proposed a modified version for the case of develop-
ments where the parameter b is a function of time (2):
1
𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 = 2 (2)
1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡+𝑐𝑐𝑐𝑐𝑡𝑡𝑡𝑡
We adopted the law in form of function (3), which has an additional
parameter because we cannot expect that a word approaches probability 1:
𝐶𝐶𝐶𝐶
𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 = 2 (3)
1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡+𝑐𝑐𝑐𝑐𝑡𝑡𝑡𝑡
Here, the dependent variable pt represents the relative probability of a word
at time t. On the basis of a plausible, linguistically motivated model, we can
further assume that the data, which seem, at a first look, to disagree with any
assumption of regularity, display a statistical spread around the actual trend.
Therefore, the data, i.e. the relative frequencies are smoothed in order to ena-
bles us to detect the trends and to use them for testing our hypothesis. As
smoothing technique, moving averages at interval size 7 was applied: The inter-
val size was chosen in a way which yielded a sufficiently smooth sequence of
values and, on the other hand, keeps as many individual values as possible. The
smoothing method works as follows. The first window starts with the first pair
of (y, x) – or, in our case of (pt, t) – values and ends with the 7th. The next
window starts with the second data set, i.e. (pt+1, t+1) and ends with the pair
(pt+21, t+7) etc. until we reach the window from (pt+n-7, t+n). For each window, the
mean of the 7 pti are calculated. The means of all the windows form the new
data.
114 Reinhard Köhler and Arjuna Tuzzi
4 Results
Not surprisingly, many words do not display a specific trajectory over time.
Word usage depends to a great deal on grammatical and stylistic circumstances
– in particular the function words – which should not change along the political
or social development in a country. There are, however, function words which
or whose form change over time. One example was described by Best and
Kohlhase (1983): The German word-form ward "became" was replaced by wurde
within the time period from 1445 to 1925 following the function (1). In our data, a
similar process can be observed: The frequency of the adverb ci (at the cost of vi)
is shown in Fig. 2.
Fig. 2: The increase of the frequency of "ci". Smoothing averages with window size 20.
Also content words often show an irregular behaviour with respect to their
frequency because their usage depends on unpredictable thematic circum-
stances. On the other hand, we assume that there are also many words which
reflect to some extent the relevance of a concept within the dynamics of the
discourse. We selected some words to illustrate this fact. Figures 3 and 4 and the
corresponding data in Table 1 show examples of words which follow the typical
logistic growth function. The part-of-speech tags which are attached to the
words in the captions have the following meanings: "N" stands for noun and
"NM" for proper name.
Linguistic modelling of sequential phenomena 115
Fig. 3 & 4: The increase of the frequency of "Europa" and "storia". Smoothing averages with
window size 7.
116 Reinhard Köhler and Arjuna Tuzzi
Fig. 5-7: Temporal behaviour of the frequencies of selected content words in the sequence of
the presidential speeches.
Linguistic modelling of sequential phenomena 117
Fig. 8: Temporal behaviour of the frequencies of selected content words in the sequence of the
presidential speeches [continued]
Figures 5-8 show graphs of words where only the rapidly increasing branch or
the decreasing branch of a reversible development of the curve occurs while
Figures 9 and 10 are cases of fully reversible developments.
Fig. 9: Temporal behaviour of the frequencies of selected content words in the sequence of the
presidential speeches – fully reversible developments
118 Reinhard Köhler and Arjuna Tuzzi
Fig. 10: Temporal behaviour of the frequencies of selected content words in the sequence of
the presidential speeches – fully reversible developments
Table 1: The (smoothed) sequences of relative frequency values of some words and the results of fitting function (3) to the data. The numbers which
represent the years have been transformed ((year-1948)/10) in order to keep the numerical values small enough to be calculated as arguments of the
exponential function.
5 Conclusions
The presented study is an illustration of the fact that data alone do not give an
answer to a research question. Only a theoretically grounded hypothesis, tested
on appropriate data, produces new knowledge.
We assume that the individual kinds of dynamics in fact reflect the rele-
vance of the corresponding concepts in the political discourse but we are not
going to propose political interpretations of the findings. In a follow-up study,
the complete vocabulary of the presidential discourse will be analysed, and on
this basis, it will be possible to find out whether conceptually related words
follow similar temporal patterns.
Acknowledgments
The authors would like to thank IQLA for providing data for this study.
References
Altmann, Gabriel. 1978. Zur Verwendung der Quotiente in der Textanalyse. Glottometrika 1. 91-
106.
Altmann, Gabriel. 1980. Prolegomena to Menzerath’s law. Glottometrika 2(2). 1-10.
Altmann, Gabriel. 1983a. Das Piotrowski-Gesetz und seine Verallgemeinerungen. In Karl-Heinz
Best & Jörg Kohlhase (eds.), Exakte Sprachwandelforschung, 59-90. Göttingen: edition
herodot.
Altmann, Gabriel. 1983b. H. Arens’ „Verborgene Ordnung“ und das Menzerathsche Gesetz. In
Manfred Faust, Roland Harweg, Werner Lehfeldt & Wienold Götz (eds.), Allgemeine
Sprachwissenschaft, Sprachtypologie und Textlinguistik, 31-39. Tübingen: Gustav Narr.
Altmann, Gabriel. 1988. Linguistische Meßverfahren. In Ulrich Ammon, Norbert Dittmar & Klaus
J. Mattheier (eds.), Sociolinguistics. Soziolinguistik, 1026-1039. Berlin, New York: Walter
de Gruyter.
Bunge, Mario. 1967. Scientific Research I, II. Berlin, Heidelberg, New York: Springer.
Bunge, Mario. 1998. Philosophy of science. From problem to theory. New Brunswick, London:
Transaction Publishers.
Bunge, Mario. 2007 [1998]. Philosophy of science. From explanation to justification, 4th edn.
New Brunswick, London: Transaction Publishers.
Covington, Michael A. & Joe D. McFall (2010). Cutting the Gordian Knot: The Moving-Average
Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94-100.
Linguistic modelling of sequential phenomena 123
Cramer, Irene. 2005. Das Menzerathsche Gesetz. In Reinhard Köhler, Gabriel Altmann &
Rajmond G. Piotrowski (eds.), Quantitative Linguistik. Ein internationales Handbuch.
Quantitative Linguistics. An International Handbook, 659-688. Berlin, New York: Walter de
Gruyter.
Köhler, Reinhard & Matthias Galle. 1993. Dynamic aspects of text characteristics. In Ludĕk
Hřebíček & Gabriel Altmann (eds.), Quantitative text analysis (Quantitative Linguistics 52),
46-53. Trier: Wissenschaftlicher Verlag.
Kubát, Miroslav & Jiří Milička. 2013. Vocabulary Richness Measure in Genres. Journal of
Quantitative Linguistics 20(4). 339-349.
Trevisani, Matilda & Arjuna Tuzzi. 2012. Chronological analysis of textual data and curve
clustering: preliminary results based on wavelets. In Società Italiana di Statistica,
Proceedings of the XLVI Scientific Meeting. Padova: CLEUP.
Trevisani, Matilda & Arjuna Tuzzi. 2013. Shaping the history of words. In Ivan Obradović,
Emmerich Kelih & Reinhard Köhler (eds.), Methods and Applications of Quantitative
Linguistics: Selected papers of the VIIIth International Conference on Quantitative
Linguistics (QUALICO), Belgrade, Serbia, April 16-19, 2012, 84-95. Belgrade, Serbia:
Akademska Misao.
Wimmer, Gejza & Gabriel, Altmann. 1999. On Vocabulary Richness. Journal of Quantitative
Linguistics 6(1). 1-9.
Ján Mačutek and George K. Mikros
Menzerath-Altmann Law for Word Length
Motifs
1 Introduction
Motifs are relatively new linguistic units which make possible an in-depth inves-
tigation of sequential properties of texts (for the general definition cf. Köhler, this
volume, pp. 89-90). They were studied in a handful of papers (Köhler 2006,
2008a,b, this volume, pp. 89-108; Köhler and Naumann 2008, 2009, 2010, Maču-
tek 2009, Sanada 2010, Milička, this volume, pp. 133-145). Specifically, a word
length motif is a continuous series of equal or increasing word lengths (measured
here in the number of syllables, although there are also other options, like, e.g.,
morphemes).
In the papers cited above it is supposed that motifs should have properties
similar to those of their basic units, i.e., words in our case. Indeed, word fre-
quency and motif frequency, as well as word length (measured in the number of
syllables) and motif length (measured in the number of words) can be modelled
by the same distributions (power laws, like, e.g., the Zipf-Mandelbrot distribu-
tion, and Poisson-like distributions, respectively; cf. Wimmer and Altmann 1999).
Also the type-token relations for words and motifs display similar behaviour, dif-
fering only in parameters values, but not in models.
We enlarge the list of analogous properties of motifs and their basic units,
demonstrating (cf. Section 3.1) that for word length motifs also the Menzerath-
Altmann law (cf. Cramer 2005; MA law henceforth) is valid. The MA law describes
the relation between sizes of the construct, e.g., a word, and its constituents, e.g.,
syllables. It states that the larger the construct (the whole), the smaller its con-
stituents (parts). In particular, for our data it holds the longer is the motif (in the
number of words), the shorter the mean length of words (in the number of sylla-
bles) which constitute the motif. In addition, in Section 3.2 we show that for ran-
domly generated texts the MA law is valid as well, but its parameters differ from
those obtained from real texts.
126 Ján Mačutek and George K. Mikros
2 Data
In order to study the MA law for word length motifs we compiled a Modern
Greek literature corpus totaling 236,233 words. The corpus contains complete
versions of literary texts from the same time period and has been strictly con-
trolled for editorial normalization. It contains five novels from four widely
known Modern Greek writers published from the same publishing house
(Kastaniotis Publishing House). All the novels were best-sellers in the Greek
market and belong to the “classics” of the Modern Greek literature. More specif-
ically, the corpus consists of:
– The mother of the dog, 1990, by Matesis (47,852 words).
– Murders, 1991, by Michailidis (72,475 words).
– From the other side of the time, 1988, by Milliex [1] (77,692 words).
– Dreams, 1991, by Milliex [2] (9,761 words) - Test novel.
– The dead liqueur, 1992, by Xanthoulis (28,453 words).
3 Results
The results obtained confirmed our expectation that the MA law should be valid
also for word length motifs. The tendency of mean word length (measured in the
number of syllables) to decrease with the increasing motif length (measured in
the number of words) is obvious in all five texts investigated, cf. Table 2.
We modelled the relation by the function
Menzerath-Altmann law for word length motifs 127
where 𝑦𝑦𝑦𝑦(𝑥𝑥𝑥𝑥) is the mean length of words which occur in motifs consisting of 𝑥𝑥𝑥𝑥
words; 𝛼𝛼𝛼𝛼 and 𝑏𝑏𝑏𝑏 are parameters. Given that
𝑦𝑦𝑦𝑦(1) = 𝛼𝛼𝛼𝛼,
we replaced 𝛼𝛼𝛼𝛼 with the mean length of words from motifs of length 1, i.e., motifs
consisting of one word only (cf. Kelih 2010, Mačutek and Rovenchak 2011). In or-
der to avoid too strong fluctuations, only motif lengths which appeared in partic-
ular texts at least 10 times were taken into account (cf. Kelih 2010). The appropri-
ateness of the fit was assessed in terms of the determination coefficient R2 (values
higher than 0.9 are usually considered satisfying, cf., e.g., Mačutek and Wimmer
2013). The numerical results (values of R2 and parameter values for which R2 reach
its maximum) are presented in Table 2.
Table 2: Fitting function (1) to the data. ML- motif length, MWL0- observed mean word length,
MWLt theoretical mean word length resulting from (1).
The results presented in the previous section show that longer motifs contain
shorter words and vice versa. The relation between the lengths (i.e., the MA law)
can be modelled by a simple power function. One cannot, however, apriori ex-
clude the possibility that the observed regularities are necessary in the sense that
they could be only a consequence of some other laws. In this particular case, it
seems reasonable to ask whether MA law remains valid if the distribution of word
128 Ján Mačutek and George K. Mikros
length is kept, but the sequential structure of word length is deliberately forgot-
ten.
Randomization (i.e., random generating of texts – or, only some properties
of texts – by means of computer programs) is a useful tool for finding answers to
questions of this type. It is slowly finding its way to linguistic research (cf., e.g.,
Benešová and Čech, this volume, pp. 57-69, and Milička, this volume, pp. 133-145
for other analyses of the MA law; Liu and Hu 2008 applied randomization to re-
fute claims that small-world and scale-free complex language networks automat-
ically give rise to syntax).
In order to get rid of the sequential structure of word lengths, while at the
same time preserving the word length distribution, we generated random num-
bers from the distribution of word length in each of the five texts under our inves-
tigation. The number of generated random word lengths is always equal to the
text length of the respective real text (e.g., we generated 47,852 random word
lengths for the text by Matesis, as it contains 47,852 words, cf. Table 1). Then, we
fitted function (1) to the randomly generated data. The outcomes of the fitting can
be found in Table 3. The generated data were truncated at the same points as their
counterparts from real texts, cf. Section 3.1.
Table 3: Fitting function (1) to the data. ML- motif length, MWLr - mean word length from ran-
domly generated data, MWLf - fitted values resulting from (1).
It can be seen that the MA law holds also in this case; however, parameters 𝑏𝑏𝑏𝑏 in
random texts are different from the ones from real texts. The parameters in the
random texts have always larger absolute values, which means that the respec-
tive curves are steeper, i.e., they decrease more quickly.
Menzerath-Altmann law for word length motifs 129
As an example, in Fig. 1, we present data from the first text (by Matesis, cf. Section
2) together with the mathematical model (1) fitted to the data.
Fig. 1: Data from the text by Matesis (circles) and from the respective random text (diamonds),
together with fitted models (dashed line – the model for the real text, solid line – the model for
the random text)
sequential structure of word lengths plays an important role in the MA law for
word length motifs.
4 Conclusions
The paper brings another confirmation that word length motifs behave in the
same way as other, more “traditional” linguistic units. In addition to the motif
frequency distribution and the distribution of motif length (which were shown to
follow the same patterns as words), also the MA law is valid for word length mo-
tifs, specifically, the more words a motif contains, the shorter mean syllabic
length of words in the motif.
The MA law can be observed also in random texts, if word lengths distribu-
tions in a real text and in its random counterpart are the same. The validity of the
law in random texts can be explained deductively from word length distribution.
However, parameters in the exponents of the power function which is a mathe-
matical model of the MA law are different for real and random texts. The power
functions corresponding to random texts are steeper. The difference in parameter
values proves that not only word length distribution, but also the sequential
structure of word lengths has an impact on word length motifs.
It remains an open question whether parameters of the MA law can be used
as characteristics of languages, genres or authors. In case of the positive answer,
they could possibly be applied to language classification, authorship attribution
and similar fields.
Acknowledgments
J. Mačutek was supported by VEGA grant 2/0038/12.
References
Benešová, Martina & Radek Čech. 2015. Menzerath-Altmann law versus random models. This
volume, pp. 57-69.
Cramer, Irene M. 2005. Das Menzerathsche Gesetz. In Reinhard Köhler, Gabriel Altmann &
Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An international handbook, 659–
688. Berlin & New York: de Gruyter.
Menzerath-Altmann law for word length motifs 131
Liu, Haitao & Fengguo Hu. 2008. What role does syntax play in a language network? EPL 83.
18002.
Kelih, Emmerich. 2010. Parameter interpretation of the Menzerath law: Evidence from Serbian.
In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and language. Structures,
functions, interrelations, quantitative perspectives, 71–79. Wien: Praesens.
Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef
Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Viktor Krupa, 145–
152. Bratislava: Slovak Academic Press.
Köhler, Reinhard. 2008a. Word length in text. A study in the syntagmatic dimension. In Sybila
Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: Veda.
Köhler, Reinhard. 2008b. Sequences of linguistic quantities. Report on a new unit of investiga-
tion. Glottotheory 1(1). 115–119.
Köhler, Reinhard. 2015. Linguistic motifs. This volume, pp. 89-108.
Köhler, Reinhard & Sven Naumann. 2008. Quantitative text analysis using L-, F- and T-seg-
ments. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme & Reinhold Decker
(eds.), Data analysis, machine learning and applications, 635–646. Berlin & Heidelberg:
Springer.
Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sen-
tence level. In Reinhard Köhler (ed.), Issues in quantitative linguistics, 34–57.
Lüdenscheid: RAM-Verlag.
Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classifica-
tion. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Em-
merich Kelih & Ján Mačutek (eds.), Text and language. Structures, functions, interrela-
tions, quantitative perspectives, 81–89. Wien: Praesens.
Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in quantitative linguistics,
51–60. Lüdenscheid: RAM-Verlag.
Mačutek, Ján & Andrij Rovenchak. 2011. Canonical word forms: Menzerath-Altmann law, phone-
mic length and syllabic length. In Emmerich Kelih, Victor Levickij & Yuliya Matskulyak
(eds.), Issues in quantitative linguistics 2, 136–147. Lüdenscheid: RAM-Verlag.
Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models
in quantitatitive linguistics. Journal of Quantitative Linguistics 20(3). 227–240.
Milička, Jiří. 2015. Is the distribution of L-motifs inherited from the word lengths distribution?
This volume, pp. 133-145.
Sanada, Haruko. 2010. Distribution of motifs in Japanese texts. In Peter Grzybek, Emmerich
Kelih & Ján Mačutek (eds.), Text and Language. Structures, functions, interrelations, quan-
titative perspectives, 183–194. Wien: Praesens.
Wimmer, Gejza & Gabriel Altmann. 1999. Thesaurus of univariate discrete probability distribu-
tions. Essen: Stamm
Jiří Milička
Is the Distribution of L-Motifs Inherited from
the Word Length Distribution?
1 Introduction
An increasing number of papers1 shows that word length sequences can be
successfully analyzed by means of L-motifs, which are a very promising attempt
to discover the syntagmatic relations of the word lengths in a text.
The L- motif2 has been defined by Reinhard Köhler (2006a) as:
(...) the text segment which, beginning with the first word of the given text, consists of word
lengths which are greater or equal to the left neighbour. As soon as a word is encountered
which is shorter than the previous one the end of the current L-Segment is reached. Thus,
the fragment (1) will be segmented as shown by the L-segment sequence (2):
Azon a tájon, ahol most Budapest fekszik, már nagyon régen laknak emberek.
2,122,13,2,12223
The main advantage of such segmentation is that it can be applied iteratively, i.e.
L-motifs of the L-motifs can be obtained (so called LL-motifs). Applying the
method several times results in not very intuitive sequences, which, however, fol-
low lawful patterns3 and they are even practically useful, e.g. for automatic text
classification (Köhler – Naumann 2010).
2 Hypothesis
However it needs to be admitted that the fact that the L-motifs follow lawful
patterns does not imply that the L-motifs reflect a syntagmatic relation of the
word lengths, since these properties could be merely inherited from the word
1 See Köhler – Naumann (2010), Mačutek (2009), Sanada (2010).
2 Former term was L-segments, see Köhler (2006a).
3 E.g. the rank-frequency relation of the L-motifs distribution can be successfully described by
the Zipf-Mandelbrot distribution, which is a well-established law for the word types rank-fre-
quency relation.
134 Jiří Milička
length distribution in the text, which has not been tested yet. The paper focuses
on the most important property of L-motifs – the frequency distribution of their
types and tests the following hypothesis:
The distribution of L-motifs measured on the text T differs from the
distribution of L-motifs measured on a pseudotext T’. The pseudotext T’ is created
by the random transposition of all tokens of the text T within the text T.4
3 Data
The hypothesis was tested on three Czech and six Arabic texts:
The graphical word segmentation was respected when determining the number
of syllables in the Arabic texts. In the Czech texts zero syllabic words (e.g. s, z, v,
k) were merged with the following words according to the conclusion in Antić et
al. (2006), to maintain the compatibility with other studies in this field (e.g. Köh-
ler 2006b).
4 The null hypothesis is: “The distribution of L-motifs measured on the text T is the same as
the distribution of L-motifs measured on a pseudotext T’. The pseudotext T’ is created by random
transposition of all tokens of the text T within the text T.”
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 135
4 Motivation
One of those texts [Kat] was randomized for one million5 times and the rank–
frequency relation (RFR) of L-motifs was measured for every randomised
pseudotext. Then these RFRs were averaged. This average RFR can be seen on the
following chart, accompanied by the RFR of L-structures measured on the real
text:
Visually, the RFR of the L-motifs for the real text does not differ from the average
pseudotext RFR of the L-motifs very much. This impression is supported by the
Chi-squared discrepancy coefficient 𝐶𝐶𝐶𝐶 = 0.0008.6 Also the fact, that both the
real text’s L-motifs RFR and the randomized text’s L-motifs RFR can be success-
fully fitted by the Right truncated Zipf-Alexeev distribution with similar parame-
ters7 encourages us to assume that the RFR of L-motifs is given by the word length
distribution in the text.
5 1 million has been arbitrarily chosen as a “sufficiently large number”, which makes it the
weakest point of our argumentation.
6 𝐶𝐶𝐶𝐶 = Χ ⁄𝑁𝑁𝑁𝑁 , where N is the sample size (Mačutek – Wimmer 2013).
2
7 For the real data: 𝑎𝑎𝑎𝑎 = 0.228; 𝑏𝑏𝑏𝑏 = 0.1779; 𝑛𝑛𝑛𝑛 = 651; 𝛼𝛼𝛼𝛼 = 0.1089; 𝐶𝐶𝐶𝐶 = 0.0066. For the
randomized pseudotexts: 𝑎𝑎𝑎𝑎 = 0.244; 𝑏𝑏𝑏𝑏 = 0.1761; 𝑛𝑛𝑛𝑛 = 651; 𝛼𝛼𝛼𝛼 = 0.1032; 𝐶𝐶𝐶𝐶 = 0.0047. Altmann
Fitter was used.
136 Jiří Milička
Very similar results can be obtained for LL-motifs, LLL-motifs8 etc. (the Zipf-Man-
delbrot distribution fits the distribution of the higher orders L-motifs better than
the right truncated Zipf-Alexeev distribution).
But these results do not answer the question asked. The next section pro-
ceeds to the testing of the hypothesis.
5 Methods
Not only L-motifs as a whole, but every single L-motif has the distribution of its
frequencies within those one million randomized pseudotexts. For example the
number of pseudotexts (randomized [Bab]), where the L-motif (1, 1, 2, 2, 2) oc-
curred 72 times, is 111. From this distribution we can obtain confidence intervals
(95%) as depicted on the following chart:
Fig. 2: Distribution of one of the L-motif types in one million pseudotexts (randomized [Bab])
vs. the frequency of the L-motif in the real text.
In this case, the frequency of the motif (1, 1, 2, 2, 2) measured on the real text [Bab]
is 145, which is above the upper confidence interval limit (in this case 125). But
the frequencies of many other L-motifs are within these intervals, such as the mo-
tif (1, 1, 1, 2, 2):
8 For the LL-motifs: 𝐶𝐶𝐶𝐶 = 0.0009; for the LLL-motifs: 𝐶𝐶𝐶𝐶 = 0.0011.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 137
Fig. 3: Distribution of one of the L-motif types in one million pseudotexts (randomized [Bab])
vs. the frequency of the L-motif in the real text.
The fact that the frequencies are not independent from each other does not allow
us to test them separately as multiple hypotheses, and moves us to merge all val-
ues of the distribution into one number. The following method was chosen:
1. The text is many times randomized (in this case 1 million times) and for each
pseudotext frequencies of L-motifs are measured. The average frequency of
every L-motif is calculated. The average frequency of the motif (indexed by
the variable 𝑖𝑖𝑖𝑖, 𝑁𝑁𝑁𝑁 is the maximal 𝑖𝑖𝑖𝑖) will be referred as 𝑚𝑚𝑚𝑚
𝑚 𝑖𝑖𝑖𝑖 .
2. The total distance (𝐷𝐷𝐷𝐷) between the frequencies of each motif (𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 ) in the text
𝑇𝑇𝑇𝑇 and their average frequencies in the randomized pseudotexts (𝑚𝑚𝑚𝑚 𝑚 𝑖𝑖𝑖𝑖 ) are cal-
culated:
𝑁𝑁𝑁𝑁
𝑚 𝑖𝑖𝑖𝑖 − 𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 |
𝐷𝐷𝐷𝐷 = �|𝑚𝑚𝑚𝑚
𝑖𝑖𝑖𝑖𝑖𝑖
3. All total distances (𝐷𝐷𝐷𝐷𝐷) between the frequencies of each motif (𝑚𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 ) in one mil-
lion pseudotexts 𝑇𝑇𝑇𝑇𝑇 (these pseudotexts must be different from those that were
measured in the step 1) and their average frequencies in the randomized
pseudotexts (𝑚𝑚𝑚𝑚
𝑚 𝑖𝑖𝑖𝑖 ) (must be the same as in the previous step) are calculated:
𝑁𝑁𝑁𝑁
𝑚 𝑖𝑖𝑖𝑖 − 𝑚𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 |
𝐷𝐷𝐷𝐷𝐷 = �|𝑚𝑚𝑚𝑚
𝑖𝑖𝑖𝑖𝑖𝑖
5. The upper confidence limit is set. The distance D significantly lower than the
distances D’ would mean that the real distribution is even closer to the distri-
bution generated by random transposing tokens than another distributions
measured on randomly transposed tokens. This would not reject the null hy-
pothesis. Considering this, the lower confidence limit is senseless and the
test can be assumed to be one-tailed.
6. D is compared with the upper confidence limit. If D is larger than the upper
confidence limit, then the null hypothesis is rejected.
Fig. 4: The distribution of the variable D in one million pseudotexts (randomized [Bab]) vs. the
value of the D variable in the real text. Here, for example, 4371 out of one million randomized
texts have D’ equal to 1500.
As the D is larger than the upper confidence limit, we shall assume that the dis-
tribution of the L-motifs measured on [Bab] is more distant from the average dis-
tribution of L-motifs measured on pseudotexts (derived by the random transpo-
sition of tokens in [Bab]), than the distribution of L-motifs measured on another
pseudotexts (also derived by the random transposition of tokens in [Bab]).
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 139
6 Results
In the following charts, one column represents the 𝐷𝐷𝐷𝐷’ values compared to the
measured D value (like in the Fig. 4, but in a more concise form) for 7 orders of
motifs. Confidence limits of the 95% confidence intervals are indicated by the
error bars.
Fig. 5: The D-value of the distribution of L-motifs (in the [Zer]) is significantly different from the
D’-value measured on randomly transposed tokens of the same text. Notice that the LL-motifs
distribution D-value is also close to the upper confidence limit.
140 Jiří Milička
Fig. 6: The D-values of the distributions of L-motifs and LL-motifs (in the [Kat]) are significantly
different from the D’-values measured on randomly transposed tokens of the same text. Notice
that the LL-motifs distribution D-value is very close to the upper confidence limit.
Fig. 7: The D-values of the distributions of L-motifs, LL-motifs and LLLL-motifs (in the [Bab]) are
significantly different from the D’-values measured on randomly transposed tokens of the
same text. Consider that the LLLL-motifs distribution can be different just by chance.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 141
Fig. 8: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Ham]) are
significantly different from the D’-values. The ratios between the D-values and the upper confi-
dence limits are more noticeable than those measured on the Czech texts (the y-axis is log
scaled). As the size of these texts is comparable, it seems that the L-motif structure is more
substantial for the Arabic texts than for the Czech ones.
Fig. 9: The D-values of the distributions of L-motifs and LL-motifs (in the [Sal]) are significantly
different from the D’-values.
142 Jiří Milička
Fig. 10: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Zam]) are
significantly different from the D’-values despite of the fact, that the text is relatively short.
Fig. 11: The D-values of the distributions of L-motifs, LL-motifs, LLL-motifs and LLLL-motifs (in
the [Maw]) are significantly different from the D’-values despite of the fact, that the text is rela-
tively large and incoherent.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 143
Fig. 12: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Baj]) are
significantly different from the D’-values.
Fig. 13: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Bah]) are
significantly different from the D’-values.
144 Jiří Milička
Table 2: Exact figures presented in the charts. U stands for the upper confidence limit. All D;
D>U are marked bold.
7 Conclusion
The null hypothesis was rejected for the L-motifs (all texts) and for LL-motifs
(except [Zer]) and was not rejected for L-motifs of higher orders (LLL-motifs etc.)
in Czech, but was rejected also for LLL-motifs in Arabic (except [Sal]). As type-
token relation and distribution of lengths are to some extent dependent on the
frequency distribution, similar results for these properties can be expected, but
proper tests are needed. Our methodology can be also used for testing F-motifs
and other types and definitions of motifs.
It needs to be said that non-rejecting the null hypothesis does not mean, that
the L-motifs of higher orders are senseless – even if their distribution was inher-
ited from the distribution of word lengths in the text (which is still not sure), it
still could be used as a tool mediating to see the distribution of the word lengths
from another point of view. However, it turns out that if we wish to use the L-
Is the Distribution of L-Motifs Inherited from the Word Length Distribution? 145
motifs to examine the syntagmatic relations of the word lengths, the structure
inherited from the word length distribution must be taken into account.
Acknowledgements
The author is grateful to Reinhard Köhler for helpful comments and suggestions.
This work was supported by the project Lingvistická a lexikostatistická analýza ve
spolupráci lingvistiky, matematiky, biologie a psychologie, grant no.
CZ.1.07/2.3.00/20.0161 which is financed by the European Social Fund and the
National Budget of the Czech Republic.
References
Antić, Gordana, Emmerich Kelih & Peter Grzybek. 2006. Zero-syllable words in determining
word length. In Peter Grzybek (ed.), Contributions to the science of text and language.
Word length studies and related issues, 117–156. Dordrecht: Springer.
Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef
Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Victor Krupa, 145–
152. Bratislava: Slovak Academic Press.
Köhler, Reinhard. 2008. Word length in text. A study in the syntagmatic dimension. In Sibyla
Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: VEDA.
Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sen-
tence level. In Reinhard Köhler (ed.), Issues in quantitative linguistics, 34–57. Lüden-
scheid: RAM-Verlag.
Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classifica-
tion. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Em-
merich Kelih & Ján Mačutek (eds.). Text and language. Structures, functions, interrela-
tions, quantitative perspectives, 81–89. Wien: Praesens.
Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics,
51–60. Lüdenscheid: RAM-Verlag.
Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models
in quantitative linguistics. Journal of Quantitative Linguistics 20(3). 227–240.
Sanada, Haruko. 2010. Distribution of motifs in Japanese texts. In Peter Grzybek, Emerich Kelih
& Ján Mačutek (eds.), Text and language. Structures, functions, interrelations, quantitative
perspectives, 183–193. Wien: Praesens.
Adam Pawłowski and Maciej Eder
Sequential Structures in “Dalimil’s
Chronicle”
Quantitative analysis of style variation
1 This explanation was transmitted to prof. Pawłowski during a conversation with prof.
Woronczak and so far has not been publicised in print.
148 Adam Pawłowski and Maciej Eder
3 Hypothesis
It has been postulated that in the text of the Chronicle a gradual shift in
versification can be observed, consisting in the growing dissimilation in the
length of adjacent verses and chapters. Called by Jerzy Woronczak
“prosaisation” (Woronczak 1993[1963]: 70), this process is presumably related to
the evolution of the author’s linguistic preferences and to his/her attitude
towards documented historical facts. It may also reflect the shift from the
ancient oral tradition, based on very regular verse structures that enhanced
memorisation, to the culture of script, based on literacy and its complex
versification rules that satisfied the readers’ growingly refined taste. Apart from
the above hypothesis, following directly from Woronczak’s considerations,
additional experiments in modelling longitudinal structures based on stress and
quantity were carried out. The main reason for this was that a similar method,
applied to Latin and ancient Greek texts, produced good results (cf. Pawłowski,
Eder 2001; Pawłowski, Krajewski, Eder 2010).
4 Method
Quantitative text analysis is a combination of applied mathematics and
linguistics. It relies on scientific methodology but its epistemological
foundations are arguable, since they reduce the complexity of language to a
small number of parameters, simplify the question of its ontological status and
often avoid critical dialogue with non-quantitative approaches. Despite these
reservations quantitative methods have proven to be efficient in resolving many
practical questions. They also support cognitive research, revealing new facets
of human verbal communication, especially statistical laws of language and
synergetic relations. As Reinhard Köhler says, “A number of linguistic laws has
been found during the last decades, some of which could successfully be
integrated into a general model, viz. synergetic linguistics. Thus, synergetic
linguistics may be considered as a first embryonic linguistic theory. [...]
According to the results of the philosophy of science, there is one widely
Sequential structures in “Dalimil’s Chronicle” 151
sampled from a large body of texts. Good examples of such series are successive
works of a given author, periodical press articles or chronological samples of
public discourse (cf. Pawłowski 2006). As in such cases the process of
quantification concerns foremostly lexical units, such sequences can be referred
to as lexical series.
In our study the method of time series analysis was applied. It comprises
several components, such as the autocorrelation function (ACF) and the partial
autocorrelation function (PACF), as well as several models of stochastic
processes, such as AR(p) (autoregressive model of order p) and MA(q) (moving
average model of order q). There are also complex models aggregating a trend
function, periodical oscillations and a random element in one formula.
Periodical oscillations are represented by AR and MA models, while complex
models such as ARMA or ARIMA include their combinations. From the linguistic
point of view the order of the model (p or q) corresponds to the depth of
statistically significant textual memory. This means that p or q elements in a
text statistically determine the subsequent element p + 1 or q + 1. As the time
series method has been exhaustively, copiously described in existing literature
– cf. its applications in economy and engineering (cf. Box, Jenkins 1976; Cryer
1986), in psychology (Glass, Wilson, Gottman 1975; Gottman 1981, Gregson 1983,
Gottman 1990), in social sciences (McCleary, Hay 1980) and in linguistics
(Pawłowski 1998, 2001, 2005; Pawłowski, Eder 2001; Pawłowski, Krajewski,
Eder 2010) – it seems unnecessary to discuss it here in detail (basic formulae are
given in the Appendix).
It is important, nonetheless, to keep in mind that a series of discrete
observations of a random variable is called time series, if it can be spanned over
a longitudinal axis (representing real time or any other sequential quantity). A
full time series can be decomposed into three components: the trend, the
periodical functions (including seasonal oscillations) and the random element
called white noise. In linguistics such full time series is hardly imaginable
(albeit theoretically possible). In a single text analysis trends usually do not
appear because the observed variables are stationary, i.e. their values do not
significantly outrange the limits imposed by the term’s frequency.2 If a greater
number of texts is put in a sequence according to their chronology of
appearance, or according to some principle of internal organisation, trend
approximation, calculated for a given parameter, is possible. For example, in
2 Our previous research on time series analysis of Latin, Greek, Polish and English verse and
prose, proves that “text series” are always stationary (cf. Pawłowski 1998, 2001, 2003;
Pawłowski, Eder 2001; Pawłowski, Krajewski, Eder 2010).
Sequential structures in “Dalimil’s Chronicle” 153
3 Also called Wald-Wolfowitz test.
154 Adam Pawłowski and Maciej Eder
The weakness of the runs test here is the assumption that a versified text is
considered as a generator of randomness. Without any preliminary tests or
assumptions, one has to admit that texts are a fruit of a deliberate intellectual
activity aimed at creating a coherent and structured message. Versification
plays here an aesthetic function and cannot be regarded as random. Another
point that seems crucial is that one can easily imagine several series presenting
exactly the same parameters (N, na and nb) but at the same time different
orderings of AA... and BB... series. It is for these reasons that the runs test was
not used in our study. Instead, trend modelling and time-series tools were
applied.
tests have shown that the coding of the text with graphical units such as letters
and words produced very similar results as compared with syllable coding.
where:
β1, β2 – coefficients of the model
ε – random noise.
Given the considerably clear picture of the increasing length of lines in
subsequent chapters, it is not surprising that the following linear regression
model (for syllables) fits the data sufficiently well:
4 The trend lines on the plots are generated using the procedure introduced by Cleveland
(1979).
156 Adam Pawłowski and Maciej Eder
Although this model (its estimate marked with dashed line in Fig. 1) seems to be
satisfactory it does not capture the disturbance of the trend in the middle of the
Chronicle (roughly covering chapters 30–60). It is highly probable that one has
to do here with the influence of an external factor on the stable narration of the
Chronicle.
Fig. 1: Mean verse length in subsequent chapters of the Chronicle (in syllables).
Fig. 2: Mean verse length in subsequent chapters of the Chronicle (in words).
Sequential structures in “Dalimil’s Chronicle” 157
Certainly, the above results alone do not indicate any shift from verse to prose
per se. However, when this observation is followed by a measure of dispersion,
such as the standard deviation of the line length in subsequent chapters or –
more accurately – a normalised standard deviation (i.e. coefficient of variation,
or standard deviation expressed in the units of the series mean value), the
overall picture becomes clearer, because the dispersion of line lengths (Fig. 3)
increases significantly as well. The lines not only become longer throughout the
whole Chronicle but also more irregular in terms of their syllabic justification,
thus resembling prose more than poetry. The results obtained for syllables are
corroborated by similar tests performed on letters and words (not shown): the
lines of the Chronicle display a decreasing trend of verse justification, whichever
units are being analysed.
One should also note that the syllabic model of the opening chapters,
namely eight syllable lines with a few outliers, is a very characteristic pattern of
archaic oral poetry in various literatures and folkloric traditions of Europe. In
the case of Dalimil’s Chronicle, it looks as if the author composed the earliest
passages using excerpts from oral epic poetry followed by chapters inspired by
non-literary (historical) written sources. If this assumption were to be true, it
should be confirmed by a closer inspection of other prosodic factors of the
Chronicle such as rhymes.
Fig. 3: Coefficient of variation of line lengths in subsequent chapters of the Chronicle (in
syllables).
158 Adam Pawłowski and Maciej Eder
Fig. 4: Rhyme repertory, or the percentage of discrete rhyme pairs used in sub-sequent
chapters of the Chronicle.
Another series of tests addressed the question of rhyme homogeneity and rhyme
repertory. Provided that oral/archaic poetry tends to reduce alternations in
rhymes or to minimise the number of rhymes used (as opposed to elaborate
written poetry, usually boasting a rich inventory of rhymes), we examined the
complexity of rhyme types in the Chronicle.
Firstly, we computed the percentage of discrete rhyme pairs per chapter –
an equivalent to the classical type/token ratio – in order to get an insight into
the rhyme inventory, with an obvious caveat that this measure is quite sensitive
to the sample size. Although the dispersion of results is rather large across
subsequent chapters (Fig. 4), the existence of the decreasing trend is evident.
This is rather counter-intuitive and contradicts the above-formulated
observations since it suggests a greater rhyme inventory at the beginning of the
Chronicle, as though text was more “oral” in its final passages than in its
opening chapters.
The next measure is the index of rhyme homogeneity, aimed to capture
longer sequences of matching rhyme patterns or clusters of lines using the same
rhyme. It was computed as a number of instances where a rhyme followed the
preceding one without alternation, divided by the number of lines of a given
chapter. Thus, the value of 0.5 means a regular alternation of rhyme pairs,
whilst a chapter of n verses using no rhyme changes (i.e. based entirely on one
rhyme pattern) will raise the homogeneity index to (n – 1)/n. On theoretical
grounds, both measures – rhyme inventory index and homogeneity index – are
Sequential structures in “Dalimil’s Chronicle” 159
correlated to some extent. However, the following example from chapter 38 will
explain the difference between these measures:
Tehdy sě sta, že kněz Ołdřich łovieše
a sám v pustém lesě błúdieše.
Když u velikých túhách bieše,
około sebe všady zřieše,
uzřě, nali-ť stojí dospěłý hrad.
Kněz k němu jíti chtieše rád.
Ale že cěsty neumějieše
a około hłožie husté bieše,
ssěd s koně, mečem cěstu proklesti,
i počě po ostrvách v hrad lézti;
neb sě nemože nikohého dovołati,
by v něm liudie byli, nemože znamenati.
Most u něho vzveden bieše,
a hrad zdi około sebe tvrdé jmějieše.
Když kněz s úsilím v hrad vnide
a všecky sklepy znide,
zetlełé rúcho vidieše,
však i čłověka na něm nebieše.
Sbožie veliké a vína mnoho naleze.
Ohledav hrad, kudyž był všeł, tudyž vyleze.
Pak kněz hrad ten da pánu tomu, jemuž Přiema diechu;
proto tomu hradu Přimda vzděchu.
In the above passage there are twelve lines that follow the preceding line’s
rhyme; however, the number of unique rhyme patterns is seven as the rhyme
-ieše returns several times. Hence, both indexes will differ significantly and
their value will be as high as 0.318 and 0.545 (inventory and homogeneity
respectively).
Fig. 5 shows the homogeneity index for the whole Chronicle. Despite some
interesting outliers a general increasing trend is again easily noticeable: the
opening chapters not only tend to use more rhyme types as compared with the
final passages, but also avoid aggregating rhymes into longer clusters. To keep
things simple, one might say that rhyme diversity and copiousness
(traditionally associated with elaborate written poetry) turns slowly into
repetitiveness and formulaic “dullness” (usually linked to archaic oral poetry).
Yet, another simple test confirms these results: out of all the rhymes used in the
Chronicle we computed the relative frequencies of the most frequent ones,
dividing the number of a particular rhyme’s occurrence by the total number of
verses in a chapter. The summarised frequencies of five predominant groups of
rhymes are represented in Fig. 6, which shows very clearly that a variety of
rhyme patterns of the initial passages is systematically displaced by a limited
160 Adam Pawłowski and Maciej Eder
number of line endings. It also means a syntactic simplification of the final parts
of the Chronicle, since the overwhelming majority of rhymes are in fact a few
inflected morphemes of verbs (-echu|-ichu|-achu, -eše|-aše, -ati|-eti, -idu|-edu|
-adu) and a genitive singular morpheme for masculine nouns, adjectives and
pronouns (-eho|-oho).
Fig. 6: Relative frequency of the most frequent rhyme types in subsequent chapters of the
Chronicle.
Sequential structures in “Dalimil’s Chronicle” 161
Next come more sophisticated measures that may betray internal correlations
between subsequent lines of the Chronicle, namely the aforementioned time
series analysis techniques. The aim of the following tests is to examine whether
short lines are likely to be followed by long ones and vice versa (Hrabák 1959).
Certainly, the alternative hypothesis is that no statistically significant
correlations exist between the consecutive lines of the Chronicle. To test this
suggestion we first computed the autocorrelation function (ACF) for subsequent
line lengths, expressed in syllables. Next came partial autocorrelation function
(PACF), followed by the stage of model estimation and evaluation (if applicable,
because most of the results not proved insufficiently significant and thus
unsuitable for modelling). The whole procedure was replicated independently
for individual chapters.5
The results of the aforementioned procedures are interesting, yet
statistically not very significant. As evidenced by Fig. 7 (produced for chapter 1),
the autocorrelation function reveals no important values at lags higher than 1,
the only significant result appearing at lag 1. It means that a minor negative
correlation between any two neighbouring lines (i.e. at lag 1) can be observed in
the dataset. To interpret this fact in linguistic terms, any given verse is slightly
affected by the preceding one but does not depend at all on the broader
preceding context. The ACF function applied to the remaining chapters
corroborates the above results. Still, the only value that turned out to be
significant is negative in some chapters (Fig. 7) and positive in the other ones
(Fig. 8), and it generally yields a substantial variance. It seemed feasible
therefore to extract throughout the consecutive chapters all the values for lag 1
and to represent them in one figure (Fig. 9). This procedure reveals some further
regularities; namely the correlation between neighbouring lines slowly but
systematically increases. It means, once again, that the Chronicle becomes
somewhat more rhythmical in its last chapters despite – paradoxically – the
prose-like instability of the verse length.
In any attempt at measuring textual rhythmicity the most obvious type of
data that should be assessed is the sequence of marked and unmarked syllables
or other alternating linguistic units (such as metrical feet). In the case of
Dalimil’s Chronicle – or the Old Czech language in general – the feature
responsible for rhythmicity is the sequence of stressed and unstressed syllables
on the one hand, and the sequence of long and short syllables on the other
5 Since the ARIMA modelling is a rather time-consuming task, we decided to assess every
second chapter instead of computing the whole dataset. We believe, however, that this
approach gives us a good approximation of the (possible) regularities across chapters.
162 Adam Pawłowski and Maciej Eder
Fig. 9: Results of the autocorrelation function values (lag 1 only) for the consecutive chapters of
the Chronicle.
The results of ACF function applied to a sample chapter are shown in Fig. 10
(stress) and Fig. 11 (quantity). Both figures are rather straightforward to
interpret. The stress series contains one predominant and stable value at lag 1,
and no significant values at higher lags. This means a strong negative
correlation between any two adjacent syllables. This phenomenon obviously
reflects a predominant alternation of stressed and unstressed syllables in the
sample; it is not surprising to see the same pattern in all the remaining samples.
Needless to say, no stylistic shift could be observed across consecutive chapters.
These findings seem to indicate that stress alternation is a systemic language
phenomenon with no bearing on a stylistic analysis of the Chronicle, and the
quantity alternation in a line of text does not generate any significant
information. Having said that, both measures should produce more interesting
results in a comparative analysis of poetic genres in modern Czech.
The quantity-based series reveals a fundamentally different behaviour of
the sequence of long and short syllables. The actual existence of quantity in the
Old Czech language seems to be totally irrelevant as a prosodic feature in
poetry. The values shown in Fig. 11 evidently suggest no autocorrelation at any
lag. It means that the time series is simply unpredictable; in linguistic terms, it
means no rhythm in the dataset.
164 Adam Pawłowski and Maciej Eder
Fig. 10: Autocorrelation function (ACF) of a sequence of stressed and non-stressed syllables
from chapter 3.
Fig. 11: Autocorrelation function (ACF) of a sequence of long and short syllables from chapter 7.
Significantly enough, both prosodic features which one would associate with
the techniques of poetic composition and which are quantity and stress, proved
to be independent of any authorial choice and did not play any role in the
gradual stylistic shift in the Chronicle.
Sequential structures in “Dalimil’s Chronicle” 165
7 Conclusions
We have verified the presence of latent rhythmic patterns and have partially
corroborated the hypothesis advanced by Jerzy Woronczak in 1963. However,
the obtained results elucidate further questions raised by our analysis of the
Chronicle. One of these questions concerns the role played by the underlying
oral culture in the process of text composition in the Middle Ages. Although in
Woronczak’s work there is no direct reference to the works by J. Lord, M. Parry,
W. Ong or E. Havelock, all of whom were involved in the study of orality in
human communication, the Polish scholar’s reasoning is largely consistent with
theirs. Woronczak agrees with the assumption that in oral literature subsequent
segments (to be transformed in future into verses of written texts) tend to keep a
repetitive and stable rhythm, cadenced by syllable stress, quantity, identical
metrical feet or verse length. The reason for this apparent monotony is that
illiterate singers of tales and their public appreciated rhythmicity. While the
former took advantage of its mnemonic properties that helped them remember
hundreds or even thousands of verses, the latter enjoyed its hypnotising power
during life performances. Variable and irregular segments marking ruptures of
rhythm occurred only in critical moments of the story.
However, the bare opposition orality vs. literacy does not suffice to explain
the quite complex, as it turns out, stylistic shift in the Chronicle. Without
rejecting the role of orality as one of the underlying factors here, we propose a
considerably straightforward hypothesis that seems to explain the examined
data satisfactorily. Since in the initial chapters of the Dalimilova Kronika the
archaic eight-syllable verse type was used, it is feasible that this part of the
Chronicle had been inspired by oral poetry of the author’s predecessors (or even
copied in some passages). Probably, when the author took on some elements of
the oral heritage of his/her times he wished to keep the original syllabic
structure, rhymes and line lengths. This was possible – however difficult – for a
while, yet the poetic form eventually proved to be too demanding. At some
point, as we stipulate, the author realised that creating or supplementing
history would require more and more aesthetic compromises; thus, he put less
effort into the poetic form of the text’s final chapters.
Certainly, other explanations cannot be ruled out. The most convincing one
is the hypothesis of the collaborative authorship if the Chronicle. However, had
there been two (or more) authors, a sudden stylistic break somewhere in the
Chronicle, rather than the gradual shift from poetry to prose, would be more
likely.
166 Adam Pawłowski and Maciej Eder
Hence, the results of our study go far beyond the Medieval universum or the
stylometric routine research; partaking in the debate concerning the questions
raised by the opposition between orality and literacy in culture as well as the
opposition between fact and fiction in historical narrative, omnipresent in the
text-audio-visual space. Our argument is all the more imperative that it is relies
on empirical data and on solid, quantitative methodology.
Source Texts
Jireček, Josef (ed.). 1882. Rýmovaná kronika česká tak řečeného Dalimila. In Prameny dějin
českých 3(1). Praha.
Daňhelka Jiří, Karel Hádek, Bohulav Havránek, Naděžda Kvítková (eds.). 1988. Staročeská
kronika tak řečeného Dalimila: Vydání textu a veškerého textového materiálu. Vol. 1–2.
Praha: Academia.
References
Box, George E.P. & Gwilym M. Jenkins. 1976. Time series analysis: Forecasting and control. San
Francisco: Holden-Day.
Cleveland, William S. 1979 Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association 74(368). 829–836.
Cryer, Jonathan. 1986. Time series analysis. Boston: Duxbury Press.
Eder, Maciej. 2008. How rhythmical is hexameter: A statistical approach to ancient epic poetry.
Digital Humanities 2008: Book of Abstracts, 112–114. Oulu: University of Oulu.
Glass, Gene V., Victor L. Wilson & John M.Gottman. 1975. Design and analysis of time-series
experiments. Boulder (CO): Colorado Associated University Press.
Gottman, John M. 1981. Time-series analysis: A comprehensive introduction for social
scientists. Cambridge: Cambridge University Press.
Gottman, John M. 1990. Sequential analysis. Cambridge: Cambridge University Press.
Gregson, Robert A.M. 1983. Time series in psychology. Hillsdale (NJ): Lawrence Erlbaum
Associates.
Herdan, Gustav. 1966. The advanced theory of language as choice and chance. Berlin:
Springer.
Hrabák, Josef. 1959. Studie o českém verši. Praha: SPN.
Köhler, Reinhard. 2005. Synergetic linguistics. In Reinhard Köhler, Gabriel Altmann & Rajmund
G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 760–774. Berlin
& New York: de Gruyter.
McCleary, Richard & Richard A.Hay. 1980. Applied time series analysis for the social sciences.
Beverly Hills: Sage Publications.
Pawłowski, Adam. 1998. Séries temporelles en linguistique. Avec application à l’attribution de
texts. Paris: Romain Gary et Émile Ajar, and Genève: Champion-Slatkine.
Sequential structures in “Dalimil’s Chronicle” 167
Pawłowski, Adam. 2003. Sequential analysis of versified texts in fixed- and free-accent
languages: Example of Polish and Russian. In Lew N. Zybatow (ed.), Europa der Sprachen:
Sprachkompetenz – Mehrsprachigkeit – Translation. Teil II: Sprache und Kognition, 235–
246. Frankfurt/M.: Peter Lang.
Pawłowski, Adam. 2005 Modelling of the sequential structures in text. In Reinhard Köhler,
Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative linguistics. An international
handbook, 738–750. Berlin & New York: de Gruyter.
Pawłowski, Adam. 2006 Chronological analysis of textual data from the “Wrocław corpus of
Polish”. Poznań Studies in Contemporary Linguistics 41. 9–29.
Pawłowski, Adam & Maciej Eder. 2001. Quantity or stress? Sequential analysis of Latin
prosody. Journal of Quantitative Linguistics 8(1). 81–97.
Pawłowski, Adam, Marek Krajewski & Maciej Eder. 2010. Time series modelling in the analysis
of Homeric verse. Eos 47(1). 79–100.
Woronczak, Jerzy. 1963. Zasada budowy wiersza “Kroniki Dalimila” [Principle of verse building
in the “Kranika Dalimila”]. Pamiętnik Literacki 54(2). 469–478.
168 Adam Pawłowski and Maciej Eder
Appendix
A series of discrete observations xi, representing the realisations of a random
variable Xt, is called time series, if it can be spanned over a longitudinal axis
(corresponding to time or any other sequential quantity):
estimated by:
𝑁𝑁𝑁𝑁
1
𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 = � 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 (3)
𝑁𝑁𝑁𝑁
𝑡𝑡𝑡𝑡=1
𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘
𝜌𝜌𝜌𝜌𝑘𝑘𝑘𝑘 = = (8)
𝛾𝛾𝛾𝛾0 𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2
where 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 – autocovariance function (if 𝑘𝑘𝑘𝑘 = 0 then 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 = 𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2 )
k – lag
ACF is estimated by the following formula:
𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘 𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘
𝑟𝑟𝑟𝑟𝑘𝑘𝑘𝑘 = = (9)
𝑐𝑐𝑐𝑐0 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥2
ACF is the most important parameter in time series analysis. If it is non-
random, estimation of stochastic models is justified and purposeful. If it is not
the case, it proves that there is no deterministic component (metaphorically
called “memory”) in the data.
There are three basic models of time series, as well as many complex
models. As no complex models were discovered in the Chronicle of Dalimil, only
the simple ones will be presented.
A random process consists of independent realisations of a variable 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 =
{𝑒𝑒𝑒𝑒1 , 𝑒𝑒𝑒𝑒2 , … }. It is characterised by zero autocovariance and zero autocorrelation.
By analogy to the spectrum of light, the values 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 which have normal
distribution are called white noise.
An autoregressive process of order p consists of the subsequent 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 values
defined as:
𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 = 𝑎𝑎𝑎𝑎1 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−1 + 𝑎𝑎𝑎𝑎2 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−2 + ⋯ + 𝑎𝑎𝑎𝑎𝑝𝑝𝑝𝑝 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−𝑝𝑝𝑝𝑝 + 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 − 𝑏𝑏𝑏𝑏1 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−1 − 𝑏𝑏𝑏𝑏2 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−2 − ⋯ − 𝑏𝑏𝑏𝑏𝑞𝑞𝑞𝑞 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−𝑞𝑞𝑞𝑞 (12)
170 Adam Pawłowski and Maciej Eder
The type of the model is determined by the behaviour of the ACF and PACF
(partial autocorrelation) functions, as their values may either slowly decrease or
abruptly truncate (cf. Pawłowski 2001: 71). It should be emphasized, however,
that in many cases of text analysis it is just the ACF function, which is quite
sufficient to analyse and linguistically interpret the data. This is precisely the
case in the hereby study of the Dalimil’s Chronicle text.
Taraka Rama and Lars Borin
Comparative Evaluation of String Similarity
Measures for Automatic Language
Classification
1 Introduction
Historical linguistics, the oldest branch of modern linguistics, deals with lan-
guage-relatedness and language change across space and time. Historical lin-
guists apply the widely-tested comparative method (Durie and Ross, 1996) to
establish relationships between languages to posit a language family and to
reconstruct the proto-language for a language family.1 Although historical
linguistics has parallel origins with biology (Atkinson and Gray, 2005), unlike the
biologists, mainstream historical linguists have seldom been enthusiastic about
using quantitative methods for the discovery of language relationships or
investigating the structure of a language family, except for Kroeber and Chrétien
(1937) and Ellegård (1959). A short period of enthusiastic application of
quantitative methods initiated by Swadesh (1950) ended with the heavy criticism
levelled against it by Bergsland and Vogt (1962). The field of computational
historical linguistics did not receive much attention again until the beginning of
the 1990s, with the exception of two noteworthy doctoral dissertations, by
Sankoff (1969) and Embleton (1986).
In traditional lexicostatistics, as introduced by Swadesh (1952), distances
between languages are based on human expert cognacy judgments of items in
standardized word lists, e.g., the Swadesh lists (Swadesh, 1955). In the termi-
nology of historical linguistics, cognates are related words across languages that
can be traced directly back to the proto-language. Cognates are identified
through regular sound correspondences. Sometimes cognates have similar sur-
face form and related meanings. Examples of such revealing kind of cognates are:
English ~ German night ~ Nacht ‘night’ and hound ~ Hund ‘dog’. If a word has
undergone many changes then the relatedness is not obvious from visual in-
spection and one needs to look into the history of the word to exactly understand
1 The Indo-European family is a classical case of the successful application of comparative
method which establishes a tree relationship between some of the most widely spoken
languages in the world.
172 Taraka Rama and Lars Borin
the sound changes which resulted in the synchronic form. For instance, the
English ~ Hindi wheel ~ chakra ‘wheel’ are cognates and can be traced back to the
proto-Indo-European root kw ekw lo-.
Recently, some researchers have turned to approaches more amenable to
automation, hoping that large-scale lexicostatistical language classification will
thus become feasible. The ASJP (Automated Similarity Judgment Program) pro-
ject2 represents such an approach, where automatically estimated distances
between languages are provided as input to phylogenetic programs originally
developed in computational biology (Felsenstein, 2004), for the purpose of
inferring genetic relationships among organisms.
As noted above, traditional lexicostatistics assumes that the cognate judg-
ments for a group of languages have been supplied beforehand. Given a stand-
ardized word list, consisting of 40–100 items, the distance between a pair of
languages is defined as the percentage of shared cognates subtracted from 100%.
This procedure is applied to all pairs of languages under consideration, to
produce a pairwise inter-language distance matrix. This inter-language distance
matrix is then supplied to a tree-building algorithm such as Neighbor-Joining (NJ;
Saitou and Nei, 1987) or a clustering algorithm such as Unweighted Pair Group
Method with Arithmetic Mean (UPGMA; Sokal and Michener, 1958) to infer a tree
structure for the set of languages. Swadesh (1950) applies essentially this method
– although completely manually – to the Salishan languages. The resulting
“family tree” is reproduced in figure 1.
The crucial element in these automated approaches is the method used for
determining the overall similarity between two word lists.3 Often, this is some
variant of the popular edit distance or Levenshtein distance (LD; Levenshtein,
1966). LD for a pair of strings is defined as the minimum number of symbol
(character) additions, deletions and substitutions needed to transform one string
into the other. A modified LD (called LDND) is used by the ASJP consortium, as
reported in their publications (e.g., Bakker et al. 2009 and Holman et al. 2008).
2 http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm
3 At this point, we use “word list” and “language” interchangeably. Strictly speaking, a
language, as identified by its ISO 639-3 code, can have as many word lists as it has recognized
(described) varieties, i.e., doculects (Nordhoff and Hammarström, 2011).
Comparative evaluation of string similarity measures 173
2 Related Work
Cognate identification and tree inference are closely related tasks in historical
linguistics. Considering each task as a computational module would mean that
each cognate set identified across a set of tentatively related languages feed into
the refinement of the tree inferred at each step. In a critical article, Nichols (1996)
points out that the historical linguistics enterprise, since its beginning, always
used a refinement procedure to posit relatedness and tree structure for a set of
tentatively related languages.4 The inter-language distance approach to tree-
building, is incidentally straightforward and comparably accurate in comparison
to the computationally intensive Bayesian-based tree-inference approach of
Greenhill and Gray (2009).5
The inter-language distances are either an aggregate score of the pairwise
item distances or based on a distributional similarity score. The string similarity
measures used for the task of cognate identification can also be used for compu-
ting the similarity between two lexical items for a particular word sense.
4 This idea is quite similar to the well-known Expectation-Maximization paradigm in machine
learning. Kondrak (2002b) employs this paradigm for extracting sound correspondences by
pairwise comparisons of word lists for the task of cognate identification. A recent paper by
Bouchard-Côté et al. (2013) employs a feed-back procedure for the reconstruction of Proto-
Austronesian with a great success.
5 For a comparison of these methods, see Wichmann and Rama, 2014.
6 SED is defined as the edit distance normalized by the average of the lengths of the pair of
strings.
174 Taraka Rama and Lars Borin
European languages prepared by Dyen et al. (1992). The language distance matrix
is then given as input to the NJ algorithm – as implemented in the PHYLIP
package (Felsenstein, 2002) – to infer a tree for 87 Indo-European languages.
They make a qualitative evaluation of the inferred tree against the standard Indo-
European tree.
7 The dataset contains 200-word Swadesh lists for 95 language varieties. Available on
http://www. wordgumbo.com/ie/cmp/index.htm.
176 Taraka Rama and Lars Borin
8 A tree building algorithm closely related to NJ.
Comparative evaluation of string similarity measures 177
9 One reason for this may be that the experiments are computationally demanding, requiring
several days for computing a single measure over the whole ASJP dataset.
10 A longer list of string similarity measures is available on: http://www.coli.uni-saarland.de/
courses/LT1/2011/slides/stringmetrics.pdf
11 WALS does not provide a classification to all the languages of the world. The ASJP
consortium gives a WALS-like classification to all the languages present in their database.
178 Taraka Rama and Lars Borin
4.1 Database
The ASJP database offers a readily available, if minimal, basis for massive cross-
linguistic investigations. The ASJP effort began with a small dataset of 100-word
lists for 245 languages. These languages belong to 69 language families. Since its
first version presented by Brown et al. (2008), the ASJP database has been going
through a continuous expansion, to include in the version used here (v. 14,
released in 2011)12 more than 5500 word lists representing close to half the
languages spoken in the world (Wichmann et al., 2011). Because of the findings
reported by Holman et al. (2008), the later versions of the database aimed to cover
only the 40-item most stable Swadesh sublist, and not the 100-item list.
Each lexical item in an ASJP word list is transcribed in a broad phonetic
transcription known as ASJP Code (Brown et al., 2008). The ASJP code consists of
34 consonant symbols, 7 vowels, and four modifiers (∗, ”, ∼, $), all rendered by
characters available on the English version of the QWERTY keyboard. Tone,
stress, and vowel length are ignored in this transcription format. The three
modifiers combine symbols to form phonologically complex segments (e.g.,
aspirated, glottalized, or nasalized segments).
In order to ascertain that our results would be comparable to those published
by the ASJP group, we successfully replicated their experiments for LDN and
LDND measures using the ASJP program and the ASJP dataset version 12
(Wichmann et al., 2010b).13 This database comprises reduced (40-item) Swadesh
lists for 4169 linguistic varieties. All pidgins, creoles, mixed languages, artificial
languages, proto-languages, and languages extinct before 1700 CE were ex-
cluded for the experiment, as were language families represented by less than 10
word lists (Wichmann et al., 2010a),14 as well as word lists containing less than
28 words (70% of 40).
This leaves a dataset with 3730 word lists. It turned out that an additional 60
word lists did not have English glosses for the items, which meant that they could
12 The latest version is v. 16, released in 2013.
13 The original python program was created by Hagen Jung. We modified the program to handle
the ASJP modifiers.
14 The reason behind this decision is that correlations resulting from smaller samples (less than
40 language pairs) tend to be unreliable.
Comparative evaluation of string similarity measures 179
not be processed by the program, so these languages were also excluded from the
analysis.
All the experiments reported in this paper were performed on a subset of
version 14 of the ASJP database whose language distribution is shown in figure
2.15 The database has 5500 word lists. The same selection principles that were
used for version 12 (described above) were applied for choosing the languages to
be included in our experiments. The final dataset for our experiments has 4743
word lists for 50 language families. We use the family names of the WALS
(Haspelmath et al., 2011) classification.
15 Available for downloading at http://email.eva.mpg.de/~wichmann/listss14.zip.
180 Taraka Rama and Lars Borin
An internal node in the tree is not necessarily binary. For instance, the
Dravidian language family has four branches emerging from the top node (see
figure 3 for the Ethnologue family tree of Dravidian languages).
Table 1: Distribution of language families in ASJP database. WN and WLs stands for WALS
Name and Word Lists.
5 Similarity measures
For the experiments decribed below, we have considered both string similarity
measures and distributional measures for computing the distance between a pair
of languages. As mentioned earlier, string similarity measures work at the level
of word pairs and provide an aggregate score of the similarity between word pairs
whereas distributional measures compare the n-gram profiles between a
language pair to yield a distance score.
The different string similarity measures for a word pair that we have investigated
are the following:
– IDENT returns 1 if the words are identical, otherwise it returns 0.
– PREFIX returns the length of the longest common prefix divided by the length
of the longer word.
182 Taraka Rama and Lars Borin
– DICE is defined as the number of shared bigrams divided by the total number
of bigrams in both the words.
– LCS is defined as the length of the longest common subsequence divided by
the length of the longer word (Melamed, 1999).
– TRIGRAM is defined in the same way as DICE but uses trigrams for computing
the similarity between a word pair.
– XDICE is defined in the same way as DICE but uses “extended bigrams”,
which are trigrams without the middle letter (Brew and McKelvie, 1996).
– Jaccard’s index, JCD, is a set cardinality measure that is defined as the ratio
of the number of shared bigrams between the two words to the ratio of the
size of the union of the bigrams between the two words.
– LDN, as defined above.
N-gram similarity measures are inspired by a line of work initially pursued in the
context of information retrieval, aiming at automatic language identification in a
multilingual document. Cavnar and Trenkle (1994) used character n-grams for
text categorization. They observed that different document categories –
including documents in different languages – have characteristic character n-
gram profiles. The rank of a character n-gram varies across different categories
and documents belonging to the same category have similar character n-gram
Zipfian distributions.
16 Lin (1998) investigates three distance to similarity conversion techniques and motivates the
results from an information-theoretical point of view. In this article, we do not investigate the
effects of similarity to distance conversion. Rather, we stick to the traditional conversion
technique
17 Thus, the resulting distance is not a true distance metric.
Comparative evaluation of string similarity measures 183
Building on this idea, Dunning (1994, 1998) postulates that each language has its
own signature character (or phoneme; depending on the level of transcription) n-
gram distribution. Comparing the character n-gram profiles of two languages can
yield a single point distance between the language pair. The comparison
procedure is usually accomplished through the use of one of the distance
measures given in Singh 2006. The following steps are followed for extracting the
phoneme n-gram profile for a language:
– An n-gram is defined as the consecutive phonemes in a window of N. The
value of N usually ranges from 1 to 5.
– All n-grams are extracted for a lexical item. This step is repeated for all the
lexical items in a word list.
– All the extracted n-grams are mixed and sorted in the descending order of
their frequency. The relative frequency of the n-grams is computed.
– Only the top G n-grams are retained and the rest of them are discarded. The
value of G is determined empirically.
For a language pair, the n-gram profiles can be compared using one of the
following distance measures:
1. Out-of-Rank measure is defined as the aggregate sum of the absolute differ-
ence in the rank of the shared n-grams between a pair of languages. If there
are no shared bigrams between an n-gram profile, then the difference in
ranks is assigned a maximum out-of-place score.
2. Jaccard’s index is a set cardinality measure. It is defined as the ratio of the
cardinality of the intersection of the n-grams between the two languages to
the cardinality of the union of the two languages.
3. Dice distance is related to Jaccard’s Index. It is defined as the ratio of twice
the number of shared n-grams to the total number of n-grams in both the
language profiles.
4. Manhattan distance is defined as the sum of the absolute difference between
the relative frequency of the shared n-grams.
5. Euclidean distance is defined in a similar fashion to Manhattan distance
where the individual terms are squared.
While replicating the original ASJP experiments on the version 12 ASJP database,
we also tested if the above distributional measures, (1–5) perform as well as LDN.
Unfortunately, the results were not encouraging, and we did not repeat the
experiments on version 14 of the database. One main reason for this result is the
relatively small size of the ASJP concept list, which provides a poor estimate of
the true language signatures.
184 Taraka Rama and Lars Borin
This factor speaks equally, or even more, against including another class of n-
gram-based measures, namely information-theoretic measures such as cross
entropy and KL-divergence. These measures have been well-studied in natural
language processing tasks such as machine translation, natural language
parsing, sentiment identification, and also in automatic language identification.
However, the probability distributions required for using these measures are
usually estimated through maximum likelihood estimation which require a fairly
large amount of data, and the short ASJP concept lists will hardly qualify in this
regard.
6 Evaluation measures
The measures which we have used for evaluating the performance of string
similarity measures given in section 5 are the following three:
1. dist was originally suggested by Wichmann et al. (2010a), and tests if LDND
is better than LDN at the task of distinguishing related languages from un-
related languages.
2. RW is a special case of Pearson’s r – called point biserial correlation (Tate,
1954) – computes the agreement between the intra-family pairwise distances
and the WALS classification for the family.
3. γ is related to Goodman and Kruskal’s Gamma (1954) and measures the
strength of association between two ordinal variables. In this paper, it is used
to compute the level of agreement between the pairwise intra-family
distances and the family’s Ethnologue classification.
The dist measure for a family consists of three components: the mean of the
pairwise distances inside a language family (din); and the mean of the pairwise
distances from each language in a family to the rest of the language families (dout).
sdout is defined as the standard deviation of all the pairwise distances used to
compute dout. Finally, dist is defined as (din−dout)/sdout. The resistance of a string
similarity measure to other language families is reflected by the value of sdout.
A comparatively higher dist value suggests that a string similarity measure is
particularly resistant to random similarities between unrelated languages and
performs well at distinguishing languages belonging to the same language family
from languages in other language families.
Comparative evaluation of string similarity measures 185
The WALS database provides a three-level classification. The top level is the
language family, the second level is the genus and the lowest level is the
language itself. If two languages belong to different families, then the distance is
3. Two languages that belong to different genera in the same family have a
distance of 2. If the two languages fall in the same genus, they have a distance of
1. This allows us to define a distance matrix for each family based on WALS. The
WALS distance matrix can be compared to the distance matrices of any string
similarity measure using point biserial correlation – a special case of Pearson’s r.
If a family has a single genus in the WALS classification there is no computation
of RW and the corresponding row for a family is empty in table 7.
Given a distance-matrix d of order N × N, where each cell dij is the distance be-
tween two languages i and j; and an Ethnologue tree E, the computation of γ for
a language family is defined as follows:
1. Enumerate all the triplets for a language family of size N. A triplet, t for a
language family is defined as {i, j, k}, where i ≠ j ≠ k are languages belonging
to a family. A language family of size N has n(n-1)(n-2)/6 triplets.
2. For the members of each such triplet t, there are three lexical distances dij ,
dik, and djk. The expert classification tree E can treat the three languages {i, j,
k} in four possible ways (| denotes a partition): {i, j | k}, {i, k | j}, {j, k | i} or can
have a tie where all languages emanate from the same node. All ties are
ignored in the computation of γ.18
3. A distance triplet dij , dik, and djk is said to agree completely with an Ethno-
logue partition {i, j | k} when dij < dik and dij < djk. A triplet that satisfies these
conditions is counted as a concordant comparison, C; else it is counted as a
discordant comparison, D.
18 We do not know what a tie in the gold standard indicates: uncertainty in the classification,
or a genuine multi-way branching? Whenever the Ethnologue tree of a family is completely
unresolved, it is shown by an empty row. For example, the family tree of Bosavi languages is a
star structure. Hence, the corresponding row in table 5 is left empty.
186 Taraka Rama and Lars Borin
4. Steps 2 and 3 are repeated for all the triplets to yield γ for a family defined as
γ = (C−D)/(C+D). γ lies in the range [−1, 1) where a score of −1 indicates perfect
C+D disagreement and a score of +1 indicates perfect agreement.
At this point, one might wonder about the decision for not using an off-the-
shelf tree-building algorithm to infer a tree and compare the resulting tree with
the Ethnologue classification. Although both Pompei et al. (2011) and Huff and
Lonsdale (2011) compare 12 their inferred trees – based on Neighbor-Joining and
Minimum Evolution algorithms – to Ethnologue trees using cleverly crafted tree-
distance measures (GRF and GQD), they do not make the more intuitively useful
direct comparison of the distance matrices to the Ethnologue trees. The tree
inference algorithms use heuristics to find the best tree from the available tree
space. The number of possible rooted, non-binary and unlabeled trees is quite
large even for a language family of size 20 – about 256 × 106.
A tree inference algorithm uses heuristics to reduce the tree space to find the
best tree that explains the distance matrix. A tree inference algorithm can make
mistakes while searching for the best tree. Moreover, there are many variations
of Neighbor-Joining and Minimum Evolution algorithms.19 Ideally, one would
have to test the different tree inference algorithms and then decide the best one
for our task. However, the focus of this paper rests on the comparison of different
string similarity algorithms and not on tree inference algorithms. Hence, a direct
comparison of a family’s distance matrix to the family’s Ethnologue tree
circumvents the choice of the tree inference algorithm.
19 http://www.atgc-montpellier.fr/fastme/usersguide.php
Comparative evaluation of string similarity measures 187
adopted for deciding upon a measure? We address these questions through the
following procedure:
1. For a column i, sort the average scores, s in descending order.
2. For a row index 1 ≤ r ≤ 16, test the significance of sr ≥ sr+1 through a sign test
(Sheskin, 2003). This test yields a p−value.
Table 2: Average results for each string similarity measure across the 50 families. The rows are
sorted by the name of the measure.
Dist RW
γ 0.30 0.95
Dist 0.32
The above procedure ensures that the chance of incorrectly rejecting a null
hypothesis is 1 in 20 for α = 0.05 and 1 in 100 for α = 0.01. In this experimental
context, this suggests that we erroneously reject 0.75 true null hypotheses out of
15 hypotheses for α = 0.05 and 0.15 hypotheses for α = 0.01. We report the Dist, γ,
and RW for each family in tables 5, 6, and 7. In each of these tables, only those
measures which are above the average scores from table 2, are reported.
The FDR procedure for γ suggests that no sign test is significant. This is in
agreement with the result of Wichmann et al., 2010a, who showed that the choice
of LDN or LDND is quite unimportant for the task of internal classification. The
FDR procedure for RW suggests that LDN > LCS, LCS > PREFIXD, DICE > JCD, and
JCD > JCDD. Here A > B denotes that A is significantly better than B. The FDR
procedure for Dist suggests that JCDD > JCD, JCD > TRID, DICED > IDENTD, LDND
> LCSD, and LCSD > LDN.
The results point towards an important direction in the task of building
computational systems for automatic language classification. The pipeline for
such a system consists of (1) distinguishing related languages from unrelated
languages; and (2) internal classification accuracy. JCDD performs the best with
respect to Dist. Further, JCDD is derived from JCD and can be computed in O(m +
n), for two strings of length m and n. In comparison, LDN is in the order of O(mn).
In general, the computational complexity for computing distance between two
word lists for all the significant measures is given in table 4. Based on the
computational complexity and the significance scores, we propose that JCDD be
used for step 1 and a measure like LDN be used for internal classification.
Table 4: Computational complexity of top performing measures for computing distance be-
tween two word lists. Given two word lists each of length l. m and n denote the lengths of a
word pair wa and wb and C = l(l − 1)/2.
Measure Complexity
JCDD CO(m + n + min(m − 1, n − 1))
JCD lO(m + n + min(m − 1, n − 1))
LDND CO(mn)
LDN lO(mn)
PREFIXD CO(max(m, n))
LCSD CO(mn)
Comparative evaluation of string similarity measures 189
Measure Complexity
LCS lO(mn)
DICED CO(m + n + min(m − 2, n − 2))
DICE lO(m + n + min(m − 2, n − 2))
8 Conclusion
In this article, we have presented the first known attempt to apply more than 20
different similarity (or distance) measures to the problem of genetic classification
of languages on the basis of Swadesh-style core vocabulary lists. The experiments
were performed on the wide-coverage ASJP database (about half the world’s
languages).
We have examined the various measures at two levels, namely: (1) their ca-
pability of distinguishing related and unrelated languages; and (2) their perfor-
mance as measures for internal classification of related languages. We find that
the choice of string similarity measure (among the tested pool of measures) is not
very important for the task of internal classification whereas the choice affects
the results of discriminating related languages from unrelated ones.
Acknowledgments
The authors thank Søren Wichmann, Eric W. Holman, Harald Hammarström, and
Roman Yangarber for useful comments which have helped us to improve the
presentation. The string similarity experiments have been made possible through
the use of ppss software20 recommended by Leif-Jöran Olsson. The first author
would like to thank Prasant Kolachina for the discussions on parallel
implementations in Python. The work presented here was funded in part by the
Swedish Research Council (the project Digital areal linguistics; contract no. 2009-
1448).
20 http://code.google.com/p/ppss/
190 Taraka Rama and Lars Borin
References
Atkinson, Quentin D. & Russell D. Gray. 2005. Curious parallels and curious connections —
phylogenetic thinking in biology and historical linguistics. Systematic Biology 54(4). 513–
526.
Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Pamela Brown,
Dmitry Egorov, Robert Mailhammer, Anthony Grant & Eric W. Holman. 2009. Adding
typology to lexicostatistics: A combined approach to language classification. Linguistic
Typology 13(1):169–181.
Benjamini, Yoav & Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society B 57(1).
289–300.
Bergsland, Knut & Hans Vogt. 1962. On the validity of glottochronology. Current Anthropology,
3(2):115–153. ISSN 00113204.
Bouchard-Côté, Alexandre, David Hall, Thomas L. Griffiths & Dan Klein. 2013. Automated
reconstruction of ancient languages using probabilistic models of sound change.
Proceedings of the National Academy of Sciences, 110(11). 4224–4229.
Brew, Chris & David McKelvie. 1996. Word-pair extraction for lexicography. In Kemal Oflazer &
Harold Somers (eds.), Proceedings of the Second International Conference on New
Methods in Language Processing, 45–55. Ankara.
Brown, Cecil H., Eric W. Holman, Søren Wichmann, & Viveka Velupillai. 2008. Automated
classification of the world’s languages: A description of the method and preliminary
results. Sprachtypologie und Universalienforschung 61(4). 285–308.
Cavnar, William B. & John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings
of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval,
161–175. Las Vegas (NV): UNLV Publications.
Christiansen, Chris, Thomas Mailund, Christian N. S. Pedersen, Martin Randers, & Martin S.
Stissing. 2006. Fast calculation of the quartet distance between trees of arbitrary
degrees. Algorithms for Molecular Biology 1. Article No. 16.
Dryer, Matthew S. 2000. Counting genera vs. counting languages. Linguistic Typology 4. 334–
350.
Dryer, Matthew S. & Martin Haspelmath. 2013. The world atlas of language structures online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://wals.info (accessed on 26 November 2014).
Dunning, Ted E. 1994. Statistical identification of language. Technical Report CRL MCCS-94-
273. Las Cruces (NM): New Mexico State University, Computing Research Lab.
Dunning, Ted E. 1998. Finding structure in text, genome and other symbolic sequences.
Sheffield: University of Sheffield.
Durie, Mark & Malcolm Ross (eds.). 1996. The comparative method reviewed: Regularity and
irregularity in language change. Oxford & New York: Oxford University Press.
Dyen, Isidore, Joseph B. Kruskal, & Paul Black. 1992. An Indo-European classification: A
lexicostatistical experiment. Transactions of the American Philosophical Society 82(5). 1–
132.
Ellegård, Alvar. 1959. Statistical measurement of linguistic relationship. Language 35(2). 131–
156.
Comparative evaluation of string similarity measures 191
Ellison, T. Mark & Simon Kirby. 2006. Measuring language divergence by intra-lexical
comparison. In Proceedings of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Association for Computational Linguistics,
pages 273–280. Association for Computational Linguistics.
http://www.aclweb.org/anthology/P06-1035 (accessed 27 November 2014).
Embleton, Sheila M. 1986. Statistics in historical linguistics (Quantitative Linguistics 30).
Bochum: Brockmeyer.
Felsenstein, Joseph. 2002. PHYLIP (phylogeny inference package) version 3.6 a3. Distributed
by the author. Seattle (WA): University of Washington, Department of Genome Sciences.
Felsenstein, Joseph. 2004. Inferring phylogenies. Sunderland (MA): Sinauer Associates.
Goodman, Leo A. & William H. Kruskal. 1954. Measures of association for cross classifications.
Journal of the American Statistical Association 49(268). 732–764.
Greenhill, Simon J. & Russell D. Gray. 2009. Austronesian language phylogenies: Myths and
misconceptions about Bayesian computational methods. In Alexander Adelaar & Andrew
Pawlye (eds.), Austronesian historical linguistics and culture history: A festschrift for
Robert Blust, 375–397. Canberra: Pacific Linguistics
Greenhill, Simon J., Robert Blust & Russell D. Gray. 2008. The Austronesian basic vocabulary
database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4. 271–283.
Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie. 2011. WALS online.
Munich: Max Planck Digital Library. http://wals.info.
Hauer, Bradley & Grzegorz Kondrak. 2011. Clustering semantically equivalent words into
cognate sets in multilingual lists. In Proceedings of 5th International Joint Conference on
Natural Language Processing, pages 865–873. Asian Federation of Natural Language
Processing. http://www.aclweb.org/anthology/I11-1097 (accessed 27 November 2014).
Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller & Dik
Bakker. 2008. Advances in automated language classification. In Antti Arppe, Kaius
Sinnemäki & Urpu Nikanne (eds.), Quantitative investigations in theoretical linguistics,
40–43. Helsinki: University of Helsinki.
Huff, Paul & Deryle Lonsdale. 2011. Positing language relationships using ALINE. Language
Dynamics and Change 1(1). 128–162.
Huffman, Stephen M. 1998. The genetic classification of languages by n-gram analysis: A
computational technique. Washington (DC): Georgetown University dissertation.
Inkpen, Diana, Oana Frunza & Grzegorz Kondrak. 2005. Automatic identification of cognates
and false friends in French and English. In Proceedings of the International Conference
Recent Advances in Natural Language Processing, 251–257.
Jäger, Gerhard. 2013. Phylogenetic inference from word lists using weighted alignment with
empirically determined weights. Language Dynamics and Change 3(2). 245–291.
Kondrak, Grzegorz. 2000. A new algorithm for the alignment of phonetic sequences. In
Proceedings of the First Meeting of the North American Chapter of the Association for
Computational Linguistics, 288–295.
Kondrak, Grzegorz. 2002a. Algorithms for language reconstruction. Toronto: University of
Toronto dissertation.
Kondrak, Grzegorz. 2002b. Determining recurrent sound correspondences by inducing
translation models. In Proceedings of the 19th international conference on Computational
linguistics, Volume 1. Association for Computational Linguistics.
http://www.aclweb.org/anthology/C02-1016?CFID=458963627&CFTOKEN=39778386
(accessed 26 November 2014).
192 Taraka Rama and Lars Borin
Kondrak, Grzegorz & Tarek Sherif. 2006. Evaluation of several phonetic similarity algorithms
on the task of cognate identification. In Proceedings of ACL Workshop on Linguistic
Distances, 43–50. Association for Computational Linguistics.
Kroeber, Alfred L. & C. Douglas Chrétien. 1937. Quantitative classification of Indo-European
languages. Language 13(2). 83–103.
Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions and
reversals. Soviet Physics - Doklady 10(8). 707–710.
Lewis, Paul M. (ed.). 2009. Ethnologue: Languages of the world, 16th edn. Dallas (TX): SIL
International.
Lin, Dekang. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th
International Conference on Machine Learning, Volume 1, 296–304.
List, Johann-Mattis. 2012. LexStat: Automatic detection of cognates in multilingual wordlists.
In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, 117–125. Association
for Computational Linguistics. http://www.aclweb.org/anthology/W12-0216 (accessed 27
November 2014).
Melamed, Dan I. 1999. Bitext maps and alignment via pattern recognition. Computational
Linguistics 25(1). 107–130.
Nichols, Johanna. 1996. The comparative method as heuristic. In Mark Durie & Malcom Ross
(eds), The comparative method revisited: Regularity and Irregularity in Language Change,
39–71. New York: Oxford University Press.
Nordhoff, Sebastian & Harald Hammarström. 2011. Glottolog/Langdoc: Defining dialects,
languages, and language families as collections of resources. In Proceedings of the First
International Workshop on Linked Science, Volume 783. http://ceur-ws.org/Vol-
783/paper7.pdf (accessed 27 November 2014).
Petroni, Filippo & Maurizio Serva. 2010. Measures of lexical distance between languages.
Physica A: Statistical Mechanics and its Applications 389(11). 2280–2283.
Pompei, Simone, Vittorio Loreto, & Francesca Tria. 2011. On the accuracy of language trees.
PloS ONE 6(6). e20109.
Rama, Taraka & Anil K. Singh. 2009. From bag of languages to family trees from noisy corpus.
In Proceedings of the International Conference RANLP-2009, 355–359. Association for
Computational Linguistics. http://www.aclweb.org/anthology/R09-1064 (accessed 27
November 2014).
Robinson, David F. & Leslie R. Foulds. 1981. Comparison of phylogenetic trees. Mathematical
Biosciences 53(1-2). 131–147.
Saitou, Naruya & Masatoshi Nei. 1987. The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution 4(4). 406–425.
Sankoff, David. 1969. Historical linguistics as stochastic process. Montreal: McGill University
dissertation.
Sheskin, David J. 2003. Handbook of parametric and nonparametric statistical procedures. Ba
Raton (FL): Chapman & Hall/CRC Press.
Singh, Anil K. 2006. Study of some distance measures for language and encoding
identification. In Proceedings of Workshop on Linguistic Distances, 63–72. Association for
Computational Linguistics. http://www.aclweb.org/anthology/W06-1109 (accessed 27
November 2014).
Singh, Anil K. & Harshit Surana. 2007. Can corpus based measures be used for comparative
study of languages? In Proceedings of Ninth Meeting of the ACL Special Interest Group in
Comparative evaluation of string similarity measures 193
Family JCDD JCD TRIGRAMD DICED IDENTD PREFIXD LDND LCSD LDN
Bos 15.0643 14.436 7.5983 10.9145 14.4357 10.391 8.6767 8.2226 4.8419
NDe 19.8309 19.2611 8.0567 13.1777 9.5648 9.6538 10.1522 9.364 5.2419
NC 1.7703 1.6102 0.6324 1.1998 0.5368 1.0685 1.3978 1.3064 0.5132
Pan 24.7828 22.4921 18.5575 17.2441 12.2144 13.7351 12.7579 11.4257 6.8728
Hok 10.2645 9.826 3.6634 7.3298 4.0392 3.6563 4.84 4.6638 2.7096
Chi 4.165 4.0759 0.9642 2.8152 1.6258 2.8052 2.7234 2.5116 1.7753
194 Taraka Rama and Lars Borin
Tup 15.492 14.4571 9.2908 10.4479 6.6263 8.0475 8.569 7.8533 4.4553
WP 8.1028 7.6086 6.9894 5.5301 7.0905 4.0984 4.2265 3.9029 2.4883
AuA 7.3013 6.7514 3.0446 4.5166 3.4781 4.1228 4.7953 4.3497 2.648
An 7.667 7.2367 4.7296 5.3313 2.5288 4.3066 4.6268 4.3107 2.4143
Que 62.227 53.7259 33.479 29.7032 27.1896 25.9791 23.7586 21.7254 10.8472
Kho 6.4615 6.7371 3.3425 4.4202 4.0611 3.96 3.8014 3.3776 2.1531
Dra 18.5943 17.2609 11.6611 12.4115 7.3739 10.2461 9.8216 8.595 4.8771
Aus 2.8967 3.7314 1.5668 2.0659 0.7709 1.8204 1.635 1.5775 1.4495
Tuc 25.9289 24.232 14.0369 16.8078 11.6435 12.5345 12.0163 11.0698 5.8166
Ura 6.5405 6.1048 0.2392 1.6473 -0.0108 3.4905 3.5156 3.1847 2.1715
Arw 6.1898 6.0316 4.0542 4.4878 1.7509 2.9965 3.5505 3.3439 2.1828
May 40.1516 37.7678 17.3924 22.8213 17.5961 14.4431 15.37 13.4738 7.6795
LP 7.5669 7.6686 3.0591 5.3684 5.108 4.8677 4.3565 4.2503 2.8572
OM 4.635 4.5088 2.8218 3.3448 2.437 2.6701 2.7328 2.4757 1.3643
Car 15.4411 14.6063 9.7376 10.6387 5.1435 7.7896 9.1164 8.2592 5.0205
Family JCDD JCD TRIGRAMD DICED IDENTD PREFIXD LDND LCSD LDN
TNG 1.073 1.216 0.4854 0.8259 0.5177 0.8292 0.8225 0.8258 0.4629
MZ 43.3479 40.0136 37.9344 30.3553 36.874 20.4933 18.2746 16.0774 9.661
Bor 9.6352 9.5691 5.011 6.5316 4.1559 6.5507 6.3216 5.9014 3.8474
Pen 5.4103 5.252 3.6884 3.8325 2.3022 3.2193 3.1645 2.8137 1.5862
MGe 4.2719 4.0058 1.0069 2.5482 1.6691 2.0545 2.4147 2.3168 1.1219
ST 4.1094 3.8635 0.9103 2.7825 2.173 2.7807 2.8974 2.7502 1.3482
Tor 3.2466 3.1546 2.2187 2.3101 1.7462 2.1128 2.0321 1.9072 1.0739
TK 15.0085 13.4365 5.331 7.7664 7.5326 8.1249 7.6679 6.9855 2.8723
IE 7.3831 6.7064 1.6767 2.8031 1.6917 4.1028 4.0256 3.6679 1.4322
Alg 6.8582 6.737 4.5117 5.2475 1.2071 4.5916 5.2534 4.5017 2.775
NS 2.4402 2.3163 1.1485 1.6505 1.1456 1.321 1.3681 1.3392 0.6085
Sko 6.7676 6.3721 2.5992 4.6468 4.7931 5.182 4.7014 4.5975 2.5371
AA 1.8054 1.6807 0.7924 1.2557 0.4923 1.37 1.3757 1.3883 0.6411
LSR 4.0791 4.3844 2.2048 2.641 1.5778 2.1808 2.1713 2.0826 1.6308
Mar 10.9265 10.0795 8.5836 7.1801 6.4301 5.0488 4.7739 4.5115 2.8612
Alt 18.929 17.9969 6.182 9.1747 7.2628 9.4017 8.8272 7.9513 4.1239
Sep 6.875 6.5934 2.8591 4.5782 4.6793 4.3683 4.1124 3.8471 2.0261
Hui 21.0961 19.8025 18.4869 14.7131 16.1439 12.4005 10.2317 9.2171 4.9648
NDa 7.6449 7.3732 3.2895 4.8035 2.7922 5.7799 5.1604 4.8233 2.3671
Sio 13.8571 12.8415 4.2685 9.444 7.3326 7.8548 7.9906 7.1145 4.0156
Kad 42.0614 40.0526 27.8429 25.6201 21.678 17.0677 17.5982 15.9751 9.426
MUM 7.9936 7.8812 6.1084 4.7539 4.7774 3.8622 3.4663 3.4324 2.1726
WF 22.211 20.5567 27.2757 15.8329 22.4019 12.516 11.2823 10.4454 5.665
Sal 13.1512 12.2212 11.3222 9.7777 5.2612 7.4423 7.5338 6.7944 3.4597
Kiw 43.2272 39.5467 46.018 30.1911 46.9148 20.2353 18.8007 17.3091 10.3285
UA 21.6334 19.6366 10.4644 11.6944 4.363 9.6858 9.4791 8.9058 4.9122
Tot 60.4364 51.2138 39.4131 33.0995 26.7875 23.5405 22.6512 21.3586 11.7915
HM 8.782 8.5212 1.6133 4.9056 4.0467 5.7944 5.3761 4.9898 2.8084
Comparative evaluation of string similarity measures 195
Family JCDD JCD TRIGRAMD DICED IDENTD PREFIXD LDND LCSD LDN
EA 27.1726 25.2088 24.2372 18.8923 14.1948 14.2023 13.7316 12.1348 6.8154
Family LDND LCSD LDN LCS PREFIXD PREFIX JCDD DICED DICE JCD
WF
Tor 0.7638 0.734 0.7148 0.7177 0.7795 0.7458 0.7233 0.7193 0.7126 0.7216
Chi 0.7538 0.7387 0.7748 0.7508 0.6396 0.7057 0.7057 0.7057 0.7057 0.7477
HM 0.6131 0.6207 0.5799 0.5505 0.5359 0.5186 0.4576 0.429 0.4617 0.4384
196 Taraka Rama and Lars Borin
Hok 0.5608 0.5763 0.5622 0.5378 0.5181 0.4922 0.5871 0.5712 0.5744 0.5782
Tot 1 1 1 1 0.9848 0.9899 0.9848 0.9899 0.9949 0.9848
Aus 0.4239 0.4003 0.4595 0.4619 0.4125 0.4668 0.4356 0.4232 0.398 0.4125
WP 0.7204 0.7274 0.7463 0.7467 0.6492 0.6643 0.6902 0.6946 0.7091 0.697
MUM 0.7003 0.6158 0.7493 0.7057 0.7302 0.6975 0.5477 0.5777 0.6594 0.6213
Sko 0.7708 0.816 0.7396 0.809 0.7847 0.7882 0.6632 0.6944 0.6458 0.6181
ST 0.6223 0.6274 0.6042 0.5991 0.5945 0.5789 0.5214 0.5213 0.5283 0.5114
Sio 0.8549 0.8221 0.81 0.7772 0.8359 0.8256 0.772 0.7599 0.7444 0.7668
Pan 0.3083 0.3167 0.2722 0.2639 0.275 0.2444 0.2361 0.2694 0.2611 0.2306
AuA 0.5625 0.5338 0.5875 0.548 0.476 0.4933 0.5311 0.5198 0.5054 0.5299
Mar 0.9553 0.9479 0.9337 0.9017 0.9256 0.9385 0.924 0.918 0.9024 0.9106
Kad
May 0.7883 0.7895 0.7813 0.7859 0.7402 0.7245 0.8131 0.8039 0.7988 0.8121
NC 0.4193 0.4048 0.3856 0.3964 0.2929 0.2529 0.3612 0.3639 0.2875 0.2755
Kiw
Hui 0.9435 0.9464 0.9435 0.9464 0.9464 0.9435 0.8958 0.9107 0.9137 0.8988
LSR 0.7984 0.7447 0.7234 0.6596 0.7144 0.692 0.7626 0.748 0.6484 0.6775
TK 0.7757 0.7698 0.7194 0.7158 0.7782 0.7239 0.6987 0.6991 0.6537 0.6705
Family LDND LCSD LDN LCS PREFIXD PREFIX JCDD DICED DICE JCD
LP 0.6878 0.6893 0.7237 0.7252 0.6746 0.7065 0.627 0.6594 0.6513 0.6235
Que 0.737 0.7319 0.758 0.7523 0.742 0.7535 0.7334 0.7335 0.7502 0.7347
NS 0.5264 0.4642 0.4859 0.4532 0.4365 0.3673 0.5216 0.5235 0.4882 0.4968
AA 0.6272 0.6053 0.517 0.459 0.6134 0.5254 0.5257 0.5175 0.4026 0.5162
Ura 0.598 0.5943 0.6763 0.6763 0.5392 0.6495 0.7155 0.479 0.6843 0.7003
MGe 0.6566 0.6659 0.6944 0.716 0.6011 0.662 0.7245 0.7099 0.7508 0.6983
Car 0.325 0.3092 0.3205 0.3108 0.2697 0.2677 0.313 0.3118 0.2952 0.316
Bor 0.7891 0.8027 0.7823 0.7914 0.7755 0.7619 0.7846 0.8005 0.7914 0.7823
Bos
EA 0.844 0.8532 0.8349 0.8349 0.8716 0.8899 0.8716 0.8716 0.8899 0.8899
TNG 0.6684 0.6692 0.6433 0.6403 0.643 0.6177 0.5977 0.5946 0.5925 0.5972
Dra 0.6431 0.6175 0.6434 0.6288 0.6786 0.6688 0.6181 0.6351 0.655 0.6112
IE 0.7391 0.7199 0.7135 0.6915 0.737 0.7295 0.5619 0.5823 0.6255 0.5248
OM 0.9863 0.989 0.9755 0.9725 0.9527 0.9513 0.9459 0.9472 0.9403 0.9406
Tuc 0.6335 0.623 0.6187 0.6089 0.6189 0.6153 0.5937 0.5983 0.5917 0.5919
Arw 0.5079 0.4825 0.4876 0.4749 0.4475 0.4472 0.4739 0.4773 0.4565 0.4727
NDa 0.9458 0.9578 0.9415 0.9407 0.9094 0.9121 0.8071 0.8246 0.8304 0.8009
Alg 0.5301 0.5246 0.5543 0.5641 0.4883 0.5147 0.4677 0.4762 0.5169 0.5106
Sep 0.8958 0.8731 0.9366 0.9388 0.8852 0.9048 0.8535 0.8724 0.892 0.8701
NDe 0.7252 0.7086 0.7131 0.7017 0.7002 0.6828 0.6654 0.6737 0.6715 0.6639
Pen 0.8011 0.7851 0.8402 0.831 0.8092 0.8092 0.7115 0.7218 0.7667 0.7437
An 0.2692 0.2754 0.214 0.1953 0.2373 0.1764 0.207 0.2106 0.1469 0.2036
Tup 0.9113 0.9118 0.9116 0.9114 0.8884 0.8921 0.9129 0.9127 0.9123 0.9119
Kho 0.8558 0.8502 0.8071 0.7903 0.8801 0.8333 0.8052 0.8146 0.736 0.7378
Alt 0.8384 0.8366 0.85 0.8473 0.8354 0.8484 0.8183 0.8255 0.8308 0.8164
UA 0.8018 0.818 0.7865 0.8002 0.7816 0.7691 0.8292 0.8223 0.8119 0.8197
Sal 0.8788 0.8664 0.8628 0.8336 0.8793 0.8708 0.7941 0.798 0.7865 0.7843
MZ 0.7548 0.7692 0.7476 0.7524 0.7356 0.7212 0.6707 0.6779 0.6731 0.6683
Comparative evaluation of string similarity measures 197
Table 7: RW for families and measures above average.
Family LDND LCSD LDN LCS PREFIXD PREFIX DICED DICE JCD JCDD TRIRAMD
NDe 0.5761 0.5963 0.5556 0.5804 0.5006 0.4749 0.4417 0.4372 0.4089 0.412 0.2841
Bos
NC 0.4569 0.4437 0.4545 0.4398 0.3384 0.3349 0.3833 0.3893 0.3538 0.3485 0.2925
Hok 0.8054 0.8047 0.8048 0.8124 0.6834 0.6715 0.7987 0.8032 0.7629 0.7592 0.5457
Pan
Chi 0.5735 0.5775 0.555 0.5464 0.5659 0.5395 0.5616 0.5253 0.5593 0.5551 0.4752
Tup 0.7486 0.7462 0.7698 0.7608 0.6951 0.705 0.7381 0.7386 0.7136 0.7125 0.6818
WP 0.6317 0.6263 0.642 0.6291 0.5583 0.5543 0.5536 0.5535 0.5199 0.5198 0.5076
AuA 0.6385 0.6413 0.5763 0.5759 0.6056 0.538 0.5816 0.5176 0.5734 0.5732 0.5147
198 Taraka Rama and Lars Borin
Que
An 0.1799 0.1869 0.1198 0.1003 0.1643 0.0996 0.1432 0.0842 0.1423 0.1492 0.1094
Kho 0.7333 0.7335 0.732 0.7327 0.6826 0.6821 0.6138 0.6176 0.5858 0.582 0.4757
Dra 0.5548 0.5448 0.589 0.5831 0.5699 0.6006 0.5585 0.589 0.5462 0.5457 0.5206
Aus 0.2971 0.2718 0.3092 0.3023 0.2926 0.3063 0.2867 0.257 0.2618 0.2672 0.2487
Tuc
Ura 0.4442 0.4356 0.6275 0.6184 0.4116 0.6104 0.2806 0.539 0.399 0.3951 0.1021
Arw
May
LP 0.41 0.4279 0.4492 0.4748 0.3864 0.4184 0.3323 0.336 0.3157 0.3093 0.1848
OM 0.8095 0.817 0.7996 0.7988 0.7857 0.7852 0.7261 0.7282 0.6941 0.6921 0.6033
Car
MZ
TNG 0.5264 0.5325 0.4633 0.4518 0.5 0.472 0.469 0.4579 0.4434 0.4493 0.3295
Bor
Pen 0.8747 0.8609 0.8662 0.8466 0.8549 0.8505 0.8531 0.8536 0.8321 0.8308 0.7625
MGe 0.6833 0.6976 0.6886 0.6874 0.6086 0.6346 0.6187 0.6449 0.6054 0.6052 0.4518
Family LDND LCSD LDN LCS PREFIXD PREFIX DICED DICE JCD JCDD TRIRAMD
ST 0.5647 0.5596 0.5435 0.5261 0.5558 0.5412 0.4896 0.4878 0.4788 0.478 0.3116
IE 0.6996 0.6961 0.6462 0.6392 0.6917 0.6363 0.557 0.5294 0.5259 0.5285 0.4541
TK 0.588 0.58 0.5004 0.4959 0.5777 0.4948 0.5366 0.4302 0.5341 0.535 0.4942
Tor 0.4688 0.4699 0.4818 0.483 0.4515 0.4602 0.4071 0.4127 0.375 0.3704 0.3153
Alg 0.3663 0.3459 0.4193 0.4385 0.3456 0.3715 0.2965 0.3328 0.291 0.2626 0.1986
NS 0.6118 0.6072 0.5728 0.5803 0.5587 0.5118 0.578 0.5434 0.5466 0.5429 0.4565
Sko 0.8107 0.8075 0.806 0.7999 0.7842 0.7825 0.6798 0.6766 0.6641 0.6664 0.5636
AA 0.6136 0.6001 0.4681 0.431 0.6031 0.4584 0.5148 0.3291 0.4993 0.4986 0.4123
LSR 0.5995 0.5911 0.6179 0.6153 0.5695 0.5749 0.5763 0.5939 0.5653 0.5529 0.5049
Mar 0.654 0.6306 0.6741 0.6547 0.6192 0.6278 0.568 0.5773 0.5433 0.5366 0.4847
Alt 0.8719 0.8644 0.8632 0.8546 0.8634 0.8533 0.7745 0.7608 0.75 0.7503 0.6492
Hui 0.6821 0.68 0.6832 0.6775 0.6519 0.6593 0.5955 0.597 0.5741 0.5726 0.538
Sep 0.6613 0.656 0.6662 0.6603 0.6587 0.6615 0.6241 0.6252 0.6085 0.6079 0.5769
NDa 0.6342 0.6463 0.6215 0.6151 0.6077 0.5937 0.501 0.5067 0.4884 0.4929 0.4312
Sio
Kad
WF
MUM
Sal 0.6637 0.642 0.6681 0.6463 0.6364 0.6425 0.5423 0.5467 0.5067 0.5031 0.4637
Kiw
UA 0.9358 0.9332 0.9296 0.9261 0.9211 0.9135 0.9178 0.9148 0.8951 0.8945 0.8831
Tot
EA 0.6771 0.6605 0.6639 0.6504 0.6211 0.6037 0.5829 0.5899 0.5317 0.5264 0.4566
HM
Comparative evaluation of string similarity measures 199
Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
Predicting Sales Trends
Can sentiment analysis on social media help?
1 Introduction
Over the last years, social media have gained a lot of attention and there are many
users sharing their opinions and experiences on them. As a result, there is an ag-
gregation of personal wisdom and different viewpoints. If all this information is
extracted and analyzed properly, the data on social media can lead to useful pre-
dictions of several human related events. Such prediction has great benefits in
many aspects, such as finance and product marketing (Yu & Kak 2012). The latter
has attracted the attention of researchers of the social network analysis fields,
and several approaches have been proposed.
Although there is a lot of research conducted on predicting events' outputs
relevant to finance, stock market and politics, so far there is not an analogous
research focusing on prediction of products' sales. Most of the current ap-
proaches analyse the sentiment of the tweets in order to predict different events.
The limitation of current approaches is that they use sentiment metrics in a
strictly quantitative way, taking into account the pure number of favourites or the
fraction of likes over dislikes which are not always accurate or representative of
a people's sentiment. In other words, the current approaches are not taking into
account the sentiment trends or a longitudinal sentiment fluctuation expressed
by people over time about a product, which could help in estimating a future
trend.
To this end, we propose a computational approach, which conducts predic-
tions on the sales trends of products based on the public sentiment (i.e. posi-
tive/negative/neutral stance) expressed via Twitter. The sentiment expressed in
a tweet is determined on a per context basis through the context of each tweet,
by taking into account the relations among its words. The sentiment feature used
for making predictions on sales trends is not considered as an isolated parameter
but is used in correlation and in interaction with other features extracted from
sequential historical data.
The rest of the chapter is organized as follows: In the next section (2) the
state-of-the-art in the area of prediction of various human related events exploit-
ing sentiment analysis is discussed. Then, in section 3, the proposed approach is
202 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
The proposed approaches use different metrics about social media which have
been employed in prediction. The metrics used may be divided into two catego-
ries (Yu & Kak 2012): message characteristics and social network characteristics.
The message characteristics focus on the messages themselves, such as the sen-
timent and time series metrics. On the other hand, the social network character-
istics concern structural features.
The sentiment metrics are the static features of posts. With a qualitative sen-
timent analysis system, the messages could be assigned a positive, negative, or
neutral sentiment. Thus naturally the numbers of positive, negative, neutral,
non-neutral, and total posts are five elementary content predictors.
These metrics may have different prediction power at different stages (Yu &
Kak 2012). In various other endeavours, they try to calculate the relative strength
of the computed sentiment. To this end, the prediction approach adopted by
Zhang & Skiena (2009), computes various ratios between different types of senti-
ment bearing posts: they specifically use the ratio between the total number of
positive and total posts, the ratio between the total counts of negative and total
posts and the ratio between the counts of neutral and total posts. In Asur & Hu-
berman (2010), they calculate the ratio between the total number of neutral and
non-neutral posts, and the ratio between the numbers of positive and negative
posts.
Further, with the use of time series metrics researchers try to investigate the
posts more dynamically, including the speed and process of the message gener-
ation. In that case different time windows size can be taken into account in order
to calculate the generating rate (such as hourly, daily or weekly) of posts. The
intuition behind the use of post generation rate is that a higher post generation
rate implies the engagement of more people with a topic, and thus topic is con-
cerned more attractive. For example, in Asur & Huberman (2010), they have
shown that the use of a daily generating post rate, as a feature before the release
of a movie is a good predictor for movie box-office.
As mentioned before, the social network characteristics measuring structural
features can be of a great importance in prediction methodologies. For example,
centrality which measures the relative importance of a node within a network, or
the number followers of a user, as well as the re-tweets in twitter when they are
combined with the message characteristics can provide invaluable information
about the importance and the relative strength of the computed sentiment.
Concerning the prediction methods used, different approaches have been
proposed. For example, prediction of future product sales using a Probabilistic
Latent Semantic Analysis (PLSA) model for sentiment analysis in blogs, is pro-
posed by Liu et al. (2007). A Bayesian approach has been also proposed in Liu et
al. (2007). In particular, if the prediction result is discrete, the Bayes classifier can
204 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
3 Methodology
The proposed approach for prediction on sales which is based on sentiment anal-
ysis is articulated in two consecutive two stages as follows:
– sentiment analysis stage
– prediction of sales' trends of products based on sentiment.
The output of the first stage (i.e. tweets annotated with positive, negative or neu-
tral sentiment) is providing the input for its successor.
Examples (a) and (b) contain the same word information, but convey a different
polarity orientation. Example (a) implies a positive polarity since the author,
placing the adverb in the initial position is trying to support its claim, providing
excuses for the behaviour of someone. On the other hand in example (b), the ad-
verb relocation at the end of the sentence changes the meaning of the sentence
and creates a negative connotation. Here the author criticizes the non-honest be-
haviour of someone, implying a negative polarity.
CRF is particularly appropriate to capturing such fine distinctions since it is able
to capture the relations among sentence constituents (word senses) as a function
of their sentential position.
206 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
On the other hand a bag-of-words classifier, such as an SVM classifier, which can-
not exploit adequately the structure of sentences, can be proven misleading in
assigning a correct polarity on examples (a) and (b), since the bag-of-words rep-
resentation is the same for both sentences.
CRF is an undirected graph model which specifies the joint probability for the
existence of possible sequences given an observation. We used a linear chain CRF
implementation1 with default settings. Figure 1 shows the abstract structure of
the model, where x indicate word senses and y states. A linear chain CRF is able
to model arbitrary features from the input instead of just the information con-
cerning the previous state of a current observation (as in Hidden Markov Models).
Therefore it is able to take into account contextual information, concerning the
constituents of a sentence. This is already discussed is valuable when it comes to
evaluate the polarity orientation of a sentence.
1 http://crfpp.sourceforge.net/
208 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
𝑍𝑍𝑍𝑍(𝑋𝑋𝑋𝑋) is a normalization factor that depends on 𝑋𝑋𝑋𝑋 and the parameter of the
training process 𝜆𝜆𝜆𝜆. In order to train the classification model, i.e. to label 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 de-
picted in Figure 3 - indicating a positive/negative/neutral sentiment for the sen-
tence, we exploit information from the previous (𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 − 1) and the following words
(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 + 1) , exploiting feature functions 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 . A feature function 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 looks at a pair of
adjacent states 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 − 1, 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 , the whole input sequence 𝑋𝑋𝑋𝑋 and the position of the cur-
rent word. This way CRF exploits the structural information of a sentential se-
quence.
The CRF component assigns a sentence (i.e. tweet) polarity for every word
participating in a sentential sequence. This sentence level polarity corresponds
to the polarity of a sentence in which this word would most probably participate,
according to the contextual information. Then a majority voting process is used
to determine the polarity class for a specific sentence. Thus the majority voting
does not involve judging the sentence level polarity from word level polarities,
but from the sentence level polarities already given to the words with the CRF
component. In other words, the final polarity of the sentence is the polarity that
a sentence would have wherein the majority of the words would participate. For
example for a sentence of five words, CRF would assign a sentence level polarity
to each word. If there are 3 negative and 2 positive sentence level polarities given,
then the dominant polarity would be the final result, i.e. the negative.
In order to train the CRF we used tweets derived from Sanders' collection2.
This corpus consists of 2471 manually-classified tweets. Each tweet was classified
with respect to one of four different topics into one of the three classes (positive,
negative, and neutral). Every word in a sentence is represented by a feature vector
of the form:
2 http://www.sananalytics.com/lab/twitter-sentiment/
Predicting Sales Trends. Can sentiment analysis on social media help? 209
where "𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡" is the word, and "𝑠𝑠𝑠𝑠𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓_𝑝𝑝𝑝𝑝𝑣𝑣𝑣𝑣𝑝𝑝𝑝𝑝"is the sentence level polarity class
(positive/negative/neutral). Each sentence (i.e. tweet) is represented by a se-
quence of such feature vectors. Within the test set, the last feature representing
the class (pos/neg) for the sentence is absent and is assessed. The test set consists
of a corpus of four subsets of tweets corresponding to four different topics: ipad,
sony experia, samsung galaxy, kindle fire.
In the current section, we are describing the methodology adopted for the predic-
tion of the fluctuation of sales, and in particular, of electronic devices. The pro-
posed methodology is based on structural machine learning and on sentiment
analysis of historical data from Twitter. The rationale behind the adopted meth-
odology for prediction lies on the assumption that a prediction for the sales fluc-
tuation of an electronic device (e.g. for ipad) can be based on the sales fluctuation
of similar gadgets (e.g. Samsung galaxy, kindle fire etc.). As it has been already
argued in Section 2 the fluctuation of sales for a product is highly dependent on
social media, and particularly it is argued that there exists a strong correlation
between the spikes in sale rank and the number of related blog posts. Even
though there is strong correlation between blog posts and sales' trends, only few
researchers work on prediction of sales based on Social Media. We are based on
historical data from Twitter and we exploit sentiment as a strong prediction indi-
cator. The latter is based on the fact that on Twitter, organization or product
brands are referred in about 19% of all the tweets, and the majority of them com-
prise strongly expressed negative or positive sentiments.
In the proposed methodology, the prediction about a product's sales is based
on the sentiment (positive/negative/neutral) expressed in tweets about similar
products, whose sales trends (increase/decrease) is already known. We found
there is a correlation among similar products, largely because they share the
same market competition. Therefore using historical data that can adequately
simulate the sales trends for a number of similar products can presumably reveal
a generic pattern which, consequently, can lead to accurate prediction. There-
fore, we are introducing a classification function which relates tweets' sentiment
and historical information (i.e. tweets' dates) concerning specific electronic de-
vices with their sales trends. Doing so, we intend to test the extent to which these
features could be seen as a valuable indicator in the correct prediction of sales
trends (i.e. increase/decrease).
In our case, the sales trends prediction problem can be seen as a discrimina-
tive machine learning classification problem which is dealt with the assistance of
210 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
Table 1: Historical Twitter data for training and testing the CRF prediction approach
tweets' data and sentiment values expressed on Twitter about Sony Experia, Kin-
dle Fire and Samsung Galaxy devices in specific early quartiles for which we al-
ready knew their sales fluctuation. In general in order to make predictions for
each device, CRF is trained with tweets' sequential data for the remaining three
devices. The tweets' data for each device as shown in Table 1 are taken from two
yearly quartiles for each class, decrease/increase.
As a test set we used tweet sequences for iPad taken from the immediately
preceding quartile from the one on which we wish to make prediction that is the
first quartile of 2011 as well as from the corresponding quartile of the previous
year, that is the second quartile of 2010. The specific information about specific
time intervals of drop or raise of sales has been taken from prestigious financial
and IT online media3. Training and Test data integrated in the prediction meth-
odology are detailed in Table 1.
More formally CRF component assigns a tweets' sequence - which is defined
as a series of consecutive tweets taken from two yearly quartiles – a sales trend
(i.e. increase/decrease) for every tweet used in this sequence. This trend corre-
sponds to the trend of a tweet sequence in which this tweet would most probably
participate. Then a majority voting process is used to decide the prediction class
for a specific tweets' sequence. For example if there are 3 decrease and 2 increase
sequence based sale trends given to the tweets comprising the sequence, then the
dominant trend for this sequence would be the class decrease.
Under more formalistic means within the training process every tweet in a
tweets' data sequence is represented by a feature vector of the form:
𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑃𝑃𝑃𝑃𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑃𝑃𝑃𝑃𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑃𝑃𝑃𝑃)
where "𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓" is the date of the corresponding tweet, probability is the output
“Sentiment Probablity” derived within the sentiment analysis stage (a) for that
tweet, polarity is the output sentiment value, positive, negative, neutral extracted
within the sentiment analysis stage (a) and the prediction trend is the trend (in-
crease/decrease) assigned to this tweet representing the whole tweets' sequence
sales trend; each tweets' sequence representing a class (increase or decrease) is
consisting of tweets corresponding to two yearly quartiles. Therefore, each
tweets' data sequence is represented by a sequence of such feature vectors put in
chronological order (according to their features values "𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓"). Within the test
3 Among others: The Financial Times, Business Insider, Reuters, The Telegraph, CNET, ZDnet,
Mobile Statistics.
212 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
set, the last feature representing the trend (increase/decrease) for a tweets' se-
quence representing a device is absent and is assessed.
4 Experimental Results
To verify our claim that the proposed prediction approach is well suited for the
particular task, we compare with a system variant that uses Naive Bayes instead
of CRF within the last step of the methodology (stage b. which assigns a predic-
tion class for sales' trend, on tweets' sequential data).
In order to evaluate this method of classifying tweets' sequential data con-
cerning four electronic devices, we used a 4-fold cross validation method. In Sec-
tion 2, we took into consideration data for four electronic devices; the algorithm
was run 4 times, each time using a different combination of 3 subsets for training
and the remaining one for testing. So each classification fold is predicting a sales'
trend for each of the four devices under question using its corresponding tweets'
sequential data as a test set and the tweets' sequential data for the three remain-
ing devices' have been used as a training set (table 1). The result of the cross-
validation is the average performance of the algorithm in the 4 runs. The availa-
ble tweets' data sequences (data points for training and testing) are two for each
product, each one representing a product's trend to increase or to decrease.
Table 2 illustrates for the increase/decrease prediction task the confusion
matrices and the accuracy scores for the proposed methodology in comparison
with a Naive Bayes approach. Each class increase/decrease is including four
tweets' data sequences, representing one out of the four devices. As Table 2 re-
ports, our method exhibits a higher level of accuracy than that of the NB variant
(0.50)(𝑝𝑝𝑝𝑝 < 0.05). Moreover, the results indicate that exploited structural infor-
mation of tweet's data series feature are the best fit for.
Predicting Sales Trends. Can sentiment analysis on social media help? 213
Table 2: System Evaluation using only CRF for prediction of sales' trends vs Naive Bayes: error
threshold 5%, p-value: 0.06
In parallel we aim to show that our method when it integrates in the CRF
component more linguistic information is more representative of the structure of
the sentence, and therefore can lead to judging more accurately the polarity ori-
entation of a sentence, than a feature representation that does not take into ac-
count enough information from the sentential context.
References
Asur, Sitaram & Bernardo A. Huberman. 2010. Predicting the future with social media. In Pro-
ceedings of the IEEE/WIC/ACM international conference on web intelligence and intelligent
agent technology, 492–499.
Bollen, Johan, Huina Mao & Xiao-Jun Zeng. 2011. Twitter mood predicts the stock market. Jour-
nal of Computational Science 2(1). 1–8.
214 Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
Choi, Yejin, Eric Breck & Claire Cardie. 2006. Joint extraction of entities and relations for opin-
ion recognition. In Proceedings of the conference on empirical methods in natural lan-
guage processing, 431–439. Association for Computational Linguistics.
Cootner, Paul H. 1964. The random character of stock market prices. Cambridge (MA): MIT
Press.
Dodds, Peter & Christopher Danforth. 2010. Measuring the happiness of large-scale written ex-
pression: Songs, blogs, and presidents. Journal of Happiness Studies 11(4). 441–456.
Eppen, Gary D. & Eugene F. Fama. 1969. Cash balance and simple dynamic portfolio problems
with proportional costs. International Economic Review 10(2). 119–133.
Fama, Eugene F. 1965. The behavior of stock-market prices. Journal of Business 38(1). 34–105.
Fama, Eugene F. 1991. Efficient capital markets: IΙ. Journal of Finance 46(5). 1575–1617.
Gallagher, Liam A. & Mark P. Taylor. 2002. Permanent and temporary components of stock
prices: Evidence from assessing macroeconomic shocks. Southern Economic Journal
69(2). 345–362.
Gilbert, Eric & Karrie Karahalios. 2010. Widespread worry and the stock market. In Proceedings
of the fourth international AAAI conference on weblogs and social media, 58–65.
Gruhl, Daniel, Ramanthan Guha, Ravi Kumar, Jasmine Novak & Andrew Tomkins. 2005. The pre-
dictive power of online chatter. In Proceedings of the eleventh ACM SIGKDD international
conference on knowledge discovery in data mining, 78–87.
Gunawardana, Asela, Milind Mahajan, Alex Acero & John C. Platt. 2005. Hidden conditional ran-
dom fields for phone classification. In Ninth European conference on speech communica-
tion and technology, 1117–1120.
Kavussanos, Manolis & Everton Dockery. 2001. A multivariate test for stock market efficiency:
the case of ASE. Applied Financial Economics 11(5). 573–579.
Lafferty, John, Andrew McCallum & Fernardo Pereira. 2001. Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In Proceedings of the 18th in-
ternational conference on machine learning, 282–289.
Lerman Kristina & Tad Hogg. 2010. Using a model of social dynamics to predict popularity of
news. In Proceedings of the 19th international conference on World Wide Web, 621–630.
Liu, Yang, Xiangji Huang, Aijun An & Xiaohui Yu. 2007. Arsa: A sentiment-aware model for pre-
dicting sales performance using blogs abstract. In Proceedings of the 30th annual inter-
national ACM SIGIR conference on research and development in information retrieval,
607–614.
Mao, Yi & Guy Lebanon. 2007. Isotonic conditional random fields and local sentiment flow. In
Advances in Neural Information Processing Systems 19. http://papers.nips.cc/pa-
per/3152-isotonic-conditional-random-fields-and-local-sentiment-flow.pdf (accessed 1
December 2014).
McDonald, Ryan, Kerry Hannan, Tyler Neylon, Mike Wells & Jeff Reynar. 2007. Structured mod-
els for ne-to-coarse sentiment analysis. In Proceedings of the 45th annual meeting of the
Association of computational linguistics, 432–439.
Mishne, Gilad & Natalie Glance. 2006. Predicting movie sales from blogger sentiment. In AAAI
symposium on computational approaches to analysing weblogs, 155–158.
Pak, Alexander and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and
opinion mining. In Proceedings of the 7th conference on international language resources
and evaluation, 1320–1326.
Qian, Bo & Khaled Rasheed. 2007. Stock market prediction with multiple classifiers. Applied
Intelligence 26(1). 25–33.
Predicting Sales Trends. Can sentiment analysis on social media help? 215
Quattoni, Ariadna, Michael Collins & Trevor Darrell. 2004. Conditional random fields for object
recognition. Proceedings of Neural Information Processing Systems, 1097–1104.
Rentoumi, Vassiliki, George Giannakopoulos, Vangelis Karkaletsis & George Vouros. 2009.
Sentiment analysis of figurative language using a word sense disambiguation approach.
In International conference on recent advances in natural language processing (RANLP
2009), 370–375.
Rentoumi, Vassiliki, George A.Vouros, Vangelis Karkaletsis & Amalia Moser. 2012. Investigat-
ing Metaphorical Language in Sentiment Analysis: a Sense-to-Sentiment Perspective.
ACM Transactions on Speech and Language Processing 9(3). Article no. 6.
Romero, Daniel M., Wojciech Galuba, Sitaram Asur & Bernardo A. Huberman. 2011. Influence
and passivity in social media. In Proceedings of the 20th international conference com-
panion on World Wide Web, 113–114.
Sadamitsu, Kugatsu, Satoshi Sekine & Mikio Yamamoto. 2008. Sentiment analysis based on
probabilistic models using inter-sentence information. In Proceedings of the 6th interna-
tional conference on language resources and evaluation, 2892–2896.
Sharda, Ramesh & Dursun Delen. 2006. Predicting box-office success of motion pictures with
neural networks. Expert Systems with Applications 30(2). 243–254.
Szabo, Gabor & Huberman, Bernardo A. 2010. Predicting the popularity of online content. Com-
munications of the ACM 53(8). 80–88.
Yu, Sheng & Subhash Kak. 2012. A survey of prediction using social media. The Computing Re-
search Repository (CoRR).
Zhang, Wenbin & Steven Skiena. 2009. Improving movie gross prediction through news analy-
sis. In Proceedings of the IEEE/WIC/ACM international joint conference on web intelligence
and intelligent agent technology - volume 1, 301–304.
Andrij Rovenchak
Where Alice Meets Little Prince
Another approach to study language relationships
1 Introduction
Quantitative methods, including those developed in physics, proved to be quite
successful for studies in various domains, including biology (Ogasawara et al.
2003), social sciences (Gulden 2002), and linguistics (Fontanari and Perlovsky
2004, Ferrer i Cancho 2006, Čech et al. 2011).
According to the recently suggested model (Rovenchak and Buk 2011a),
which is based on the analogy between word frequency structure of text and
Bose-distribution in physics, a set of parameters can be obtained to describe
texts. These parameters are related to the grammar type or, being more precise,
to the analyticity level of a language (Rovenchak and Buk 2011b).
One of the newly defined parameters is an analog of the temperature in phys-
ics. One should not mix such a term with other definitions of the “temperature of
text” introduced by, e. g., Mandelbrot (1953), de Campos (1982), Kosmidis et al.
(2006), Miyazima and Yamamoto (2008), etc.
In the present work, two famous novels are analyzed, namely Alice’s Adven-
tures in Wonderland by Lewis Carroll (also known under a shortened title as Alice
in Wonderland) and The Little Prince by Antoine de Saint-Exupéry. These texts
were translated into numerous languages from different language families and
thus are seen as a proper material for contrastive studies. The calculations are
made for about forty translations of The Little Prince (LP) and some twenty trans-
lations of Alice in Wonderland (AW) with both texts available in thirteen lan-
guages.
The following are the languages for LP:
Arabic, Armenian, Azerbaijani, Bamana, Belarusian, Bulgarian, Catalan,
Czech, German, Dutch, English, Esperanto*, Spanish, Estonian, Euskara
(Basque), Farsi, French, Georgian, Greek, Hebrew, Hindi, Croatian, Hungarian,
Italian, Lojban*, Korean, Latvian, Lithuanian, Mauritian Creole, Mongolian,
Polish, Portuguese, Romanian, Russian (2 texts), Serbian, Turkish, Ukrainian, Vi-
etnamese, Chinese, Thai.
The following are the languages for AW:
218 Andrij Rovenchak
2 Method description
For a given text, the frequency list of words is compiled. In order to avoid issues
of lemmatization for inflectional languages, words are understood as orthograph-
ical words – i.e., alphanumeric sequences between spaces and/or punctuation
marks. Thus, for instance, ‘hand’ and ‘hands’ are treated as different words (dif-
ferent types).
Orthographies without “western” word separation can be studied within this
approach with some precautions. These will be discussed in Section 3 with re-
spect to texts in Chinese, Japanese, and Thai languages.
Let the types with the absolute frequency equal j occupy the jth “energy
level”. The number of such types is the occupation number Nj. Somewhat special
role is assigned to hapax legomena (types or words occurring only once in a given
text), their number is N1.
Evidently, there is no natural ordering of words within the level they occupy,
which is seen as an analog of indistinguishability of particles in quantum physics
and thus a quantum distribution can be applied to model the frequency behavior
of words. Since the number of types with a given frequency can be very large, it
seems that the Bose-distribution (Isihara 1971: 82; Huang 1987: 183) is relevant to
this problem. The occupation number for the jth level in the Bose-distribution
equals:
1
𝑁𝑁𝑁𝑁𝑗𝑗𝑗𝑗 = , (1)
𝑧𝑧𝑧𝑧 −1 𝑒𝑒𝑒𝑒 𝜀𝜀𝜀𝜀𝑗𝑗𝑗𝑗 /𝑇𝑇𝑇𝑇 − 1
Where Alice meets Little Prince 219
where the values of α appear to be within the domain between 1 and 2. The unity
is subtracted to ensure that the lowest energy ε1 = 0. Therefore, one can easily
obtain the relation between the number of hapaxes N1 and fugacity analog z:
𝑁𝑁𝑁𝑁1
𝑧𝑧𝑧𝑧 = . (3)
𝑁𝑁𝑁𝑁1 + 1
The parameters T and α are defined from observable frequency data for every text
or its part. The procedure is as follows. First, the fugacity analog z is calculated
from the number of hapaxes using Eq. (3). Then, observed values of Nj with
j = 2 ÷ jmax are fitted to
1
𝑁𝑁𝑁𝑁𝑗𝑗𝑗𝑗 = 𝛼𝛼𝛼𝛼 (4)
𝑧𝑧𝑧𝑧 −1 𝑒𝑒𝑒𝑒 (𝑗𝑗𝑗𝑗−1) /𝑇𝑇𝑇𝑇 − 1
The upper limit jmax for fitting can be defined empirically. Due to the nature of
word frequency distributions, the occupations of high levels rapidly decreases,
or, being more precise, Nj is either 1 or 0 for j being large enough as this is the
number of words with distinct large absolute frequencies.
To define the value of jmax one can use some naturally occurring separators
linked to frequency distributions. In this work, the k-point is applied. The defini-
tion of the k-point is similar to that of the h-point rh in the rank-frequency distri-
bution, being the solution of equation f(rh) = rh, where f(r) is the absolute fre-
quency of a word with rank r. The k-point corresponds to the so-called cumulative
distribution (Popescu and Altmann 2006), thus it is the solution of equation
Nj = j. An extension of these definitions must be applied if the respective equa-
tions do not have integer roots.
In practice, however, some average values were defined based on different
translations of each novel. Such an approach simplifies the automation of the
calculation process and, according to preliminary verifications, does not influ-
ence the values of the fitting parameters significantly due to the rapid exponen-
tial decay of the fitting function (4) for large j.
The use of the k-point can be justified from the observations that the low-
frequency vocabulary is composed mostly of the autosemantic (full-meaning)
words (Popescu et al. 2009: 37).
220 Andrij Rovenchak
Note that the parameters of the model of the frequency distribution (T, z, and α)
are not given any in-depth interpretation based on the analogy with the respec-
tive physical quantities. Possibility of such a parameter attribution is yet to be
studied.
Table 1: Observed values of Nj for different translations of The Little Prince. The nearest js to
the k-points are marked in bold italic and grey-shaded.
As Table 1 demonstrates, one can use the following value for The Little Prince:
jmax = 10, which appears to be close to the mean of the k-points for the whole set
of languages.
Table 2: Observed values of Nj for different translations of Alice in Wonderland. The nearest js
to the k-points are marked in bold italic and grey-shaded.
From Table 2 the estimate for Alice can be taken as jmax = 15 on the same grounds
as above.
After the fitting procedure is applied, for each text a pair of T and α parame-
ters is obtained. It was established previously (Rovenchak and Buk 2011a; 2011b)
that the parameter τ = ln T / ln N has a weak dependence on text length N and
thus can be used to compare texts in different languages as they can contain dif-
ferent number of tokens. This is related to the parameter scaling discussed in Sec-
tion 4.
Each text is thus represented by a point on the α– τ plane or, being more precise,
by a domain determined through the uncertainty of the fitting. Results are shown
in Figures 1 and 2.
222 Andrij Rovenchak
(a)
(b)
Fig. 1: Positions of different translations of Alice in Wonderland on the α–τ-plane (a). Enlarged view
is given on bottom panel (b). Language codes are explained in the Appendix. For clarity, not all the
analyzed languages are shown.
Where Alice meets Little Prince 223
(a)
(b)
Fig. 2: Positions of different translations of The Little Prince on the α–τ-plane (a). Enlarged view
is given on bottom panel (b). Language codes are explained in the Appendix. For clarity, not all
the analyzed languages are shown.
224 Andrij Rovenchak
Interestingly, texts are mostly located on a wide band stretching between small-
α– small-τ and large-α–large-τ values. In the lower-left corner languages of
highly analytical nature are placed, namely: Chinese, Vietnamese, Lojban, Ha-
waii, Bamana, Mauritian Creole. Top-right corner is, on the contrary, occupied by
highly synthetic languages (Slavic group, Hungarian, Latin, etc.). Intermediate
positions correspond to language with medium analyticity level.
4 Parameter scaling
The main focus in this study is made on the dependence of the defined parame-
ters on text length, meaning the parameter development in the course of text pro-
duction. The parameters appear to be quite stable (their behavior being easily
predictable) for highly analytic languages (Chinese and Lojban, an artificial lan-
guage). On the other hand, even if the difference is observed between the same
language versions for two novels, the divergence is not significant for different
translations. This means that translator strategies do not influence the parameter
values much, but rather text type (genre) is more important.
To study the scaling of the temperature parameter, the languages were
grouped into several classes depending on the coarsened mean value of the α ex-
ponent:
α = 1.15: Lojban;
α = 1.20: Chinese;
α = 1.40: English;
α = 1.45: French;
α = 1.50: Bulgarian, German, Esperanto, Italian, Romanian;
α = 1.60: Hungarian, Polish, Russian, Ukrainian.
With the value of α fixed as above, temperature T was calculated consecu-
tively for first 1000, 2000, 3000, etc. words of each text. The results for T demon-
strate a monotonously increasing dependence on the number of words N.
The simplest model for the temperature scaling was tested, namely:
Obviously, the new parameters, t and β, are not independent of the previ-
ously defined τ = ln T / ln N:
ln 𝑡𝑡𝑡𝑡
𝛽𝛽𝛽𝛽 = 𝜏𝜏𝜏𝜏 − , (6)
ln 𝑁𝑁𝑁𝑁
therefore, β coincides with τ in the limit of an infinitely long text.
Where Alice meets Little Prince 225
Fig. 3: “Temperature” T (vertical axis) versus text length N (horizontal axis) for several
laguages. Longer series of data correspond to Alice in Wonderland. Solid line is a fit for Alice in
Wonderland, dashed line is a fit for The Little Prince.
226 Andrij Rovenchak
Table 3: Scaling parameters for temperature T. The results of fitting are within intervals t ± Δt
and β ± Δβ. Language codes are explained in the Appendix.
Code Text t Δt β Δβ
Code Text t Δt β Δβ
LP 0.33 0.12 0.82 0.04
ukr
AW 0.16 0.01 0.93 0.01
5 Text trajectories
Text trajectories are obtained for each translation. The values of two parameters
correspond to a point and their change as text grows is a trajectory on the respec-
tive plane.
First of all, one can observe that stability in the parameter values is not
achieved immediately for a small text length. Generally, the number of tokens
greater than several thousand is preferable, and translations of The Little Prince
are close to the lower edge with respect to text length (cf. Rovenchak and Buk
2011b).
Fig. 4: Trajectories of Russian translations of The Little Prince (+ and × symbols) and Alice in
Wonderland (□, ■, and ✴). As with the scaling analysis, step of 1000 was used; thus, the first
point corresponds to first 1000 tokens, the second one to first 2000 tokens, etc. Errorbars are
not shown for clarity.
228 Andrij Rovenchak
Results for text trajectories are demonstrated in Figure 4 for Russian, with two
translations of The Little Prince and three translations of Alice in Wonderland.
There is a qualitative similarity in the shapes of these trajectories, while numeri-
cally different results are obtained for different texts, which, as was already men-
tioned in Section 4, suggests the influence of genre on the parameter values (even
within a single language).
6 Discussion
The method to analyze texts using an approach inspired by a statistical-mechan-
ical analogy is described and applied for studying translations of two novels, Al-
ice in Wonderland and The Little Prince. A result obtained earlier is confirmed:
there exists a correlation between the level of language analyticity and the values
of parameters calculated in this approach.
Behavior of the parameters in the course of text production is studied. Within
the same language, the dependence on a translator of a given text appears much
weaker than the dependence on the text genre. So far, it is yet not possible to
provide exact attribution of a language with respect to parameter values since the
influence of genre has not been studied in detail.
Further prospects of the presented approach are seen in analyzing texts of
different genres, which are translated in a number of languages, preferably from
different families. These could be religious texts (being though generally very
specific in language style) or some popular fiction works (e.g., Pinocchio by Carlo
Collodi, Twenty Thousand Leagues under the Sea by Jules Verne, Winnie-the-Pooh
by Alan A. Milne, The Alchemist by Paulo Coelho, etc.). With texts translated into
now-extinct languages (e.g., Latin or Old Church Slavonic) some hints about lan-
guage evolution can be obtained as well (Rovenchak 2014).
Acknowledgements
I am grateful to Michael Everson for providing electronic versions of Alice in Won-
derland translated into Cymraeg (Welsh), Gaelic, Hawaiian, Latin, and Swedish.
Where Alice meets Little Prince 229
References
de Campos, Haroldo. 1982. The informational temperature of the text. Poetics Today 3(3). 177–
187.
Čech, Radek, Ján Mačutek & Zdeněk Žabokrtský. 2011. The role of syntax in complex networks:
Local and global importance of verbs in a syntactic dependency network. Physica A
390(20). 3614–3623.
Ferrer i Cancho, Ramon. 2006. Why do syntactic links not cross? Europhysics Letters 76(6).
1228–1234.
Fontanari, José F. & Leonid I. Perlovsky. 2004. Solvable null model for the distribution of word
frequencies. Physical Review E 70(4). 042901.
Gulden, Timothy R. 2002. Spatial and temporal patterns of civil violence in Guatemala, 1977–
1986. Politics and the Life Sciences 21(1). 26–36.
Huang, Kerson. 1987. Statistical mechanics. 2nd edn. New York: Wiley.
Isihara, Akira. 1971. Statistical physics. New York & London: Academic Press.
Kosmidis, Kosmas, Alkiviadis Kalampokis & Panos Argyrakis. 2006. Statistical mechanical ap-
proach to human language. Physica A 366. 495–502.
Mandelbrot, Benoit. 1953. An informational theory of the statistical structure of language. In
Willis Jackson (ed.), Communication theory, 486–504. London: Butterworths.
Miyazima, Sasuke & Keizo Yamamoto. 2008. Measuring the temperature of texts. Fractals
16(1). 25–32.
Ogasawara, Osamu, Shoko Kawamoto & Kousaku Okubo. 2003. Zipf's law and human tran-
scriptomes: An explanation with an evolutionary model. Comptes rendus biologies
326(10–11). 1097–1101.
Popescu, Ioan-Iovitz & Gabriel Altmann. 2006. Some aspects of word frequencies. Glottomet-
rics 13. 23–46.
Popescu, Ioan-Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur D. Jayaram , Reinhard Köhler,
Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matummal N. Vidya. 2009.
Word frequency studies (Quantitative Linguistics 64). Berlin & New York: Mouton de Gruy-
ter.
Rovenchak, Andrij. 2014. Trends in language evolution found from the frequency structure of
texts mapped against the Bose-distribution. Journal of Quantitative Linguistics 21(3). 281–
294.
Rovenchak, Andrij & Solomija Buk. 2011a. Application of a quantum ensemble model to lin-
guistic analysis. Physica A 390(7). 1326–1331.
Rovenchak, Andrij & Solomija Buk. 2011b. Defining thermodynamic parameters for texts from
word rank–frequency distributions. Journal of Physical Studies 15(1). 1005.
230 Andrij Rovenchak
Appendix
Language Codes
1 Introduction
Texts are written or spoken in form of linear sequences of some entities. From
the qualitative point of view, they are sequences of phonic, lexical, morphologi-
cal, semantic and syntactical units. Qualitative properties are secondary for
quantitative linguistics; hence they are simply omitted from the research. Thus
a text may be presented as a sequence (x1,…,xn) whose elements represent phon-
ic, lexical, grammatical, morphological, syntactic or semantic entities. Of
course, the positions of entities depend among others on the grammar, when a
sentence is constructed, or on a historical succession of events, when a story is
written, and both aspects are known to the speaker. What is unknown and hap-
pens without a conscious application, are effects of some background laws
which cannot be learned or simply applied. The speaker abides by them without
knowing them.
A number of such laws – i.e. derived and sufficiently positively tested state-
ments - can be discovered using the sequential form of the text. If such a se-
quence is left in its qualitative form, there appear distances between equal enti-
ties, and every text may be characterized using some function of the given dis-
tances. If the text is transformed into a quantitative sequence (e.g. in terms of
word lengths, sentence lengths, word frequencies, semantic complexities etc.),
then we have a time series which is the object of an extensive statistical disci-
pline.
Even if one does not attribute the positions to the entities, one may find
Markov chains, transition probabilities, autocorrelations, etc. If one replaces the
entities by quantities, a number of other methods become available which may
reveal hidden structures of language that would not be accessible in any other
way. The original complete text shows only deliberate facts such as meanings
and grammatical rules, but rewriting it e.g. in terms of measured morphological
complexities may reveal some background mechanisms. Their knowledge may
serve as a basis for future research.
The text in its quantitative form may change to a fractal, it may display reg-
ular or irregular oscillation, the distances between neighbors can be measured,
etc. Here we restrict ourselves to the study of the arc length expressing the sum
232 Peter Zörnig
a) E(Z) = � i qi
i=1
m−1
b) E(B) = � �i2 + 1 qi
i=0
Example 2:
(a) Assuming that the elements of M are chosen with equal probabilities,
i.e. p1=…= pm = 1/m, the preceding theorems yield in particular
𝑚𝑚𝑚𝑚𝑚
P(Z=0)=q0=1/m; P(Z=i)=qi=2 for i= 1,…,m,
𝑚𝑚2
𝑚𝑚𝑚𝑚𝑚
P(B = 1) = 1/m; P(B = √𝑖𝑖 2 + 1) = 2 for i = 1,...,m,
𝑚𝑚2
𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚
𝑚𝑚 𝑚 𝑚𝑚 𝑚𝑚2 − 1
𝐸𝐸(𝑍𝑍) = � 𝑖𝑖𝑖𝑖𝑖𝑖 = � 𝑖𝑖 ∙ 2 =
𝑚𝑚2 3𝑚𝑚
𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖
𝑚𝑚−1 𝑚𝑚−1
1 2
𝐸𝐸(𝐵𝐵) = � �𝑖𝑖 2 + 1 ∙ 𝑞𝑞𝑖𝑖 = + 2 � �𝑖𝑖 2 + 1 ∙ (𝑚𝑚 − 𝑖𝑖) (6)
𝑚𝑚 𝑚𝑚
𝑖𝑖=0 𝑖𝑖=1
for 4≤m ≤20, and for m > 20 the sequence of relative errors is monotone
decreasing starting with the value 0.021. The underlying idea for obtain-
ing the approximation (7) in the case m>20 is as follows. The sum in (6)
is approximately equal to
𝑚𝑚𝑚𝑚
𝑚𝑚3 − 𝑚𝑚
� 𝑖𝑖 ∙ (𝑚𝑚 𝑚𝑚𝑚) =
6
𝑖𝑖𝑖𝑖
Thus
1 2 𝑚𝑚3 − 𝑚𝑚 1 𝑚𝑚 1 𝑚𝑚
𝐸𝐸(𝐵𝐵) ≈ + 2+ = + − ≈
𝑚𝑚 𝑚𝑚 6 𝑚𝑚 3 3𝑚𝑚 3
(b) For m = 4, p1 = 1/2 , p2 = 1/4, p3 = 1/6, p4 = 1/12 we obtain
25 13 5 1
𝑞𝑞0 = , 𝑞𝑞1 = , 𝑞𝑞2 = , 𝑞𝑞3 = ,
72 36 24 12
3
13 5 1 37
E(Z) = � i qi = +2 +3 = = 1.0278
36 24 12 36
i=1
3
25 13 5 1
𝐸𝐸(𝐵𝐵) = � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 = + √2 + √5 + √10 = 1,5873.
72 36 24 12
𝑖𝑖𝑖𝑖
By using the simple relation V(X) = E(X2) – (E(X))2 we obtain the following
formulas for the variances:
Theorem 3
m−1 m−1 2
a) V(Z) = � i2 q1 − �� i q1 � ,
i=1 i=1
m−1 m−1 2
Example 3:
(a) For the special case of equal probabilities pi the theorem yields
m−1 2 2
m−i m2 − 1 m2 − 1 m2 − 1
V(Z) = � i2 ∙ 2 ∙ − � � = − � �
m2 3m 6 3m
i=1
236 Peter Zörnig
m4 + m2 − 2
= ,
18m2
m−1
1 2
V(B) = + 2 � (i2 + 1) ∙ (m − 1) −
m m
i=1
m−1 2
1 2
� + 2 � �i2 + 1 ∙ (m − 1)�
m m
i=1
m−1 2
m2 + 5 1 2
= − � + 2 � �i2 + 1 ∙ (m − 1)� .
6 m m
i=1
2 13 5 1
𝑉𝑉(𝑍𝑍) = 𝑞𝑞1 + 4𝑞𝑞2 + 9𝑞𝑞3 − �𝐸𝐸(𝑍𝑍)� = +4 +9
36 24 12
37 2
−� � = 1.3395 .
36
2 25 13 5
V(B) = q0 + 2q1 + 5q2 + 10q3 − �E(B)� = +2 +5
72 36 24
1
+10 − 1.58732 = 0.4249
12
In order to obtain the variance of the arc length (2) we need the following two
theorems. Let 𝐵𝐵𝑖𝑖 = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑖𝑖𝑖𝑖 )2 + 1 denote the summands in (2) (see also (5)).
Since the Xi are assumed to be independent, two of the variables Bi are also in-
dependent if they are not directly consecutive. This means e.g. that B1 and Bj are
independent for j≥3, but the variables 𝐵𝐵1 = �(𝑋𝑋1 − 𝑋𝑋2 )2 + 1 and 𝐵𝐵2 =
�(𝑋𝑋2 − 𝑋𝑋3 )2 + 1 are not independent. This can be made clear as follows. As-
sume for example that m = 5, i.e. the values of the Xi are chosen from M =
{1,…,5}. Then the maximal length of a line segment of the arc is �(1 − 5)2 + 1 =
√17 (see Fig. 1). But if e.g. the second element of the sequence (x1,...,xn) is x2 = 3
(end height of the first segment), then the length of the second segment is a
𝑚𝑚𝑚𝑚𝑚𝑚
most 𝑖𝑖 ∈𝑀𝑀
�(3 − 𝑖𝑖)2 + 1 = √5. Hence the first line segment of the arc prevents
that the second assumes the maximum lengths, i.e. there is a dependency be-
tween these segments.
A probabilistic model for the arc length in quantitative linguistics 237
The proof follows from the definition of the expectation, since any element
of the triple sum represents a product of a value of the random variable B1 B2
and its probability (where the values of B1 B2 need not be distinct).
The last formula cannot be simplified, not even for equal pi. The triple sum
consists of m3 summands and for not too large values of m it can be calculated
by means of suitable software.
Example 4:
For m = 2, p1 = 3/5 and p2 = 2/5 we get
𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = 𝑝𝑝1 𝑝𝑝1 𝑝𝑝1 ∙ 1 ∙ 1 + 𝑝𝑝1 𝑝𝑝1 𝑝𝑝2 ∙ 1 ∙ √2 + 𝑝𝑝1 𝑝𝑝2 𝑝𝑝1 ∙ √2 ∙ √2 +∙ 𝑝𝑝1 𝑝𝑝2 𝑝𝑝2 √2 ∙ 1
+𝑝𝑝2 𝑝𝑝1 𝑝𝑝1 √2 ∙ 1 + 𝑝𝑝2 𝑝𝑝1 𝑝𝑝2 ∙ √2 ∙ √2 ∙ 𝑝𝑝2 𝑝𝑝2 𝑝𝑝1 ∙ 1 ∙ √2 + 𝑝𝑝2 𝑝𝑝2 𝑝𝑝2 ∙ 1 ∙ 1
3 3 3 22 3 22 3 2 2 3 22 3 2 2
= � � + � � √2 + � � ∙ 2 + � � √2 + � � √2 + � � ∙ 2
5 5 5 5 5 5 5 5 5 5 5
3 2 2 2 3
+ � � √2 + � � = 1.4388.
5 5 5
By using Theorem 2(b) we now obtain the following statement.
Theorem 5: The covariance of B1 and B2 is
Cov (B1 B2) = E (B1 B2) - E(B1) E(B2)
𝑚𝑚 𝑚𝑚 𝑚𝑚 𝑚𝑚𝑚𝑚 2
This means in particular that E(L) increases linearly with n and the relative arc
𝐸𝐸(𝐿𝐿)
length is an increasing function of the inventory size m.
𝑛𝑛𝑛𝑛
For equal probabilities pi we obtain (see Example 2(a))
𝑚𝑚−1
1 2
𝐸𝐸(𝐿𝐿) = (𝑛𝑛 − 1) 𝐸𝐸(𝐵𝐵) = (𝑛𝑛 − 1) � + � �𝑖𝑖 2 + 1(𝑚𝑚 − 1)� (8)
𝑚𝑚 𝑚𝑚2
𝑖𝑖=1
or the approximate relation
(n-1) (0.31 m+0.46) for 4≤m≤20,
E(L) ≈ (9)
(n-1) m/3 for m>20
An additivity law corresponding to Theorem 6 holds for the variance only
when a sum of independent random variables is given. As we have seen above,
this is not the case for the arc length. Therefore we make use of the following
fact to determine V(L) (see e.g. Ross (2007, p. 54)).
Theorem 7: The variance of a sum of arbitrary random variables Y1+…+Yn is
given by
𝑛𝑛 𝑛𝑛
because the other covariances are zero. Since the last sum has (n-2) ele-
ments, it follows the statement.
Theorem 9: The variance of the arc length can be alternatively computed as
V(L) = (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2
𝑚𝑚𝑚𝑚 𝑚𝑚 𝑚𝑚 𝑚𝑚
= (𝑛𝑛 𝑛 1) � (𝑖𝑖 2 + 1)𝑞𝑞𝑖𝑖 + 2(𝑛𝑛 𝑛 2) � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1�(𝑗𝑗𝑗 𝑗𝑗)2 + 1
𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘
𝑚𝑚𝑚𝑚 2
(see Popescu et al. (2013, p. 52)). The corresponding observed arc length is L
≈ 50.63.
We consider the probabilistic model in which a random variable Xi assumes
the values 1,…,9 with probabilities corresponding to the observed relative
frequencies in (10), i.e. (p1,…,p9) = (0,0,0,0,2/32,10/32,9/32,7/32,4/32) and m
= 9. From the definition of the numbers qi we obtain (q0,…,q8) = (125/512,
201/512, 31/128, 27/256, 1/64, 0, 0, 0, 0). Hence
8 8
𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1 �(𝑗𝑗𝑗 𝑗𝑗)2 + 1 ≈ 3.1115,
𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘
yielding
E(L) = (n-1) ⋅ E(B) = 31 ⋅ 1.7388 ≈ 53.90,
V(L) = (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2
= 31 ⋅ 3.5605 + 60 ⋅ 3.1115 -91 ⋅ 1.73882 ≈ 21.93.
Thus in this example the observed arc length is smaller than expected in the
probability model, however the absolute deviation between observed and
expected arc length is less than the standard deviation 𝜎𝜎 = �𝑉𝑉(𝐿𝐿) ≈ 4.68.
3 Applications
We study the sequences (x1,…,xn) determined for 32 texts from 14 different lan-
guages (see Table 1), where xi represents the length of the i-th word in the text.
The texts are the same as in Zörnig (2013, Section 4, Tab. 1-9). The first column
of Table 1 indicates the text, e.g. number 3a indicates the third text in Table 3 of
240 Peter Zörnig
the aforementioned article. For each sequence the observed arc length L was
determined according to (1) (see column 2 of Table 1).
The columns 3 and 4 contain the minimal and approximate maximal arc length
Lmin and Lmax that could be achieved by permuting the respective sequence (see
Popescu et al. (2013, Section 2)). Columns 5 and 6 contain the expectation and
variance of the arc length, when it is considered as a random variable as in Sec-
tion 2. In this context the probabilities are assumed to be equal to the observed
sequences of the abstract text. For example, the sequence (x1,...,xn) of the first
text in Table 1 has length n = 926 and consists of m = 6 different elements cho-
sen from M = {1,...,6}. The frequency of ocurrence of the element i in the se-
quence is ki, where (k1,...,k6) = (336, 269, 213, 78, 27, 3), see Zörnig (2013, p. 57).
Thus the probabilities in the probabilistic model are assumed to be pi = ki /n for i
= 1,…,6 (see also Example 6).
Column 7 contains the values of the standardized arc length defined by
𝐿𝐿𝐿𝐿𝐿(𝐿𝐿)
𝐿𝐿𝑠𝑠𝑠𝑠 = which is calculated in order to make the observed arc lengths
√𝜎𝜎
242 Peter Zörnig
comparable (see e.g. Ross (2007, p.38) for this concept). For example, for the
fourth text it was found Lst ≈ 3.98. This means that the observed arc length is
equal to the expected plus 3.98 times the standard deviation 𝜎𝜎, i.e. the observed
arc length is significantly larger than in the random model. The columns 8 to 10
𝐿𝐿
of Table 1 contain the sequence length n, the relative arc length 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟 = and
𝑁𝑁𝑁𝑁
the inventory size m.
It can be observed that the standardized arc length varies between -0.7641
(text 5c: Bamana) and 6.8158 (text 7b: Tagalog). In particular, the latter text is
characterized by an extremely high arc length which corresponds to a very fre-
quent change between different entities. The results, which should be validated
for further texts, can be used e.g. for typological purposes. The resulting quanti-
ties are no isolated entities but their relations to other text properties must still
be investigated in order to incorporate them into synergetic control cycles.
Comparing the observed arc length L with the sequence length n, one can
easily detect the linear correlation illustrated graphically in Fig. 2. By means of a
simple linear regression we found the relation L = 254.58 + 1.5023 n correspond-
ing to the straight line.
As a measure for the goodness of fit we calculate the coefficient of determi-
nation (see Zörnig (2013, p. 56)), defined by
∑(254.58 + 1.5023𝑛𝑛 𝑛𝑛𝑛(𝑛𝑛))2
𝑅𝑅2 = 1 −
∑(𝐿𝐿(𝑛𝑛) − 𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 )2
where L(n) is the observed arc length corresponding to the sequence length
n and Lmean is the mean of the 32 observed L-values. The summation runs over all
32 values of n in Tab. 3.1. The above formula yields R2 = 0.9082. Since R2 > 0.9,
the fitting can be considered good.
Finally, Fig. 3 illustrates graphically the relation between inventory size m
𝐿𝐿
and observed relative arc length 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟 = . Though different values of Lrel can
𝑛𝑛𝑛𝑛
belong to the same inventory size m, it seems that there is a slight tendency that
Lrel increases with m. This observation is in accordance with the relations follow-
ing from the probability model (see the remark after Theorem 6). The simple
linear regression yields Lrel = 1.0192 + 0.1012 m, which corresponds to the
straight line. The coefficient of determination is R2 = 0.5267.
A probabilistic model for the arc length in quantitative linguistics 243
5 Concluding remarks
It is possible to consider generalizations or variations of the concept “arc
length”. By using the Minkowski distance we obtain a more general concept by
defining
𝑛𝑛−1
References
Popescu, Ioan-Iovitz, Ján Mačutek & Gabriel Altmann. 2009. Aspects of word frequencies.
Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Emmerich Kelih, Ján Mačutek, Radek Čech, Karl-Heinz Best & Gabriel
Altmann. 2010. Vectors and codes of text. Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2011. The lambda-structure of texts.
Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Peter Zörnig, Peter Grzybek, Sven Naumann & Gabriel Altmann. 2013.
Some statistics for sequential text properties. Glottometrics 26. 50–99.
Ross, Sheldon M. 2007. Introduction to probability models, 9th edn. Amsterdam: Elsevier.
Wälchli, Bernhard. 2011. Quantifying inner forms: A study in morphosemantics. Working paper
46. Bern: University of Bern, Institute for Linguistics. http://www.isw.unibe.ch (accessed
December 2013)
Zörnig, Peter. 2013. A continuous model for the distances between coextensive words in a text.
Glottometrics 25. 54–68.
Subject Index
accuracy, 188, 202, 206, 212 – generalized, 9, 10, 11, 12, 13, 14, 21, 22, 26,
adjectives, 4, 17, 160 29, 32
adverbs, 17, 42 – part of speech, 16, 17, 20, 25, 26, 27, 28
affix, 2 – partial, 9, 10, 11, 12, 13, 14, 18, 21, 23, 32
ANOVA, 109, 110 – rhymed, 17, 20, 23, 25, 30
arc, 20 – unrhymed, 18, 23, 24, 25, 28
– length, 4, 84, 231, 232, 236, 237, 238, 239, – weighted sum, 11
240, 241, 242, 243, 244, 245 – weights, 11, 12, 13, 18, 19, 21, 28
associative symmetric sums, 10 characters, 36, 39, 41, 91, 178, 202
author, 77, 78, 94, 149, 152, 157, 165, 205 – frequency, 220
– conscious control, 77 chi-square, 40, 41, 42, 43, 45, 46, 54, 94, 100
– subconscious control, 77 Chomsky, Noam, 67
authorship attribution, 28, 72, 110, 130 clauses, 58, 59, 60, 95
autocorrelation, 4, 35, 43, 231 – length, 58, 59, 60, 61
– function. See function coefficient of variation, 157
– index, 38, 39, 40, 41, 42, 45, 46, 51, 52 cognacy judgments, 171
– negative, 39, 44 cognates, 171, 172
– positive, 41, 44, 48 – automatic identification, 173, 174, 175
– semantic, 50 comparative method, 171
– textual, 35, 38, 39, 46, 47, 48, 49, 54 conditional random fields (CRF) model, 205,
automated similarity judgment program 206, 207, 208, 210, 211, 212, 213
(ASJP), 172, 176, 177, 178, 179, 180, 183, confidence intervals, 61, 65, 136, 139, 153
189, 193 confusion matrix, 212
autoregressive integrated moving average control cycle, 2, 3, 4
(ARIMA) model, 152, 161 corpus, 49, 112, 126, 176, 208
autoregressive–moving-average (ARMA) – Brown, 50
model, 152, 169 – Potsdam, 98
axioms, 9, 10, 14 – reference, 50
– commutativity, 11 correlation, 13, 30, 148, 161, 186, 209
Bayes classifier, 203, 204, 212 – coefficient, 13, 17
binary coding, 148, 162 – coefficient of determination, 61, 62, 63, 64,
Bonferroni correction, 80, 187 65, 127, 242
borderline conditions, 9 – negative, 161, 163
boundary conditions, 2, 5, 72 – Pearson’s r, 17, 185
centroid, 44 – point biserial, 185
chapters, 1, 8, 26, 27, 28, 29, 32, 75, 76, 147, – Spearman’s ρ, 186
149, 154, 155, 157, 158, 159, 161, 163, correspondence analysis, 46
164, 165 cross entropy, 184
– length, 72, 73, 74, 76, 148, 150 cross validation, 212
characteristics, 8, 9, 11, 13, 14, 17, 18, 21, 22, dependent variable, 204
23, 28, 32 dictionary, 2, 90, 91
– consistency, 8, 12, 13, 14, 16, 17, 18, 20, 21, dimensionality, 11
22, 23, 24, 25, 26, 28, 29, 32 dispersion, 157, 158
248 Subject Index
– genre, 71, 72, 79, 130, 163, 224, 228 unweighted pair group method with
– length, 71, 72, 73, 77, 78, 79, 83, 84, 85, arithmetic mean (UPGMA), 172
110, 111, 221, 224, 225, 227 u-test, 80
– production, 224 verbs, 4, 16, 17, 42, 44, 46, 50, 51, 53, 54, 160
– randomization, 59, 128, 129, 135, 136, 137, verse, 28, 29, 147, 149, 153, 157, 159, 165, 239
138 – length, 148, 150, 153, 154, 155, 156, 161, 165
– topic, 203, 208, 209 vertices, 20
– trajectories, 227 vocabulary "richness", 71, 109
textual neighbourhoods, 35 word length, 1, 90, 91, 110, 125, 126, 127, 129,
textual series, 151 133, 231
time series, 35, 109, 148, 151, 152, 154, 161, word lists, 171, 172, 173, 175, 176, 178, 179,
168, 169, 203, 213, 231 188, 191
– autocovariance, 168, 169 WordNet, 50, 51, 53
– variance, 168 words, 36, 155, 157
tokens, 49, 50, 109, 134, 138, 227 – autosemantic, 219
translations, 8, 13, 26, 27, 28, 29, 30, 32, 217, – content, 114, 116, 117, 118
219, 220, 221, 223, 224, 226, 227, 228 – frequency, 112, 113, 125, 218, 220, 231
translator, 224, 228 – function, 114
tree inference algorithm, 173, 186 – ordering, 205
trend analysis, 148, 154 – orthographical, 218
tweets, 201, 202, 204, 208, 209, 212 – repetition, 3, 77
types, 35, 36, 39, 40, 41, 45, 49, 54, 84, 109, – senses, 205, 207
133, 218 – unrhymed, 28, 33
type-token ratio, 71, 72, 109, 110, 158 – valence reversers, 206
undirected graph, 207 – zero syllabic, 134
– fuzzy, 20, 23 world atlas of language structures (WALS),
177, 179, 180, 184, 185
Authors Index
Albert, Réka, 59, 66 Coleridge, Samuel T., 13, 16, 17, 18, 20, 26,
Altmann, Gabriel, 1, 5, 6, 28, 29, 33, 57, 66, 27, 29, 30, 31
71, 72, 86, 87, 95, 96, 108, 110, 113, 122, Cootner, Paul H., 202, 214
123, 125, 130, 131, 219, 229 Covington, Michael A., 71, 86, 110, 122
Andreev, Sergey, 7, 28, 33 Cramer, Irene M., 61, 66, 110, 123, 125, 130
Andres, Jan, 57, 61, 66 Cressie, Noel A.C, 35, 55
Anselin, Luc, 35, 55 Cryer, Jonathan, 152, 166
Antić, Gordana, 134, 145 Danforth, Christopher, 202, 214
Argamon, Shlomo, 28, 33 Daňhelka Jiří, 149, 166
Asur, Sitaram, 202, 203, 213, 215 de Campos, Haroldo, 217, 229
Atkinson, Quentin D., 171, 190 de Saint-Exupéry, Antoine, 217
Baixeries, Jaume, 57, 66 Delen, Dursun, 204, 215
Bakker, Dik, 172, 190, 191, 193 Dementyev, Andrey V., 9, 33
Barabási, Albert-László, 59, 66 Dickens, Charles, 72, 74, 76
Bavaud, François, 35, 39, 40, 41, 51, 55 Dockery, Everton, 202, 214
Beliankou, Andrei, 97, 98, 108 Dodds, Peter, 202, 214
Benešová, Martina, 57, 66, 128, 130 Dryer, Matthew S., 179, 190, 191
Benjamini, Yoav, 187, 190 Dubois, Didier, 9, 10, 11, 12, 33
Bergsland, Knut, 171, 190 Dunning, Ted E., 183, 190
Best, Karl-Heinz, 96, 108, 114 Durie, Mark, 171, 190, 192
Bollen, Johan, 202, 213 Dyen, Isidore, 174, 175, 190
Borisov, Vadim V., 7, 9, 26, 33 Eder, Maciej, 147, 148, 150, 152, 166, 167
Boroda, Mojsej, 4, 6, 89, 108 Ellegård, Alvar, 171, 190
Bouchard-Côté, Alexandre, 173, 190 Ellison, T. Mark, 173, 191
Box, George E., 152, 166 Elvevåg, Brita, 59, 66
Brew, Chris, 182, 190 Embleton, Sheila M., 171, 191
Brown, Cecil H., 178, 190, 191, 193 Eppen, Gary D., 202, 214
Buk, Solomija, 217, 219, 221, 227, 229 Fama, Eugene F., 202, 214
Bunge, Mario, 111, 122 Fedulov, Alexander S., 9, 26, 33
Bychkov, Igor A., 9, 33 Felsenstein, Joseph, 172, 174, 191
Carroll, John B., 2, 6 Ferrer i Cancho, Ramon, 59, 66, 217, 229
Carroll, Lewis, 217 Fontanari, José F., 217, 229
Cavnar, William B., 182, 190 Foulds, Leslie R., 176, 192
Čech, Radek, 29, 33, 71, 72, 86, 128, 130, 217, Francis, Nelson W., 50, 55
229 Gallagher, Liam A., 202, 214
Choi, Yejin, 206, 214 Gilbert, Eric, 202, 214
Chomsky, Noam, 59, 67 Glance, Natalie, 202, 214
Chrétien, C. Douglas, 171, 192 Glass, Gene V., 152, 166
Christiansen, Chris, 176, 190 Goethe, 239
Cleveland, William S., 155, 166 Goodman, Leo A., 184, 191
Cliff, Andrew D., 39, 55 Gottman, John M., 152, 166
Cohen, Avner, 59, 66 Gray, Russell D., 171, 173, 190, 191
Greenacre, Michael, 46, 55
254 Authors Index
Greenhill, Simon J., 173, 176, 191 Kroeber, Alfred L., 171, 192
Gregson, Robert A., 152, 166 Kruskal, William H., 184, 190, 191
Grinstead, Charles M., 47, 55 Kubát, Miroslav, 110, 123
Gruhl, Daniel, 202, 214 Kučera, Henry, 50, 55
Gulden, Timothy R., 217, 229 Lafferty, John, 205, 214
Gumilev, Nikolay, 13, 26, 27, 28, 29, 30, 31 Le Roux, Brigitte, 40, 55
Gunawardana, Asela, 206, 214 Lebanon, Guy, 206, 214
Hammarström, Harald, 172, 189, 192, 193 Lebart, Ludovic, 39, 55
Hašek, Jaroslav, 72, 73, 75 Lee, Kwang H., 18, 33
Haspelmath, Martin, 177, 179, 190, 191 Lerman Kristina, 204, 214
Hauer, Bradley, 173, 191 Levenshtein, Vladimir I., 172, 192
Havel, Václav, 60, 61 Levik, Wilhelm, 13, 26, 27, 28, 29, 30, 31
Hay, Richard A., 152, 166 Lewis, Paul M., 176, 192
Herdan, Gustav, 151, 166 Li, Wientian, 59, 67
Hess, Carla W., 71, 86 Lin, Dekang, 182, 192
Hochberg, Yosef, 187, 190 List, Johann-Mattis, 173, 192, 193
Hogg, Tad, 204, 214 Liu, Haitao, 128, 131
Holman, Eric W., 172, 178, 189, 190, 191, 193 Liu, Yang, 202, 203, 214
Hoover, David L., 7, 33 Lonsdale, Deryle, 176, 186, 191
Hrabák, Josef, 161, 166 Mačutek, Ján, 60, 66, 71, 86, 95, 108, 125,
Hřebíček, Luděk, 57, 66 127, 130, 131, 133, 135, 145
Hu, Fengguo, 128, 131 Mandelbrot, Benoit, 59, 67, 217, 229
Huang, Kerson, 218, 229 Mao, Yi, 206, 213, 214
Huberman, Bernardo A., 202, 203, 204, 213, Mardia, Kanti V., 51, 55
215 Martynenko, Gregory, 71, 86
Huff, Paul, 176, 186, 191 Matesis, Pavlos, 126, 127, 128, 129
Huffman, Stephen M., 176, 191 McCleary, Richard, 152, 166
Inkpen, Diana, 175, 191 McDonald, Ryan, 206, 214
Isihara, Akira, 218, 229 McFall, Joe D., 71, 86, 110, 122
Jäger, Gerhard, 176, 191 McKelvie, David, 182, 190
Jenkins, Gwilym M., 152, 166 Melamed, Dan I., 182, 192
Jireček, Josef, 149, 166 Menzerath, Paul, 57, 67
Juola, Patrick, 7, 33 Michailidis, George, 126, 127, 128
Kak, Subhash, 201, 203, 215 Michener, Charles D., 172, 193
Karahalios, Karrie, 202, 214 Mikros, George K., 7, 33
Kavussanos, Manolis, 202, 214 Milička, Jiří, 110, 123, 125, 128, 131
Keeney, Ralph L., 12, 33 Miller, George A., 50, 55, 59, 67
Kelih, Emmerich, 58, 66, 127, 131 Miller, Rupert G., 80, 86
Kirby, Simon, 173, 191 Milliex-Gritsi, Tatiana, 126, 127, 128
Köhler, Reinhard, 4, 6, 29, 33, 89, 91, 94, 95, Mishne, Gilad, 202, 214
96, 97, 98, 108, 109, 110, 123, 125, 130, Mitzenmacher, Michael, 59, 67
131, 133, 134, 145, 150, 151, 166, 167 Miyazima, Sasuke, 217, 229
Kohlhase, Jörg, 114, 122 Molière, 42
Kondrak, Grzegorz, 173, 174, 191, 192 Moran, Patrick, 39, 55
Koppel, Moshe, 28, 33 Müller, Dieter, 71, 86
Kosmidis, Kosmas, 217, 229 Naumann, Sven, 28, 33, 89, 91, 94, 95, 96,
Krajewski, Marek, 148, 150, 152, 167 97, 98, 108, 125, 131, 133, 145
Authors Index 255
Altmann, Gabriel
Stüttinghauser Ringstrasse 44, DE-58515 Lüdenscheid, Germany
email: ram-verlag@t-online.de
Andreev, Sergey N.
Department of Foreign Languages
Smolensk State University
Przhevalskogo 4, RU-214000 Smolensk, Russia
email: smol.an@mail.ru
Bavaud, François
Department of Language and Information Sciences
University of Lausanne
CH-1015 Lausanne, Switzerland
email: francois.bavaud@unil.ch
Benešová, Martina
Department of General Linguistics
Faculty of Arts
Palacky University
Křížkovského 14, CZ-77147 Olomouc, Czech Republic
email: martina.benesova@upol.cz
Borin, Lars
Department of Swedish
University of Göteborg
Box 200, SE- 40530 Göteborg, Sweden
email: lars.borin@svenska.gu.se
Borisov, Vadim V.
Department of Computer Engineering
Smolensk Branch of National Research University “Moscow Power
Engineering Institute”
Energeticheskiy proezd 1, RU-214013 Smolensk, Russia
email: vbor67@mail.ru
258 Authors’ addresses
Čech, Radek
Department of Czech Language
University of Ostrava
Reální 5, CZ-77147 Ostrava, Czech Republic
email: cechradek@gmail.com
Cocco, Christelle
Department of Language and Information Sciences
University of Lausanne
CH-1015 Lausanne, Switzerland
email: christelle.cocco@alumnil.unil.ch
Eder, Maciej
Institute of Polish Studies
Pedagogical University of Kraków
ul. Podchorążych 2, PL-30084 Kraków, Poland
and
Köhler, Reinhard
Computerlinguistik und Digital Humanities
University of Trier
FB II / Computerlinguistik, DE-54286 Trier, Germany
email: koehler@uni-trier.de
Krithara, Anastasia
Institute of Informatics and Telecommunications
National Center for Scientific Research (NCSR) ‘Demokritos’
Terma Patriarchou Grigoriou, Aghia Paraskevi, GR-15310, Athens, Greece
email: akrithara@iit.demokritos.gr
Mačutek, Ján
Department of Applied Mathematics and Statistics
Comenius University
Mlynská dolina, SK-84248 Bratislava, Slovakia
email: jmacutek@yahoo.com
Authors’ addresses 259
Mikros, George K.
Department of Italian Language and Literature
School of Philosophy
National and Kapodistrian University of Athens
Panepistimioupoli Zografou, GR-15784 Athens, Greece
email: gmikros@gmail.com
and
Milička, Jiří
Institute of Comparative Linguistics
Faculty of Arts
Charles University
Nám. Jana Palacha 2, CZ-11638 Praha 1
email: jiri@milicka.cz
and
Pawłowski, Adam
University of Wrocław
Institute of Library and Information Science
pl. Uniwersytecki 9/13, 50-137 Wrocław, Poland
email: apawlow@uni.wroc.pl
Rentoumi, Vassiliki
SENTImedia Ltd.
86 Hazlewell road, Putney, London, SW156UR, UK
email: vassiliki.rentoumi@gmail.com
260 Authors’ addresses
Rovenchak, Andrij
Department for Theoretical Physics
Ivan Franko National University of Lviv
12 Drahomanov St., UA-79005 Lviv, Ukraine
email: andrij.rovenchak@gmail.com
Tuzzi, Arjuna
Department of Philosophy, Sociology, Education and Applied Psychology
University of Padova
via Cesarotti 10/12, IT-35123 Padova, Italy
email: arjuna.tuzzi@unipd.it
Tzanos, Nikos
SENTImedia Ltd.
86 Hazlewell road, Putney, London, SW156UR, UK
email: corelon@gmail.com
Xanthos, Aris
Department of Language and Information Sciences
University of Lausanne
CH-1015 Lausanne, Switzerland
email: aris.xanthos@unil.ch
Zörnig, Peter
Department of Statistics, Institute of Exact Sciences,
University of Brasília, 70910-900, Brasília-DF, Brazil
email: peter@unb.br