Download as pdf or txt
Download as pdf or txt
You are on page 1of 268

George K. Mikros and Ján Mačutek (Eds.

)
Sequences in Language and Text
Quantitative Linguistics

Editors
Reinhard Köhler
Gabriel Altmann
Peter Grzybek

Advisory Editor
Relja Vulanović

Volume 69
Sequences in
Language and
Text
Edited by
George K. Mikros
Ján Mačutek

DE GRUYTER
MOUTON
ISBN 978-3-11-036273-2
e-ISBN (PDF) 978-3-11-036287-9
e-ISBN (EPUB) 978-3-11-039477-1
ISSN 0179-3616

Library of Congress Cataloging-in-Publication Data


A CIP catalog record for this book has been applied for at the Library of Congress.

Bibliographic information published by the Deutsche Nationalbibliothek


The Deutsche Nationalbibliothek lists this publication in the Deutsche Nationalbibliografie;
detailed bibliographic data are available on the Internet at http://dnb.dnb.de.

© 2015 Walter de Gruyter GmbH, Berlin/Boston


Printing and binding: CPI books GmbH, Leck
♾ Printed on acid-free paper
Printed in Germany

www.degruyter.com
Foreword
Gustav Herdan says in his book Language as Choice and Chance that “we may
deal with language in the mass, or with language in the line”. Within the frame-
work of quantitative linguistics, the language-in-the-mass approach dominates;
however, the other direction of research has enjoyed an increased attention in
the last few years.
This book documents some recent results in this area. It contains an intro-
duction and a collection of thirteen papers of altogether twenty-two authors. The
contributions, ordered alphabetically according to the first author surname, cover
quite a broad spectrum of topics. They can be, at least roughly, divided into two
groups (admittedly, the distinction between the groups is a bit fuzzy).
The first of them consists of theoretically oriented papers. One could distin-
guish three subgroups here – development of text characteristics, either within
a text or in time (Andreev and Borisov, Čech, Köhler and Tuzzi, Rovenchak, Zör-
nig), linguistic motifs (Köhler, Mačutek and Mikros, Milička) and, in the last
contribution from this group, Benešová and Čech demonstrate that, with respect
to the Menzerath-Altmann law, a randomized text behaves differently from the
“normal” one, and thus the role of a sequential structure of a text is confirmed.
Four papers belonging to the other group focus (more or less) on appli-
cations, or at least they are inspired by real-world problems. Bavaud et al. apply
the apparatus of autocorrelation to different text properties. Pawłowski and Eder
investigate a development of rhythmic patterns in a particular Old Czech text. A
prediction of sales trends based on the analysis of short texts from Twitter is pre-
sented by Rentoumi et al. Finally, Rama and Borin discuss different measures of
string similarities and their appropriateness for automatic text classification.
We hope that this book will provide an impetus for a further development of
linguistic research both in language-in-the-mass and language-in-the-line direc-
tions. It goes without saying that the two approaches complement and not ex-
clude each other.
We would like to express our thanks to Gabriel Altmann, who gave an impulse
which was at the beginning of the book, and to Reinhard Köhler for his continu-
ous help and encouragement during the process of editing. Ján Mačutek would
also like to acknowledge VEGA grant 2/0038/12 which supported him during the
preparation of this volume.

George K. Mikros, Ján Mačutek


Contents
Foreword  V

Gabriel Altmann
Introduction  1

Sergey N. Andreev and Vadim V. Borisov


Linguistic Analysis Based on Fuzzy Similarity Models  7

François Bavaud, Christelle Cocco, Aris Xanthos


Textual navigation and autocorrelation  35

Martina Benešová and Radek Čech


Menzerath-Altmann law versus random model  57

Radek Čech
Text length and the lambda frequency structure of a text  71

Reinhard Köhler
Linguistic Motifs  89

Reinhard Köhler and Arjuna Tuzzi


Linguistic Modelling of Sequential Phenomena:
The role of laws  109

Ján Mačutek and George K. Mikros


Menzerath-Altmann Law for Word Length Motifs  125

Jiří Milička
Is the Distribution of L-Motifs Inherited from the Word Length
Distribution?  133

Adam Pawłowski and Maciej Eder


Sequential Structures in “Dalimil’s Chronicle”:
Quantitative analysis of style variation  147
VIII  Contents

Taraka Rama and Lars Borin


Comparative Evaluation of String Similarity Measures for Automatic Language
Classification  171

Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos


Predicting Sales Trends
Can sentiment analysis on social media help?  201

Andrij Rovenchak
Where Alice Meets Little Prince:
Another approach to study language relationships  217

Peter Zörnig
A Probabilistic Model for the Arc Length in Quantitative Linguistics  231

Subject Index  247

Authors’ Index  253

Authors’ Addresses  257


Gabriel Altmann
Introduction

Sequences occur in texts, written or spoken, and not in language which is con-
sidered system. However, the historical development of some phenomenon in
language can be considered sequence, too. In the first case we have a linear
sequence of units represented as classes or as properties represented by num-
bers resulting from measurements on the ordinal, interval or ratio scales. In the
second case we have a linear sequence of years or centuries and some change of
states. A textual sequence differs from a mathematical sequence; the latter can
usually be captured by a recurrence function. A textual sequence is rather a
repetition pattern displaying manifold regularities such as distances, lumping,
strengthening or weakening towards the end of sentence or chapter or text,
oscillations, cohesion, etc. This holds both for units and their conceptually
constructed and measured properties, as well as for combinations of these
properties which are abstractions of higher degree.
The sequential study of texts may begin with the scrutinizing of repetitions.
Altmann (1988: 4f.) showed several types of repetitions some of which can be
studied sequentially: (a) Runs representing uninterrupted sequences of ident-
ical elements, especially formal elements such as word-length, metric patterns,
sentence types. (b) Aggregative repetitions arising from Skinner’s formal rein-
forcement which can be expressed by the distribution of distances or by a de-
creasing function of the number of identities. (c) The relaxing of strict identity
may yield aggregative repetitions of similar entities. (d) Cyclic repetitions occur-
ring especially in irregular rhythm, in prose rhythm, in sentence structure, etc.
Different other forms of sequences are known from text analyses. They will
be studied thoroughly in the next chapters.
If we study the sequences of properties of text entities, we have an enor-
mous field of research in our view. An entity has as many properties as we are
able to discriminate at the given state of science. The number of properties in-
creases with time, as can be seen in 150 years of linguistic research. The proper-
ties are not “possessed” by the entities, they do not belong to the fictive essence
of entities; they are results of definitions and measurements ascribed to the
entities by us. In practice, we can freely devise new properties, but not all of
them need to turn out to be useful. In practise, a property is useful (reasonable)
2  Gabriel Altmann

if its behaviour can be captured by an a priori hypothesis. (Usually, we proceed


by trial and error, isolate conspicuous phenomena and search for a regularity
inductively.) A property must display some kind of regularity, even if the regu-
larity is a long range one. Further, it is reasonable, if it is not isolated but signif-
icantly correlated with at least some other property. This requirement makes it
incorporable into a theory. A property should be element of a control cycle.
Of course, a property does not have the same sequential realization in all
texts. The difference may consist in parameter values, in form modifications,
resulting from the differences of text sorts, of languages, etc. But all this are
only boundary conditions if we are able to set up at least an elementary theory
from which we derive our models.
In order to illustrate the proliferation of properties let us consider those of
the concept of “word”. Whatever its definition - there are several dozens - one
can find a quantification and measure the following properties: (a) length
measured in terms of syllable, morpheme or phoneme numbers, or in that of
duration; (b) frequency in text; (c) polysemy in the dictionary; (d) polytexty
representing the number of different contexts in which it occurs; (e) morph-
ological status: simple, derived, compound, reduplicated; (f) number of parts-
of-speech classes to which it belongs (with or without an additional affix); (g)
productivity measured in terms of derivates, compounds or both which can be
formed with the word; (h) age given in number of years or centuries; (i) valency;
(j) the number of its grammatical categories (conjugation, declination, gender,
times, modes,…) or the number of different affixes it can obtain; (k) emotionali-
ty vs. notionality, e.g. mother vs. clip; (l) meaning abstractness vs. concreteness;
(m) generality vs. specificity; (n) degree of dogmatism; (o) number of assoc-
iations in an association dictionary (= connotative potency); (p) number of syn-
onyms (in a dictionary); (q) number of possible functions in a sentence; (r) di-
atopic variation (= in how many sites of a dialectological atlas it exists); (s)
number of dialectal competitors; (t) discourse properties; (u) state of standard-
ization, (v) originality (genuine, calque, borrowing); (w) the language of origin
(with calques and borrowings); (x) phrasal potentiality (there are dictionaries of
this property); (y) degree of verb activity; (z) subjective ratings proposed by
Carroll (1968). This list can be extended. An analogical list can be set up for
every linguistic entity, hence we do not know how many sequences of different
kind can be created in text.
At this elementary scientific level there is no hindrance to using intuition,
Peircean abduction, inductive generalizations, qualitative comparisons, classi-
fications, etc. These all are simply the first steps on the way to the construction
of a theory. But before we shall be able to work theoretically, observed sequenc-
es can be used (a) to characterize texts using some numerical indicators; (b) to
Introduction  3

state the mean of the property and compare the texts setting up text classes, (c)
to find tendencies depending (usually) on other properties, (d) to search for
idiosyncrasies; (e) to set up an empirical control cycle joining several proper-
ties, and (f) to use the sequences in practice, e.g. in psychology and psychiatry.
Though according to Bunge’s dictum “no laws, no theory” and “no theory,
no explanation”, we may try to find at least some “causes” or “supporting cir-
cumstances” or “motives” or “mechanisms” leading to the rise of sequences.
They are: (a) Restrictions of the inventories of units. The greater the restrictions
(= the smaller the inventory), the more frequently a unit must be repeated. A
certain sound has a stronger repetitiveness than a certain sentence because the
inventory of sentences is infinite. (b) The grammar of the given language not
allowing many repetitions of the same word or word class in immediate neigh-
bourhood. (c) Thematic restriction. This restriction forces the repetition of some
words, terms etc., but hinders the use of some constructions. (d) Stylistic and
aesthetic grounds which may evoke some regularities like rhyme, rhythm, but
avoiding repetition of words. (e) Perseveration reinforcing repetitions, support-
ing self-stimulation, etc. This phenomenon is well known in psychology and
psychiatry. (f) The flow of information in didactic works is ideal if words are
repeated in order to concentrate the text around the main theme. In press texts
the information flow is more rapid.
It is not always possible to find the ground for the rise of the given se-
quence. Speaking is a free activity, writing is full of pauses for thinking, correc-
tions, addition or omissions, it is not always a spontaneous creation. Hence the
study of texts is a very complex task and the discovery of sequences is merely
one of the steps towards theory formation.
Sequences are secondary units and their establishment is a step in concept
formation. As soon as they are defined, one strives for the next definitions, viz.
the quantification of their properties and behaviour. By means of analogy, ab-
duction, intuition, or inductive generalization one achieves a state in which it is
possible to set up hypotheses. They may concern any aspect of sequences in-
cluding form, repetition, distances, perseveration, dependences, history, role,
etc. Of course, hypotheses are a necessary but not a sufficient component of a
theory. They may (and should) be set up at every stage of investigation but the
work attains its goal if we are able to set up a theory from which the given hy-
potheses may be derived deductively Usually, this theory has a mathematical
form and one must use the means of mathematics. The longest way is that of
testing the given hypotheses. One sole text does not do, and one sole language
is merely a good beginning. Often, a corroboration in an Indo-European lan-
guage seduces us to consider the hypothesis as sufficiently corroborated - but
this is merely the first step. Our recommendation is: do not begin with English.
4  Gabriel Altmann

Ask a linguist knowing a different language (usually, linguists do) for data and
test your hypothesis at the same time on a different text sort. If the hypothesis
can be sufficiently corroborated, try to find some links to other properties. Set
up a control cycle and expand it stepwise. The best example is Köhler’s linguis-
tic synergetics containing units, properties, dependencies, forces and hypothe-
ses. The image of such a system gets complex with every added entity, but this
is the only way to construct theories in linguistics. They furnish us laws which
are derived and corroborated hypotheses and thereby at the same time explana-
tions because phenomena can be subsumed under a set of laws.
There are no overall theories of language, there are only restricted ones.
They expand in the course of time, just as in physics, but none of them embrac-
es the language as a whole, just as in physics.
In the sequel we shall list some types of sequences in text.
(A) Symbolic sequences. If one orders the text entities in nominal classes -
which can also be dichotomous - one obtains a sequence of symbols. Such clas-
ses are e.g. parts-of-speech, reduction to noun (N) and rest (R), in order to study
the nominal style; or to adjectives (A), verbs (V) and rest (R), in order to study
the ornamentality vs. activity of the text; or to accentuated and non-accentuated
syllable, in order to study rhythm, etc. Symbolic sequences can be studied as
runs, as a sequence of distances between equal symbols, as a devil’s staircase,
as thematic chains, etc.
(B) Numerical sequences may have different forms: there can be oscillation
in the values, there are distances between neighbours, one can compute the arc
length of the whole sequence, one can consider maxima and minima in the
parts of the sequence and compute Hurst’s coefficient; one can consider the
sequence a fractal and compute its different properties; numerical sequences
can be subdivided in monotonous sub-sequences which have their own proper-
ties (e.g. length), probability distributions, etc. Numerical sequences represent a
time series which may display autocorrelation, “seasonal oscillation”, they may
represent a Fourier series, damped oscillation, climax, etc.
(C) Musical sequences are characteristic for styles and development stages
as has been shown already by Fucks (1968). Besides motifs defined formally and
studied by Boroda (1973, 1982, 1988) there is the possibility of studying sequen-
tially the complete full score obtaining multidimensional sequences. In music,
we have, as a matter of fact, symbolic sequences, but the symbols have ordinal
values, thus different types of computations can be performed.
Spoken text is a multidimensional sequence: there are accents, intonation,
tones, which are not marked in the written texts. Even a street can be consid-
ered simple sequence if we restrict our attention to one sole property. Observa-
tions are always simplified in order to make our conceptual operations possible
Introduction  5

and lucid. Nothing can be captured in its wholeness because we do not even
know what it is.
We conjecture that textual sequences abide by laws but there is a great
number of boundary conditions which will hinder theory construction. Most
probably we shall never discover all boundary conditions but we can help by in-
troducing un-interpreted parameters which may but need not change with the
change of the independent variable. In many laws derivable from the “unified
theory” (cf. Wimmer, Altmann 2005) there is usually a constant representing the
state of the language.
At the beginning of research, we set up elementary hypotheses and im-
prove, specify them step by step. Sometimes, we simply conjecture that there
may be some trend or there may be some relation between two variables and try
to find an empirical formula. In order to remove the “may be”-formulation we
stay at the beginning of a long way beginning with quantification and continu-
ing with measurement, conjecturing, testing, systematization, explanation.
Even a unique hypothesis may keep occupied many teams of researchers be-
cause a well corroborated hypothesis means “corroborated in all languages”.
And since there is no linguist having at his disposal all languages, every linguis-
tic hypothesis is a challenge for many generations of researchers. Consider for
example the famous Zipf’s law represented by a power function used in many
scientific disciplines (http://www.nslij-genetics.org/wli/zipf). Originally there
was only an optic conjecture made by Zipf without any theoretical foundation.
Later on one began to search for modifications of the function, found a number
of exceptions, fundamental deviations from the empirical data depending on
the prevailing “type” realized in the language, changed the theoretical founda-
tions, etc. Hence, no hypothesis and none of its mathematical expressions
(models) are final. Nevertheless, the given state of the art helps us to get orien-
tation, and testing the hypothesis on new data helps us to accept or to modify it.
Sometimes a deviation means merely the taking of boundary conditions into
account. As an example, consider the distribution of word length in languages:
it has been tested in about 50 languages using ca 3000 texts (cf.
http://www.gwdg.de/~kbest/litlist.htm) but in Slavic languages we obtain dif-
ferent results if we consider the non-syllabic prepositions as prefixes/clitics or
as separate words. The same holds for Japanese postpositions, etc. Thus already
the first qualitative interpretation of facts may lead to different data and conse-
quently to different models.
In spite of the labile and ever changing (historically, geographically, idio-
lectally, sociolectally, etc.) linguistic facts from which we construct our data, we
cannot stay for ever at the level of rule descriptions but try to establish laws and
systems of laws, that is, we try to establish theories. But even theories may con-
6  Gabriel Altmann

cern different aspects of the reality. There is no overall theory. While Köhler
(1986, 2005) in his synergetic linguistics concentrated on the links among punc-
tual properties, here we look at the running text, try to capture its properties
and find links among these sequential properties. The task is enormous and
cannot be got through by an individual researcher. Even a team of researchers
will need years in order to come at the boundary of the first stage of a theory.
Nevertheless, one must begin.

References
Altmann, Gabriel. 1988. Wiederholungen in Texten. Bochum: Brockmeyer.
Carroll, John B. 1968. Vectors of prose style. In Thomas A. Sebeok (ed.), Style in language,
283–292. Cambridge, Mass.: MIT Press.
Boroda, Mojsej G. 1973. K voprosu o metroritmičeski elementarnoj edinice v muzyke. Bulletin
of the Academy of Sciences of the Georgian SSR. 71(3). 745–748.
Boroda, Mojsej G. 1982. Die melodische Elementareinheit. In Jurij K.Orlov, Mojsej G.Boroda &
Isabella Š. Nadarejšvili (eds), Sprache, Text, Kunst. Quantitative Analysen, 205–222. Bo-
chum: Brockmeyer.
Boroda, Mojsej G. 1988. Towards a problem of basic structural units of a musical texts.
Musikometrika. 1. 11–69.
Fucks, Wilhelm. 1968. Nach allen Regeln der Kunst. Stuttgart: Dt. Verlags-Anstalt.
Köhler, Reinhard. 1986. Zur linguistischen Synergetik. Struktur und Dynamik der Lexik. Bo-
chum: Brockmeyer.
Köhler, Reinhard. 2005. Synergetic linguistics. In Reinhard Köhler, Gabriel Altmann &
Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An International Handbook, 760–
774. Berlin-New York: de Gruyter.
Wimmer, Gejza, Gabriel Altmann. 2005. Unified derivation of some linguistic laws. In Reinhard
Köhler, Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An In-
ternational Handbook, 791–807. Berlin-New York: de Gruyter.
Sergey N. Andreev and Vadim V. Borisov
Linguistic Analysis Based on Fuzzy
Similarity Models

1 Introduction
Modern state of research in linguistics is characterized by development and
active use of various mathematical methods and models as well as their combi-
nations to solve such complex problems as text attribution, automatic gender
classification, classification of texts or individual styles, etc. (Juola 2006,
Hoover 2007, Mikros 2009, Rudman 2002).
Among important and defining features of linguistic research is the neces-
sity to take into account uncertainty, heuristic representation and subjectivity of
estimation of the analyzed information, heterogeneity and different measure-
ment scales of characteristics.
All this leads to the necessity to use data mining methods, primarily, meth-
ods of fuzzy analysis and modeling.
In empirical sciences the following possibilities of building fuzzy models
exist.
The first approach consists in the adaptation of fuzzy models which were
built to solve problems in other spheres (technical diagnostics, decision making
support models, complex systems analysis). The advantages of this approach
include the use of the obtained experience and scientific potential. At present in
empirical sciences this approach prevails.
The second approach presupposes introduction of fuzziness into already
existing models to take into account various types of uncertainty. Despite the
fact that this approach is rather effective, difficulties arise in the introduction of
fuzziness into the models.
The third approach implies creation of fuzzy models, aimed at solution of
the problems in the given branch of empirical science under conditions of un-
certainty. In our opinion this approach is preferable, though at the same time it
is the most difficult due to the necessity to consider both specific aspects of the
problems and characteristics of the developed fuzzy models (Borisov 2014).
This chapter deals with some issues of building fuzzy models within the
third approach, using as example fuzzy models for the estimation of similarity
of linguistic objects. Such models usage makes the basis for solving several
problems of linguistic analysis including those viewed in this chapter.
8  Sergey N. Andreev and Vadim V. Borisov

Solutions of the following problems of linguistic analysis using the suggested


fuzzy similarity models are considered.
I) Comparison (establishing the degree of similarity) of the original and its
translations. The essence of the approach consists in establishing estimated
values of the compared linguistic objects, e.g. parts (chapters) of the original
and of its translations. In this case the similarity model, built for the original
text, is used. The obtained results of the estimation allow not only to rank the
compared linguistic objects, but also to find out the degree of their similarity.
II) Comparison of the structure of similarity models for the original text and its
translations. The suggested fuzzy similarity models can be built not only for the
original but also for its translations. The results of the comparison of the struc-
ture of these models present interest as they contain information about interre-
lations and consistency of the estimated characteristics.
III) Analysis of the dynamics of individual style development. In this case
comparison (establishing the degree of similarity) of consecutive fragments of
the same text can be used.

2 Approaches to building similarity models for


linguistic analysis
As has been mentioned above many solutions of various linguistic problems are
based on different similarity models, which are aimed at:
firstly, direct estimation of the objects;
secondly, classification of objects on the basis of different estimation
schemes.
These tasks are interconnected though not always interchangeable. Simi-
larity model aimed at solving problems of the first class, may be used even for
separate linguistic objects (without comparison with other objects). In this case
the estimated objects may be ranked. On the other side, methods of ranking are
often not applicable for the estimation of separate objects.
More generally, the problem of multivariate estimation is formulated as
follows. There are a number of estimation characteristics pi, i = 1, …, n. There is
also a set of estimated linguistic objects A = {a1, …, aj , …, am}. For ∀ aj ∈ A it is
necessary to find the values of these characteristics 𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �.
In fuzzy approach the values of these characteristics are expressed as a
number in the range [0, 1] and characterize the degree of meeting the corre-
sponding criterion.
Linguistic analysis based on fuzzy similarity models  9

As a rule, similarity models possess a complex structure and are represented as


a hierarchy of interconnected characteristics.
Consider as example a two-level system of characteristics.
Let the characteristics pi, i = 1, …, n be partial characteristics of the lower
level of the hierarchy. And let the generalized characteristic P be denoted.
Then the value of this generalized characteristic for each of the estimated
objects may be obtained by aggregating the values of partial characteristics
(Borisov, Bychkov, Dementyev, Solovyev, Fedulov 2002):
∀𝑎𝑎𝑗𝑗 ∈ 𝐴𝐴, 𝑃𝑃𝑃𝑃𝑃𝑗𝑗 � = ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�.

where h – is mapping, satisfying the following axioms:


А1) ℎ(0, … ,0) = 0, ℎ(1, … ,1) = 1– borderline conditions.
А2) For any pairs, �𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝𝑖𝑖′ �𝑎𝑎𝑗𝑗 �� ∈ ⌈0, 1⌉2 , if ∀𝑖𝑖 , 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 � ≥ 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, then
ℎ �𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �� ≥ ℎ �𝑝𝑝𝑖𝑖′ �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑖𝑖′ �𝑎𝑎𝑗𝑗 �� – non-diminishing axiom.
А3) h – symmetric function of its arguments, i.e. its value is the same at any
permutation of its arguments.
А4) h – continuous function (Dubois, Prade 1990).
Thus the task of linguistic objects estimation conventionally may be divided
into two stages:
firstly, getting the estimation of partial characteristics of linguistic objects;
secondly, getting a generalized estimation of linguistic objects.
The first stage, as a rule, is performed in a traditional way. The second stage
of getting generalized estimation is characterized by a number of specific pecu-
liarities.
The following main approaches to getting generalized estimation, which ac-
tually means building estimation models of different linguistic objects, exist:
– generalized characteristic is formed under conditions of equality of partial
characteristics;
– generalized characteristic is formed on the basis of recursive aggregation of
partial characteristics;
– generalized characteristic is formed under conditions of inequality of par-
tial characteristics;
– generalized characteristic is formed on the basis of utilizing fuzzy quantifi-
ers for the aggregation of partial characteristics.

Let us consider these approaches in detail.

Generalized characteristic is formed under conditions of equality of partial


characteristics
10  Sergey N. Andreev and Vadim V. Borisov

In this case the following variants of estimation are possible:


i. the value of the generalized characteristic is determined by the lowest
value of all the partial characteristics;
ii. the value of the generalized characteristic is determined by the highest
value of all the partial characteristics;
iii. compromise strategies;
iv. hybrid strategies (Dubois, Prade 1988).

For variant (i), in addition to axioms А1) – А4), the following axiom must be
fulfilled:
А5) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, ℎ �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� ≤ 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 ��, i.e. the value of
the generalized characteristic must not exceed the minimal value of any of the
partial characteristics.
For variant (ii), in addition to axioms А1) – А4), the following axiom must
be fulfilled:
А6) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≥ 𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, i.e. the value of
the generalized characteristic must not be less than the maximal value of any of
the partial characteristics.
For compromise strategies variant (iii), in addition to axioms А1) – А4), the
following axiom must be fulfilled:
A7) ∀𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �,
𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� < ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� < 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, i.e.
the value of the generalized characteristic is between the values of the partial
characteristics.
For hybrid strategy (variant (iv)) the value of the generalized characteristic
may be obtained by aggregation of the values of the partial characteristics on
the basis of symmetric sum such as median 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝2 �𝑎𝑎𝑗𝑗 �; 𝑎𝑎𝑎, 𝑎𝑎 𝑎 [0, 1].
Another variant of hybrid strategy consists in using associative symmetric
sums (excluding median):
1
if 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� < , then
2
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≤ 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�
1
if 𝑚𝑚𝑚𝑚𝑚𝑚 �𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� > , then
2
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ≥ 𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )�;
This approach makes it possible under condition of equality of partial
characteristics to identify adequate operations for their aggregation.
Linguistic analysis based on fuzzy similarity models  11

Establishing (identification) of adequate aggregation operations, which


embrace the whole set of hybrid strategies between the “opposite” variants (i)
and (ii), provides the parameterized family of operations of the following type:
𝛾𝛾 (1−𝛾𝛾)
ℎ�𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� = 𝐼𝐼𝐼𝐼𝐼1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� ∙ 𝑈𝑈 𝑈𝑈𝑈1 �𝑎𝑎𝑗𝑗 �, … , 𝑝𝑝𝑛𝑛 �𝑎𝑎𝑗𝑗 �� , 𝛾𝛾 𝛾 [0,1]

where I, U are some operations of intersection and union correspondingly;


𝛾𝛾 is an index characterizing the degree of compromise between variants (i) and
(ii) (Zimmermann, Zysno 1980).

The generalized characteristic is formed on the basis of recursive


aggregation of partial characteristics
Often it is difficult to carry out the estimation of all partial characteristics
due to dimensionality of the estimation problem. In such cases the operation of
obtaining a generalized estimation can be realized in a recursive way for the
number of equivalent partial characteristics n > 2 with fulfilling commutativity
axiom A3 (Dubois, Prade 1988):

ℎ(𝑛𝑛) �𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛 (𝑎𝑎𝑗𝑗 )� = ℎ�ℎ(𝑛𝑛𝑛𝑛) �𝑝𝑝1 (𝑎𝑎𝑗𝑗 ), … , 𝑝𝑝𝑛𝑛𝑛𝑛 (𝑎𝑎𝑗𝑗 )�, 𝑝𝑝𝑛𝑛𝑛𝑛 (𝑎𝑎𝑗𝑗 )�.

The generalized characteristic is formed under the condition of non-equality


of partial characteristics
Within this approach the following variants of estimation are possible:
i. introduction of borderline values for partial characteristics;
ii. weighing partial characteristics;
iii. setting asymmetric generalized characteristic.

In case variant (i) is applied, borderline values for partial characteristics are set,
the fulfillment of which determines the acceptability of the generalized
estimation of the object under study.
If variant (ii) is chosen, every partial characteristic is assigned weight with
consequent aggregation of weighed partial characteristics using the operations
of aggregation. The most wide spread is weighed sum of partial characteristics.
𝑛𝑛 𝑛𝑛

𝑃𝑃𝑃𝑃𝑃𝑗𝑗 � = � 𝑤𝑤𝑖𝑖 𝑝𝑝𝑖𝑖 �𝑎𝑎𝑗𝑗 �, � 𝑤𝑤𝑖𝑖 = 1


𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖

However, the possibility of assigning weights wi is problematic if these weights


belong to characteristics of different nature.
In this case it is feasible to use relative values of these characteristics in the
latter equation.
12  Sergey N. Andreev and Vadim V. Borisov

The procedure of weighing partial characteristics can be also applied to other


types of aggregation operations.
Variant (iii) of setting asymmetric generalized characteristic is
characterized by a complicated structure and the necessity to aggregate partial
characteristics at the lowest level of hierarchy using, e.g. “AND-OR”-tree.

The generalized characteristic is formed on the basis of fuzzy quantifiers


during aggregation of partial characteristics
These fuzzy quantifiers specify the estimation of achieving acceptable
results, such as “majority”, “few”, “not less than a half”.
Fuzzy quantifiers may also provide the estimation to what extent the
borderline values of partial characteristics (or most of partial characteristics)
which define the acceptability of the result are reached.
In this case besides a fuzzy quantifier, weights of these partial
characteristics are assigned (Dubois, Prade 1990).
Actually, the importance of choosing the best approach to the formation of
the generalized characteristic becomes more obvious when some generalized
estimation, in its turn, is combined with other generalized estimations (Keeney,
Raiffa 1981).
Despite a big number of the existing approaches to building similarity
models, most of them do not take into account consistency of partial
heterogeneous characteristics for the operation of aggregation. As a result this
does not permit to build adequate similarity models with complex structure for
solving linguistic tasks.

3 Method and fuzzy similarity models for solving


linguistic problems
Consider the suggested method and fuzzy similarity models for solving the
above mentioned linguistic problems:
– comparison (establishing the degree of similarity) of the original and its
translations;
– comparison of the structure of similarity models for the original text and its
translations;
– analysis of the dynamics of individual style development (establishing the
degree of similarity) of fragments of text.
Linguistic analysis based on fuzzy similarity models  13

We shall show the solution of these problems on the data-base of the poem by
Samuel Coleridge The Rime of the Ancient Mariner and its two translations by
Nikolay Gumilev and Wilhelm Levik.
The main demands to similarity models for solving tasks of linguistic
analysis include the following:
– possibility to formulate a generalized characteristic on the basis of
changing sets of partial characteristics;
– possibility to aggregate heterogeneous characteristics (both quantitative
and qualitative) which differ in measurement scales, range of variation;
– taking into consideration different significance of partial characteristics in
the generalized characteristic;
– taking into consideration consistency of partial characteristics;
– possibility to adjust (adapt) the model in a flexible way to the changes in
the number of characteristics (addition, exclusion) and changes in the
characteristics (in consistency and significance).

The results of the analysis of the existing approaches to building similarity


models for linguistic analysis enables us to make a conclusion that building
fuzzy similarity models is the most feasible method.
The suggested method and model allow defining the operation ℎ(𝑝𝑝1,…, 𝑝𝑝𝑛𝑛 ) of
getting the generalized estimation P on the basis of operations with partial
characteristics ℎ𝑘𝑘,𝑙𝑙 (𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 ), 𝑘𝑘, 𝑙𝑙 ∈ {1, … , 𝑛𝑛}, 𝑘𝑘 𝑘 𝑘𝑘 (for two-place operations) and
weights of these partial characteristics 𝑤𝑤𝑖𝑖 (𝑖𝑖 = 1, … , 𝑛𝑛, ∑𝑛𝑛𝑖𝑖𝑖𝑖 𝑤𝑤𝑖𝑖 = 1)
In this case the identification operations ℎ𝑘𝑘,𝑙𝑙 (𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 ), 𝑘𝑘, 𝑙𝑙 ∈ {1, … , 𝑛𝑛}, 𝑘𝑘 𝑘 𝑘𝑘
with partial characteristics pk and pl pairwise degrees of their consistency
𝑐𝑐𝑘𝑘,𝑙𝑙 (𝑘𝑘, 𝑙𝑙 = 1, … , 𝑛𝑛) must be established.
Depending on the specifics of the estimation task in question, consistency may
be treated as correlation, interdependence of partial characteristics, the
possibility of obtaining exact values of the compared characteristics.
As different methods may be used to establish degrees of consistency
𝑐𝑐𝑘𝑘,𝑙𝑙 (𝑘𝑘, 𝑙𝑙 = 1, … , 𝑛𝑛) of partial characteristics, it is feasible to specify a set 𝐶𝐶̃ ,
elements of which define criterion levels of these characteristics consistency in
ascending order:
𝐶𝐶̃ = {NC – «No consistency», LC – «Low consistency», MC – «Medium
consistency», HC – «High consistency», FC – «Full consistency»}.
For example, NC – «No consistency» can be treated as the minimum of
correlation coefficient values between partial characteristics, and FC – «Full
consistency» – as the maximum of correlation coefficient values between partial
characteristics. Other values LC, MC, HC – correspond to intermediate values of
correlation coefficient between partial characteristics.
14  Sergey N. Andreev and Vadim V. Borisov

Then the obtained pairwise consistency degrees for partial characteristics


𝑐𝑐𝑘𝑘,𝑙𝑙 (𝑘𝑘, 𝑙𝑙 = 1, … , 𝑛𝑛), regardless of the method with which they were obtained, are
compared with criterial levels of consistency from the set 𝐶𝐶̃

𝑐𝑐𝑘𝑘,𝑙𝑙 ⇔ 𝑐𝑐𝑐𝑖𝑖 ∈ 𝐶𝐶̃ = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹}.

Obviously, as a result of such comparison the whole set of partial characteristics


will be divided into subsets, each of them corresponding to one criterion
consistency level.
In their turn, all criterion levels of consistency of partial characteristics from
the set 𝐶𝐶 = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹} are compared to operations of aggregation of
these characteristics, satisfying the above mentioned axioms: normalization,
non-diminishing, continuity, bisymmetry (association).
The results of the study substantiated a set of operations for the aggregation
of partial characteristics which both satisfies the above-mentioned axioms and
corresponds to the given criterion levels of characteristics consistency (Table 1,
Fig. 1).

Table 1: Comparison of aggregation operations with criterion levels of characteristics


consistency

№ Operation of aggregation Criterion level of Description


of characteristics pk, and pl characteristics of criterion level
consistency of consistency
1 min(pk, pl) NC No consistency
2 med(pk, pl, 0.25) LC Low consistency
3 med(pk, pl, 0.5) MC Medium consistency
4 med(pk, pl, 0.75) HC High consistency
5 max(pk, pl) FC Full consistency

Till now we spoke of a similarity model, supposing that it has a two-level


hierarchical structure, consisting of the generalized characteristic and partial
characteristics. However the system of characteristics may be organized in a
more complex structure, viz. partial characteristics may be separated into
groups which in their turn are divided into subgroups.1
Thus as a result of grouping characteristics the structure of the similarity
model may include three or even more levels.


1 The approach, suggested in this chapter, makes it possible to build similarity models both
with and without a priori grouping of the characteristics. In both cases all the provisions of the
method and similarity models remain valid.
Linguistic analysis based on fuzzy similarity models  15

Fig. 1: Operations of characteristics aggregation


16  Sergey N. Andreev and Vadim V. Borisov

For solving different linguistic analysis problems of verse such groupings of


characteristics may be used:
− according to linguistic levels:
− syntactic characteristics;
− characteristics of poetical syntax;
− morphological characteristics
− rhythmic characteristics;
− characteristics of rhyme;
− characteristics of stanza (strophic);
− grouping based on the localization in the verse line:
− localized:
− in the first metrically strong position (1st ictus) of the line:
− part of speech characteristics;
− rhythmic characteristics
− in the final metrically strong position (last ictus) of the line;
− part of speech characteristics;
− rhythmic characteristics
− unlocalized:
− syntactic characteristics;
− characteristics of poetical syntax;
− part of speech characteristics;
− strophic characteristics.

In case of grouping the estimation is found for each group (subgroup) of


characteristics taking into account the consistency of these characteristics
within the group (subgroup).
The result of estimation of the characteristics in each group (subgroup) can
have relevance of its own. But besides, the results of estimation in all groups of
characteristics can be aggregated to achieve generalized estimation (taking into
account the degree of consistency between these groups). The procedure of
getting such generalized estimation is analogous to that of obtaining estimation
for one group.
Consider as example the building of similarity model for a group of part of
speech characteristics of the poem The Rime of the Ancient Mariner by S.T.
Coleridge.
The group of these part-of-speech characteristics can be subdivided into
three subgroups:
(i) subgroup of parts of speech, localized at the end of the line and rhymed:
– p1 – number of rhymed nouns;
– p2 – number of rhymed verbs;
Linguistic analysis based on fuzzy similarity models  17

– p3 – number of rhymed adjectives;


– p4 – number of rhymed adverbs;
– p5 – number of rhymed pronouns;

(ii) subgroup of parts of speech, localized at the end of the line and not rhymed:
– p6 – number of unrhymed nouns;
– p7 – number of unrhymed verbs;
– p8 – number of unrhymed adjectives;
– p9 – number unrhymed adverbs;
– p10 – number of unrhymed pronouns;

(iii) subgroup of parts of speech which are not localized at some position in the
line:
– p11 – number of nouns;
– p12 – number of verbs;
– p13 – number of adjectives;
– p14 – number of adverbs.

To establish the degree of consistency of the characteristics in the above-


mentioned subgroups we shall use correlation analysis (Pearson correlation
coefficient).
Table 2 contains the results of comparing the values of the correlation
coefficient of part of speech localized rhymed characteristics from subgroup (i)
with the criterion levels of consistency from set 𝐶𝐶 = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹} which
are represented as a matrix of consistency.

Table 2: Consistency matrix of part of speech localized rhymed characteristics in the poem The
Rime of the Ancient Mariner by S.T. Coleridge.

p1 p2 p3 p4 p5
p1 – HC LC MC LC
p2 HC – LC MC NC
p3 LC LC – HC MC
p4 MC MC HC – MC
p5 LC NC MC MC –

In the same way consistency matrices for localized unrhymed (Table 3) and
unlocalized (Table 4) part of speech characteristics are formed.
18  Sergey N. Andreev and Vadim V. Borisov

Table 3: Consistency matrix of part of speech localized unrhymed characteristics in the poem
The Rime of the Ancient Mariner by S.T. Coleridge.

p6 p7 p8 p9 p10
p6 – MC LC LC HC
p7 MC – LC LC MC
p8 LC LC – MC MC
p9 LC LC MC – LC
p10 HC MC MC LC –

Table 4: Consistency matrix of part of speech unlocalized characteristics in the poem The Rime
of the Ancient Mariner by S.T. Coleridge.

p11 p12 p13 p14


P11 – MC MC LC
P12 MC – LC MC
P13 MC LC – LC
P14 LC MC LC –

It should be noted that consistency matrices in Tables 2–4 may be treated as


fuzzy relations, defined on the corresponding sets of characteristics (Lee 2004).
The requirement of using different significance of partial characteristics in
every group (or subgroup) is fulfilled by introduction of weight vector:
𝑛𝑛

𝑊𝑊 = {𝑤𝑤1 , 𝑤𝑤2 , … , 𝑤𝑤𝑛𝑛 }, ∀𝑖𝑖: 𝑤𝑤𝑖𝑖 ∈ [0,1], � 𝑤𝑤𝑖𝑖 = 1


𝑖𝑖𝑖𝑖

Weights {𝑤𝑤1 , 𝑤𝑤2 , … , 𝑤𝑤𝑛𝑛 } may be:


– defined during the experiment;
– assigned by an expert;
– obtained from the data of pairwise comparison of the characteristics
significance, also carried out by an expert.

The latter approach to establishing weights of characteristics is the most


convenient because it is focused on comparison of only one pair of objects at a
time with consequent processing of the results of pairwise comparisons. This
radically simplifies the task in case of a large number of characteristics to be
compared.
Consider getting weight vector based on the method of pairwise comparison
of characteristics significance (Saaty 1980).
At first, on the basis of pairwise comparison of the significance of
characteristics from a certain group (subgroup), a positive definite, back
Linguistic analysis based on fuzzy similarity models  19

symmetric matrix (Table 5) is formed, in which element vi/vj denotes the degree
of significance of the i-th characteristic as compared to the j-th characteristic.

Table 5: Matrix of pairwise comparisons of the significance of the characteristics from a group.

Characteristic number
1 2 … i … n
1 1 v1/v2 … v1/vi … v1/vn
2 v2/v1 1 … v2/vi … v2/vn
… … … … … … …
i vi/v1 vi/v2 1 … vi/vn
Number

… … … … … … …
n vn/v1 vn/v2 … vn/vi … 1

To define the values of vi and vj the following estimations of degrees of


significance are used in the table:
1 – equally significant;
3 – a little more significant;
5 – sufficiently more significant;
7 – much more significant;
9 – absolutely more significant;
2, 4, 6, 8 – intermediate significance.

Then on the basis of these pairwise comparisons weights of characteristics in


the group (subgroup) are calculated:
𝑛𝑛 𝑣𝑣
�∏𝑛𝑛𝑙𝑙𝑙𝑙 � 𝑣𝑣𝑘𝑘 �
𝑙𝑙
𝑤𝑤𝑖𝑖 =
𝑛𝑛 𝑣𝑣
∑𝑛𝑛𝑘𝑘𝑘𝑘 �∏𝑛𝑛𝑙𝑙𝑙𝑙 � 𝑘𝑘 �
𝑣𝑣 𝑙𝑙

For the analyzed example the following weights of characteristics were obtained
using the above mentioned method:
– for the subgroup of part of speech localized rhymed features: {w1 = 0.12, w2 =
0.29, w3 = 0.25, w4 = 0.15, w5 = 0.19};
– for the subgroup of part of speech localized unrhymed features: {w6 = 0.11,
w7 = 0.30, w8 = 0.23, w9 = 0.14, w10 = 0.22};
– for the subgroup of part of speech unlocalized features: {w11 = 0.25, w12 = 0.3,
w13 = 0.3, w14 = 0.15};
20  Sergey N. Andreev and Vadim V. Borisov

The results of estimation of consistency and significance of characteristics


within each group (subgroup) may be structurally represented as a fuzzy
undirected graph 𝐺𝐺� with fuzzy nodes and fuzzy edges:

𝐺𝐺� = �𝑃𝑃�, 𝑅𝑅� �,


𝑝𝑝
where 𝑃𝑃� = �� 𝑖𝑖 �� , 𝑖𝑖 𝑖 {1, … , 𝑛𝑛}, 𝑝𝑝𝑖𝑖 ∈ 𝑃𝑃 – fuzzy set of vertices, and each node
𝑤𝑤𝑖𝑖
(𝑝𝑝 ,𝑝𝑝 )
𝑝𝑝𝑖𝑖 is weighed by the corresponding value of 𝑤𝑤𝑖𝑖 ∈ [0,1]; 𝑅𝑅� = �� 𝑘𝑘̃ 𝑙𝑙 �� , 𝑘𝑘, 𝑙𝑙 =
𝐶𝐶𝑖𝑖
1, … , 𝑛𝑛, 𝑐𝑐𝑐𝑖𝑖 ∈ 𝐶𝐶̃ = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹} - fuzzy set of edges, and each arch (𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 )
corresponds to a certain criterion consistency level 𝑐𝑐𝑐𝑖𝑖 .
Figure 2 shows fuzzy undirected graph which corresponds to the results of
the estimation of consistency and significance in the subgroup of part of speech
localized rhymed characteristics in The Rime of the Ancient Mariner by
Coleridge.

Fig. 2: Fuzzy undirected graph which corresponds to the results of the estimation of
consistency and significance in the subgroup of part of speech localized rhymed
characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , } in The Rime of the Ancient Mariner by Coleridge.

In the same way fuzzy graphs corresponding to the results of the estimation of
consistency and significance in other groups (subgroup) can be made.
Consider the method of building an similarity model on the data-source of
the above-mentioned subgroup of localized rhymed part of speech
characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , }.
The suggested method is based on alternate search in fuzzy graph of
complete subgraphs 𝐺𝐺 �′ = �𝑃𝑃�′ , 𝑅𝑅� ′�, whose arcs correspond to one of the criterion
levels of consistency 𝑐𝑐𝑐𝑖𝑖 ∈ 𝐶𝐶̃ = {𝑁𝑁𝑁𝑁, 𝐿𝐿𝐿𝐿, 𝑀𝑀𝑀𝑀, 𝐻𝐻𝐻𝐻, 𝐹𝐹𝐹𝐹}. The search order of such
Linguistic analysis based on fuzzy similarity models  21

subgraphs is determined by the order of considering the levels of characteristics


consistency.

There are two orders of proceeding to consider the levels of characteristics


consistency:
– firstly, in the direction from the lowest to higher degrees of consistency;
– secondly, from the most consistent to less consistent characteristics.

The order of proceeding from the lowest to higher degrees of consistency allows
«not to lose» «good» estimations of badly consistent characteristics, because
the operation of aggregation of the characteristics is usually aimed at such
estimation when the value of generalized characteristic is defined by the worst
value of the partial characteristics.
The order of proceeding from the most consistent to less consistent
characteristics makes it possible to take into account «bad» estimations of
highly consistent characteristics whose aggregation was based on such type of
estimation when the value of the generalized characteristic is defined by the
highest value of partial characteristics (Zernov 2007).
Consider the alternate search of subgraphs 𝐺𝐺 �′ = �𝑃𝑃�′ , 𝑅𝑅� ′� following the
procedure from the most consistent to less consistent characteristics in
subgroup {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , }.
The procedure starts with finding the highest level of consistency for the
characteristics {𝑝𝑝1 , 𝑝𝑝2 , 𝑝𝑝3 , 𝑝𝑝4 , 𝑝𝑝5 , } (Fig. 2).
Pairs of characteristics p1 and p2, and also p3 and p4 have HC «High level of
consistency», the largest in fuzzy graph in Figure 2. These pairs of nodes form
two complete subgraphs of maximum size.
In case there are several such subgraphs, these complete subgraphs of
maximum size are sorted in the order of decreasing the total weight of their
nodes (another method of analysis of subgraphs can also be applied). In Fig. 3a
only those edges of fuzzy graph were left which correspond to the given level of
consistency HC.
Considering in succession these subgraphs, we merge all the nodes of these
subgraphs into one on the basis of the operation of aggregation corresponding
to the given level of consistency.
If the operation of aggregation is not multi-place ℎ�𝑝𝑝1 , … , 𝑝𝑝𝑞𝑞 �, 𝑞𝑞 𝑞 {1, … , 𝑛𝑛},
but a two-place nonassociative operation ℎ(𝑝𝑝𝑘𝑘 , 𝑝𝑝𝑙𝑙 ), 𝑘𝑘, 𝑙𝑙 𝑙 {1, … , 𝑛𝑛}, 𝑘𝑘 𝑘 1, then
the order of enumeration of characteristics pk and pi may be set, e.g. according
to decreasing weights.
22  Sergey N. Andreev and Vadim V. Borisov

In our case of bisymmetric operation it is also possible to start with the nodes of
the largest weight. Then, supposing that 𝑤𝑤𝑖𝑖 ≤ 𝑤𝑤𝑖𝑖𝑖𝑖 , 𝑖𝑖 𝑖 {1, … , 𝑞𝑞 𝑞 1}, we get:
ℎ∗ �𝑝𝑝1 , … , 𝑝𝑝𝑞𝑞 � = ℎ�ℎ�… (ℎ(𝑝𝑝1 , 𝑝𝑝2 ), … ), 𝑝𝑝𝑞𝑞𝑞𝑞 �, 𝑝𝑝𝑞𝑞 �

For aggregating the characteristics p1 and p2, and also p3 and p4 the operations
𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75) and 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75) from Table 1, corresponding to the
specified level of consistency HC are chosen.
Figure 3b shows the fuzzy graph, made by merging nodes p1 and p2, and p3
and p4 using the above-mentioned operations. The weights of the new nodes of
the fuzzy graph are determined by summing weights of the merging nodes.

Fig. 3: Subgraphs of a fuzzy graph, which correspond to the HC “High Consistency” level – a);
fuzzy graph with merged nodes (𝑝𝑝1 , 𝑝𝑝2 ) and (𝑝𝑝3 , 𝑝𝑝4 ), received by using the operations,
correspondingly, 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75), and 𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75) – b).

Then from the set of edges of the fuzzy graph the edges which are connected
with the merging nodes are removed.
Afterwards levels of consistency of edges, adjacent to the merging nodes.
These levels of consistency are defined according to the strategy of estimation
which was chosen (see the classification of approaches to building similarity
models for linguistic analysis). In the given example we choose the type of
estimation in which the value of the generalized characteristic is determined by
Linguistic analysis based on fuzzy similarity models  23

the worst values of the partial characteristics. The level of consistency,


established by this type, is shown in Figure 3b.
Then the procedure is repeated.
At the next step next highest level of consistency of the characteristics (nodes of
a fuzzy graph) out of the set {(𝑝𝑝1 , 𝑝𝑝2 ), (𝑝𝑝3 , 𝑝𝑝4 ), 𝑝𝑝5 } is found. It is obvious that the
pair of characteristics (𝑝𝑝3 , 𝑝𝑝4 ) and 𝑝𝑝5 demonstrates the highest level of
consistency MC – «Medium consistency».
To aggregate these characteristics according to Table 1 the operation
𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75)�, 𝑝𝑝5 ; 0.5� is used.
Figure 4 shows fuzzy graph made after another merging operation of nodes.
At the final stage of identification the operation of aggregation of
characteristics (merged nodes of a fuzzy graph) (𝑝𝑝1 , 𝑝𝑝2 ) and �(𝑝𝑝3 , 𝑝𝑝4 ), 𝑝𝑝5 � is
carried out according to the level of consistency NC – «No consistency».

Fig. 4: Fuzzy graph after one of successive merging operations of nodes

Thus the similarity model of the subgroup of part of speech localized rhymed
characteristics for the poem The Rime of the Ancient Mariner can be expressed as
follows:

𝑃𝑃𝐿𝐿𝐿𝐿 = min ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝1 , 𝑝𝑝2 ; 0.75)�, �𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝3 , 𝑝𝑝4 ; 0.75)�, 𝑝𝑝5 ; 0.5���

Similarity models for two other subgroups of characteristics are built in the
same way.
Figure 5 shows fuzzy undirected graph which corresponds to the results of
estimation of consistency and significance in the subgroup of part of speech
localized unrhymed characteristics {𝑝𝑝6 , 𝑝𝑝7 , 𝑝𝑝8 , 𝑝𝑝9 , 𝑝𝑝10 , } for the poem The Rime of
the Ancient Mariner.
24  Sergey N. Andreev and Vadim V. Borisov

Fig. 5: Fuzzy undirected graph which corresponds to the results of estimation of consistency
and significance in the subgroup of part of speech localized unrhymed characteristics
{𝑝𝑝6 , 𝑝𝑝7 , 𝑝𝑝8 , 𝑝𝑝9 , 𝑝𝑝10 } for the poem The Rime of the Ancient Mariner.

Having repeated all the steps of this method we get similarity model for the
subgroup of part of speech localized unrhymed characteristics as follows:

𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝6 , 𝑝𝑝10 ; 0.75)�, 𝑝𝑝7 ; 0.5�� , �𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝8 , 𝑝𝑝9 ; 0.5)�; 0.25�.

Fig. 6: Fuzzy undirected graph which corresponds to the results of estimation of consistency
and significance in the subgroup of part of speech unlocalized characteristics
{𝑝𝑝11 , 𝑝𝑝12 , 𝑝𝑝13 , 𝑝𝑝14 , } for the poem The Rime of the Ancient Mariner.
Linguistic analysis based on fuzzy similarity models  25

Figure 6 shows fuzzy undirected graph which corresponds to the results of


estimation of consistency and significance in the subgroup of part of speech
unlocalized characteristics {𝑝𝑝11 , 𝑝𝑝12 , 𝑝𝑝13 , 𝑝𝑝14 , } for the poem The Rime of the
Ancient Mariner.
Similarity model for the subgroup of part of speech unlocalized
characteristics may be presented as follows:

𝑃𝑃𝑢𝑢𝑢𝑢 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑝𝑝11 , 𝑝𝑝12 ; 0.5)�, 𝑝𝑝13 ; 0.5�� , 𝑝𝑝14 ; 0.5�.

The results of the estimation of linguistic objects using different subgroups of


characteristics may be useful. In this case the objects will be estimated on the
basis of vector of estimations, every element of which is the estimation obtained
with the help of similarity model using the corresponding subgroup of
characteristics.
For example, the vector of part of speech (morphological) characteristics of
some linguistic object 𝑎𝑎𝑗𝑗 , 𝑗𝑗 = 1, … , 𝑚𝑚 includes the estimations obtained with the
help of similarity models for the corresponding subgroups of part of speech
characteristics 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 :
�𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 �, 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 �, 𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 ��;

�𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝2 �𝑎𝑎𝑗𝑗 �; 0.75�� ,


𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � �;
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝3 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝4 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝5 �𝑎𝑎𝑗𝑗 �; 0.5��

�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝6 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝10 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝7 �𝑎𝑎𝑗𝑗 �; 0.5�� ,


𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � �;
�𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝8 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝9 �𝑎𝑎𝑗𝑗 �; 0.5�� ; 0.25

�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝11 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝12 �𝑎𝑎𝑗𝑗 �; 0.5�� , 𝑝𝑝13 �𝑎𝑎𝑗𝑗 �; 0.5�� ,


𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � �.
𝑝𝑝14 �𝑎𝑎𝑗𝑗 �; 0.5

Where 𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � is the estimation of linguistic object 𝑎𝑎𝑗𝑗 using the subgroup of
localized part of speech rhymed characteristics and obtained with the help of
similarity model 𝑃𝑃𝐿𝐿𝐿𝐿 ; 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � is the estimation of object aj using the subgroup of
part of speech localized unrhymed characteristics with the help of similarity
26  Sergey N. Andreev and Vadim V. Borisov

model 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 ; 𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 � is the estimation of the object aj using the subgroup of part
of speech unlocalized characteristics, obtained with the help of the model 𝑃𝑃𝑢𝑢𝑢𝑢 .2
On the other side, the estimation results got for the above mentioned
subgroups of characteristics may be aggregated into a generalized estimation
for the whole group of part of speech characteristics.
In order to do this according to the suggested method it is necessary to
build a similarity model in which 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 , in their turn, function as
“partial” characteristics, and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is a generalized characteristic which is
formed taking into account the consistency and significance of these “partial”
characteristics:

𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝐻𝐻 ∗ (𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 , 𝑃𝑃𝑢𝑢𝑢𝑢 ),

where 𝐻𝐻 ∗ is the notation for operation (set of operations) of aggregation,


identified in accordance with consistency levels 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 .
Obviously, estimation of consistency levels between 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 as
well as setting their significance coefficients is carried out by experts. For the
estimation of consistency between 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 it is possible to use fuzzy
cognitive model (Borisov, Fedulov 2004) and in order to set their significance –
the method of pairwise comparison (Saaty 1980).
Let us assume that 𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 and 𝑃𝑃𝑢𝑢𝑢𝑢 correspond to the consistency level
“High consistency”, and the following significance coefficients are assigned to
them: 𝑤𝑤𝐿𝐿𝐿𝐿 = 0.35; 𝑤𝑤𝐿𝐿𝐿𝐿𝐿𝐿 = 0.25; 𝑤𝑤𝑢𝑢𝑢𝑢 = 0.40.
Then the similarity model for the whole group of part of speech
characteristics of the poem by Coleridge The Rime of the Ancient Mariner may be
represented in the following form:

𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝑢𝑢𝑢𝑢 ; 0.75)�, 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 ; 0.75�

Consider briefly some approaches to the analysis of linguistic objects using the
developed models.

Comparison (establishing the degree of similarity) of the original and its


translations
The essence of the approach consists in establishing estimated values of the
compared linguistic objects, e.g. 𝑎𝑎1 , 𝑎𝑎2 , 𝑎𝑎3 – parts (chapters) of the original


2 Significance of characteristics is taken into account directly during the calculation of a
particular linguistic object using the estimation models.
Linguistic analysis based on fuzzy similarity models  27

poem The Rime of the Ancient Mariner by Coleridge and of its two translations
made by Gumilev and Levik.
In this case the model 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 , which was built for the original text, is used.
Specific numerical values of characteristics are set:
– for the original text by Coleridge – {p1(a1), …, pn(a1)};
– for the translation by Gumilev – {p1(a2), …, pn(a2)};
– for the translation by Levik – {p1(a3), …, pn(a3)}.

Then the values of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎1 ), 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎2 ) and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎3 ) are calculated using the
similarity model built for the group of part of speech characteristics in the
original text.

�𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝1 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝2 �𝑎𝑎𝑗𝑗 �; 0.75�� ,


𝑃𝑃𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � �
�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝3 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝4 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝5 �𝑎𝑎𝑗𝑗 �; 0.5��

�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝6 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝10 �𝑎𝑎𝑗𝑗 �; 0.75�� , 𝑝𝑝7 �𝑎𝑎𝑗𝑗 �; 0.5�� ,


𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � �
�𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝8 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝9 �𝑎𝑎𝑗𝑗 �; 0.5�� ; 0.25

�𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚�𝑝𝑝11 �𝑎𝑎𝑗𝑗 �, 𝑝𝑝12 �𝑎𝑎𝑗𝑗 �; 0.5�� , 𝑝𝑝13 �𝑎𝑎𝑗𝑗 �; 0.5�� ,


𝑃𝑃𝑢𝑢𝑢𝑢 �𝑎𝑎𝑗𝑗 � = 𝑚𝑚𝑚𝑚𝑚𝑚 � �.
𝑝𝑝14 �𝑎𝑎𝑗𝑗 �; 0.5

𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎𝑖𝑖 ) = 𝑚𝑚𝑚𝑚𝑚𝑚 ��𝑚𝑚𝑚𝑚𝑚𝑚(𝑃𝑃𝐿𝐿𝐿𝐿 (𝑎𝑎𝑖𝑖 ), 𝑃𝑃𝑢𝑢𝑢𝑢 (𝑎𝑎𝑖𝑖 ); 0.75)�, 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 (𝑎𝑎𝑖𝑖 ); 075�.

Then using the values of translations 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎2 ) and 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎3 ) their similarity with
the original, i.e. proximity to the value of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 (𝑎𝑎1 ) is established.
The obtained results of the estimation allow not only to rank the compared
linguistic objects, but also to find out the degree of their similarity.
Table 6 contains the results of the estimation of 7 parts (chapters) of the
texts of the original poem by Coleridge The Rime of the Ancient Mariner and its
two translations by Gumilev and Levik, which were obtained using the
similarity model for part of speech characteristics of the original.
28  Sergey N. Andreev and Vadim V. Borisov

Table 6: Results of the estimation of 7 parts of the texts of the original poem by Coleridge The
Rime of the Ancient Mariner and its two translations by Gumilev and Levik, obtained using the
similarity model for part of speech characteristics of the original.

Original (Coleridge) Translation (Gumilev) Translation (Levik)


Parts PLR(a1) PLuR(a1) PuL(a1) PPoS(a1) PLR(a1) PLuR(a1) PuL(a1) PPoS(a1) PLR(a1) PLuR(a1) PuL(a1)PPoS(a1)
Part 1 0.061 0.250 0.203 0.250 0.049 0.250 0.296 0.296 0.036 0.250 0.323 0.323
Part 2 0.05 0.250 0.235 0.250 0.017 0.217 0.322 0.322 0.046 0.215 0.329 0.329
Part 3 0.037 0.173 0.232 0.232 0.099 0.222 0.288 0.288 0.060 0.060 0.316 0.316
Part 4 0.103 0.221 0.213 0.221 0.103 0.191 0.267 0.267 0.029 0.176 0.345 0.345
Part 5 0.051 0.245 0.193 0.245 0.059 0.220 0.279 0.279 0.059 0.250 0.314 0.314
Part 6 0.048 0.192 0.218 0.218 0.163 0.250 0.285 0.285 0.076 0.229 0.273 0.273
Part 7 0.080 0.196 0.193 0.193 0.035 0.204 0.273 0.273 0.092 0.227 0.272 0.272

These results show that on the whole the translation by Gumilev is closer to the
original in Parts 1–5 in which the idea of crime and punishment prevails. On the
other hand, Part 6 and to some extent Part 7 in which the themes of atonement
and relief of the Mariner (and the writer himself as it is implied in the text) by
storytelling are expressed, are translated by Levik with a higher degree of
exactness, thus revealing difference of approaches demonstrated by the two
translators.
Gumilev is much closer in his translation in unrhymed characteristics and
also in unlocalized part of speech characteristics whereas Levic pays more
attention to the rhymed words. Rhyme is one of important means of introducing
vertical links into verse. Unrhymed words attract less attention, but together
with unlocalized characteristics they allow Gumilev to render the distribution of
parts of speech in the line. In other words, Levik creates better similarity in
vertical organization, Gumilev – in horizontal aspect.
The estimation was carried out with the following assumptions:
– For the given example the evaluation model is unchanged for all parts
(chapters) of the original poem. Obviously, it is possible to build different
similarity models for different chapters of the text, and then estimation can
be made for each chapter, using its “own” similarity model.
– The comparison of these linguistic objects in the given example was made
after normalizing the characteristics against the number of lines in the
corresponding chapter. It is reasonable also to examine the possibilities of
using this model with normalization of the values of characteristics against
their maximum values and taking into account their weights.
– While building the model in the given example we followed the strategy of
examining the characteristics in the order “from the most consistent to the
Linguistic analysis based on fuzzy similarity models  29

least consistent». Using the other strategy of building the model one may
widen the possibilities of the classification.

Part of speech characteristics being highly important for authorship attribution,


classification of styles and estimation of translation exactness (Andreev 2012,
Koppel, Argamon, Shimoni 2002, Naumann, Popescu, Altmann 2012, Popescu,
Čech, Altmann 2013), may further on be combined with characteristics,
reflecting other linguistic layers: rhythm, rhyme, syntax and other features of
verse (Köhler, Altmann 2014, 70-73). Such “parameter adjustment” is possible
and will allow to focus on new aspects of classification.
Similarity models can be built not only for the original but also for its
translations. The results of the comparison of the structure of these models
present interest as they contain information about interrelations and
consistency of the estimated characteristics.
Analysis of the dynamics of individual style development (establishing the
degree of similarity of fragments of text).
In this case the compared linguistic objects are different fragments
(chapters) of a text or texts (see Table 6). Figure 7 reflects the dynamics of the
values of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 throughout the original text and its two translations.

Fig. 7: Dynamics of the characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 over the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.

According to the analysis of the dynamics of these characteristics it is possible


to make the following conclusions.
Firstly, in the original the generalized characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is influenced by the
characteristic 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 whereas in both Gumilev’s and Levil’s translations this
30  Sergey N. Andreev and Vadim V. Borisov

characteristic 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 is mostly influenced by 𝑃𝑃𝑢𝑢𝑢𝑢 . This can explain higher values of
𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in both translations as compared with the values of this characteristic in
the original.
Secondly, for the original 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 remains stable practically for the whole of
the text. Some deviation – tendency to decreasing – takes place only in Part 6,
reaching the minimum value of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in Part 7. The same tendency is observed in
both translations.
Thirdly, there is a closer correlation of the changes of 𝑃𝑃𝑃𝑃𝑃𝑃𝑃𝑃 in the text of the
poem between the original and the translation by Gumilev than between the
original and the translation of Levik.
The same type of comparisons were made for the dynamics over the text of,
𝑃𝑃𝐿𝐿𝐿𝐿 , 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 , 𝑃𝑃𝑢𝑢𝑢𝑢 , for the original and the translations (Fig. 8–10).
Variation of the values of two characteristics 𝑃𝑃𝑢𝑢𝑢𝑢 and 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 in all the three
texts is rather moderate. Some deviation from this tendency is observed in
Levik’s translation (Fig. 9) in Part 3 where the theme of punishment is
introduced. On the other hand, the localized rhymed characteristics 𝑃𝑃𝐿𝐿𝐿𝐿 (Fig. 8)
show a wide range of variation of their values in each text.

Fig. 8: Dynamics of the characteristic 𝑃𝑃𝐿𝐿𝐿𝐿 for the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.
Linguistic analysis based on fuzzy similarity models  31

Fig. 9: Dynamics of the characteristic 𝑃𝑃𝐿𝐿𝐿𝐿𝐿𝐿 for the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.

Fig. 10: Dynamics of the characteristic 𝑃𝑃𝑢𝑢𝑢𝑢 for the original text The Rime of the Ancient Mariner
by Coleridge and its translations by Gumilev and Levik.

4 Conclusion
In this chapter the relevance of building fuzzy similarity models for linguistic
analysis has been discussed. The models are aimed at solving a wide range of
tasks under conditions of uncertainty:
32  Sergey N. Andreev and Vadim V. Borisov

– estimation of the degree of similarity of the original text (poem) and its
translations using different groups of characteristics;
– estimation of the similarity of parts of the compared poems (chapters);
– classification of texts and their features according to certain rules.

Similarity models for complicated linguistic object have been analyzed, their
classification depending on the type of characteristics aggregation has been
suggested. The relevance of building similarity models based on the fuzzy
approach and aimed at solving various linguistic problems is outlined.
The suggested fuzzy similarity models for linguistic analysis satisfy the
following conditions:
– generalized characteristic is formed on the basis of changing sets of partial
characteristics;
– possibility to aggregate heterogeneous characteristics (both quantitative
and qualitative) which differ in measurement scales, range of variation and
weight;
– possibility to build a similarity model both with and without a priori
grouping of the characteristics;
– possibility to create a complex hierarchical structure of the similarity
model, in which partial characteristics are divided into groups and
subgroups;
– using criterion levels of characteristics consistency for the identification of
aggregation operations;
– possibility to adjust (adapt) the model in a flexible way to the changes of
the number of characteristics (addition, exclusion) and changes of
characteristics (in consistency and weight).

Acknowledgment
Research is supported by the State Task of the Ministry of Education and
Science of the Russian Federation (the basic part, task # 2014/123, project #
2493) and Russian Foundation for Humanities (project # 14-04-00266).

References
Andreev, Sergey. 2012. Literal vs. liberal translation – formal estimation. Glottometrics 23. 62–
69.
Linguistic analysis based on fuzzy similarity models  33

Borisov, Vadim V. 2014. Hybridization of intellectual technologies for analytical tasks of


decision-making support. Journal of Computer Engineering and Informatics 2(1). 11–19.
Borisov, Vadim V., Igor A. Bychkov, Andrey V. Dementyev, Alexander P. Solovyev & Alexander
S. Fedulov. 2002. Kompyuternaya podderzhka slozhnykh organizacionno-tekhnicheskikh
sistem [Computer support of complex organization-technical systems]. Moskva:
Goryachaya linia – Telekom.
Borisov, Vadim V. & Alexander S. Fedulov. 2004. Generalized rule-based fuzzy cognitive maps:
Structure and dynamics model. In Nikhil R. Pal, Nikola Kasabov, Rajani K. Mudi, Srimanta
Pal & Swapan K. Parui (eds.), Neural information processing (Lecture notes in computer
science 3316), 918–922. Berlin & Heidelberg: Springer.
Dubois, Didier & Henri Prade. 1988. Théorie des possibilités: Application à la représentation
des connaissances en informatique. Paris: Masson.
Dubois, Didier & Henri Prade. 1990. Teoriya vozmozhnostey. Prilozheniya k predstavleniyu v
informatike [Possibility theory. Applications to the representation of knowledge in
Informatics]. Moskva: Radio i svyaz.
Hoover, David L. 2007. Corpus stylistics, stylometry and the styles of Henry James. Style 41(2).
160–189.
Juola, Patrick. 2006. Authorship attribution. Foundations and Trends in Information Retrieval
1(3). 233–334.
Keeney, Ralph L. & Howard Raiffa. 1981. Prinyatye resheniy pri mnogikh kriteriyakh:
Predpochteniya i zameshcheniya [Decisions with multiple objectives: Preferences and
value tradeoffs]. Moskva: Radio i svyaz.
Köhler, Reinhard & Gabriel Altmann. 2014. Problems in Quantitative Linguistics 4.
Lüdenscheid: RAM-Verlag.
Koppel, Moshe, Shlomo Argamon & Anat R. Shimoni. 2002. Automatically categorizing written
texts by author gender. Literary and Linguistic Computing 17(4). 401–412.
Lee, Kwang H. 2004. First course on fuzzy theory and application. Berlin & Heidelberg:
Springer.
Mikros, George K. 2009. Content words in authorship attribution: An evaluation of stylometric
features in a literary corpus. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics,
61–75. Lüdenscheid: RAM-Verlag.
Naumann, Sven, Ioan-Iovetz Popescu & Gabriel Altmann. 2012. Aspects of nominal style.
Glottometrics 23. 23–35.
Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2013. Descriptivity in Slovak lyrics.
Glottotheory 4(1). 92–104.
Rudman, Joseph. 2002. Non-traditional authorship attribution studies in eighteenth century
literature: Stylistics statistics and the computer. Jahrbuch für Computerphilologie 4. 151–
166.
Saaty, Thomas L. 1980. The analytic hierarchy process: Planning, priority setting, resource
allocation. New York: McGraw - Hill.
Zernov, Mikhail M. 2007. Sposob postroyeniya nechetkoy mnogokriteriyalnoy ocenochnoy
modeli [Method of building fuzzy multicriteria estimation model]. Neyrokompyutery:
Razrabotka i primeneniye 1, 40–49.
Zimmermann, Hans-Jürgen & Peter Zysno. 1980. Latent connectives in human decision making.
Fuzzy Sets and Systems 4(1), 37–51.
François Bavaud, Christelle Cocco, Aris Xanthos
Textual navigation and autocorrelation

1 Introduction
Much work in the field of quantitative text processing and analysis has adopted
one of two distinct symbolic representations: in the so-called bag-of-words
model, text is conceived as an urn from which units (typically words) are inde-
pendently drawn according to a specified probability distribution; alternatively,
the sequential model describes text as a categorical time series, where each unit
type occurs with a probability that depends, to some extent, on the context of
occurrence.
The present contribution pursues two related goals. First, it aims to general-
ise the standard sequential model of text by decoupling the order in which units
occur from the order in which they are read. The latter can be represented by a
Markov transition matrix between positions in the text, which makes it possible
to account for a variety of ways of navigating the text, including in particular non-
linear and non-deterministic ones. Markov transitions thus define textual neigh-
bourhoods as well as positional weights – the stationary distribution or promi-
nence of textual positions.
Building on the notion of textual neighbourhood, the second goal of this con-
tribution is to introduce a unified framework for textual autocorrelation, namely
the tendency for neighbouring positions to be more (or less) similar than ran-
domly chosen positions with respect to some observable property – for instance
whether the same unit types tend to occur, or units of similar length, or consisting
of similar sub-units, and so on. Inspired from spatial analysis (see e.g. Cressie
1991; Anselin 1995; Bavaud 2013), this approach relates the above mentioned
transition matrix (specifying neighbourhoodness) with a second matrix specifying
the dissimilarity between textual positions.
The remainder of this contribution is organised as follows. Section 2 intro-
duces the foundations of the proposed formalism and illustrates them with toy
examples. Section 3 presents several case studies intended to show how the for-
malism and some of its extensions apply to more realistic research problems in-
volving, among others, morphosyntactic and semantic dissimilarities computed
in literary or hypertextual documents. Conclusion briefly summarises the key
ideas introduced in this contribution.
36  François Bavaud, Christelle Cocco, Aris Xanthos

2 Formalism

2.1 Textual navigation: positions, transitions and exchange


matrix

A text consists of n positions i=1,…,n. Each position contains an occurrence of a


type (or term) α=1,…,v. Types may refer to characters, words (possibly lemma-
tised), sentences, or units from any other level of analysis. The occurrence of an
instance of type a at position i is denoted o(i)=a.

Fig. 1: Fixations and ocular saccades during reading, illustrating the skimming navigation
model. Source: Wikimedia Commons, from a study of speed reading made by Humanistlabora-
toriet, Lund University (2005).

A simple yet non-trivial model of textual navigation can be specified by means of


a regularn × n Markov transition matrix 𝑇𝑇𝑇𝑇 = (𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ), obeying 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ≥ 0 and
∑𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1. Its stationary distribution 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 > 0, obeying ∑𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 and
∑𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 = 1, characterises the prominence of position 𝑖𝑖𝑖𝑖 . For instance, neglecting
boundary effects,1


1 Boundary effects are, if necessary, dealt with the usual techniques (reflecting boundary, pe-
riodic conditions, addition of a "rest state", etc.), supporting the formal fiction of an ever-reading
agent embodied in stationary navigation.
Textual navigation and autocorrelation  37

Fig. 2: Word cloud (generated by http://www.wordle.net), aimed at illustrating the free or bag-
of-words navigation model with differing positional weights f.

– 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1(𝑗𝑗𝑗𝑗 = 𝑖𝑖𝑖𝑖 + 1) corresponds to standard linear reading2


– 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1(𝑖𝑖𝑖𝑖 < 𝑗𝑗𝑗𝑗 < 𝑖𝑖𝑖𝑖 + 𝑟𝑟𝑟𝑟 + 1)/𝑟𝑟𝑟𝑟 describes (directed) skimming with jumps of
maximum length 𝑟𝑟𝑟𝑟 𝑟 1, as suggested by Figure 1
– 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 defines the free or bag-of-words navigation, consisting of independ-
ent random jumps towards weighted positions (Markov chain of order 0), as
suggested by Figure 2.

Define the 𝑛𝑛𝑛𝑛 × 𝑛𝑛𝑛𝑛 exchange matrix 𝐸𝐸𝐸𝐸 = (𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ) as


1
𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ∶= (𝑓𝑓𝑓𝑓 𝑡𝑡𝑡𝑡 + 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ) (1)
2 𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
By construction, 𝐸𝐸𝐸𝐸 is symmetric, non-negative, with margins 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖 = 𝑒𝑒𝑒𝑒•𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 , and
total 𝑒𝑒𝑒𝑒•• = 1.3 The exchange matrix constitutes a symmetrical measure of posi-
tional interaction, reading flow, or neighbourhoodness between textual positions
𝑖𝑖𝑖𝑖 and 𝑗𝑗𝑗𝑗.
In particular, the following three exchange matrices correspond to the three
transition matrices defined above (still neglecting boundary effects):
– 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = [1(𝑗𝑗𝑗𝑗 = 𝑖𝑖𝑖𝑖 𝑖 1) + 1(𝑗𝑗𝑗𝑗 = 𝑖𝑖𝑖𝑖 + 1)]/2𝑛𝑛𝑛𝑛 for standard linear reading


2 Here and in the sequel, 1(𝐴𝐴𝐴𝐴) denotes the indicator function of event 𝐴𝐴𝐴𝐴, with value 1(𝐴𝐴𝐴𝐴) = 1 if
𝐴𝐴𝐴𝐴 is true, and 1(𝐴𝐴𝐴𝐴) = 0 otherwise.
3 Here and in the sequel, the notation "•" denotes summation over the replaced index, as in
𝑎𝑎𝑎𝑎•𝑖𝑖𝑖𝑖 : = ∑𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 , 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖 : = ∑𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 and 𝑎𝑎𝑎𝑎•• : = ∑𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 .
38  François Bavaud, Christelle Cocco, Aris Xanthos

1
– 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 1(|𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖| ≤ 𝑟𝑟𝑟𝑟) 1(𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖) for skimming with (undirected) jumps of
2𝑟𝑟𝑟𝑟𝑟𝑟𝑟𝑟
maximum length 𝑟𝑟𝑟𝑟
– 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 for the free or bag-of-words exchange matrix.

Toy example 1: consider a text consisting of four positions, either (periodically)


linearly read as ...123412341... (A) or freely read (B). The previously introduced
quantities are then
0 1 0 0 1/4 0 1 0 1
0 0 1 0 1/4 1 1 0 1 0
𝐴𝐴𝐴𝐴
𝑇𝑇𝑇𝑇 = � 𝐴𝐴𝐴𝐴
� 𝑓𝑓𝑓𝑓 = � 𝐴𝐴𝐴𝐴
� 𝐸𝐸𝐸𝐸 = � � (2)
0 0 0 1 1/4 8 0 1 0 1
1 0 0 0 1/4 1 0 1 0

1 1 1 1 1/4 1 1 1 1
1 1 1 1 1 1/4 1 1 1 1 1
𝐵𝐵𝐵𝐵
𝑇𝑇𝑇𝑇 = � 𝐵𝐵𝐵𝐵
� 𝑓𝑓𝑓𝑓 = � 𝐵𝐵𝐵𝐵
� 𝐸𝐸𝐸𝐸 = � � (3)
4 1 1 1 1 1/4 16 1 1 1 1
1 1 1 1 1/4 1 1 1 1
Note that (2) can also be conceived as an example of (periodic) skimming with
jumps of maximum length 𝑟𝑟𝑟𝑟 = 1, which is indeed equivalent to (periodic) linear
reading. Similarly, free navigation is equivalent to skimming with jumps of max-
imum size 𝑛𝑛𝑛𝑛, with the single difference that the latter forbids jumping towards
the currently occupied position.

2.2 Autocorrelation index

Let us now consider a matrix of dissimilarities 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 between pairs of positions (𝑖𝑖𝑖𝑖, 𝑗𝑗𝑗𝑗).
Here and in the sequel, we restrict ourselves to squared Euclidean dissimilarities,
i.e. of the form 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 =∥ 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 − 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 ∥2 , where ∥. ∥ denotes the Euclidean norm and 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 , 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖
are 𝑝𝑝𝑝𝑝-dimensional vectors, for some 𝑝𝑝𝑝𝑝 𝑝 1.
The average dissimilarity between a pair of randomly chosen positions de-
fines the (global) inertia Δ, while the average dissimilarity between a pair of
neighbouring positions defines the local inertia Δloc :
1 1
Δ ∶= � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 Δloc ∶= � 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 (4)
2 2
𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖

A local inertia much smaller than the global inertia reflects the presence of textual
autocorrelation: closer average similarity between neighbouring positions than
Textual navigation and autocorrelation  39

between randomly chosen positions. Conversely, negative autocorrelation char-


acterizes a situation where neighbouring positions are more dissimilar than ran-
domly chosen ones4. Textual autocorrelation can be measured by the autocorre-
lation index
Δ − Δloc
𝛿𝛿𝛿𝛿 𝛿=
Δ
generalizing Moran's 𝐼𝐼𝐼𝐼 index of spatial statistics (Moran 1950). The latter holds in
the one-dimensional case 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = (𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 − 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 )2 , where 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 , 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 are scalars, Δ = var(𝑥𝑥𝑥𝑥),
and Δloc = varloc (𝑥𝑥𝑥𝑥) (e.g. Lebart 1969).
Under the null hypothesis 𝐻𝐻𝐻𝐻0 of absence of textual autocorrelation, the ex-
pected value of the autocorrelation index is not zero in general, but instead
trace(𝑊𝑊𝑊𝑊) − 1
𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) = , (5)
𝑛𝑛𝑛𝑛 − 1
where 𝑊𝑊𝑊𝑊 = (𝑤𝑤𝑤𝑤𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 ) is the transition matrix of a reversible Markov chain defined as
𝑤𝑤𝑤𝑤𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 : = 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 /𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 (so that 𝐸𝐸𝐸𝐸0 reduces to −1/(𝑛𝑛𝑛𝑛 𝑛 1) for off-diagonal exchange matri-
ces). Similarly, under normal approximation, the variance reads
2 2)
(trace(𝑊𝑊𝑊𝑊) − 1)2
Var0 (𝛿𝛿𝛿𝛿) = �trace(𝑊𝑊𝑊𝑊 − 1 − � ,
𝑛𝑛𝑛𝑛2 − 1 𝑛𝑛𝑛𝑛 𝑛 1

(e.g. Cliff and Ord 1981; Bavaud 2013), thus making the autocorrelation index sig-
nificant at level 𝛼𝛼𝛼𝛼 if
𝛿𝛿𝛿𝛿 − 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿)
� � ≥ 𝑢𝑢𝑢𝑢1−𝛼𝛼𝛼𝛼 , (6)
�Var0 (𝛿𝛿𝛿𝛿) 2

where 𝑢𝑢𝑢𝑢𝛼𝛼𝛼𝛼 denotes the 𝛼𝛼𝛼𝛼-th quantile of the standard normal distribution.
Toy example 1, continued: suppose that the types occurring at the four po-
sitions of example 1 are the following trigrams: 𝑜𝑜𝑜𝑜(1) = 𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼, 𝑜𝑜𝑜𝑜(2) = 𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿𝛼𝛼𝛼𝛼, 𝑜𝑜𝑜𝑜(3) =
𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖, and 𝑜𝑜𝑜𝑜(4) = 𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼. Define the (squared Euclidean) dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 as the num-
ber of characters by which trigrams 𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) and 𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) differ:
0 2 3 0
2 0 2 2
𝐷𝐷𝐷𝐷 = � �
3 2 0 3
0 2 3 0


4 Up to the bias associated to the contribution of self-comparisons: see (6).
40  François Bavaud, Christelle Cocco, Aris Xanthos

Under linear periodic navigation (2), one gets the values Δ𝐴𝐴𝐴𝐴 = 3/4, Δ𝐴𝐴𝐴𝐴loc = 7/8
and 𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 = −1/6, higher than 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 ) = −1/3: the dissimilarity between immedi-
ate neighbours under linear periodic navigation is on average (after subtracting
the bias 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐴𝐴𝐴𝐴 )) smaller than the dissimilarity between randomly chosen pairs –
although not significantly by the above normal test.
By contrast, free navigation (3) yields Δ𝐵𝐵𝐵𝐵 = Δ𝐵𝐵𝐵𝐵loc = 3/4 and 𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 = 0, since local
and global inertia here coincide by construction. Note that 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 ) = 0 and
Var0 (𝛿𝛿𝛿𝛿 𝐵𝐵𝐵𝐵 ) = 0 in case of free navigation, regardless of the values of 𝐷𝐷𝐷𝐷.

2.3 Type dissimilarities

In most cases, the dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 between positions 𝑖𝑖𝑖𝑖 and 𝑗𝑗𝑗𝑗 depends only on the
types 𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎 and 𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) = 𝑏𝑏𝑏𝑏 found at these positions. Thus, the calculation of the
autocorrelation index can be based on the 𝑣𝑣𝑣𝑣 × 𝑣𝑣𝑣𝑣 type dissimilarity matrix 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ra-
ther than on the 𝑛𝑛𝑛𝑛 × 𝑛𝑛𝑛𝑛 position dissimilarity matrix – which makes it possible to
simplify both computation (since 𝑣𝑣𝑣𝑣 < 𝑛𝑛𝑛𝑛 in general) and notation (cf. sections 3.2
and 3.3).
Here are some examples of squared Euclidean type dissimilarities, i.e. of the
form 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ∥ 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∥2 where 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∈ ℝ𝑝𝑝𝑝𝑝 are the 𝑝𝑝𝑝𝑝-dimensional coordinates of type
𝑎𝑎𝑎𝑎, recoverable by multidimensional scaling (see section 3.4):

– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = (𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 )2 where 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 characterises type 𝑎𝑎𝑎𝑎, e.g. 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 = "length of 𝑎𝑎𝑎𝑎" or 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 =
1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎) (presence-absence dissimilarity with respect to property 𝐴𝐴𝐴𝐴)
– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎), the discrete metric
1 1
– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = � + � 1(a≠b), the weighted discrete metric, where 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 > 0 is the rel-
πa πb
ative proportion of type 𝑎𝑎𝑎𝑎, with ∑𝑎𝑎𝑎𝑎 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 1 (Le Roux and Rouanet 2004; Ba-
vaud and Xanthos 2005)
𝑝𝑝𝑝𝑝 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟•• 𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟•• 2
– 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ∑𝑘𝑘𝑘𝑘𝑘𝑘 ( − ) , the chi-square dissimilarity, used for compo-
𝑟𝑟𝑟𝑟•• 𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑎 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘 𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑏 𝑟𝑟𝑟𝑟•𝑘𝑘𝑘𝑘
site types made of distinguishable features, where 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑘𝑘𝑘𝑘 is the type-feature ma-
trix counting the occurrences of feature 𝑘𝑘𝑘𝑘 in type 𝑎𝑎𝑎𝑎.

In order to compute the autocorrelation index using a type dissimilarity matrix,


a 𝑣𝑣𝑣𝑣 × 𝑣𝑣𝑣𝑣 type exchange matrix can be defined as

𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = � � 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 1(𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎)1(𝑜𝑜𝑜𝑜(𝑗𝑗𝑗𝑗) = 𝑏𝑏𝑏𝑏) ,


𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖
Textual navigation and autocorrelation  41

whose margins specify the relative distribution of types: 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎 = 𝛼𝛼𝛼𝛼•𝑎𝑎𝑎𝑎 =
∑𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 1(𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎). Global and local inertias (4) can then be calculated as
1 1
Δ = � 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 Δloc = � 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 (7)
2 2
𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎 𝑎𝑎𝑎𝑎,𝑎𝑎𝑎𝑎

The Markov transition probability from term 𝑎𝑎𝑎𝑎 to term 𝑏𝑏𝑏𝑏 is now 𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 /𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 . Following
(5), the autocorrelation index 𝛿𝛿𝛿𝛿 = 1 − Δloc /Δ has to be compared with its ex-
pected value under independence
𝜖𝜖𝜖𝜖𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎
∑𝑣𝑣𝑣𝑣𝑎𝑎𝑎𝑎=1 −1
𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) =
𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 (8)
𝑣𝑣𝑣𝑣 − 1
which generally differs from (5). Indeed, the permutation-invariance implied by
the null hypothesis 𝐻𝐻𝐻𝐻0 of absence of textual autocorrelation relies on permuta-
tions of positions 𝑖𝑖𝑖𝑖 = 1, … , 𝑛𝑛𝑛𝑛 in (5), while it considers permutations of terms 𝑎𝑎𝑎𝑎 =
1, … , 𝑣𝑣𝑣𝑣 in (8) – two natural although not equivalent assumptions. In the follow-
ing, the position permutation test will be adopted by default, unless explicitly
stated otherwise.
Toy example 1, continued: recall that the set of types occurring in the text
of example 1 is {𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖}. The type dissimilarity 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 corresponding to the
position dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 previously used is defined as the number of characters
by which trigrams 𝑎𝑎𝑎𝑎 and 𝑏𝑏𝑏𝑏 differ:
0 2 3
𝐷𝐷𝐷𝐷 = �2 0 2�
3 2 0
Under linear periodic navigation (2), the type exchange matrix and type propor-
tions are

1 2 1 1 1 2
(𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 ) = �1 0 1� 𝜋𝜋𝜋𝜋 = �1� ,
8 4
1 1 0 1
yielding inertias (7) Δ𝐴𝐴𝐴𝐴 = 3/4 and Δ𝐴𝐴𝐴𝐴loc = 7/8 with 𝛿𝛿𝛿𝛿 = −1/6, as already obtained.

3 Case studies
The next sections present several case studies involving in particular chi-squared
dissimilarities between composite types such as play lines, hypertext navigation,
and semantic dissimilarities (further illustrations, including Markov iterations
𝑊𝑊𝑊𝑊 𝑊 𝑊𝑊𝑊𝑊 𝑟𝑟𝑟𝑟 , may be found in Bavaud et al. 2012). Unless otherwise stated, we use
42  François Bavaud, Christelle Cocco, Aris Xanthos

the "skimming" navigation model defined in section 2.1 (slightly adapted to han-
dle border effects) and let the maximum length of jumps vary as 𝑟𝑟𝑟𝑟 = 1,2,3, …,
yielding autocorrelation indices 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] for neighbourhoods of size 𝑟𝑟𝑟𝑟, i.e. including
𝑟𝑟𝑟𝑟 neighbours to the left and 𝑟𝑟𝑟𝑟 neighbours to the right of each position. In partic-
ular, 𝛿𝛿𝛿𝛿 [1] constitutes a generalisation of the Durbin-Watson statistic.

3.1 Autocorrelation between lines of a play

The play Sganarelle ou le Cocu imaginaire by Molière (1660) contains 𝑛𝑛𝑛𝑛 = 207
lines declaimed by feminine or masculine characters (coded 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 = 1 or 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 = 0 re-
spectively). Each line 𝑖𝑖𝑖𝑖 is also characterised by the number of occurrences 𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖𝑘𝑘𝑘𝑘 of
each part-of-speech (POS) tag 𝑘𝑘𝑘𝑘 = 1, … , 𝑝𝑝𝑝𝑝 as assigned by TreeTagger (Schmid
1994); here 𝑝𝑝𝑝𝑝 = 28. The first few rows and columns of the data are represented on
Table 1.

Table 1: First rows and columns of the Sganarelle data

Position Gender # interjections # adverbs # verbs (present) …


1 1 1 2 1 …
2 0 1 11 20 …
3 1 1 0 0 …
4 0 3 15 15 …
… … … … … …

The following distances are considered:


length
– the length dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = (𝑙𝑙𝑙𝑙𝑖𝑖𝑖𝑖 − 𝑙𝑙𝑙𝑙𝑖𝑖𝑖𝑖 )2 , where 𝑙𝑙𝑙𝑙𝑖𝑖𝑖𝑖 ∶= ∑𝑘𝑘𝑘𝑘 𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖𝑘𝑘𝑘𝑘 is the total
count of POS tags for line 𝑖𝑖𝑖𝑖
gender
– the gender dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = (𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 − 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 )2 = 1(𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 ≠ 𝑠𝑠𝑠𝑠𝑖𝑖𝑖𝑖 )
𝜒𝜒𝜒𝜒
– the chi-square dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 associated to the 207 × 28 contingency table
𝑛𝑛𝑛𝑛𝑖𝑖𝑖𝑖𝑘𝑘𝑘𝑘 (see section 2.3 d for the corresponding type dissimilarity)
𝑅𝑅𝑅𝑅𝜒𝜒𝜒𝜒
– the reduced chi-square dissimilarity 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 obtained after aggregating all POS
tag counts into two supercategories, namely verbs and non-verbs.

The length autocorrelation index (Figure 3 left) reveals that the length of lines
tends to be strongly autocorrelated over neighborhoods of size up to 5: long
(short) lines tend to be surrounded at short range by long (short) lines. This might
reflect the play structure, which comprises long passages declaiming general
considerations on human condition, and more action-oriented passages, made of
shorter lines.
Textual navigation and autocorrelation  43

Fig. 3: Autocorrelation index 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] (circles), together with expected value (5) (solid line), and
95% confidence interval (6) (dashed lines). Left: length dissimilarity 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑟𝑟𝑟𝑟𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙ℎ . Right: gender dis-
similarity 𝐷𝐷𝐷𝐷 𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑔𝑔𝑔𝑔𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙 .

Fig. 4: Autocorrelation index 𝛿𝛿𝛿𝛿 [𝑟𝑟𝑟𝑟] (same setting as figure 3) for the chi-square dissimilarity 𝐷𝐷𝐷𝐷 𝜒𝜒𝜒𝜒
(left) and the reduced chi-square dissimilarity 𝐷𝐷𝐷𝐷𝑅𝑅𝑅𝑅𝜒𝜒𝜒𝜒 (right).

The strong negative gender autocorrelation observed on figure 3 (right) for a


neighbourhood size of 1 shows that lines declaimed by characters of a given gen-
der have a clear tendency to be immediately followed by lines declaimed by char-
acters of the other gender, and vice-versa. The significant positive autocorrela-
tion for neighbourhoods of size 2 seems to be a logical consequence of this, as
well as the globally alternating pattern of the curve. Interestingly, the autocorre-
lation is always positive for larger neighbourhood sizes, which can be explained
by two observations: (i) overall, masculine lines are considerably more frequent
44  François Bavaud, Christelle Cocco, Aris Xanthos

than feminine lines (64.7% vs. 35.3%); (ii) the probability of being followed by a
line of differing gender is much lower for masculine lines than for feminine ones
(44.4% vs. 82.2%). These factors concur to dominate the short-range preference
for alternation.
The POS tag profile of lines tends to exhibit no autocorrelation, although the
alignments observed in Figure 4 (left) are intriguing. The proportion of verbs (Fig-
ure 4 right) tends to be positively (but not significantly, presumably due to the
small size 𝑛𝑛𝑛𝑛 = 207 of the sample) autocorrelated up to a range of 10, and nega-
tively autocorrelated for a range between 20 and 30 – an observation whose in-
terpretation requires further investigation.

3.2 Free navigation within documents

Let the text be partitioned into documents 𝑔𝑔𝑔𝑔 = 1, … , 𝑚𝑚𝑚𝑚; 𝑖𝑖𝑖𝑖 𝑖 𝑖𝑖𝑖𝑖 denotes the mem-
bership of position 𝑖𝑖𝑖𝑖 to document 𝑔𝑔𝑔𝑔 and 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 : = ∑𝑖𝑖𝑖𝑖𝑖𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 is the relative weight of doc-
ument 𝑔𝑔𝑔𝑔. Consider now the free textual navigation confined within each document

𝑙𝑙𝑙𝑙[𝑖𝑖𝑖𝑖] 1(𝑗𝑗𝑗𝑗 ∈ 𝑔𝑔𝑔𝑔[𝑖𝑖𝑖𝑖]) 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖


𝑡𝑡𝑡𝑡𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 ∶= , (9)
𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙[𝑖𝑖𝑖𝑖]

where 𝑔𝑔𝑔𝑔[𝑖𝑖𝑖𝑖] denotes the document to which position 𝑖𝑖𝑖𝑖 belongs. The associated ex-
𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙
change matrix 𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖 : = ∑𝑚𝑚𝑚𝑚
𝑙𝑙𝑙𝑙=1 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 is reducible, i.e. made out 𝑚𝑚𝑚𝑚 disconnected
submatrices. Note that 𝑓𝑓𝑓𝑓 obtains here as the margin of 𝐸𝐸𝐸𝐸, rather as the stationary
distribution of 𝑇𝑇𝑇𝑇, which is reducible and hence not regular. In this setup, the local
inertia is nothing but the within-groups inertia
1 𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙
Δloc = � 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 Δ𝑙𝑙𝑙𝑙 = : Δ𝑊𝑊𝑊𝑊 Δ𝑙𝑙𝑙𝑙 ∶= � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 𝐷𝐷𝐷𝐷𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖
2
𝑙𝑙𝑙𝑙 𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖

and hence 𝛿𝛿𝛿𝛿 = Δ𝐵𝐵𝐵𝐵 /Δ ≥ 0, where Δ𝐵𝐵𝐵𝐵 = Δ − Δ𝑊𝑊𝑊𝑊 = ∑𝑙𝑙𝑙𝑙 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙0 is the between-groups
inertia, and 𝐷𝐷𝐷𝐷𝑙𝑙𝑙𝑙0 is the dissimilarity between the centroid of the group 𝑔𝑔𝑔𝑔 and the
overall centroid 0. Here 𝛿𝛿𝛿𝛿, always non negative, behaves as a kind of generalised
𝐹𝐹𝐹𝐹-ratio.
In practical applications, textual positional weights are uniform, and the free
navigation within documents involves the familiar term-document matrix

𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 : = 𝑛𝑛𝑛𝑛•• � 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 1(𝑜𝑜𝑜𝑜(𝑖𝑖𝑖𝑖) = 𝑎𝑎𝑎𝑎)1(𝑖𝑖𝑖𝑖 ∈ 𝑔𝑔𝑔𝑔) (10)


𝑖𝑖𝑖𝑖

with
Textual navigation and autocorrelation  45

1 𝑙𝑙𝑙𝑙 1(𝑖𝑖𝑖𝑖 ∈ 𝑔𝑔𝑔𝑔) 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙


𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 = 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 = 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 = . (11)
𝑛𝑛𝑛𝑛•• 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛••

In particular, Δ, Δloc and 𝛿𝛿𝛿𝛿 can be computed from (7), where


𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎•
𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = � 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = (free within–documents navigation). (12)
𝑛𝑛𝑛𝑛•• 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛••
𝑙𝑙𝑙𝑙

The significance of 𝛿𝛿𝛿𝛿 = (Δ − Δloc )/Δ can be tested by (6), where trace(𝑊𝑊𝑊𝑊 2 ) =
trace(𝑊𝑊𝑊𝑊) = 𝑚𝑚𝑚𝑚 for the free within-documents navigation under the position per-
mutation test (5).
𝑟𝑟𝑟𝑟 𝑟𝑟𝑟𝑟
When 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = ( •• + •• )1(𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎) is the weighted discrete metric (section 2.3
𝑟𝑟𝑟𝑟𝑎𝑎𝑎𝑎𝑎 𝑟𝑟𝑟𝑟𝑏𝑏𝑏𝑏𝑏
c), the autocorrelation index turns out to be

𝜒𝜒𝜒𝜒 2
𝛿𝛿𝛿𝛿 = (13)
𝑛𝑛𝑛𝑛•• (𝑣𝑣𝑣𝑣 − 1)

where 𝑣𝑣𝑣𝑣 is the number of types and 𝜒𝜒𝜒𝜒 2 the term-document chi-square.
Toy example 2: consider 𝑣𝑣𝑣𝑣 = 7 types represented by greek letters, whose 𝑛𝑛𝑛𝑛 =
𝑛𝑛𝑛𝑛•• = 20 occurrences possess the same weight 𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 = 1/20, and are distributed
among 𝑚𝑚𝑚𝑚 = 4 documents as "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛿𝛿𝛿𝛿", "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼", "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼" and "𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝛼𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖𝜖" (Figure 5).
The term-document matrix, term weights and document weights read
𝒈𝒈𝒈𝒈 = 𝟏𝟏𝟏𝟏 𝒈𝒈𝒈𝒈 = 𝟐𝟐𝟐𝟐 𝒈𝒈𝒈𝒈 = 𝟑𝟑𝟑𝟑 𝒈𝒈𝒈𝒈 = 𝟒𝟒𝟒𝟒
8
𝜶𝜶𝜶𝜶 0 2 2 4
⎛ ⎞ 4
⎜ 𝜷𝜷𝜷𝜷 2 0 2 0 ⎟ ⎛2⎞ 1
𝜸𝜸𝜸𝜸 1 1 0 0 ⎟ 1 ⎜ ⎟ 1 1
�𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 � = ⎜ 𝜋𝜋𝜋𝜋 = 1 ⎟ 𝜌𝜌𝜌𝜌 = � � (14)
⎜ 𝜹𝜹𝜹𝜹 1 0 0 0 ⎟ 20 ⎜
⎜2⎟ 5 1
⎜ 𝝐𝝐𝝐𝝐 0 1 0 1 ⎟
2
2
𝜻𝜻𝜻𝜻 0 0 0 2 ⎝1⎠
⎝ 𝜼𝜼𝜼𝜼 0 0 0 1 ⎠

Consider three type dissimilarities, namely the "vowels" presence-absence dis-


similarity 𝐷𝐷𝐷𝐷 𝐴𝐴𝐴𝐴 (section 2.3 a, with 𝐴𝐴𝐴𝐴 = {𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼} and 𝐴𝐴𝐴𝐴𝑐𝑐𝑐𝑐 = {𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛿𝛿𝛿𝛿, 𝜖𝜖𝜖𝜖, 𝜖𝜖𝜖𝜖}):
0 1 1 1 0 1 1
1 0 0 0 1 0 0
⎛1 0 0 0 1 0 0⎞
𝐷𝐷𝐷𝐷 𝐴𝐴𝐴𝐴 = ⎜
⎜1 0 0 0 1 0

0⎟ ,
⎜0 1 1 1 0 1 1⎟
1 0 0 0 1 0 0
⎝1 0 0 0 1 0 0⎠
the discrete metric 𝐷𝐷𝐷𝐷𝐵𝐵𝐵𝐵 (section 2.3 b):
46  François Bavaud, Christelle Cocco, Aris Xanthos

0 1 1 1 1 1 1
1 0 1 1 1 1 1
⎛1 1 0 1 1 1 1⎞
⎜ ⎟
𝐷𝐷𝐷𝐷𝐵𝐵𝐵𝐵 = ⎜ 1 1 1 0 1 1 1⎟ ,
⎜1 1 1 1 0 1 1⎟
1 1 1 1 1 0 1
⎝1 1 1 1 1 1 0⎠
and the weighted discrete metric 𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 (section 2.3 c):
0 7.5 12.5 22.5 12.5 12.5 22.5
7.5 0 15 25 15 15 25
⎛ ⎞
12.5 15 0 30 20 20 30
⎜ ⎟
𝐷𝐷𝐷𝐷𝐶𝐶𝐶𝐶 = ⎜ 22.5 25 30 0 30 30 40 ⎟ .
⎜ 12.5 15 20 30 0 20 30 ⎟
12.5 15 20 30 20 0 30
⎝ 22.5 25 30 40 30 30 0 ⎠

The corresponding values of global inertias (7), local inertias (12) and textual au-
tocorrelation 𝛿𝛿𝛿𝛿 are given in Table 2 below.
Sganarelle, continued: consider the distribution of the 961 nouns and 1'204
verbs of the play Sganarelle among the 𝑚𝑚𝑚𝑚 = 24 scenes of the play, treated here as
documents (section 3.1). The autocorrelation index (13) for nouns associated to
the weighted discrete metric takes on the value 𝛿𝛿𝛿𝛿 nouns = 0.0238, lower than the
expected value (5) 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 nouns ) = 0.0240. For verbs, one gets 𝛿𝛿𝛿𝛿 verbs = 0.0198 >
𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿 verbs ) = 0.0191. Although not statistically significant, the sign of the differ-
ences reveals a lexical content within scenes more homogeneous for verbs than
for nouns. Finer analysis can be obtained from Correspondence Analysis (see e.g.
Greenacre 2007), performing a spectral decomposition of the chi-square in (13).

3.3 Hypertext navigation

Consider a set 𝐺𝐺𝐺𝐺 of electronic documents 𝑔𝑔𝑔𝑔 = 1, … , 𝑚𝑚𝑚𝑚 containing hyperlinks at-
tached to a set 𝐴𝐴𝐴𝐴 of active terms and specified by a function 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎] from 𝐴𝐴𝐴𝐴 to 𝐺𝐺𝐺𝐺,
associating each active term 𝑎𝑎𝑎𝑎 to a target document 𝑔𝑔𝑔𝑔 = 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎]. A simple model of
hypertext navigation consists in clicking at each position occupied by an active
term, thus jumping to the target document, while staying in the same document
when meeting an inactive term; in both cases, the next position 𝑖𝑖𝑖𝑖 is selected as
𝑙𝑙𝑙𝑙
𝑓𝑓𝑓𝑓𝑖𝑖𝑖𝑖 in (11). This dynamics generates a document to document transition matrix
Φ = (𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ ), involving the term-document matrix 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 (10), as
Textual navigation and autocorrelation  47

𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙
𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ : = � 𝜏𝜏𝜏𝜏 (15)
𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 (𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ
𝑎𝑎𝑎𝑎

where 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ is the probability to jump from term 𝑎𝑎𝑎𝑎 in document 𝑔𝑔𝑔𝑔 to document ℎ
and obeys 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)• = 1. In the present setup, 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ = 1(ℎ = 𝛼𝛼𝛼𝛼[𝑎𝑎𝑎𝑎]) for 𝑎𝑎𝑎𝑎 𝑎 𝑎𝑎𝑎𝑎 and
𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ = 1(ℎ = 𝑔𝑔𝑔𝑔) for 𝑎𝑎𝑎𝑎 𝑎̸ 𝐴𝐴𝐴𝐴. Alternative specifications taking into account click-
ing probabilities, or contextual effects (of term 𝑎𝑎𝑎𝑎 relatively to its background 𝑔𝑔𝑔𝑔)
could also be cast within this formalism. The document-to-document transition
matrix obeys 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ ≥ 0 and 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙• = 1, and the broad Markovian family of hypertext
navigations (15) generalizes specific proposals such as the free within-documents
setup, or the Markov chain associated to the PageRank algorithm (Page 2001).
By standard Markovian theory (e.g. Grinstead and Snell 1998), each docu-
ment belongs to a single "communication-based" equivalence class, which is ei-
ther transient, i.e. consisting of documents eventually unattainable by lack of in-
coming hyperlinks, or recurrent, i.e. consisting of documents visited again and
again once the chain has entered into the class. The chain is regular iff it is ape-
riodic and consists of a single recurrent class, in which case its evolution con-
verges to the stationary distribution 𝑠𝑠𝑠𝑠 of Φ obeying ∑𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 𝜑𝜑𝜑𝜑𝑙𝑙𝑙𝑙ℎ = 𝑠𝑠𝑠𝑠ℎ , which differs
in general from the document weights 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 = 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 /𝑛𝑛𝑛𝑛•• .
In the regular case, textual autocorrelation for type dissimilarities (section
2.3) can be computed by means of (7), where (compare with (12))
1 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎ℎ
𝛼𝛼𝛼𝛼𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 = 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑎)(𝑎𝑎𝑎𝑎•) 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)(𝑎𝑎𝑎𝑎ℎ) : = [𝜏𝜏𝜏𝜏 𝑠𝑠𝑠𝑠 + 𝜏𝜏𝜏𝜏(𝑎𝑎𝑎𝑎ℎ)𝑙𝑙𝑙𝑙 𝑠𝑠𝑠𝑠ℎ ]
2 𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛•ℎ (𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙)ℎ 𝑙𝑙𝑙𝑙

and
𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎
𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 = 𝜖𝜖𝜖𝜖(𝑎𝑎𝑎𝑎𝑎)(••) = � 𝑠𝑠𝑠𝑠 ≠ (hypertextual navigation).
𝑛𝑛𝑛𝑛•𝑙𝑙𝑙𝑙 𝑙𝑙𝑙𝑙 𝑛𝑛𝑛𝑛••
𝑙𝑙𝑙𝑙

Toy example 2, continued: let the active terms be 𝐴𝐴𝐴𝐴 = {𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛿𝛿𝛿𝛿}, with hyper-
links 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 1, 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 2, 𝛼𝛼𝛼𝛼[𝛼𝛼𝛼𝛼] = 3 and 𝛼𝛼𝛼𝛼[𝛿𝛿𝛿𝛿] = 4 (Figure 5). The transition matrix
(15) turns out to be regular. From (14), the document-document transition proba-
bility, its stationary distribution and the document weights are
0 1/2 1/4 1/4 1/3 1/5
1/2 1/4 1/4 0 1/3 1/5
Φ=� � 𝑠𝑠𝑠𝑠 = � � 𝜌𝜌𝜌𝜌 = � � .
1/2 1/2 0 0 1/6 1/5
1/2 0 0 1/2 1/6 2/5

In a certain sense, hyperlink navigation magnifies the importance of each docu-


ment 𝑔𝑔𝑔𝑔 by a factor 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 /𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 , respectively equal to 5/3, 5/3, 5/6 and 5/12 for the 𝑚𝑚𝑚𝑚 =
48  François Bavaud, Christelle Cocco, Aris Xanthos

4 documents of toy example 2. Similary, the term magnification factor 𝑛𝑛𝑛𝑛•• 𝜋𝜋𝜋𝜋𝑎𝑎𝑎𝑎 /𝑛𝑛𝑛𝑛𝑎𝑎𝑎𝑎𝑎
is 1.04 for 𝛼𝛼𝛼𝛼, 0.83 for 𝛼𝛼𝛼𝛼, 1.67 for 𝛼𝛼𝛼𝛼, 1.67 for 𝛿𝛿𝛿𝛿, 1.04 for 𝛼𝛼𝛼𝛼, 0.42 for 𝜖𝜖𝜖𝜖 and 0.42 for 𝜖𝜖𝜖𝜖.

Fig. 5: Hypertextual navigation (toy example 2) between 𝑚𝑚𝑚𝑚 = 4 documents containing |𝐴𝐴𝐴𝐴| = 4
active terms 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, and 𝛿𝛿𝛿𝛿.

Table 2: (Toy example 2) terms autocorrelation is positive under free within document naviga-
tion, but negative under hypertextual navigation. Here 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) refers to the term permutation
test (8).

free within-document navigation hypertextual navigation


𝚫𝚫𝚫𝚫 𝚫𝚫𝚫𝚫𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥 𝜹𝜹𝜹𝜹 𝑬𝑬𝑬𝑬𝟎𝟎𝟎𝟎 (𝜹𝜹𝜹𝜹) 𝚫𝚫𝚫𝚫 𝚫𝚫𝚫𝚫𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥𝐥 𝜹𝜹𝜹𝜹 𝑬𝑬𝑬𝑬𝟎𝟎𝟎𝟎 (𝜹𝜹𝜹𝜹)
𝑫𝑫𝑫𝑫𝑨𝑨𝑨𝑨 0.25 0.18 0.28 0.16 0.25 0.36 -0.47 -0.10
𝑫𝑫𝑫𝑫𝑩𝑩𝑩𝑩 0.38 0.31 0.20 0.16 0.39 0.48 -0.24 -0.10
𝑫𝑫𝑫𝑫𝑪𝑪𝑪𝑪 6.00 4.94 0.18 0.16 6.15 6.90 -0.12 -0.10

Table 2 summarizes the resulting textual autocorrelation for the three term dis-
similarities already investigated: systematically, hyperlink navigation strongly
increases the terms heterogeneity as measured by the local inertia, since each of
the active terms 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼, 𝛼𝛼𝛼𝛼 and 𝛿𝛿𝛿𝛿 point towards documents not containing them.
Textual autocorrelation in "WikiTractatus": strict application of the above
formalism to real data requires full knowledge of a finite, isolated network of 𝑚𝑚𝑚𝑚
hyperlinked documents.
Textual navigation and autocorrelation  49

Fig. 6: Entry page of "WikiTractatus", constituting the document "les mots" (left). Free-within
navigation weights 𝜌𝜌𝜌𝜌𝑙𝑙𝑙𝑙 versus hypertextual weights 𝑠𝑠𝑠𝑠𝑙𝑙𝑙𝑙 , logarithmic scale (right).

The site "WikiTractatus" is a "network of aphorisms" (in French) created by André


Ourednik (2010), containing 𝑛𝑛𝑛𝑛 = 27′ 172 words (tokens) distributed among 𝑣𝑣𝑣𝑣 =
5′ 487 terms and 𝑚𝑚𝑚𝑚 = 380 documents (Figure 6). Each document is identified by
a distinct active term, as in a dictionary notice, and hyperlinks connect each ac-
tive term in the corpus to the corresponding document. Also, each document con-
tains hyperlinks pointing towards distinct documents, with the exception of the
document "clavier" which contains no active terms. As a consequence, "clavier"
acts as an absorbing state of the Markov chain (15), and all remaining documents
are transient – as attested by the study of Φ𝑟𝑟𝑟𝑟 , converging for 𝑟𝑟𝑟𝑟 large towards a
null matrix except for a unit column vector associated to the document "clavier".
Suppressing document "clavier" together with its incoming hyperlinks makes the
Markov chain regular.
In contrast to Table 2, terms are positively autocorrelated under hypertextual
navigation on "WikiTractatus": with the discrete metric 𝐷𝐷𝐷𝐷 and the term permuta-
tion test (8), one finds Δ = 0.495, Δloc = 0.484, 𝛿𝛿𝛿𝛿 = 0.023 and 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) = 0.014. In
the same setup, the free within-documents navigation yields very close results,
namely Δ = 0.496, Δloc = 0.484, 𝛿𝛿𝛿𝛿 = 0.024 and 𝐸𝐸𝐸𝐸0 (𝛿𝛿𝛿𝛿) = 0.015. Here both types
of navigation have identical effects on textual autocorrelation per se, but quite
different effects on the document (and type) relative weights (Figure 6 right).
50  François Bavaud, Christelle Cocco, Aris Xanthos

3.4 Semantic autocorrelation

Semantic similarities have been systematically investigated in the last two dec-
ades, using in particular reference word taxonomies expressing "ontological" re-
lationships (e.g. Resnik 1999). In WordNet (Miller et al. 1990), words, and in par-
ticular nouns and verbs, are grouped into synsets, i.e. cognitive synonyms, and
each synset represents a different concept. Hyponymy expresses inclusion be-
tween concepts: the relation "concept 𝑐𝑐𝑐𝑐1 is an instance of concept 𝑐𝑐𝑐𝑐2 " is denoted
𝑐𝑐𝑐𝑐1 ≤ 𝑐𝑐𝑐𝑐2 , and 𝑐𝑐𝑐𝑐1 ∨ 𝑐𝑐𝑐𝑐2 represents the least general concept subsuming both 𝑐𝑐𝑐𝑐1 and
𝑐𝑐𝑐𝑐2 . For instance, in the toy ontology of Figure 7, cat ≤ animal and cat ∨ dog =
animal.

Fig. 7: Toy noun ontology made up of 7 concepts: numbers in bold are probabilities (16), num-
bers in italic are similarities (17), and the underlined number is the dissimilarity between bicy-
cle and car according to (18).

Based on a reference corpus (hereafter the Brown corpus, Kučera and Francis
1967), the probability 𝑝𝑝𝑝𝑝(𝑐𝑐𝑐𝑐) of concept 𝑐𝑐𝑐𝑐 can be estimated as the proportion of word
tokens whose sense 𝐶𝐶𝐶𝐶(𝑤𝑤𝑤𝑤) is an instance of concept 𝑐𝑐𝑐𝑐. Thus, representing the num-
ber of occurrences of word 𝑤𝑤𝑤𝑤 by 𝑛𝑛𝑛𝑛(𝑤𝑤𝑤𝑤),
∑𝑤𝑤𝑤𝑤 𝑛𝑛𝑛𝑛 (𝑤𝑤𝑤𝑤) 1(𝐶𝐶𝐶𝐶(𝑤𝑤𝑤𝑤) ≤ 𝑐𝑐𝑐𝑐)
𝑝𝑝𝑝𝑝(𝑐𝑐𝑐𝑐): = (16)
∑𝑤𝑤𝑤𝑤 𝑛𝑛𝑛𝑛 (𝑤𝑤𝑤𝑤)
Textual navigation and autocorrelation  51

Following Resnik (1999), a measure of similarity between concepts can then be


defined as:

𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐2 ): = − log 𝑝𝑝𝑝𝑝 (𝑐𝑐𝑐𝑐1 ∨ 𝑐𝑐𝑐𝑐2 ) ≥ 0 (17)

from which a squared Euclidean dissimilarity between concepts can be derived as


(Bavaud et al. 2012):

𝐷𝐷𝐷𝐷(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐2 ): = 𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐1 ) + 𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐2 , 𝑐𝑐𝑐𝑐2 ) − 2𝑠𝑠𝑠𝑠(𝑐𝑐𝑐𝑐1 , 𝑐𝑐𝑐𝑐2 ) (18)

For instance, based on the probabilities given in Figure 7,

𝐷𝐷𝐷𝐷(bicycle, car) = 𝑠𝑠𝑠𝑠(bicycle, bicycle) + 𝑠𝑠𝑠𝑠(car, car) − 2𝑠𝑠𝑠𝑠(bicycle, car)


= − log ( 0.2) − log ( 0.4) + 2 log ( 0.7) = 1.81

According to TreeTagger (Schmid 1994), the short story The Masque of the
Red Death by Edgar Allan Poe (1842) contains 497 positions occupied by nouns
and 379 positions occupied by verbs. Similarities between nouns and between
verbs can be obtained using the WordNet::Similarity interface (Pedersen et et al.
2004) – systematically using, in this case study, the most frequent sense of am-
biguous concepts. Autocorrelation indices (for neighbourhoods of size 𝑟𝑟𝑟𝑟) calcu-
lated using the corresponding dissimilarities exhibit no noticeable pattern (Fig-
ure 8).

Fig. 8: Autocorrelation index δ[r] (same setting as Figure 3) in "The Masque of the Red Death"
for the semantic dissimilarity (18) for nouns (left) and for verbs (right).

This being said, the 𝑝𝑝𝑝𝑝-dimensional coordinates 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 entering in any squared Eu-
clidean distance 𝐷𝐷𝐷𝐷𝑎𝑎𝑎𝑎𝑎𝑎𝑎𝑎 =∥ 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 − 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎 ∥2 can be recovered by (weighted) multidimen-
sional scaling (MDS) (e.g. Torgeson 1958; Mardia et al. 1979), yielding orthogonal
52  François Bavaud, Christelle Cocco, Aris Xanthos

factorial coordinates 𝑥𝑥𝑥𝑥𝑎𝑎𝑎𝑎𝛼𝛼𝛼𝛼 (for 𝛼𝛼𝛼𝛼 = 1, … , 𝑣𝑣𝑣𝑣 𝑣 1) whose low-dimensional projections


express a maximum proportion of (global) inertia.

Fig. 9: Screeplot for the MDS on semantic dissimilarities for nouns (left). Factorial coordinates
xaα and proportion of explained inertia for α = 1,2 (right).

Fig. 10: Autocorrelation index δ[r] for nouns in the first semantic dimension (left) and in the sec-
ond semantic dimension (right).

The first semantic coordinate for nouns in The Masque of the Red Death (Figure 9)
clearly contrasts abstract entities such as horror, pestilence, disease, hour, mean,
night, vision, or precaution, on the left, with physical entities such as window, roof,
Textual navigation and autocorrelation  53

wall, body, victim, glass, or visage, on the right, respectively defined in WordNet
as "a general concept formed by extracting common features from specific exam-
ples" and "an entity that has physical existence". Figure 10 (left) shows this first
coordinate to be strongly autocorrelated, echoing long-range semantic persis-
tence, in contrast to the second coordinate (Figure 10 right), whose interpretation
is more difficult.

Fig. 11: Screeplot for the MDS on semantic dissimilarities for verbs (left). Factorial coordinates
xaα and proportion of explained inertia for α = 1,2 (right).

Fig. 12: Autocorrelation index δ[r] for verbs in the first semantic dimension (left) and in the sec-
ond semantic dimension (right)
54  François Bavaud, Christelle Cocco, Aris Xanthos

For verbs (Figure 11), the first semantic coordinate differentiates stative verbs,
such as be, seem, or sound, from all other verbs, while the second semantic coor-
dinate differentiates the verb have from all other verbs. Figure 12 reveals that the
first coordinate is strongly autocorrelated, while the second coordinate is nega-
tively autocorrelated for neighbourhood ranges up to 2. Although the latter result
is not significant for 𝛼𝛼𝛼𝛼 = 0.05 according to (6), it is likely due to the use of have
as an auxiliary verb in past perfect and other compound verb tenses.

4 Conclusions
In this contribution, we have introduced a unified formalism for textual autocor-
relation, i.e. the tendency for neighbouring textual positions to be more (or less)
similar than randomly chosen positions. This approach to sequence and text
analysis is based on two primitives: (i) neighbourhoodness between textual posi-
tions, as determined by a Markov model of navigation, and formally represented
by the exchange matrix 𝐸𝐸𝐸𝐸; and (ii) (dis-)similarity between positions, as encoded
in the (typically squared Euclidean) dissimilarity matrix 𝐷𝐷𝐷𝐷.
By varying 𝐸𝐸𝐸𝐸 and or 𝐷𝐷𝐷𝐷, the proposed formalism recovers and revisits well-
known statistical objects and concepts, such as the 𝐹𝐹𝐹𝐹-ratio, the chi-square and
Correspondence Analysis. It also gives a unified account of various representa-
tions commonly used for textual data analysis, in particular the sequential and
bag-of-words models, as well as the term-document matrix. It can also be ex-
tended to provide a model of hypertext navigation, where hyperlinks act as mag-
nifying (or reducing) glasses, modifying the relative weights of documents, and
altering (or not) textual autocorrelation.
This approach is applicable to any form of sequence and text analysis that
can be expressed in terms of dissimilarity between positions (or between types).
The presented case studies have aimed at illustrating this versatility by address-
ing lexical, morphosyntactic, and semantic properties of texts. As shown in the
latter case, squared Euclidean dissimilarities can be visualised and decomposed
into factorial components by multidimensional scaling; the textual autocorrela-
tion of each component can in turn be analysed and interpreted – yielding in par-
ticular a new means of dealing with semantically related problems.
Textual navigation and autocorrelation  55

References
Anselin, Luc. 1995. Local indicators of spatial association. Geographical Analysis 27(2). 93–
115.
Bavaud, François. 2013. Testing spatial autocorrelation in weighted networks: The modes per-
mutation test. Journal of Geographical Systems 15(3). 233–247.
Bavaud, François, Christelle Cocco & Aris Xanthos. 2012 Textual autocorrelation: Formalism
and illustrations. In Anne Dister, Dominique Longrée & Gérald Purnelle (eds.), 11èmes
journées internationales d'analyse statistique des données textuelles, 109–120. Liège :
Université de Liège.
Bavaud, François & Aris Xanthos. 2005. Markov associativities. Journal of Quantitative Linguis-
tics 12(2-3). 123–137.
Cliff, Andrew D. & John K. Ord. 1981. Spatial processes: Models & applications. London: Pion.
Cressie, Noel A.C. 1991. Statistics for spatial data. New York: Wiley.
Greenacre, Michael. 2007. Correspondence analysis in practice, 2nd edn. London: Chapman
and Hall/CRC Press.
Grinstead, Charles M. & J. Laurie Snell. 1998. Introduction to probability. American Mathemati-
cal Society.
Kučera, Henry & W. Nelson Francis. 1967. Computational analysis of present-day American
English. Providence: Brown University Press.
Lebart, Ludovic. 1969. Analyse statistique de la contigüité. Publication de l'Institut de Statis-
tiques de l'Université de Paris 18. 81–112.
Le Roux, Brigitte & Henry Rouanet. 2004. Geometric data analysis. Kluwer: Dordrecht.
Mardia, Kanti V., John T. Kent & John M. Bibby. 1979. Multivariate analysis. New York: Aca-
demic Press.
Miller, George A., Richard Beckwith, Christiane Fellbaum, Derek Gross & Katherine Miller.
1990. WordNet: An on-line lexical database. International Journal of Lexicography 3(4).
235–244.
Moran, Patrick & Alfred Pierce. 1950. Notes on continuous stochastic phenomena. Biometrika
37(1-2). 17–23.
Ourednik, André. 2010. Wikitractatus. http://wikitractatus.ourednik.info/ (accessed December
2012).
Page, Lawrence. 2001. Method for node ranking in a linked database. U.S. Patent No
6,285,999.
Pedersen, Ted, Siddharth Patwardhan & Jason Michelizzi. 2004. WordNet:Similarity – Measur-
ing the relatedness of concepts. In Susan Dumais, Daniel Marcu & Salim Roukos (eds.),
Proceedings of HLT-NAACL 2004: Demonstration Papers, 38–41. Boston: Association for
Computational Linguistics.
Resnik, Philip. 1999. Semantic similarity in a taxonomy: An information-based measure and its
application to problems of ambiguity in natural language. Journal of Artificial Intelligence
Research 11. 95–130.
Schmid, Helmut. 1994. Probabilistic part-of-speech tagging using decision trees. Proceedings
of international conference on new methods in language processing. Manchester, UK.
Torgeson, Warren S. 1958. Theory and methods of scaling. New York: Wiley.
Martina Benešová and Radek Čech
Menzerath-Altmann law versus random
model

1 Introduction
The Menzerath-Altmann law belongs among the in linguistics most used and the
best corroborated empirically elaborated linguistic laws. Firstly, it was enunci-
ated by Menzerath (1928), later transformed into its mathematical formula by Alt-
mann (1980), and recently it has been proved by the labour of Hřebíček (1995),
Andres (2010) and Andres et al. (2012) to be closely related with the fractal quality
which can be seen in texts. On the basis of already executed experiments when
we tried to examine the fractality of the text we observed (Andres – Benešová
2011, Benešová 2011) that the gained results differ significantly depending on the
manner in which the units of the sequence segmentation had been chosen and
set; for example, the word was grasped graphically as a sequence of graphemes
in between two blanks in a sentence, or it was regarded as a sound unit, semantic
unit and so on. Importantly, although all used manners of segmentation are sub-
stantiated linguistically, the differences of results (e.g., the value of b parameter
in formula (1)) are so striking that an analysis of the relationship between the
segmentation and the character of the model representing the Menzerath-Alt-
mann law is strongly needed, in our opinion.
As the first step, we decided to scrutinize this relationship by using a random
model of data building based on an original text sample. Surprisingly enough,
despite the fact that testing models representing real systems of any kind by ran-
dom models is quite usual in empirical science, the Menzerath-Altmann law has
been tested by the random model only once in linguistics (Hřebíček, 2007), to our
knowledge1. However, this test lacks appropriate amount of experiments to be
supported. In sum, we pursue two aims. Firstly, the validity (or non-validity) of
the Menzerath-Altmann law in the text segmented randomly will be tested; sec-
ondly, because three random models are used (cf. Section 3) the character of re-
sults of all models with regard to the real text will be analyzed.


1 In biology, Baixeries et al. (2012) tested the Menzerath-Altmann law by a random model.
58  Martina Benešová and Radek Čech

2 The Menzerath-Altmann law


The Menzerath-Altmann law (MAL) enunciates that “the longer the language con-
struct is, the shorter its constituents are”. The construct is considered to be a unit
of language (for the purpose of our experiment we chose the sentence) which is
constituted by its constituents (here the clauses); the constituent has to be meas-
urable in terms of the units on the immediately lower linguistic level which con-
stitute it (here the words). Specifically, the MAL applied in our experiment pre-
dicts that the longer the sentence is, the shorter its clauses are; the length of the
clause is computed in the number of words. It should be emphasized that the MAL
model is stochastic.
The MAL can be mathematically formulated in a truncated formula of a
power model as follows:

𝑦𝑦𝑦𝑦 = 𝐴𝐴𝐴𝐴 ∙ 𝑥𝑥𝑥𝑥 −𝑏𝑏𝑏𝑏

where x is the length of the construct measured in the number of its constituents
(in our experiments it is the length of sentences measured in the number of their
clauses; x∈N), y is the average length of the constituents measured in the units
on the immediately lower language level, i.e. in the constituent’s constituents (in
our experiment it is the average length of clauses measured in the number of
words they are constituted of; y∈Q), A,b are positive real parameters. Graph-
ically, parameter A determines how far from the x-axis the graph representing the
particular MAL realization is positioned. However, parameter b is responsible
firstly for the steepness of the curve, and secondly, more essentially, for the rising
or falling tendency of the curve; when b>0 or b<0, respectively. As the MAL ex-
presses the inversely proportional relation between the length of the construct
and the length of its constituents, parameter b is required to be positive real so
that it reflects the substance of MAL. According to Kelih (2010) parameter b in the
MAL can clearly depend on the text type.

3 Random Models
Despite the fact that the MAL is considered to be one of the best corroborated
linguistic laws, one can ask whether the relationships between units engaged in
the MAL are not a result of a random process. If so, the law would be put into
question and, obviously, the observed relationships between constructs and con-
stituents would be explained differently. The most striking example of using the
Menzerath-Altmann law versus random model  59

random model and its impact on the explanation of language properties is Man-
delbrot’s (1953) and Miller’s (1957) interpretations of Zipf’s law which triggered a
debate (still unending) about validity/non-validity of Zipf’s law, on one hand,
and about the character of the random model (e.g., Miller, Chomsky 1963; Li 1992;
Cohen et al. 1997; Ferrer i Cancho, Solé 2002; Mitzenmacher 2003; Ferrer i Can-
cho, Elvevåg 2010). Further, random models have played an important role also
in the theory of complex networks (c.f., Barabási, Albert 1999, Newman 2011)
which has been used for analysis of language, too (c.f., Ferrer i Cancho 2010; Bib-
liography on Linguistic and Cognitive Networks2).
The main problem with the use of random models lies in the determination
of randomness. As any property of reality, neither randomness is somewhat
“given”; it is our concept which is defined with regard to the goal of analysis un-
der consideration. Consequently, there is not the only “right” definition of ran-
domness; it is always the matter of critical discussion and, therefore, any random
model must be properly justified and described in detail. In our study, three ran-
dom models are used; their characteristics are presented in the next section.

4 Methodology and language material


All random models presented in this study respect some characteristics of the real
text under the consideration. We used three different models because each model
takes into account different text characteristics and each model defines random-
ness differently. In all the models both the number of constructs and the number
of constituents correspond with the real text. Further, the constructs are repre-
sented by sentences, the constituents by clauses; the length of a sentence is
counted in the number of clauses which it is composed of, and the length of a
clause is computed in the number of words which constitute it. The sentence is
defined as a sequence of words which ends with a full stop, the clause as a unit
containing predicate represented by a finite verb and the word is defined graph-
ically as a sequence of letters between spaces. In all analyses, properties of con-
structs, i.e. the total number of them, and the number of clauses which a partic-
ular sentence contains (i.e. its length) correspond with the real text. Only
constituents, namely their length, are matter of randomization.
In the first model (M1), the relative frequency of each constituent length is
computed first. (For illustration, let us suppose that we have a text containing ten


2 http://www.lsi.upc.edu/~rferrericancho/linguistic_and_cognitive_networks.html
60  Martina Benešová and Radek Čech

clauses; the length of five clauses equals to 6, the length of three clauses equals
to 5, and the length of two clauses equals to 3. The relative frequency of clauses
with the length 6 is 0.5, with the length 5 it is 0.3 and with the length 3 it is 0.2.)
Furthermore, the length of each clause is computed randomly as follows: for each
clause the number expressing its length is generated randomly with the proba-
bility given by observed relative frequencies.
The second model (M2) is a slight modification of the M1. Although the prob-
abilities of individual lengths of clauses correspond with those in the real text in
the M1, the total number of words generated in the M1 does not, i.e., the sum of
lengths of clauses in the M1 is not equal to the total number of words in the real
text. Therefore, the algorithm of the M2 follows that of the M1 and it is extended
by a condition which requires that the sum of lengths of clauses in the M2 has to
correspond to the total number of words in the real text (technically, a loop is
added to the program which repeats the generation of random numbers until the
condition is fulfilled).
The third model (M3) is created by the following procedure3: firstly, to each
clause one word is assigned, i.e., all clauses have the length equal to one; fur-
thermore, in each step another word is added to a randomly chosen clause, i.e.,
its length is increased by one. This procedure ends when the total number of
words in the real text is used in the model. For the generation of random numbers
the free statistical software R (http://www.r-project.org/) was used.
As for the language material, Václav Havel’s essay Moc bezmocných (The
Power of the Powerless)4 was analyzed. The text was segmented into sentences
and clauses manually.

5 Results
As was mentioned above, the initial model to provide us at the onset of the exper-
iment with the input data which was processed later in a random way was Václav
Havel’s essay Moc bezmocných (The Power of the Powerless). Table 1 shows the
figures which are the result of quantifying the original sample text.


3 This model was proposed by Ján Mačutek; we appreciate that he allowed us to use it in our
study and we thank him for a fruitful discussion and help with programming.
4 Available at: http://vaclavhavel.cz/showtrans.php?cat=eseje&val=2_eseje.html&typ=HTML
Menzerath-Altmann law versus random model  61

Table 1: The original sample text (Václav Havel’s essay The Power of the Powerless): x the
length of sentences (in clauses) – z their frequency – y the mean length of clauses (in words).

𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚

1 408 12.6373
2 260 9.5481
3 176 8.7992
4 94 8.8005
5 57 9.0140
6 20 7.6583
7 14 7.9388
8 4 10.2813
9 7 6.6984
10 4 8.1250
11 2 7.0000
14 1 9.0000

The data shown in Table 15 was processed in mathematical and statistical ways,
as demonstrated e.g. in Andres et al. (2012). We gained the parameters of the
MAL; A=11.9311 and b=0.22356, for more information on MAL parameters cf. e.g.
Cramer (2005). The value of parameter b proves that the empirically gained data
of the particular sample behaves in accordance with the original assumption of
the MAL, i.e. the relation between the length of constructs (sentences) and the
length of their constituents is inversely proportional. The graph in Figure 1 is pre-
sented as evidence.

The coefficient of determination, which shows the degree of the goodness-of-


fit of the mathematically gained model to the empirically collected data, is
R2=0.8737, which is convenient for our purposes. The 95% confidence interval of
parameter b, which is an interval where we can expect to find parameter b with
95% certainty, is 〈0.1258; 0.3212〉7. As can be seen, the whole confidence interval


5 The data points with x≥8 were omitted in further data processing due to their low frequency
compared to the frequencies of other data points.
6 For estimating all the values of the MAL parameters A "and" b in this paper linear regression
and the method of least squares were used. The data was processed and the outcomes gained by
means of the free statistical software R.
7 All the confidence intervals in this paper were calculated by means of the free statistical soft-
ware R.
62  Martina Benešová and Radek Čech

is a subset of positive real number set so the fundamental MAL assumption is


satisfied with 95% certainty.

Fig. 1: Plotted data points of the observations presented in Table 1 and the graphical display of
the goodness-of-fit of the MAL model (R2=0.8737).

As is obvious from the above mentioned, the input, empirically gained data fol-
lows the MAL in the outlined manner. Additionally, the data in Table 1 was ad-
justed randomly to get models M1, M2 and M3. To get a bigger variety of data, we
constructed five different random samples per each model, cf. Tables 4a, b, c, d,
e, 5a, b, c, d, e, 6a, b, c, d, e in Appendix. The results (parameters A, b and the
coefficients of determination R2) are presented in Tables 2a, 2b and 2c, respec-
tively.

Table 2a: Parameters A, b and the coefficients of determination R2 extracted from the random
samples constructed in terms of Model 1.

random sample parameter A parameter b coefficient of determination R2

1 9.7072 -0.0094 0.3840


2 9.6326 -0.0186 0.1126
3 9.7184 0.0119 0.4340
Menzerath-Altmann law versus random model  63

random sample parameter A parameter b coefficient of determination R2


4 9.1361 -0.0304 0.3039
5 9.6514 0.0484 0.1539

Table 2b: Parameters A, b and the coefficients of determination R2 extracted from the random
samples constructed in terms of Model 2.

random sample parameter A Parameter b coefficient of determination R2

1 9.4904 -0.0051 0.0420


2 10.0553 0.0456 0.5196
3 9.6084 0.0131 0.1500
4 9.7572 0.0212 0.4771
5 9.5234 0.0011 0.0050

Table 2c: Parameters A, b and the coefficients of determination R2 extracted from the random
samples constructed in terms of Model 3.

random sample parameter A parameter b coefficient of determination R2

1 9.4432 -0.0162 0.3492


2 9.5629 -0.0038 0.1160
3 9.5145 -0.0072 0.1705
4 9.6773 0.0107 0.1268
5 9.4149 -0.0190 0.2218

It is obvious that the MAL assumption is satisfied due to its positive parameter b
only in the following cases: M1 – random sample 3; M2 – random samples 2, 3, 4,
5; M3 – random sample 4. Their parameters b are all positive real and in addition
the tendency of the relation between the construct length and the length of its
constituents is inversely proportional. However, the respective coefficients of de-
termination are only in two random samples greater than fifty percent, they are
M2 – random samples 2 and 4; R2=0.5196 and R2=0.4771. It is, nevertheless, still
less than in the original real text sample. Figures 2a and 2b illustrate the two ran-
dom samples with the highest and lowest degrees of goodness-of-fit.
64  Martina Benešová and Radek Čech

Fig. 2a: Model 2 – random sample 2 (the coefficient of determination R2=0.5196).

Fig. 2a: Model 2b – random sample 5 (the coefficient of determination R2=0.0050).


Menzerath-Altmann law versus random model  65

We can also measure the confidence intervals for all of these random samples
and check this way if they really satisfy the MAL assumption with the given 95%
certainty. The confidence intervals for these cases are displayed in Table 3. Only
those cases whose parameters b are positive real numbers are listed. The confi-
dence intervals prove that the MAL assumption is not satisfied with 95% certainty
because the bottom line in all the random samples declined to negative real num-
bers.

Table 3: The random samples with b>0 supplied with the respective coefficients of determina-
tion and confidence intervals.

the coefficient of
Model random sample confidence interval
determination R2
1 3 0.4340 〈−0.0521; 0.0758〉

2 2 0.5196 〈−0.0048; 0.0961〉


2 3 0.1500 〈−0.1093; 0.1355〉
2 4 0.4771 〈−0.0043; 0.0468〉
2 5 0.0050 〈−0.0514; 0.0536〉
3 4 0.1268 〈−0.0216; 0.0430〉

6 Conclusion
The results of the experiment reveal that the data under examination generated
by random models does not fulfil the MAL. Consequently, the results can be
viewed as another argument supporting the assumption considering that the
MAL expresses one of important mechanisms controlling human language be-
havior.
Secondly, we wanted to explore which method of random modelling can con-
struct the best data in terms of the MAL. We came to the finding that the biggest
number of mathematical models showing the closest properties to the original
text sample in terms of the MAL is designed by means of M2; this result is not
surprising so much because M2 shares more characteristics of the original text
than the other two random models.
66  Martina Benešová and Radek Čech

Acknowledgments
Martina Benešová’s and Radek Čech's contribution is supported by the project
CZ.1.07/2.3.00/30.0004 POSTUP and the project Linguistic and lexicostatistic
analysis in cooperation of linguistics, mathematics, biology and psychology,
grant no. CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund
and the National Budget of the Czech Republic, respectively.

References
Altmann, Gabriel. 1980. Prolegomena to Menzerath’s law. In Rüdiger Grotjahn (ed.), Glot-
tometrika 2, 1–10. Bochum: Brockmeyer.
Andres, Jan. 2010. On a conjecture about the fractal structure of language. Journal of Quantita-
tive Linguistics 17(2). 101–122.
Andres, Jan & Martina Benešová. 2011. Fractal analysis of Poe’s Raven. Glottometrics 21. 73–
100.
Andres, Jan, Martina Benešová, Lubomír Kubáček & Jana Vrbková, J. 2012. Methodological note
on the fractal analysis of texts. Journal of Quantitative Linguistics 19(1). 1–31.
Baixeries, Jaume, Antoni Hernández-Fernández & Ramon Ferrer-i-Cancho. 2012. Random mod-
els of Menzerath-Altmann lawin genomes. BioSystems 107(3). 167–173.
Barabási, Albert-László & Réka Albert. 1999. Emergence of Scaling in Random Networks. Sci-
ence 286(5439). 509-512.
Benešová, Martina. 2011. Kvantitativní analýza textu se zvláštním zřetelem k analýze fraktální
[Quantitative analysis of text with special respect to fractal analysis]. Olomouc: Palacký
University dissertation.
Cohen, Avner, Rosario N. Mantegna & Shlomo Havlin. 1997. Numerical analysis of word fre-
quencies in artificial and natural language texts. Fractals 5(1). 95–104.
Cramer, Irene M. 2005. The parameters of the Altmann-Menzerath law. Journal of Quantitative
Linguistics 12(1). 41–52.
Ferrer i Cancho, Ramon. 2010. Network theory. In Patrick Colm Hogan (ed.), The Cambridge en-
cyclopaedia of the language sciences, 555–557. Cambridge: Cambridge University Press.
Ferrer i Cancho, Ramon & Ricard V. Solé. 2002. Zipf’s law and random texts. Advances in Com-
plex Systems 5(1). 1–6.
Ferrer i Cancho, Ramon & Brita Elvevåg. 2010. Random texts do not exhibit the real Zipf’s law-
like rank distribution. PLoS ONE 5(3). e9411.
Hřebíček, Luděk. 1995. Text levels. Language constructs, constituents and the Menzerath-Alt-
mann law. Trier: WVT.
Hřebíček, Luděk. 2007. Text in semantics. The principle of compositeness. Praha: Oriental Insti-
tute of the Academy of Sciences of the Czech Republic.
Kelih, Emmerich. 2010. Parameter interpretation of the Menzerath law: Evidence from Serbian.
In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and Language. Structures,
functions, interrelations, quantitative perspectives, 71–79. Wien: Praesens.
Menzerath-Altmann law versus random model  67

Li, Wientian. 1992. Random texts exhibit Zipf’s-law-like word frequency distribution. IEEE
Transactions on Information Theory 38(6). 1842–1845.
Mandelbrot, Benoit. 1953. An informational theory of the statistical structure of language. In
Willis Jackson (ed.), Communication theory, 486–504. London: Butterworths.
Menzerath, Paul. 1928. Über einige phonetische Probleme. In Actes du premier congrés inter-
national de linguistes, 104–105. Leiden: Sijthoff.
Miller, George A. 1957. Some effects of intermittent silence. The American Journal of Psychol-
ogy 70(2). 311–314.
Miller, George A. & Noam Chomsky. 1963. Finitary models of language users. In R. Duncan
Luce, Robert R. Bush & Eugene Galanter (eds.), Handbook of mathematical psychology,
419–491. New York: Wiley.
Mitzenmacher, Michael. 2003. A brief history of generative models for power law and lognor-
mal distributions. Internet Mathematics 1(2). 226–251.

Appendix
Table 4: Model 1 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their fre-
quency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words).

Random sample 1 Random sample 2 Random sample 3

𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚


1 408 9.9975 1 408 9.8701 1 408 9.8529
2 260 9.5000 2 260 9.6115 2 260 9.5135
3 176 9.3977 3 176 9.3598 3 176 9.7973
4 94 10.2154 4 94 10.2872 4 94 9.4601
5 57 9.5754 5 57 9.5193 5 57 8.8632
6 20 10.0417 6 20 10.2750 6 20 9.6500
7 14 10.0408 7 14 10.0918 7 14 9.9592

Random sample 4 Random sample 5

𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚


1 408 9.1152 1 408 9.0613
2 260 9.1885 2 260 9.6077
3 176 9.5663 3 176 9.5000
4 94 9.5319 4 94 9.7261
5 57 9.7649 5 57 9.0526
6 20 10.0833 6 20 9.3417
68  Martina Benešová and Radek Čech

Random sample 4 Random sample 5


7 14 9.1531 7 14 7.5918

Tab. 5: Model 2 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their fre-
quency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words).

Random sample 1 Random sample 2 Random sample 3

𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚


1 408 9.3676 1 408 10.1838 1 408 9.4853
2 260 9.7327 2 260 9.7462 2 260 9.4635
3 176 9.6705 3 176 9.2405 3 176 10.1477
4 94 9.9548 4 94 9.6649 4 94 9.2872
5 57 8.8035 5 57 8.9298 5 57 9.5228
6 20 9.1000 6 20 9.6250 6 20 8.1917
7 14 10.2959 7 14 9.2449 7 14 10.2449

Random sample 4 Random sample 5

𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚


1 408 9.7206 1 408 9.2696
2 260 9.5808 2 260 9.8904
3 176 9.4678 3 176 9.5473
4 94 9.6702 4 94 9.7128
5 57 9.6035 5 57 9.4281
6 20 9.3833 6 20 9.0333
7 14 9.1429 7 14 9.7245

Table 6: Model 3 – random samples 1–5: 𝑥𝑥𝑥𝑥 the length of sentences (in clauses) – 𝑧𝑧𝑧𝑧 their fre-
quency – 𝑦𝑦𝑦𝑦 the mean length of clauses (in words).

Random sample 1 Random sample 2 Random sample 3

𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚


1 408 9.5490 1 408 9.7010 1 408 9.5588
2 260 9.4923 2 260 9.4038 2 260 9.5731
3 176 9.5303 3 176 9.4091 3 176 9.6174
4 94 9.4229 4 94 9.7473 4 94 9.4309
Menzerath-Altmann law versus random model  69

Random sample 1 Random sample 2 Random sample 3


5 57 9.8000 5 57 9.6175 5 57 9.5404
6 20 9.9167 6 20 10.0083 6 20 9.6750
7 14 9.7143 7 14 9.3776 7 14 9.7959

Random sample 4 Random sample 5


𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚 𝒙𝒙𝒙𝒙 𝒛𝒛𝒛𝒛 𝒚𝒚𝒚𝒚
1 408 9.6471 1 408 9.4314
2 260 9.7308 2 260 9.8212
3 176 9.5341 3 176 9.3277
4 94 9.2739 4 94 9.5239
5 57 9.6175 5 57 9.5754
6 20 9.7667 6 20 9.6583
7 14 9.3061 7 14 10.1327
Radek Čech
Text length and the lambda frequency
structure of a text

1 Frequency structure of a text


The frequency structure of a text is usually considered to be one of the im-
portant aspects for the evaluation of language usage, and at first sight it does
not seem to be a very problematic concept. The type-token ratio or several other
methods for the measurement of vocabulary “richness” are usually used for
analyses of this kind (e.g. Weizman 1971; Tešitelová 1972; Ratkowsky et al. 1980;
Hess et al. 1986, 1989; Richards 1987; Tuldava 1995; Wimmer, Altmann 1999;
Müller 2002; Wimmer 2005; Popescu 2007; Popescu et al. 2009; Martynenko
2010; Covington, McFall 2010). However, all approaches face the problem of the
undesirable impact of text length. Obviously, if one’s aim is to compare two or
more texts with regard to their frequency structure (e.g. for a comparison of
authors or genres), one has to somehow eliminate the role of text length. The
simplest method is to use only a part of the text (e.g. the first 100 or 1000 words)
for the comparison. However, this method neglects the “integrity” of the text;
for example, if the author has the intention of writing a longer text, (s)he can
consciously use many new words at the end of the text which can radically
change the frequency structure. Consequently, the cutting of the text cannot be
considered a suitable method for the measurement of frequency structure and
only entire texts should be analyzed.
Another way to eliminate the impact of text length consists in some kind of
normalization. One of the most recent attempts at this approach was proposed
by Popescu, Mačutek, Altmann (2010), and in more detail by Popescu, Čech,
Altmann (2011). The authors introduced a method based on the properties of the
arc length, defined as
𝑉𝑉𝑉𝑉−1
1
𝐿𝐿𝐿𝐿 = �[(𝑓𝑓𝑓𝑓𝑟𝑟𝑟𝑟 − 𝑓𝑓𝑓𝑓𝑟𝑟𝑟𝑟−1 )2 + 1]2 (1)
𝑟𝑟𝑟𝑟=1

where fr are the ordered absolute word frequencies (𝑟𝑟𝑟𝑟 = 1,2, . . . 𝑉𝑉𝑉𝑉 𝑉 1) and
V is the highest rank (= vocabulary size), i.e. L consists of the sum of Euclidean
distances between ranked frequencies. The frequency structure is measured by
the so-called lambda indicator
72  Radek Čech

𝐿𝐿𝐿𝐿(𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙𝑙10 𝑁𝑁𝑁𝑁)
𝛬𝛬𝛬𝛬 = (2)
𝑁𝑁𝑁𝑁
The authors state that this indicator successfully eliminates the impact of
text length, meaning that it can be used for a comparison of genre, authorship,
authorial development, language typology and so on; an analysis of 1185 texts
in 35 languages appears to prove the independence between lambda and text
length (cf. Popescu, Čech, Altmann 2011, p. 10 ff.).
However, if particular languages are analyzed separately, a dependence of
lambda on text length emerges, as presented in this study. Moreover, this study
reveals that the relationship between lambda and text length is not straightfor-
ward, unlike the relationship between text length and type-token ratio. Further,
the study presents a method for the empirical determination of the interval in
which lambda should be independent of text length; within this interval lambda
could be used for its original purpose, i.e. comparison of genre, authorship,
authorial development etc. (cf. Popescu et al. 2011).

2 The analysis of an individual novel


The analysis of the relationship between lambda and text length within an indi-
vidual novel has the following advantages; first, a maximum of boundary condi-
tions (e.g. authorship, genre, year) which can influence the relationship be-
tween observed characteristics is eliminated; second, the novel can be viewed
to a certain extent as a homogeneous entity with regard to its theme and style.
Consequently, one can expect that the novel represents one of the best types of
material for the observation of the relationship between frequency characteris-
tics and text length. Since the majority of novels are divided into chapters, the
length of the chapter and lambda can be taken as parameters. However, for an
analysis of this kind one has to use a very long novel with many chapters of
different lengths.
This analysis uses the Czech novel The Fateful Adventures of the Good Sol-
dier Švejk During the World War, written by Jaroslav Hašek, and Oliver Twist,
written by Charles Dickens. The novels are very long (N(Hašek) = 199,120;
N(Dickens) = 159,017) and contain 27 (for Hašek) and 53 (for Dickens) chapters
of different lengths; the length of the chapters lies in the interval N(Hašek) ∈
<1289, 18,550>, N(Dickens) ∈ <998, 5276>.
Figure 1 and Figure 2, and Table 1 and Table 2, present the relationships be-
tween the lambda of individual chapters and their length. Obviously, lambda
Text Length and the lambda frequency structure of a text  73

decreases depending on text length. Moreover, if one observes the relationship


between lambda and cumulative N (cumulative values of N are added by chap-
ter), the tendency of decreasing lambda with regard to the text length is even
more evident, as is shown in Figure 3 and Figure 4 and Table 3 and Table 4.

Fig. 1: The relationship between lambda of individual chapters and their length in the novel The
Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek.

Table 1: The length and lambda of individual chapters in the novel The Fateful Adventures of the
Good Soldier Švejk During the World War written by Jaroslav Hašek

Chapter N Lambda Chapter N Lambda Chapter N Lambda


1 2810 1.6264 10 6191 1.6246 19 7092 1.7107
2 2229 1.7761 11 2530 1.8599 20 12183 1.4956
3 1725 1.8092 12 2007 1.7584 21 16426 1.5733
4 1289 1.7880 13 3914 1.7578 22 15195 1.5524
5 2103 1.8084 14 12045 1.6473 23 14986 1.6202
6 2881 1.7484 15 3342 1.7310 24 14852 1.5521
7 1458 1.8587 16 5500 1.6894 25 7504 1.7106
8 4610 1.7812 17 18550 1.5772 26 2734 1.8112
9 6095 1.6935 18 17009 1.5685 27 11860 1.6319
74  Radek Čech

Fig. 2: The relationship between lambda of individual chapters and their length in the novel
Oliver Twist written by Charles Dickens.

Table 2: The length and lambda of individual chapters in the novel Oliver Twist written by
Charles Dickens.

Chapter N Lambda Chapter N Lambda Chapter N Lambda


1 1124 1.5275 19 3423 1.2748 37 3617 1.3137
2 3976 1.2629 20 3000 1.2772 38 3601 1.2882
3 3105 1.3246 21 2197 1.4481 39 5276 1.2553
4 2589 1.3688 22 2493 1.3979 40 2577 1.1221
5 4073 1.3375 23 2740 1.3156 41 3595 1.2040
6 1747 1.4540 24 1998 1.4026 42 3720 1.1990
7 2341 1.3413 25 2259 1.3994 43 3784 1.2151
8 3282 1.3365 26 4526 1.3238 44 2383 1.2170
9 2348 1.2879 27 2518 1.3819 45 1228 1.3474
10 1818 1.4883 28 3438 1.3369 46 3588 1.2435
11 2627 1.3294 29 1394 1.4498 47 2530 1.2529
12 3433 1.3256 30 2360 1.3538 48 3488 1.4025
13 2841 1.4036 31 3989 1.1882 49 3579 1.2004
14 4002 1.1101 32 3377 1.2650 50 4275 1.3574
15 2378 1.3504 33 3295 1.2828 51 4858 1.1045
16 3538 1.3210 34 3729 1.2423 52 3322 1.2684
17 3232 1.2858 35 2823 1.2438 53 1562 1.5089
18 3013 1.4018 36 998 1.4436
Text Length and the lambda frequency structure of a text  75

Fig. 3: The relationship between lambda of individual chapters and cumulative length in the
novel The Fateful Adventures of the Good Soldier Švejk During the World War written by Jaroslav
Hašek.

Table 3: The cumulative length and lambda of individual chapters in the novel The Fateful
Adventures of the Good Soldier Švejk During the World War written by Jaroslav Hašek.

Chapter N (cum) Lamda Chapter N (cum) Lambda Chapter N (cum) Lambda


1 2810 1.6264 1-10 31391 1.4267 1-19 103380 1.2208
1-2 5039 1.5838 1-11 33921 1.4251 1-20 115563 1.1878
1-3 6764 1.5727 1-12 35928 1.4140 1-21 131989 1.1609
1-4 8053 1.5668 1-13 39842 1.3966 1-22 147184 1.1350
1-5 10156 1.5408 1-14 51887 1.3557 1-23 162170 1.1178
1-6 13037 1.5109 1-15 55229 1.3378 1-24 177022 1.0975
1-7 14495 1.5170 1-16 60729 1.3174 1-25 184526 1.0887
1-8 19105 1.5063 1-17 79279 1.2731 1-26 187260 1.0863
1-9 25200 1.4744 1-18 96288 1.2344 1-27 199120 1.0718
76  Radek Čech

Fig. 4: The relationship between lambda of individual chapters and cumulative length in the
novel Oliver Twist written by Charles Dickens.

Table 4: The cumulative length and lambda of individual chapters in the novel Oliver Twist
written by Charles Dickens.

Chapter N (cum.) Lambda Chapter N (cum.) Lambda Chapter N (cum.) Lambda


1 1134 1.5321 1-19 54900 0.8236 1-37 105651 0.7148
1-2 5111 1.2489 1-20 57900 0.8104 1-38 109252 0.7099
1-3 8216 1.1712 1-21 60097 0.8090 1-39 109252 0.7099
1-4 10804 1.1287 1-22 62590 0.8014 1-40 117105 0.6960
1-5 14877 1.0882 1-23 65330 0.7943 1-41 120700 0.6880
1-6 16624 1.0703 1-24 67328 0.7909 1-42 124420 0.6816
1-7 18965 1.0327 1-25 69587 0.7851 1-43 128204 0.6765
1-8 22247 1.0064 1-26 74113 0.7784 1-44 130587 0.6725
1-9 24595 0.9805 1-27 76631 0.7732 1-45 131815 0.6706
1-10 26413 0.9758 1-28 80069 0.7663 1-56 135403 0.6675
1-11 29040 0.9619 1-29 81463 0.7639 1-47 137933 0.6633
1-12 32473 0.9378 1-31 83823 0.7581 1-48 141421 0.6628
1-13 35314 0.9258 1-31 87812 0.7490 1-49 145000 0.6581
1-14 39316 0.8888 1-32 91189 0.7436 1-50 149275 0.6576
1-15 41694 0.8740 1-33 94484 0.7370 1-51 154133 0.6504
1-16 45232 0.8566 1-34 98213 0.7285 1-52 157455 0.6472
1-17 48464 0.8432 1-35 101036 0.7216 1-53 159017 0.6465
1-18 51477 0.8352 1-36 102034 0.7193
Text Length and the lambda frequency structure of a text  77

3 Analysis of individual languages (Czech and


English)
The findings presented in Section 2 seriously undermine the assumption of the
independence of lambda on text length, and they open up questions about the
relationship between these two text parameters (i.e. N and Λ) in general. For a
more detailed analysis, texts of different length in the same language were used
and the relationship between N and Λ was observed. For Czech, I analyzed 610
texts whose length lies in the interval N ∈ <37, 211405>, and for English, 218
texts whose length lies in the interval N ∈ <14, 360922>. The results are present-
ed in Figure 5 and Figure 6. In both cases lambda first increases along with N,
and then it decreases in form of a concave bow.
The question is whether there is some interval of N in which lambda is in-
dependent of text length. Theoretically, one can assume that in a very short text
the author cannot control the frequency structure because of the lack of
“space”. In other words, the author cannot consciously manage the proportion
of particular words in short texts because the length of the text simply does not
make it possible. With increasing N the possibilities to control the frequency
structure increase (for instance, one can try to avoid repetition of words).
On the other hand, there should be probably some maximum of N where the
capability of the author to influence the frequency structure diminishes (which
could be caused by some limits of human mental ability) and the frequency
structure is ruled by a self-regulating mechanism; Popescu et al. (2012, pp. 126–
127) describe this mechanism as follows:

The longer the text, the more the writer loses his subconscious control over some propor-
tions and keeps only the conscious control over contents, grammar, his aim, etc. But as
soon as parts of control disappear, the text develops its own dynamics and begins to abide
by some laws which are not known to the writer but work steadily in the background. The
process is analogous to that in physics: if we walk, we consider our activity as something
normal; but if we stumble, i.e. lose the control, gravitation manifests its presence and we
fall. That means, gravitation does not work ad hoc in order to worry us maliciously, but it
is always present, even if we do not realize it consciously. In writing, laws are present,
too, and they work at a level which is only partially accessible. One can overcome their
working, but one cannot eliminate them. On the other hand, if the writer slowly loses his
control of frequency structuring, a new order begins to arise by self-organization or by
some not perceivable background mechanism.

Furthermore, a long text (such as a novel) is written with many breaks,


which also leads to the author’s loss of control of frequency characteristics. In
other words, if one interrupts the process of writing (for any reason, e.g. sleep-
78  Radek Čech

ing, eating, traveling), one’s mental state is probably changed; even if the au-
thor repeatedly reads the text, this change influences the character of frequency
structuring. In short, I would like to emphasize the difference between homoge-
neous text types (e.g. a personal letter, e-mail, poem, or short story) which are
written relatively continuously, on the one hand, and long texts written under
different circumstances, on the other.

Fig. 5: The relationship between lambda and text length in Czech (610 texts).

Fig. 6: The relationship between lambda and text length in English (218 texts).

In sum, it can be assumed that in the case of texts that are too short or too long,
the author cannot control the frequency structure. So, the task is to find the
Text Length and the lambda frequency structure of a text  79

interval in which lambda is independent of N (if it exists at all). It can be sup-


posed that the differences among lambda values of individual texts in this in-
terval should be caused by some pragmatic reasons, such as authorship, genre
and so on (cf. Popescu et al 2011). The next section presents a method for deriv-
ing the interval empirically.

4 The interval of the relative independence of


lambda on text length
For the empirical determination of the interval in which lambda should be in-
dependent of text length, the following method was used. First, the data were
pooled into intervals and for each interval the mean lambda was computed (see
Table 5 and 6).

Table 5: The mean lambdas of particular intervals of N in Czech; 615 texts were used for the
analysis.

Interval n mean(Λ) s^2(Λ) s^2(Λ)/n


<37, 100> 113 1.5337 0.023246 0.000206
<101, 200> 140 1.6950 0.018871 0.000135
<201, 300> 39 1.7999 0.012658 0.000325
<301, 400> 28 1.7970 0.011724 0.000419
<401, 500> 27 1.8602 0.009719 0.000360
<501, 1000> 69 1.8543 0.012765 0.000185
<1001, 1500> 47 1.8351 0.015120 0.000322
<1501, 2000> 35 1.8180 0.024705 0.000706
<2001, 2500> 26 1.8210 0.027314 0.001051
<2501, 3000> 20 1.7184 0.013041 0.000652
<3001, 4000> 18 1.7127 0.026641 0.001480
<4000, 6500> 13 1.7142 0.028925 0.002225
<9000, 40000> 17 1.5250 0.020573 0.001210
<40001, 212000> 20 1.2233 0.014912 0.000746

Table 6: The mean lambdas of particular intervals of N in English; 218 texts were used for the
analysis.

Interval n mean(Λ) s^2(Λ) s^2(Λ)/n


<14, 100> 40 1.2107 0.036552 0.000914
<101, 200> 15 1.4911 0.021244 0.001416
<201, 500> 12 1.4889 0.028356 0.002363
<501, 1000> 14 1.4659 0.011208 0.000801
80  Radek Čech

Interval n mean(Λ) s^2(Λ) s^2(Λ)/n


<1001, 1500> 9 1.4437 0.017859 0.001984
<1501, 2000> 12 1.4174 0.026552 0.002213
<2001, 2500> 18 1.3281 0.019375 0.001076
<2501, 3000> 14 1.3088 0.007136 0.000510
<3001, 4000> 29 1.3058 0.008702 0.000300
<4000, 5000> 14 1.2821 0.014297 0.001021
<5001, 7000> 16 1.2238 0.020234 0.001265
<7001, 20000> 11 1.0022 0.009905 0.000900
<20001, 100000> 8 0.8297 0.019885 0.002486
<100000, 361000> 6 0.5432 0.006244 0.001041

Then, the differences of lambdas between subsequent intervals were tested


by the asymptotic u-test
𝛬𝛬𝛬𝛬1 − 𝛬𝛬𝛬𝛬2
𝑢𝑢𝑢𝑢 =
𝑠𝑠𝑠𝑠 2 𝑠𝑠𝑠𝑠 2 (3)
�� 1 + 2 �
𝑛𝑛𝑛𝑛1 𝑛𝑛𝑛𝑛2
Specifically, for the comparison of the first and second interval from Table 5
we obtain

1.6950 − 1.5337
𝑢𝑢𝑢𝑢 = = 8.73
√0.000135 − 0.000206
Because multiple u-tests in subsequent intervals are performed (which in-
flates probability of Type I error), the Bonferroni correction is used for an ap-
propriate adjustment (cf. Miller 1981). Specifically, the critical value for a rejec-
tion of null hypothesis is determined as follows
𝛼𝛼𝛼𝛼/2
𝑝𝑝𝑝𝑝𝑖𝑖𝑖𝑖 = (4)
𝑛𝑛𝑛𝑛
where α is the significance level and n is a number of performed tests. For
the significance level α = 0.05 and n = 13 (we perform 13 tests, cf. Table 7 and 8)
we get ucorrected ≤ 2.89 as a critical value. Thus we can state that between first and
second interval (cf. Table 5) is a significant difference (at the significance level α
= 0.05). All results are presented in Table 7 and 8 and graphically in Figure 7
and 8, where subsequent intervals with non-significant differences are linked.

Table 7: The mean lambdas of particular intervals of N in Czech; 615 texts were used for the
analysis.

Interval mean(Λ) u Interval mean(Λ)


<30, 100> 1.5337 8.73 <101, 200> 1.6950
Text Length and the lambda frequency structure of a text  81

Interval mean(Λ) u Interval mean(Λ)


<101, 200> 1.6950 4.89 <201, 300> 1.7999
<201, 300> 1.7999 0.11 <301, 400> 1.7970
<301, 400> 1.7970 2.26 <401, 500> 1.8602
<401, 500> 1.8602 0.25 <501, 1000> 1.8543
<501, 1000> 1.8543 0.85 <1001, 1500> 1.8351
<1001, 1500> 1.8351 0.53 <1501, 2000> 1.8180
<1501, 2000> 1.8180 0.07 <2001, 2500> 1.8210
<2001, 2500> 1.8210 2.49 <2501, 3000> 1.7184
<2501, 3000> 1.7184 0.12 <3001, 4000> 1.7127
<3001, 4000> 1.7127 0.02 <4000, 6500> 1.7142
<4000, 6500> 1.7142 3.23 <9000, 40000> 1.5250
<9000, 40000> 1.5250 6.82 <40001, 212000> 1.2233

Table 8: The differences between subsequent intervals in English. The values of significant
differences (at the significance level α = 0.05, ucorrected ≤ 2.89) are boldfaced.

Interval mean(Λ) u Interval mean(Λ)


<14, 100> 1.2107 5.81 <101, 200> 1.4911
<101, 200> 1.4911 0.04 <201, 500> 1.4889
<201, 500> 1.4889 0.41 <501, 1000> 1.4659
<501, 1000> 1.4659 0.42 <1001, 1500> 1.4437
<1001, 1500> 1.4437 0.41 <1501, 2000> 1.4174
<1501, 2000> 1.4174 1.56 <2001, 2500> 1.3281
<2001, 2500> 1.3281 0.48 <2501, 3000> 1.3088
<2501, 3000> 1.3088 0.11 <3001, 4000> 1.3058
<3001, 4000> 1.3058 0.65 <4000, 5000> 1.2821
<4000, 5000> 1.2821 1.22 <5001, 7000> 1.2238
<5001, 7000> 1.2238 4.76 <7001, 20000> 1.0022
<7001, 20000> 1.0022 2.96 <20001, 100000> 0.8297
<20001, 100000> 0.8297 4.82 <100000, 361000> 0.5432
82  Radek Čech

Fig. 7: The differences between subsequent intervals in Czech. Subsequent intervals with non-
significant differences are connected by the lines. The line does not mean that all the intervals
inside the line are not different; the line expresses non-significant differences only between
pairs of subsequent intervals.

As is seen in Figure 7 and 8, there is a relatively wide interval (the black


points) in which no significant differences between particular subsequent inter-
vals appear; for Czech N ∈ <201, 6500> and for English N ∈ <101, 7000>. As for
the lower endpoint of these intervals, it can be assumed that this is related to
the synthetic/analytic character of the two languages. Specifically, the more
analytic the language, the lower the endpoint of this interval, and vice versa. As
for the upper endpoint, it can be assumed that it should be relatively independ-
ent of the type of the language because this endpoint should be a consequence
of human mental ability (cf. Section 3). The difference between this endpoint in
Czech (N = 6500) and English (N = 7000) is rather because of the different char-
acter of the sample (cf. Table 5 and 6). Obviously, only further research based
on an analysis of more texts and more languages could bring a more reliable
result.
Text Length and the lambda frequency structure of a text  83

Fig. 8: The differences between subsequent intervals in English. Subsequent intervals with
non-significant differences are connected by the lines. The line does not mean that all the
intervals inside the line are not different; the line expresses non-significant differences only
between pairs of subsequent intervals.

5 Maximum of lambda (theoretical) and empirical


findings
Another way to observe the properties of the relationship between the text
length and lambda consists in a comparison of the maximum theoretical value
of lambda and the empirical findings. Because lambda is based on the property
84  Radek Čech

of the arc length L (cf. Section 1), it is first necessary to derive the theoretical
maximum of the arc length, which is given as

𝐿𝐿𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = 𝑉𝑉𝑉𝑉 − 1 + 𝑓𝑓𝑓𝑓(1) − 1 = 𝑁𝑁𝑁𝑁 − 1 (5)

where V is the number of word types (i.e. maximum rank) and f(1) is the
maximum frequency. The theoretical maximum of lambda is
𝐿𝐿𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 (𝑁𝑁𝑁𝑁 − 1) 1
𝛬𝛬𝛬𝛬𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = log(𝑁𝑁𝑁𝑁) = � � log(𝑁𝑁𝑁𝑁) = �1 − � log(𝑁𝑁𝑁𝑁) (6)
𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁 𝑁𝑁𝑁𝑁
practically N >> 1, hence

𝛬𝛬𝛬𝛬𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 = log(𝑁𝑁𝑁𝑁) (7)

The comparison of the maximum theoretical value of lambda and the em-
pirical findings is presented in Figure 9 and 10. It is clear that shorter texts are
closer to the maximum than longer ones. Moreover, with increasing text length
the difference between Λmax and Λ grows. Obviously, this result reveals that with
increasing text length the language user's ability to control the frequency struc-
ture of the text decreases, and consequently it is governed by some self-
regulating mechanism (cf. Popescu et al. 2012).

Fig. 9: Comparison of the maximum theoretical value of lambda (line) and the empirical find-
ings (dots) in Czech.
Text Length and the lambda frequency structure of a text  85

Fig. 10: Comparison of the maximum theoretical value of lambda (line) and the empirical find-
ings (dots) in English.

6 Conclusion
The study revealed the evident dependence of the lambda indicator on text
length. Consequently, this finding undermines the main methodological ad-
vantage of the lambda measurement and casts doubt upon many of the results
presented by Popescu at al. (2011). However, the specific relationship between
lambda and the text length, which is expressed graphically by a concave bow
(cf. Section 3), allows us to determine the interval of the text length in which the
lambda measurement is meaningful and can be used for original purposes.
Moreover, this determination is connected to theoretical reasoning which may
be enhanced by psycholinguistic or cognitive explanations.
To summarize, the lambda indicator can on the one hand be ranked among
all previous attempts which have tried unsuccessfully to eliminate the impact of
text length; on the other hand, however, its specificity means that the method
offers a potential use in comparisons of texts. Of course, only further analyses
can reveal the potential meaningfulness or meaninglessness of this method.
86  Radek Čech

Acknowledgments
This study was supported by the project Linguistic and lexicostatistic analysis
in cooperation of linguistics, mathematics, biology and psychology, grant no.
CZ.1.07/2.3.00/20.0161, which is financed by the European Social Fund and the
National Budget of the Czech Republic.

References
Covington, Michael A. & Joe D. McFall. 2010. Cutting the Gordian Knot: The Moving Average
Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94–100.
Hess, Carla W., Karen M. Sefton & Richard G. Landry. 1986. Sample size and type-token ratios
for oral language of preschool children. Journal of Speech and Hearing Research, 29. 129–
134.
Hess, Carla W., Karen M. Sefton & Richard G. Landry. 1989. The reliability of type-token ratios
for the oral language of school age children. Journal of Speech and Hearing Research,
32(3). 536–540
Martynenko, Gregory. 2010. Measuring lexical richness and its harmony. In Peter Grzybek,
Emmerich Kelih & Ján Mačutek (eds.), Text and language. Structures, Functions, In-
terrelations, 125-132. Wien: Praesens Verlag.
Miller, Rupert G. 1981. Simultaneous Statistical Inference. Berlin, Heidelberg: Springer.
Müller, Dieter. 2002. Computing the type token relation from the a priori distribution of types.
Journal of Quantitative Linguistics 9(3). 193-214.
Popescu, Ioan-Iovitz. 2007. Text ranking by the weight of highly frequent words. In Peter
Grzybek & Reinhard Köhler (eds.), Exact methods in the study of language and text, 555-
565. Berlin-New York: de Gruyter.
Popescu, Ioan-Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur Dayaloo Jayaram, Reinhard
Köhler, Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matumnal N. Vidya.
2009. Word frequency studies. Berlin-New York: Mouton de Gruyter.
Popescu, Ioan-Iovitz, Ján Mačutek & Gabriel Altmann. 2010. Word forms, style and typology.
Glottotheory, 3. 89-96.
Popescu, Ioan-Iovitz, Rakek Čech, & Gabriel Altmann. 2011. The lambda-structure of texts.
Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2012. Some geometric properties of
Slovak poetry. Journal of Quantitative Linguistics 19(2). 121-131.
Ratkowsky, David A., Maurice H. Halstead & Linda Hantrais. 1980. Measuring vocabulary rich-
ness in literary works. A new proposal and a re-assessment of some earlier measures.
Glottometrika 2. 125-147.
Richards, Brian. 1987. Type/token ratios: what do they really tell us? Journal of Child Lan-
guage 14(2). 201-209.
Tešitelová, Marie. 1972. On the so-called vocabulary richness. Prague Studies in Mathematical
Linguistics 3. 103-120.
Text Length and the lambda frequency structure of a text  87

Tuldava, Juhan. 1995. On the relation between text length and vocabulary size. In Juhan Tulda-
va (ed.), Methods in quantitative linguistics, 131-150. Trier: WVT.
Weizman, Michael. 1971. How useful is the logarithmic type-token ratio? Journal of Linguistics
7(2). 237-243.
Wimmer, Gejza. 2005.Type-token relation. In Reinhard Köhler, Gabriel Altmann & Rajmund G.
Piotrowski (eds.), Handbook of Quantitative Linguistics, 361-368. Berlin: de Gruyter.
Wimmer, Gejza & Gabriel Altmann.1999. On vocabulary richness. Journal of Quantitative Lin-
guistics 6(1). 1-9.
Reinhard Köhler
Linguistic Motifs

1 Introduction
Quantitative linguistics has been concerned with units, properties, and their re-
lations mostly in a way where syntagmatic resp. sequential behaviour of the ob-
jects under study was ignored. The employed mathematical means and models
reflect, in their majority, mass phenomena treated as samples taken from some
populations - even if texts and corpora do not possess the statistical properties
which are needed for many of the common methods. Nevertheless, with some
caution, good and valid results can be obtained using probability distributions,
functions, differential and difference equations, etc.
The present volume gives an overview of alternative methods which can be
applied if the sequential structure of linguistic expressions, in general texts, are
in the focus of an investigation. Here, a recently presented new unit will be intro-
duced in order to find a method which can give information about the sequential
organisation of a text with respect to any linguistic unit and to any of its proper-
ties – without relying on a specific linguistic approach or grammar. Moreover, the
method brings with it several advantages, which will be described below.
The construction of this unit, the motif (originally called segment or sequence,
cf. Köhler 2006, 2008a,b; Köhler/Naumann 2008, 2009, 2010; ) was inspired by
the so-called F-motiv for musical “texts” (Boroda 1982). Boroda was in search for
a unit which could replace the word as used in linguistics for frequency studies
in musical pieces. Units common in musicology were not usable for his purpose,
and so he defined the "F-Motiv" with respect to the duration of the notes of a mu-
sical piece.

2 The new unit 'motif'


As a much more general approach is needed, the linguistic motif is defined as

the longest continuous sequence of equal or increasing values representing a quantitative


property of a linguistic unit.

Thus,
90  Reinhard Köhler

A L-motif is a continuous series of equal or increasing length values (e.g. of morphs, words
or sentences).

A F-motif is a continuous series of equal or increasing frequency values (e.g. of morphs,


words or syntactic construction types).

A P-motif is a continuous series of equal or increasing polysemy values (e.g. of morphs or


words).

A T-motif is a continuous series of equal or increasing polytextuality values (e.g. of morphs,


words or syntactic construction types).

An example of a L-motif segmentation is the following. The sentence

“Word length studies are almost exclusively devoted to the problem of distributions.”

is, according to the above-given definition, represented by a sequence of 5 L-mo-


tifs:

(1-1-2) (1-2-4) (3) (1-1-2) (1-4)

if the definition is applied to word length measured in the number of syllables.


Similarly, motifs can be defined for any linguistic unit (phone, phrase [type],
clause [type], etc.) and for any linguistic property (poly-functionality, complex-
ity, familiarity etc.). Variants of investigations based on motifs can be generated
by changing the direction in which these units are segmented, i.e. beginning from
the first unit in the text/discourse and proceeding forward or beginning from the
last item and applying the definition in the opposite direction and by replacing
“increasing” by “decreasing” values of the given property in the definition of the
motif1. We do not expect statistically significant differences in the results. In con-
trast, different operationalisations of properties will affect the results in many
cases, e.g. if word length is measured in the number of letters or in the average
duration in ms in speech.
Some of the advantageous properties of the new units are the following:
1. Following the definition, any text or discourse can be segmented in an un-
ambiguous way. In many cases, the segmentation can be done automatically
using simple computer programs. Frequency motifs can be determined in
two ways: (1) the frequency of the units under study are counted in the stud-
ied text or corpus itself or (2) taken from a frequency dictionary. The same


1 It may be, e.g. appropriate to go from right to left when a language with syntactic left branch-
ing preference is analyzed.
Linguistic Motifs  91

holds for polytextuality motifs, only that a single text does not suffice, of
course, whereas polysemy must be looked up in a dictionary. Lengths motifs
can be determined automatically to the extent to which the writing system
reflects the units in which length is to be counted. Alphabetic scripts are good
conditions for character counts whereas syllabic scripts abet syllable counts.
Special circumstances such as with Chinese are also helpful if syllables are
to be counted. Syntactic complexity, depth of embedding and other more
complicated properties can also be used to form motifs but determining them
automatically presupposes previously annotated text material. Even if seg-
mentation into motifs cannot be done automatically, the important ad-
vantage remains that the result does not depend on any interpretation but is
objective and unambiguous.
2. Segmentation in motifs is always exhaustive, i.e. no rest will remain. The suc-
cessor of a numerical value in a sequence is always (1) larger than or equal to
the given value – or (2) smaller. In the first case, the successor belongs to the
current motif, in the second case, it starts a new motif. The last value in a
text does not cause any problem; for the first one, we have to add the only
additional rule: It starts the first motif.
3. Motifs have an appropriate granularity. They can always be operationalised
in a way that segmentation takes place in the same order of magnitude as the
phenomena under analysis.
4. Motifs are scalable with respect to granularity. One and the same definition
can be iteratively applied: It is possible to form motifs on the basis of length
or frequency values etc. of motifs. A closer look unveils that there are two
different modes in a scaled analysis. The first level of motifs is based on a
property of genuine linguistic units, e.g. word length, which is counted in
terms of e.g., syllables, phones, characters, morphs, or even given as dura-
tion in milliseconds. On the second level, on which the length of length mo-
tifs is determined, only numbers exist. Thus, the concept of length is a differ-
ent one than that on the first level. The length motifs of length motifs of word
length or higher levels, however, do not add new aspects. It goes without
saying that corresponding differences exist if other properties or units are
used to form motifs.
The scaling mechanism can be used to generate infinitely many new kinds of
motifs. Thus, frequency motifs of word lengths motifs can be formed as well
as length motifs of frequency motifs of morph polysemy motifs. This may
sound confusing and arbitrary; nevertheless, a number of such cross-cate-
gory motifs was used and proved to be useful for text classification purposes
(cf. Köhler, Naumann 2010).
92  Reinhard Köhler

In the cited work, a notation was introduced to avoid long-winded expres-


sions. For length motifs of frequency motifs of word lengths, the symbol LFL
was used etc. In a more general context, a symbolic representation of the
basic unit should be added, e.g. the symbol LFP(m) could be appropriate for
length motifs of frequency motifs of morph polysemy.
5. Motifs display a rank-frequency distribution of the Zipf-Mandelbrot type, i.e.
they behave in this respect in a way similar to other, more intuitive units of
linguistic analysis. The empirical parameters of this law, e.g. applied to the
distribution of frequency motifs of word frequencies, were also used for text
classification (l.c.).

3 Application
Let us consider a text, e.g. one of the end-of-year speeches of Italian presidents,
in which the words have been replaced by their lengths measured in syllables.
The beginning of the corresponding sequence looks as follows:

2 2 1 1 2 1 3 2 3 2 2 1 1 1 4 1 5 1 1 2 1 2 2 1 2 1 4 3 2 1 1 1 4 1 3 4 2 1 1 4 2 2 2 2 4 2 1 4 ...

Forming L-motifs according to the definition given above yields (we do not need
the parentheses):
2-2
1-1-2
1-3
2-3
2-2
1-1-1-4
1-5
1-1-2
1-2-2
1-2
1-4
3
2
1-1-1-4
1-3-4
2
1-1-4
...
Linguistic Motifs  93

The rank-frequency distribution of the motifs in the complete text is given in Ta-
ble 1.

Table 1: Rank-frequency distribution of the L-motifs (syllables) in an Italian text.

Rank Motif Frequency Rank Motif Frequency


1 1-3 297 71 2-3-4 2
2 1-2 218 72 1-1-3-3-5 2
3 1-4 151 73 3-3-4 2
4 2 141 74 2-6 2
5 1-1-3 86 75 1-1-2-2-3-4 2
6 1-2-3 74 76 2-2-2-4 2
7 1-1-2 62 77 1-3-4-4 2
8 1-2-2 53 78 2-4-4 2
9 2-3 52 79 1-1-1-2-2-2 2
10 1-1-4 50 80 1-1-2-2-5 2
11 3 49 81 3-4 2
12 2-2 47 82 1-2-4-4 2
13 1-5 35 83 1-1-1-1-2-2 2
14 1-2-4 31 84 2-2-2-2-4 1
15 1-3-4 26 85 1-1-2-2-2-2-3 1
16 1-3-3 25 86 2-2-2-2-3-3 1
17 2-4 24 87 2-2-2-2-3 1
18 1-1-1-3 24 88 1-2-2-2-2-2-2-4 1
19 1-1-5 21 89 1-1-1-1-1-1-2-2 1
20 1-2-5 21 90 1-1-1-1-1-2 1
21 1-2-2-3 17 91 1-1-1-1-1-5 1
22 2-2-2 17 92 1-1-2-2-3-3 1
23 1-2-2-2 17 93 2-2-3-3 1
24 1-1-3-3 16 94 4-4-4 1
25 1-1-2-3 16 95 2-2-2-2-2 1
26 1-1-2-2 15 96 1-1-2-2-2-3 1
27 2-5 15 97 1-1-2-2-2-2 1
28 1-6 14 98 4-4 1
29 2-2-3 13 99 1-2-2-2-2-2-2-3 1
30 1-2-2-4 12 100 1-1-3-6 1
31 1-1-1-2 12 101 1-1-2-2-5-5 1
32 1-1-1-4 10 102 1-1-1-1-2-2-2 1
33 4 10 103 1-2-3-3-4 1
34 1-3-5 9 104 1-2-2-6 1
35 2-3-3 9 105 2-2-2-5 1
36 1-1-2-4 8 106 1-1-1-1-1-3 1
37 1-1-3-4 8 107 1-1-1-1-1-3-3-3 1
38 1-1-2-2-2 8 108 1-3-3-3-4 1
39 1-4-4 7 109 1-4-7 1
40 1-1-2-2-3 7 110 5 1
94  Reinhard Köhler

Rank Motif Frequency Rank Motif Frequency


41 3-3 7 111 1-1-2-3-4 1
42 1-2-2-2-2 6 112 1-1-1-2-2 1
43 1-2-3-3 6 113 1-1-1-2-2-2-3 1
44 1-2-2-2-3 6 114 1-1-1-1-6 1
45 1-1-1-1-4 5 115 1-2-2-3-4 1
46 2-2-2-2 5 116 2-2-2-3-3 1
47 1-2-6 5 117 1-4-4-5 1
48 1-1-4-4 5 118 1-1-1-3-4 1
49 1-1-1-1-3 5 119 1-4-5 1
50 2-3-3-3 4 120 1-5-5 1
51 1-3-3-4 4 121 1-1-2-7 1
52 1-3-3-3 4 122 2-2-2-6 1
53 1-2-2-2-4 4 123 1-2-2-2-3-4 1
54 1-1-2-5 4 124 2-4-5 1
55 1-1-1-5 4 125 2-2-2-3-4 1
56 1-2-3-4 3 126 1-2-3-7 1
57 1-1-2-6 3 127 1-1-4-4-4 1
58 1-2-2-5 3 128 1-1-4-4-4-4 1
59 1-1-2-3-3 3 129 1-1-1-1-3-4 1
60 1-1-1-3-3 3 130 1-2-2-3-5 1
61 1-1-1-2-4 3 131 2-2-4-4 1
62 2-2-2-3 3 132 1-1-2-2-2-4 1
63 1-1-1-2-3 3 133 1-2-2-2-2-2 1
64 1-1-3-3-3 3 134 1-2-4-5 1
65 2-2-4 3 135 1-2-2-2-5 1
66 1-1-6 3 136 1-3-3-3-5 1
67 1-1-1-1-2-4 2 137 1-1-1-2-5 1
68 1-2-2-2-2-2-2 2 138 1-1-3-5 1
69 1-3-6 2 139 1-2-2-3-3-3 1
70 2-2-5 2 140 1-1-2-3-3-4 1

Fitting the Zipf-Mandelbrot distribution to the data yields an excellent result: The
value of the Chi-square statistic is 30.7651 with 131 degrees of freedom; the prob-
ability cannot be distinguished from 1.0. The parameters are estimated as a =
1.6468, b = 3.1218. Figure 1 gives an impression of the fit.
So far, the empirical description of one of the aspects of the statistical struc-
ture of motif sequences in texts was shown. Such results can be used for various
purposes, because the differences of the parameters of two empirical distribu-
tions can be tested for significance, e.g. for the comparison of authors, texts, and
text sorts and for classification purposes (cf. Köhler, Naumann 2010). As to the
question, which theoretical probability distribution should be expected for mo-
tifs, a general hypotheses was set up in Köhler (2006). It states that the frequency
Linguistic Motifs  95

distributions of motifs are similar to the distributions of the basic units. This hy-
pothesis was successfully tested in the cited work, in the cited papers by Köhler
and Naumann, and in Mačutek (2009).

Fig. 1: Bi-logarithmic graph of the Zipf-Mandelbrot distribution as fitted to the data from Table
1.

The following is an example, which may illustrate that the underlying probability
distribution of the basic units resp. their properties and of the corresponding mo-
tifs can be theoretically derived and thus explained. In Köhler, Naumann (2009),
sentence length was studied. As opposed to most studies on sentence length, this
quantity was measured in terms of the number of clauses. There are two main
reasons to do so. First, the word, the commonly used unit for this purpose, is not
the immediate constituent of the sentence (in the sense of the Menzerath-Alt-
mann Law). Second, frequency counts of sentence length based on the number
of words, display ragged distribution shapes with notoriously under-populated
classes. The data are usually pooled into intervals of at least ten but do not dis-
play smooth distributions nevertheless. The length motifs formed on the basis of
clause counts yielded an inventory of motif types, which turned out to be distrib-
uted according to the Zipf-Mandelbrot distribution, as expected. The next step
was the attempt to theoretically find the probability distribution of the length of
these length motifs. The corresponding considerations were as follows:
[1] In a given text, the mean sentence length, the estimation of the math-
ematical expectation of sentence length, can be interpreted as the
sentence length intended by the text expedient (speaker/writer).
96  Reinhard Köhler

[2] Shorter sentences are formed in order to decrease decoding/pro-


cessing effort (the requirement minD in synergetic linguistics) within
the sentence. This tendency will be represented by the quantity D.
[3] Longer sentences are formed where they help to compactify what oth-
erwise would be expressed by two or more sentences and where the
more compact form decreases processing effort with respect to the
next higher (inter-sentence) level. This will be represented by H.
[2] and [3] are causes for deviations from the mean length value while they, at the
same time, compete with each other. This interdependence can be expressed us-
ing Altmann’s approach (Köhler, Altmann 1966): The probability of a sentence
length x is proportional to the probability of sentence length x-1. The function
𝐷𝐷𝐷𝐷
𝑃𝑃𝑃𝑃𝑥𝑥𝑥𝑥 = 𝑃𝑃𝑃𝑃
𝑥𝑥𝑥𝑥 + 𝐻𝐻𝐻𝐻 𝐻 1 𝑥𝑥𝑥𝑥𝑥𝑥
represents the above-sketched relations: D has an increasing influence on this
relation whereas H has a decreasing one. The probability class x itself has also a
decreasing influence, which reflects the fact that the probability of long sen-
tences decreases with the length. This equation leads to the hyper-Poisson distri-
bution (Wimmer/Altmann 1999, 281):
𝑎𝑎𝑎𝑎 𝑥𝑥𝑥𝑥
𝑃𝑃𝑃𝑃𝑥𝑥𝑥𝑥 = (𝑥𝑥𝑥𝑥)
, 𝑥𝑥𝑥𝑥 = 0,1,2, … , 𝑎𝑎𝑎𝑎 𝑎 0, 𝑏𝑏𝑏𝑏 => 0
1𝐹𝐹𝐹𝐹1 (1; 𝑏𝑏𝑏𝑏; 𝑎𝑎𝑎𝑎)𝑏𝑏𝑏𝑏

where 1F1(1; b; a) is the confluent hyper-geometric function



𝑎𝑎𝑎𝑎𝑗𝑗𝑗𝑗
1𝐹𝐹𝐹𝐹1 (1; 𝑏𝑏𝑏𝑏; 𝑎𝑎𝑎𝑎) = �
𝑏𝑏𝑏𝑏𝑗𝑗𝑗𝑗
𝑗𝑗𝑗𝑗=0

And 𝑏𝑏𝑏𝑏 (𝑥𝑥𝑥𝑥) = 𝑏𝑏𝑏𝑏(𝑏𝑏𝑏𝑏 + 1) … (𝑏𝑏𝑏𝑏 + 𝑥𝑥𝑥𝑥 𝑥 1). According to this derivation, the hyper-Pois-
son distribution, which plays a basic role with word length distributions (Best
1997), should therefore also be a good model of L-motif length on the sentence
level although motifs on the word level, regardless of the property considered
(length, polytextuality, frequency), follow the hyper-Pascal distribution (Köhler
2006); Köhler/Naumann 2008). Figure 2 shows one example of the fitting tests,
which support the specific hypothesis derived above as well as the general hy-
pothesis that motif distributions resemble the distributions of the basic units
resp. properties. Similarly, in Köhler (2006) a theoretical model of the distribution
of the length of L-motifs was derived, which yielded the hyper-Pascal distribu-
tion. Empirical tests confirmed this hypothesis.
Linguistic Motifs  97

Fig. 2: Fitting the hyper-Poisson distribution to the frequency distribution of the lengths of L-
motifs on the sentence level.

4 Motifs on the basis of categorical sequences


Linguistic data are often categorical. The definition of motifs given in the intro-
duction prevents such data from forming motifs: metrical or ordinal comparisons
are not defined for variables on a nominal/categorical scale. An example of cate-
gorical data was discussed in Beliankou, Köhler, Naumann (2013), where argu-
mentation structures were in the focus. The argumentation relations such as cir-
cumstance, condition, concession, evidence, elaboration, contrast, evaluation, are
obviously categorical values. For the purpose of the cited study, a definition of
motifs was given which allowed for the formation of two kinds of motifs. The first
one is based on the repetition of a value, i.e. the current motif ends as soon as one
of the values in the sequence is a repetition of a previous one. This kind of motif
was called R-motif:

An R-motif is an uninterrupted sequence of unrepeated elements.


98  Reinhard Köhler

An example of the segmentation of a text fragment taken from one of the anno-
tated newspaper commentaries of the Potsdam corpus2 (represented as a se-
quence of argumentative relations) into R-motifs is the following:

["elaboration"], ["elaboration”, "concession"], ["elaboration", "evidence", "list", "prepara-


tion", "evaluation", "concession"], ["evidence", "elaboration", "evaluation].

The first R-motif consists of a single element because the following relation is a
repetition of the first; the second one ends also where one of its elements occurs
again etc.
On this basis, the lengths of these R-motifs in the Potsdam commentary cor-
pus were determined. The distribution of the motif lengths turned out to abide by
the hyper-Binomial distribution (cf. Figure 3):

Fig. 3: The distribution of the length of R-motifs in a Corpus, which was annotated for argumen-
tation relations (cf. Beliankou, Köhler, Naumann 2013).

Another variant of categorical motifs was introduced because the argumentation


data are organised not only in a sequence but also in a tree. Therefore, the D-motif
was defined as

A D-motif is an uninterrupted depth-first path of elements in a tree structure.


2 Stede (2004).
Linguistic Motifs  99

Each motif begins at the root of the tree and follows one of the possible paths
down until a terminal element is reached. The length of motifs determined in this
way displays a behaviour that differs considerably from that of the R-motifs. A
linguistically interpretable theoretical probability distribution which can be fit-
ted to the empirical frequency distribution is the mixed negative binomial distri-
bution (cf. Figure 4).

Fig. 4: Fitting the mixed negative binomial distribution to the D-motif data

This distribution was justified in the paper by the assumption that it is the result
of a combination of two processes, viz. the combination of two diversifications,
which result both in the negative binomial distribution but with different param-
eters.
We will apply the R-motif method to the Italian text we studied above but this
time with respect to the sequences of part-of-speech tags3. Replacing the words
in the text by the symbols for their resp. part-of-speech yields the sequence (for
the tags cf. Table 2):

A N CONG A N DET N CONG V A N V PREP DET N PREP N PRON V V PREP V...


3 Again, many thanks to Arjuna Tuzzi.
100  Reinhard Köhler

Table 2: Tags as used in the annotations to the Italian text.

Tag Part-of-speech
N noun
PREP preposition
V verb
A adjective
DET article
CONG conjunction
PRON pronoun
AVV adverb
NM proper name
NUM number
ESC interjection

We determine the R-motifs and fit the Zipf-Mandelbrot distribution to the fre-
quency distribution of these motifs. The motifs and their frequency are shown in
Table 3. The result of the fitting is excellent. The probability of the Chi-square
value is given as 1.0, the parameters are a = 0.9378, b = 0.9553, and n = 602. The
number of degrees of freedom is 473 (caused by pooling classes with low fre-
quency). The graph of the distribution and the data is given in Figure 5.

Table 3: Empirical rank-frequency distribution of part-of-speech based R-motifs from an Italian


Text.

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
1 V 155 302 A-N-PREP-NM 1
2 N 86 303 AVV-V-N-CONG-PREP 1
3 N-PREP 84 304 AVV- 1
4 PREP 39 305 AVV-PREP-PRON 1
5 PREP-N 36 306 V-PREP-N-CONG-AVV 1
6 N-V 34 307 V-A-DET-N-CONG 1
7 PRON 34 308 CONG-A-PRON-AVV-V-N-DET 1
8 V-PREP 31 309 CONG-A-AVV-PRON-V-N 1
9 V-CONG 29 310 A-PREP-DET-N 1
10 PRON-V 28 311 A-N-ESC-AVV 1
11 AVV 27 312 AVV-DET-A-N-CONG-PRON-V 1
12 N-A 26 313 A-N-AVV-PREP 1
13 A 24 314 N-PREP-A-V-DET 1
14 N-PRON-V 19 315 CONG-PRON-V-A-N 1
15 V-AVV 19 316 N-A-AVV 1
16 V-A 19 317 V-PREP-A-N-CONG-DET-NM 1
17 N-DET 18 318 N-AVV-V-A 1
18 N-PRON 18 319 AVV-PRON-V-N-PREP 1
Linguistic Motifs  101

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
19 N-AVV-V 16 320 V-A-N-AVV-DET 1
20 V-DET-N 16 321 N-AVV-A 1
21 V-DET-N-PREP 15 322 A-N-PRON-V-AVV 1
22 V-DET 14 323 V-PREP-DET-N-A 1
23 N-PREP-A 14 324 DET-N-A-PREP-V 1
24 N-CONG 14 325 N-DET-NM-V 1
25 A-N 13 326 DET-N-PRON 1
26 DET 13 327 A-PRON-V-CONG-DET-N-PREP 1
27 N-V-DET 11 328 V-N-A-CONG 1
28 AVV-V 10 329 CONG-V-DET-N-ESC 1
29 V-PREP-N 10 330 V-CONG-DET-PRON 1
30 A-N-V 10 331 PRON-V-PREP-DET-N 1
31 PREP-A 10 332 V-DET-N-A-ESC-PREP 1
32 V-PRON 10 333 DET-N-PREP-PRON-V-A 1
33 A-N-PREP 10 334 V-A--CONG-N 1
34 N-A-PREP 10 335 N-PREP-DET-A 1
35 V-PREP-DET-N 8 336 PREP-N-PRON 1
36 DET-N-PREP 8 337 PRON-V-DET-N-PREP-AVV 1
37 V-CONG-PRON 8 338 N-AVV-V-PREP-A 1
38 V-A-CONG 8 339 V-AVV-PREP-NM-CONG 1
39 PREP-V 8 340 PREP-A-N-AVV-V 1
40 A-V 8 341 AVV-DET 1
41 PREP-DET-N 7 342 PREP-NM-V-AVV-DET-N-CONG 1
42 PRON-V-CONG 7 343 NM-PREP-DET 1
43 A-PREP 7 344 NM-V-A 1
44 CONG-PRON-V 7 345 V-CONG-DET-N-PREP-NM 1
45 PREP-PRON 7 346 V-AVV-DET 1
46 PRON-V-AVV 6 347 N-NM 1
47 PREP-NM 6 348 N-PREP-NM-PRON-V-CONG 1
48 N-CONG-DET 6 349 N-NM-A-CONG-PRON-V-PREP 1
49 CONG-V 6 350 NM-PRON-V-PREP 1
50 V-N 6 351 V-N-A-AVV 1
51 PREP-A-N 6 352 V-PRON-DET-N-A 1
52 N-PREP-PRON 5 353 V-AVV-PRON-PREP-N-DET 1
53 N-AVV 5 354 PRON-V-PREP-N-CONG 1
54 V-DET-N-PRON 5 355 DET-N-A-V-PREP 1
55 N-CONG-AVV 5 356 PREP-N-A-CONG-PRON 1
56 V-A-N-PREP 5 357 PREP-V-A-N-DET 1
57 CONG 5 358 V-N-CONG-A 1
58 V-DET-N-CONG 5 359 AVV-A-N-CONG 1
59 DET-A 5 360 V-DET-N-CONG-ESC-PREP-PRON 1
60 PREP-N-V 4 361 AVV-PREP-A-N 1
61 V-A-PREP-N 4 362 N-ESC-DET 1
62 V-A-AVV 4 363 V-N-A-DET 1
63 A-N-DET 4 364 A-DET-N 1
102  Reinhard Köhler

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
64 V-CONG-AVV 4 365 N-AVV-PRON 1
65 N-CONG-V 4 366 PRON-V-PREP-N-DET 1
66 A-CONG 4 367 N-CONG-AVV-DET 1
67 DET-N 4 368 N--DET 1
68 PREP-DET-A-N 4 369 N-V-AVV-CONG-PREP 1
69 V-PREP-PRON 4 370 AVV-N 1
70 PRON-V-PREP 4 371 V-DET-N-AVV-A-CONG 1
71 N-PREP-DET 4 372 V-CONG-PREP-N-PRON-A 1
72 V-N-CONG-DET 4 373 N-AVV-V-CONG 1
73 N-A-V 4 374 PREP-N-A-DET 1
74 V-AVV-PRON 4 375 V-PREP-A-N 1
75 N-PREP-V 4 376 PRON-CONG-AVV-V-DET-N-A 1
76 N-PREP-NM 4 377 N-AVV-A-PREP 1
77 DET-N-A 3 378 V-DET-N-A-PRON-PREP 1
78 N-AVV-V-DET 3 379 N-NM-DET-PRON-V 1
79 N-ESC 3 380 PREP-N-CONG-AVV-V-A 1
80 PREP-N-DET 3 381 PREP-N-M 1
81 N-CONG-PRON-V 3 382 V-AVV-PREP-DET-N-PRON 1
82 V-A-PRON 3 383 DET-N-V-PRON 1
83 V-A-N 3 384 N-PREP-V-CONG 1
84 N-A-V-DET 3 385 V-CONG-DET 1
85 N-AVV-PREP 3 386 N-AVV-CONG-PRON 1
86 PREP-N-A 3 387 AVV-V-PRON 1
87 V-N-PRON 3 388 V-PRON-AVV 1
88 V-DET-N-PREP-A 3 389 V-DET-N-CONG-AVV 1
89 AVV-PREP-N 3 390 V-CONG-AVV-N-M-N 1
90 A-CONG-PRON-V 3 391 AVV-V-PREP-NM 1
91 DET-N-V 3 392 PREP-N-AVV 1
92 DET-NM-V 3 393 PREP-DET-N-A 1
93 DET-V 3 394 N-NM-PRON-V-CONG 1
94 A-PREP-N 3 395 N-A-CONG-AVV-V 1
95 PREP-DET 3 396 AVV-DET-N-PREP 1
96 CONG-PRON 3 397 N-NM-V-CONG-PRON 1
97 V-PREP-N-DET 3 398 N-A-PRON-AVV-V 1
98 PRON-DET-V 3 399 N--V 1
99 N-AVV-DET 3 400 A-CONG-PRON-V-AVV 1
100 AVV-CONG-V 3 401 DET-A-N-V 1
101 V-A-N-DET 3 402 PRON-DET-N-V 1
102 DET-N-PREP-NM 3 403 V-CONG-DET-N-A-AVV 1
103 PREP-A-N-PRON-V 3 404 CONG-PREP-N-AVV-V 1
104 DET-N-V-PREP 3 405 CONG-V-AVV 1
105 V-PREP-N-PRON 3 406 AVV-V-N-PREP 1
106 N-V-PREP 3 407 N-AVV-PREP-DET 1
107 A-AVV-V 3 408 N-PRON-AVV 1
108 AVV-V-A 3 409 N-PREP-DET-PRON 1
Linguistic Motifs  103

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
109 V-CONG-DET-N 3 410 V-PREP-N-CONG-DET 1
110 CONG-DET-N-V 3 411 PREP-N-A-V 1
111 PREP-PRON-V 3 412 V-AVV-PREP-A-N 1
112 V-DET-N-A 3 413 AVV-PREP-N-M 1
113 AVV-V-DET-N-PREP 3 414 N-M-N-CONG-V-DET 1
114 V-PREP-N-CONG 3 415 N-A-PRON-V-CONG 1
115 V-A-N-PRON 3 416 A-CONG-PRON-V-PREP 1
116 N-PREP-PRON-V 3 417 V-N-CONG-AVV-PRON 1
117 N-CONG-V-A 2 418 V-ESC-PREP-N-CONG 1
118 N-V-PREP-DET 2 419 PREP-V-CONG-A-N 1
119 N-CONG-PREP 2 420 V-N-PRON-AVV 1
120 PRON-CONG 2 421 V-PREP-ESC-PRON 1
121 N-PREP-DET-NM 2 422 V-AVV-DET-N-PREP 1
122 N-A-PRON 2 423 V-PREP-ESC 1
123 CONG-V-N 2 424 PREP-N-PRON-AVV-V 1
124 N-PRON-PREP 2 425 CONG-V-A 1
125 A-N-V-PREP 2 426 CONG-N-M-V 1
126 PRON-N 2 427 DET-N-CONG 1
127 PREP-A-N-CONG 2 428 PREP-PRON-N-M-N-AVV 1
128 PRON-V-AVV-CONG 2 429 CONG-AVV-V 1
129 DET-N-PRON-V 2 430 V-A-CONG-DET-N 1
130 V-A-N-CONG 2 431 V-AVV-A-CONG-DET-N-PREP 1
131 CONG-DET-N-PREP 2 432 PREP-PRON-DET 1
132 N-PRON-V-DET 2 433 DET-N-CONG-PREP-V 1
133 V-DET-N-CONG-A 2 434 V-DET-A 1
134 CONG-DET 2 435 N-A-CONG 1
135 N-V-A 2 436 PREP-A-N-DET 1
136 A-N-V-DET 2 437 V-PREP-PRON-AVV 1
137 V-DET-N-PRON-AVV 2 438 PREP-V-A-AVV 1
138 PRON-V-N-CONG 2 439 A-N--V 1
139 V-N-PREP 2 440 CONG-AVV-PRON-V 1
140 A-AVV 2 441 V-PREP-DET-A-PRON 1
141 A-CONG-DET-N-PREP 2 442 A-AVV-DET-N-M-N-PREP 1
142 N-V-AVV 2 443 N-A-V-AVV-PREP 1
143 PREP-N-CONG 2 444 CONG-V-PREP 1
144 AVV-V-A-N 2 445 N-V-CONG-AVV-DET 1
145 PREP-A-N-V 2 446 A-PRON-V-DET-N 1
146 N-V-DET-A 2 447 V-PREP-N-A-CONG-PRON 1
147 V-AVV-PREP-N 2 448 PREP-N-NM-AVV-DET 1
148 PREP-N-A-AVV 2 449 PREP-N-AVV-A-CONG 1
149 CONG-PRON-V-PREP 2 450 PRON-PREP-N 1
150 PRON-V-PREP-A-N 2 451 PREP-N-PRON-V-DET 1
151 PREP-V-DET-N 2 452 PREP-PRON-A-AVV-V 1
152 V-AVV-A-N 2 453 PREP-N-CONG-V 1
153 N-PREP-V-DET 2 454 A-N-PRON-V 1
104  Reinhard Köhler

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
154 V-CONG-A 2 455 PRON-V-AVV-A 1
155 V-N-CONG-AVV 2 456 CONG-PRON-V-DET-N 1
156 PRON-V-DET-N-A-CONG 2 457 CONG-V-N-M-N-DET 1
157 PREP-N-A-CONG 2 458 AVV-CONG-PRON-V 1
158 A-N-PRON 2 459 CONG-AVV-V-DET-N 1
159 A-N-AVV 2 460 V-DET-N-PREP-N-M 1
160 PRON-V-PREP-N 2 461 N-A-AVV-V 1
161 V-PRON-PREP 2 462 AVV-V-PREP-DET-N-A 1
162 CONG-PREP 2 463 N-PRON-AVV-V-A 1
163 N-CONG-A 2 464 N-A-PREP-V 1
164 N-A-CONG-DET 2 465 N--PRON-V-CONG 1
165 N-PREP-V-A 2 466 PRON-V-DET-A-N-CONG 1
166 N-V-AVV-A 2 467 DET-V-CONG-N-A 1
167 AVV-CONG 2 468 N-V-CONG-DET 1
168 N-PRON-V-PREP 2 469 CONG-V-N-A-PREP 1
169 V-DET-N-AVV-PRON 2 470 N-CONG-PREP-PRON-V 1
170 AVV-PRON 2 471 V-PREP-NM-CONG-AVV 1
171 N-A-V-PREP 2 472 V-AVV-N-M-N-A-PRON 1
172 V-DET-A-PREP 2 473 PREP-N-M-N 1
173 DET-N-V-A 2 474 V-AVV-N-M 1
174 V-N-CONG 2 475 V-PREP-N-A-PRON 1
175 PRON-V-AVV-PREP-N 2 476 A-CONG-PREP-NM-V-DET-N 1
176 PREP-N-CONG-PRON-V 2 477 NM-V-DET-N-PREP 1
177 CONG-AVV 2 478 NM-DET-N-V-A 1
178 V-DET-A-N 2 479 A-PRON-V 1
179 N-AVV-CONG-PREP 2 480 V-A-PREP-PRON 1
180 V-PREP-A-N-DET 2 481 PREP-PRON-DET-N 1
181 PRON-V-DET-N-CONG 2 482 V-AVV-PRON-PREP 1
182 V-A-PREP 2 483 V-PREP-N-AVV-A-CONG 1
183 AVV-A-N-PREP 2 484 V-AVV-PREP 1
184 AVV-PREP 2 485 AVV-V-DET-N-CONG-A-PREP-PRON 1
185 V-AVV-CONG 2 486 V-AVV-PREP-NM 1
186 CONG-N-PREP 2 487 CONG-PREP-NM-V 1
187 V-CONG-PREP 2 488 A-N-CONG-PREP 1
188 N-A-CONG-PREP 2 489 A-PREP-N-M-N 1
189 PREP-NM-CONG 2 490 PREP-DET-N-AVV-A 1
190 V-DET-N-A-PREP 2 491 V-PRON-DET-N 1
191 PREP-A-DET 2 492 N-PRON-V-A-AVV-PREP 1
192 PRON-PREP 2 493 V-N-M 1
193 N-CONG-V-DET 2 494 DET-N-PREP-NM-V 1
194 N-AVV-CONG 2 495 V-PREP-NM-CONG 1
195 A-N-CONG 1 496 PREP-NM-DET-N 1
196 V-AVV-DET-N 1 497 NM-A-PREP 1
197 AVV-A 1 498 PREP-NM- 1
198 AVV-PRON-V-DET-A-N 1 499 NM-N-M-PREP 1
Linguistic Motifs  105

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
199 AVV-A-CONG 1 500 NM 1
200 A-DET-N-V 1 501 N-M-N-DET 1
201 V-CONG-PRON-AVV-PREP 1 502 N-M-PREP 1
202 N-V-PREP-A 1 503 N-M-V-A-CONG 1
203 ESC-V-DET-N 1 504 V-DET-N-NM 1
204 PRON-V-A 1 505 PREP-NM-V 1
205 A-PRON-V-AVV 1 506 AVV-V-N-CONG-DET 1
206 PREP-PRON-N-V-DET 1 507 A-CONG-PREP-PRON-V 1
207 V-ESC-CONG-PREP-N-DET 1 508 V-CONG-PREP-NM 1
208 N-V-AVV-A-CONG 1 509 PREP-V-DET-N-A-PRON 1
209 V-CONG-N 1 510 PRON-V-AVV-A-N 1
210 PRON-V-CONG-DET-N-PREP-A 1 511 V-PREP-DET-N-CONG-AVV 1
211 N-CONG-V-PREP 1 512 AVV-V-PREP-PRON-A 1
212 N-CONG-AVV-V-PREP 1 513 PREP-DET-N-V 1
213 V-N-A-AVV-CONG 1 514 AVV-PRON-V-N-CONG-A 1
214 A-NM-V-DET-N 1 515 V-PREP-DET 1
215 PREP-A-N-PRON-V-DET 1 516 V-N-CONG-PREP-NM-PRON-AVV 1
216 NM-PREP 1 517 V-PREP-NM 1
217 N-V-CONG-PREP-DET 1 518 NM-V 1
218 PREP-NM-PRON-A-N 1 519 V-N-AVV-A 1
219 PRON-V-DET 1 520 V-A-AVV-PREP-PRON-DET 1
220 N-AVV-V-DET-A 1 521 CONG-PREP-N 1
221 N-DET-AVV-A 1 522 PREP-V-DET 1
222 V-A-PRON-AVV 1 523 NM-A-N-AVV 1
223 AVV-A-N-V 1 524 A-N-PREP-DET-N-M 1
224 DET-N--CONG 1 525 N-CONG-AVV-N-M-DET 1
225 CONG-V-PREP-N-PRON 1 526 N-PREP-PRON-AVV 1
226 A-CONG-AVV-V-DET-N-PREP 1 527 N-V-CONG-NM 1
227 N-PRON-V-A 1 528 V-NM-AVV 1
228 V-DET-A-N-PRON 1 529 AVV-V-PREP 1
229 V-ESC-PREP-A 1 530 PREP-DET-N-V-A 1
230 N-A-AVV-V-DET 1 531 N-PREP-NM-PRON-V 1
231 V-PREP-N-M-N 1 532 PREP-DET-V-AVV 1
232 PREP-N-PRON-V-AVV-CONG 1 533 PREP-V-N 1
233 AVV-PRON-V-DET 1 534 N-PREP-AVV-V 1
234 V-AVV-CONG-PRON 1 535 AVV-V-CONG 1
235 V-CONG-PRON-N 1 536 PREP-NM-A-AVV-PRON-V 1
236 N-PREP-DET-NM-CONG 1 537 V-DET-NM-PREP 1
237 A-CONG-AVV-V-PREP 1 538 NM-PREP-PRON 1
238 PREP-DET-N-CONG 1 539 PRON-V-DET-N-PREP 1
239 PREP-N-DET-A 1 540 PREP-A-N-PRON-DET 1
240 A-N-ESC-PREP-PRON 1 541 N-PRON-DET-V 1
241 ESC-PREP-PRON 1 542 V-DET-A-N-PRON-AVV 1
242 CONG-AVV-PRON 1 543 N-DET-A 1
243 V-CONG-DET-N-PREP-A 1 544 V--NM-AVV 1
106  Reinhard Köhler

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
244 PRON-PREP-N-A-V 1 545 DET-N-A-PREP 1
245 PRON-V-PREP-AVV 1 546 N-NM-A 1
246 V-CONG-DET-N-PREP 1 547 PREP-NM-N 1
247 V-DET-N-PREP-N-M-CONG 1 548 NM-A-CONG-DET 1
248 A-N-CONG-PRON-V 1 549 V-A-DET-N 1
249 A-N-AVV-DET 1 550 V-N-PREP-PRON 1
250 CONG-DET-NM 1 551 PRON-V-AVV-CONG-PREP-N 1
251 DET-A-N-CONG 1 552 CONG-PREP-N-PRON-V-DET 1
252 DET-N-PREP-A 1 553 A-V-AVV-PREP 1
253 N-CONG-DET-V 1 554 PREP-V-AVV 1
254 CONG-AVV-A-N-DET 1 555 PREP-PRON-N 1
255 AVV-A-V-CONG 1 556 PRON-AVV-V-N-M-N-A-PREP-CONG 1
256 PRON-AVV-V-A-N 1 557 A-N-CONG-DET 1
257 N-AVV-V-A-N-M 1 558 A-PREP-N-M 1
258 V-DET-N-AVV-A 1 559 N-CONG-AVV-V-PREP-PRON 1
259 N-AVV-A-CONG 1 560 N-PREP-NM-ESC-DET 1
260 V-DET-N-PRON-CONG-A 1 561 PREP-N-PRON-V 1
261 AVV-V-A-CONG 1 562 PRON-V-DET-A-N 1
262 PREP-DET-NM-V 1 563 A-PREP-V 1
263 N-CONG-PRON 1 564 A-ESC-N-PREP-NM-CONG 1
264 V-A--AVV 1 565 ESC-PRON 1
265 V-AVV-CONG-PREP-N 1 566 PREP-A-N-CONG-V 1
266 PREP-N-V-DET 1 567 A-CONG-AVV 1
267 A-N-PRON-V-PREP-DET 1 568 N-A-CONG-V 1
268 PRON-N-PREP 1 569 DET-V-A 1
269 N-A-V-CONG 1 570 CONG-A 1
270 A-N-AVV-V 1 571 CONG-PREP-DET-N-A 1
271 CONG-A-N 1 572 N-PRON-V-CONG 1
272 V-DET-N-PREP-PRON 1 573 A-CONG-DET-N-PRON-V 1
273 V-PRON-DET 1 574 N-V-PREP-NM-A 1
274 AVV-V-N 1 575 PREP-DET-N-A-CONG 1
275 A-AVV-PREP-DET-N-CONG 1 576 A-V-CONG-PREP 1
276 DET-N-A-CONG-PRON-V 1 577 PREP-V-PRON-A-NM 1
277 AVV-A-PREP-N 1 578 A-CONG-V-DET 1
278 N-CONG-V-A-PREP 1 579 PREP-PRON-CONG-V-A-N 1
279 V-PREP-DET-N-A-AVV 1 580 V-DET-N-A-PRON 1
280 A-CONG-PREP 1 581 PRON-V-CONG-PREP-DET 1
281 PRON-V-PREP-DET-N-A 1 582 N--AVV-V-A-PREP 1
282 PRON-V-DET-N-A-AVV 1 583 AVV-DET-N-M-PREP 1
283 N-A-PREP-PRON-V 1 584 N-M-N-AVV 1
284 A-CONG-V 1 585 AVV-N-M-N 1
285 N-PRON-AVV-V 1 586 N-V-N-M 1
286 V-CONG-N-PREP 1 587 N-PREP-CONG-DET 1
287 N-PREP-NM-CONG-DET 1 588 PRON-A-V-AVV-DET-N 1
288 DET-A-N-PREP 1 589 PREP-V-DET-N-CONG-PRON 1
Linguistic Motifs  107

Rank Part-of-speech based Freq. Rank Part-of-speech based Freq.


R-motifs R-motifs
289 A-PRON-V-CONG-PREP 1 590 V-N-M-N 1
290 PRON-A-N 1 591 N-M-N 1
291 A-N-PRON-V-DET 1 592 N-M-PRON-V-A 1
292 PRON-CONG-A-N-V 1 593 PRON-V-A-PREP 1
293 A-DET-NM-V 1 594 A-PREP-PRON 1
294 A-AVV-PREP-N-PRON 1 595 N-AVV-PRON-V-DET 1
295 PREP-AVV 1 596 V-CONG-PREP-A-N-M-N 1
296 PREP-A-N-CONG-V-DET 1 597 V-PREP-AVV 1
297 PREP-N-PRON-DET 1 598 CONG-N-PRON-V-PREP 1
298 AVV-N-PREP-NM-DET 1 599 PRON-V-AVV-A-N-PREP 1
299 PREP-N-A-CONG-AVV 1 600 V-DET-A-N-PREP 1
300 A-PRON-V-PREP 1 601 PREP-PRON-CONG-A-N 1
301 N-ESC-A 1 602 N-PREP-V-AVV-A 1

Fig. 5: Fitting the Zipf-Mandelbrot distribution to the data in Table 3. Both axes are logarithmic.

It goes without saying that also R-motifs, D-motifs, and maybe other variants of
motifs which are formed from categorical data can be used as the basis for form-
ing F-, L- and other kinds of motifs. The procedure can be recursively continued
until the point is reached where too few elements are left.
Thus, motifs provide a means to analyse text with respect to their sequential
structure with respect to all kinds of linguistic units and properties; even categor-
ical properties can be studied in this way. The granularity of an investigation can
be adjusted by iterative application of motif-formation, and proven statistical
methods can be used for the evaluation. The full potential of this approach has
not yet been explored.
108  Reinhard Köhler

References
Beliankou, Andrei, Reinhard Köhler & Sven Naumann. (2013). Quantitative properties of argu-
mentation motifs. In Ivan Obradović, Emmerich Kelih & Reinhard Köhler (eds.), Methods
and applications of quantitative linguistics, 33–43. Belgrade: Academic Mind.
Best, Karl-Heinz. 1997. Zum Stand der Untersuchungen zu Wort- und Satzlängen. In Third Inter-
national Conference on Quantitative Linguistics, 172–176. Helsinki.
Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef
Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Viktor Krupa, 145–
152. Bratislava: Slovak Academic Press.
Köhler, Reinhard. 2008a. Word length in text. A study in the syntagmatic dimension. In Sybila
Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: Veda.
Köhler, Reinhard. 2008b. Sequences of linguistic quantities. Report on a new unit of investiga-
tion. Glottotheory 1(1). 115–119.
Köhler, Reinhard & Gabriel Altmann. 1996. “Language forces” and synergetic modelling of lan-
guage phenomena. In Peter Schmidt (ed.), Glottometrika 15, 63–76. Trier: WVT.
Köhler, Reinhard & Sven Naumann. 2008. Quantitative text analysis using L-, F- and T-seg-
ments. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme & Reinhold Decker
(eds.), Data Analysis, Machine Learning and Applications, 635–646. Berlin & Heidelberg:
Springer.
Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sen-
tence level. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics, 34–57.
Lüdenscheid: RAM-Verlag.
Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classifica-
tion. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Em-
merich Kelih & Ján Mačutek (eds.), Text and Language. Structures, functions, interrela-
tions, quantitative perspectives, 81–89. Wien: Praesens.
Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics,
51–60. Lüdenscheid: RAM-Verlag.
Stede, Manfred. 2004. The Potsdam commentary corpus. In Bonnie Webber & Donna Byron
(eds.), Proceedings of the 2004 ACL workshop on discourse annotation, 96–102.
Wimmer, Gejza & Gabriel Altmann. 1999. Thesaurus of univariate discrete probability distribu-
tions. Essen: Stamm
Reinhard Köhler and Arjuna Tuzzi
Linguistic Modelling of Sequential
Phenomena
The role of laws

1 Introduction
A number of textual aspects can be represented in form of linear sequences of
linguistic units and/or their properties. Some well-known examples of simple
models of such phenomena are the (dynamic variant) of the type-token relation
(TTR) representing the gradual development of the 'lexical richness' of a text,
i.e. how the amount of types increases as the number of tokens increases, and
other properties of a text, the linguistic motifs, which can be applied with any
degree of granularity, and time-series models such as ANOVA (cf. the contribu-
tions about motifs and time series in this volume).
Before conclusions are drawn on the basis of the application of a mathe-
matical model, some important issues should be taken into account. One of the
most important ones is the validity of the model, i.e. the question whether a
model does really represent what we think it does. The simplest possible quanti-
tative model of a linguistic phenomenon is a single number which results from a
measurement. Combining two or more numbers by arithmetic operations yields
an index, e.g. a quotient. Another way to reflect more than just one property in
order to represent a complex property is forming a vector, which consists of as
many numbers as dimensions are relevant. The appropriate definition of the
measure of a property is fundamental to all the rest of the modelling procedure.
Every measure can be used to represent a whole text (such as the original TTR)
or to scrutinize the dynamic behaviour of the properties under study from text
position to text position, adding a temporal, or more generally, a sequential
dimension to the model. The validity of the result depends, of course, on the
validity of the basic measure(s) and on the validity of the mathematical function
which is set up as a model of its dynamic development. A function which, e.g.
increases boundlessly with text position cannot serve as a valid model of any
text property because there are no infinitely long texts and because there is not
any linguistic property which would yield infinite values if measured in an ap-
propriate way. The same is true of a measure such as the entropy, which gives a
reliable estimation of the 'true' value of a property only for an infinitely long
110  Reinhard Köhler and Arjuna Tuzzi

sequence of symbols. In general, the domain of the function must agree with the
range and the possible values of the linguistic property. The appropriateness of
ANOVA models should also be checked carefully; it is not obvious that this kind
of model correctly reflects linguistic reality. Sometimes, categorical data are
represented by numbers e.g., when stressed and unstressed syllables are
mapped to the numbers "1" and "0" without any explication why not the other
way round or to "4" and "-14,000.22". And why it is correct to calculate with
these numbers, which still represent the non-numerical categories "stressed'
and "unstressed".
Besides validity, every model should be checked also for some other im-
portant properties: reliability, interpretability, and simplicity (cf. Altmann 1978,
1988). Moreover, appropriateness of the scale level of the measure should be
considered, i.e. whether the property to be modelled is measured on a nominal,
an ordinal, or one of the metrical scales. The choice of the scale level decides on
the mathematical operations which are applicable on the measured data and
hence also which kind of mathematical function would agree with the numbers
obtained as results of the measurement.
Another aspect which must not be neglected when a combination of proper-
ties is used to form an index or a vector is the independence of the component
variables. An example may illustrate this issue: The most commonly used index
of reading ease (the 'Flesh formula') for English texts consists of a linear combi-
nation of sentence length (SL) and word length (WL)

𝐹𝐹𝐹𝐹 = (0.39 𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆) + (11.8 𝑊𝑊𝑊𝑊𝑆𝑆𝑆𝑆) − 15.59

This may look like a good idea until it turns out that word length is an indi-
rect function of sentence length as follows from the well-established Menzerath-
Altmann Law (Altmann 1980, 1983b; Cramer 2005) and was empirically con-
firmed over and over again. Thus, this index combines sentence length with
sentence length, which is certainly at least redundant.
It is not too difficult to set up a measure of a linguistic or textual property,
however, many simple measures and also more complex indexes suffer from
problems of various kinds. Some of these problems are well-known in the scien-
tific community such as the dependence of the type-token relation on text
length. Therefore, attempts are made (Wimmer & Altmann 1999, Köhler 1993,
Kubat & Milička 2013, Covington & McFall 2010) to find a method to circumvent
this problem because this measure is easy to apply and hoped to be useful to
differentiate text sorts or to serve as a technique for authorship attribution and
for other stylistic tasks. Another good reason not to drop this measure is the fact
that not many measures of dynamic, sequential text phenomena are known.
Linguistic modelling of sequential phenomena  111

2 The role of laws for text analysis


What frequently remains problematic even if problems such as the above-men-
tioned one could be solved is the lack of knowledge of the statistical properties
of a measure. An objective evaluation of a measurement, of the difference be-
tween two numbers, of the agreement between a measurement and previous
expectations or of the parameters of a model, however, cannot be obtained
without the knowledge of the theoretical probability distribution of the meas-
ure. Otherwise, the researcher cannot be sure whether e.g., a number is signifi-
cant or could be the result of a random effect. Valid and reliable, theoretically
justified models are indispensable also when linguistic objects are to be charac-
terised without any significance requirements. Researchers which are interested
in quantitative text analysis but without much linguistic background tend to
use methods and measures they are familiar with, e.g. from information theory,
mechanics, statistics or quantum physics. These methods may sometimes work
for practical purposes even if one cannot be sure whether they really represent
the properties the researcher aims at. However, very often, they are inappropri-
ate when applied to language and make unrealistic assumptions: infinite length
of texts, infinite size of lexicons, normal distribution of the data etc. Moreover,
they are rarely interpretable in linguistic terms. For a linguistically meaningful
and valid analysis of linguistic objects, linguistic models are required. The most
reliable linguistic models are, of course, laws of language and text.
Laws are the principal components of a theory – without laws, there is no
theory. Furthermore, laws are the only statements which can explain and pre-
dict facts, and new phenomena and interrelations can be discovered by logical
deduction of consequences from laws, whereas rules and classifications are
descriptive statements, which do not entail anything. All this is well-known
from the philosophy of science (cf. e.g., Bunge 1967, 2005). Nevertheless, when
practitioners in computational or corpus linguistics learn about the existence of
laws not only in the natural sciences but also in language and text often ask
whether there is any benefit or application affecting their work. We will illus-
trate the usefulness of linguistic laws on a practical example.

3 Hypothesis and method


The application presented here is inspired by previous work (Trevisani & Tuzzi
2012, 2013). The corpus examined in previous studies included 63 end-of-year
messages delivered by all Presidents of the Italian Republic over the period from
112  Reinhard Köhler and Arjuna Tuzzi

1949 to 2011. Main aims of these studies were identifying a specific temporal
pattern, i.e. a curve, for each word and clustering curves portraying similar
temporal patterns. The authors proposed a flexible wavelet-based model for
curve clustering in the frame of functional data analysis approaches. However,
clear specific patterns could not be observed as each word possesses its own
irregular series of frequency values (cf. Fig. 1). In their conclusions the authors
highlighted: some critical points in representing the temporal pattern of words
as functional objects, the weaknesses of an explorative approach and the lack of
a linguistic theory to justify and interpret such a complex and extremely sophis-
ticated model.

Fig. 1: Temporal patterns of six words (taken from Trevisani &Tuzzi 2013)

In the present paper, we will emphasize the linguistic point of view: The
material under study is not just a corpus or a set of texts but a sequence of texts
representing an Italian political-institutional discourse. The temporal trajectory
of the frequency of a word can thus be considered as an indicator of the com-
municative relevance of a concept at a given point in time. We will not abandon
the original idea of characteristic patterns but instead of expecting a general
time-depending behaviour we will set up the specific hypothesis that the tem-
poral behaviour of the frequency of a word is discourse-specific. A ready-made
model of this kind of phenomenon is not available but there is an established
law of a related phenomenon: the logistic function, in linguistics called the
Piotrowski (or Piotrowski-Altmann) law. It does not make any statements about
Linguistic modelling of sequential phenomena  113

frequency but about the dispersal of units or properties over a community along
the time axis. We will argue in the following way: degree of dispersion and
frequency are, with respect to communication systems, two sides of a coin. The
more frequently a unit is used the higher the probability to make the unit more
familiar to other members of the community – and vice versa: the greater the
degree of dispersion the higher the probability of occurrence. We will therefore
adopt this law for our purposes and assume that it is a good model of the
dynamics of the frequencies of words within a discourse.
The basic form of the Piotrowski-Altmann law (Altmann 1983a) is given as
𝐶𝐶𝐶𝐶
𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 = (1)
1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡
Altmann (1983a) also proposed a modified version for the case of develop-
ments where the parameter b is a function of time (2):
1
𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 = 2 (2)
1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡+𝑐𝑐𝑐𝑐𝑡𝑡𝑡𝑡
We adopted the law in form of function (3), which has an additional
parameter because we cannot expect that a word approaches probability 1:
𝐶𝐶𝐶𝐶
𝑝𝑝𝑝𝑝𝑡𝑡𝑡𝑡 = 2 (3)
1 + 𝑎𝑎𝑎𝑎𝑒𝑒𝑒𝑒 −𝑏𝑏𝑏𝑏𝑡𝑡𝑡𝑡+𝑐𝑐𝑐𝑐𝑡𝑡𝑡𝑡
Here, the dependent variable pt represents the relative probability of a word
at time t. On the basis of a plausible, linguistically motivated model, we can
further assume that the data, which seem, at a first look, to disagree with any
assumption of regularity, display a statistical spread around the actual trend.
Therefore, the data, i.e. the relative frequencies are smoothed in order to ena-
bles us to detect the trends and to use them for testing our hypothesis. As
smoothing technique, moving averages at interval size 7 was applied: The inter-
val size was chosen in a way which yielded a sufficiently smooth sequence of
values and, on the other hand, keeps as many individual values as possible. The
smoothing method works as follows. The first window starts with the first pair
of (y, x) – or, in our case of (pt, t) – values and ends with the 7th. The next
window starts with the second data set, i.e. (pt+1, t+1) and ends with the pair
(pt+21, t+7) etc. until we reach the window from (pt+n-7, t+n). For each window, the
mean of the 7 pti are calculated. The means of all the windows form the new
data.
114  Reinhard Köhler and Arjuna Tuzzi

4 Results
Not surprisingly, many words do not display a specific trajectory over time.
Word usage depends to a great deal on grammatical and stylistic circumstances
– in particular the function words – which should not change along the political
or social development in a country. There are, however, function words which
or whose form change over time. One example was described by Best and
Kohlhase (1983): The German word-form ward "became" was replaced by wurde
within the time period from 1445 to 1925 following the function (1). In our data, a
similar process can be observed: The frequency of the adverb ci (at the cost of vi)
is shown in Fig. 2.

Fig. 2: The increase of the frequency of "ci". Smoothing averages with window size 20.

Also content words often show an irregular behaviour with respect to their
frequency because their usage depends on unpredictable thematic circum-
stances. On the other hand, we assume that there are also many words which
reflect to some extent the relevance of a concept within the dynamics of the
discourse. We selected some words to illustrate this fact. Figures 3 and 4 and the
corresponding data in Table 1 show examples of words which follow the typical
logistic growth function. The part-of-speech tags which are attached to the
words in the captions have the following meanings: "N" stands for noun and
"NM" for proper name.
Linguistic modelling of sequential phenomena  115

Fig. 3 & 4: The increase of the frequency of "Europa" and "storia". Smoothing averages with
window size 7.
116  Reinhard Köhler and Arjuna Tuzzi

Fig. 5-7: Temporal behaviour of the frequencies of selected content words in the sequence of
the presidential speeches.
Linguistic modelling of sequential phenomena  117

Fig. 8: Temporal behaviour of the frequencies of selected content words in the sequence of the
presidential speeches [continued]

Figures 5-8 show graphs of words where only the rapidly increasing branch or
the decreasing branch of a reversible development of the curve occurs while
Figures 9 and 10 are cases of fully reversible developments.

Fig. 9: Temporal behaviour of the frequencies of selected content words in the sequence of the
presidential speeches – fully reversible developments
118  Reinhard Köhler and Arjuna Tuzzi

Fig. 10: Temporal behaviour of the frequencies of selected content words in the sequence of
the presidential speeches – fully reversible developments
Table 1: The (smoothed) sequences of relative frequency values of some words and the results of fitting function (3) to the data. The numbers which
represent the years have been transformed ((year-1948)/10) in order to keep the numerical values small enough to be calculated as arguments of the
exponential function.

t Europa storia valore futuro progresso problema terrorismo violenza


pt pt pt pt pt pt pt pt
0 0.4296455 0 0.2148228 0.2148228 2.4899569 1.9641072 0 0.4296455
0.1 0.4296455 0 0.4676673 0.341245 1.8800021 1.9862636 0 0.4296455
0.2 0.4296455 0.1612383 0.7901439 0.341245 2.5249553 2.147502 0 0.4296455
0.3 0.4296455 0.1612383 1.200064 0.341245 2.3137574 2.9673421 0 0.6346056
0.4 0.4296455 0.1612383 1.200064 0.341245 2.4914404 3.2353556 0 0.6346056
0.5 0.4296455 0.2753415 1.200064 0.341245 2.947853 4.0340776 0 0.6346056
0.6 0.4296455 0.2753415 1.200064 0.5348184 2.9791225 4.4212243 0 0.6346056
0.7 0.4296455 0.2753415 1.3352174 0.5348184 2.8812408 4.4212243 0 0.6346056
0.8 0 0.2753415 1.1203946 0.3199956 2.6664181 4.2987985 0 0.20496
0.9 0.1356668 0.2753415 0.8675501 0.1935734 2.6756626 3.540265 0 0.20496
1 0.1356668 0.2332501 0.5450735 0.1935734 2.2690032 3.6173205 0 0.20496
1.1 0.6767923 0.2332501 0.1351534 0.1935734 2.2649273 2.7974803 0 0
1.2 0.6767923 0.3549342 0.1351534 0.1935734 2.4522966 2.2181187 0 0
1.3 0.8571675 0.240831 0.225341 0.283761 2.2664468 2.231085 0 0.450938
1.4 0.9312251 0.3148886 0.4475138 0.1642452 1.8338419 2.658572 0 0.7471684
1.5 0.9312251 0.3148886 0.3123604 0.1642452 2.1087914 3.2038283 0 0.7471684
1.6 0.9312251 0.3148886 0.3123604 0.1642452 2.6675546 3.4553719 0 0.7471684
1.7 0.9098441 0.3148886 0.4266462 0.1642452 2.5318878 3.6839433 0 0.9757398
1.9 0.3687185 0.3033149 0.4266462 0.1642452 1.8877499 3.9467174 0 1.2616615
2 0.3687185 0.1816308 0.4266462 0.2688258 1.6272782 4.4012067 0.1045806 1.4708228
Linguistic modelling of sequential phenomena  119
Europa storia valore futuro progresso problema terrorismo violenza
t
pt pt pt pt pt pt pt pt
2.1 0.3664694 0.2706938 0.5145847 0.1786382 1.6239046 4.2129598 0.1936437 1.1980109
2.2 0.4837811 0.1966362 0.2924118 0.1045806 1.4757893 3.4940106 0.5763822 1.3802036
2.3 0.6699546 0.258694 0.2924118 0.1045806 0.9305331 2.9487543 1.3210762 1.5043192
2.4 0.6699546 0.258694 0.2924118 0.1045806 0.4768118 2.4950331 2.0563704 1.6093613
2.5 0.6063634 0.3600831 0.1781261 0.1045806 0.5275063 2.3171562 2.715399 1.5328734
2.6 0.663828 0.4175477 0.1781261 0.1045806 0.5275063 2.8283836 3.8072276 1.3545249
2.7 0.8162903 0.3099746 0.1781261 0.1045806 0.5275063 2.5056641 4.0740365 1.2469517
2.8 1.2427295 0.3099746 0.1781261 0 0.4229257 1.5644385 3.9694559 1.0377905
2.9 1.1251618 0.2814699 0.0605583 0.0605583 0.3979699 1.0621138 4.1226262 1.2230144
3 0.9337925 0.2814699 0.1664569 0.0605583 0.5038685 2.0254146 3.7398877 0.7445913
3.1 0.815939 0.287732 0.2347769 0.1288784 0.6405085 2.3670147 3.0635137 0.6204756
120  Reinhard Köhler and Arjuna Tuzzi

3.2 0.8758372 0.3476302 0.3545733 0.1288784 0.715161 2.9208526 2.3282196 0.5753318


3.3 1.9458839 0.5451055 0.9523019 0.2035944 1.0380469 2.8701581 1.6691909 0.4979643
3.4 2.4434529 0.7011154 0.9523019 0.2035944 1.0807418 2.4367516 0.5773624 0.5406592
3.5 2.2909907 0.7011154 0.9523019 0.2035944 1.0807418 2.4367516 0.3105534 0.5406592
3.6 2.5860522 0.8041869 1.0553734 0.2035944 1.0807418 2.5398231 0.3620892 0.6952665
3.7 2.6712168 0.8407772 1.1405381 0.1916104 0.8385084 2.4187064 0.1684301 0.3804908
3.8 2.7108443 0.8804047 1.1535221 0.1916104 0.7722373 1.7956234 0.1684301 0.4201183
3.9 2.8790428 0.9134498 1.1189904 0.1232904 0.6693857 1.4878117 0.1001101 0.4201183
4 3.0246945 1.1276181 1.2047439 0.2603237 0.4896911 1.1029983 0.1001101 0.56577
4.1 2.1604293 0.8287537 0.6640099 0.1856076 0.1161108 1.3594743 0.1001101 0.5195512
4.2 1.8557085 0.7225562 0.6640099 0.1856076 0.0734159 1.1390639 0.1001101 0.4768563
4.3 2.5187896 0.8699075 0.885037 0.3329589 0.2944429 1.4337666 0.1001101 0.4768563
4.4 2.4170595 0.9217787 1.0918508 0.3329589 0.4493856 1.330695 0.1260457 0.322249
4.5 2.6806693 1.0975185 0.9461278 0.3526067 0.654052 1.3989172 0.1456935 0.3418968
4.6 3.1786039 1.3266721 0.8272452 0.5541925 0.6144245 1.0302099 0.280084 0.4366599
Europa storia valore futuro progresso problema terrorismo violenza
t
pt pt pt pt pt pt pt pt
4.7 3.3993747 1.225307 1.0678304 0.7371082 0.8550096 1.2707951 0.4629997 0.4366599
4.8 3.6684333 1.1885448 0.8622806 0.6791764 0.934111 1.1549315 0.6212025 0.23111
4.9 3.5319045 1.1885448 1.2850748 0.7991236 1.0540582 1.0184027 0.6212025 0.2026127
5 3.7355181 1.1461145 1.4796147 0.9288169 1.1189049 1.1123369 0.8157424 0.2674593
5.1 3.3911367 1.0784381 1.3382626 0.7814655 0.8978778 1.056659 0.8954173 0.2674593
5.2 3.0216998 1.00694 1.4456002 1.0317993 0.7429352 1.2235482 0.9848352 0.3509039
5.3 2.7369698 0.7340515 1.7571073 1.2127828 0.6005702 1.4045317 0.9789145 0.3449832
5.4 2.5425393 0.522459 1.9286731 1.5258945 0.6577588 1.5660909 0.8445239 0.2677813
5.5 2.3872737 0.5828637 1.7147043 1.5845977 0.5041947 1.4125268 0.722013 0.2677813
5.6 2.07488 0.3996311 1.8228475 1.6136394 0.4250933 1.4165387 0.5638102 0.3759244
a 4.069 3.0989 1.0116 8.4092 57069.8938 43788.8045 62452861.8 2222666.92
b -0.9838 -0.8346 -1.1818 -1.2552 -0.2303 0.1958 11.3231 1.6940
c -0.4588 -0.3999 -0.2691 -0.2775 0.0338 0.0802 2.0271 0.3525
C 2.9841 0.96890 2.0321 4.4388 163948.013 128716.588 31.1565 337354.494
R2 0.8729 0.7581 0.6155 0.9363 0.7419 0.6253 0.7908 0.5609
Linguistic modelling of sequential phenomena  121
122  Reinhard Köhler and Arjuna Tuzzi

5 Conclusions
The presented study is an illustration of the fact that data alone do not give an
answer to a research question. Only a theoretically grounded hypothesis, tested
on appropriate data, produces new knowledge.
We assume that the individual kinds of dynamics in fact reflect the rele-
vance of the corresponding concepts in the political discourse but we are not
going to propose political interpretations of the findings. In a follow-up study,
the complete vocabulary of the presidential discourse will be analysed, and on
this basis, it will be possible to find out whether conceptually related words
follow similar temporal patterns.

Acknowledgments
The authors would like to thank IQLA for providing data for this study.

References
Altmann, Gabriel. 1978. Zur Verwendung der Quotiente in der Textanalyse. Glottometrika 1. 91-
106.
Altmann, Gabriel. 1980. Prolegomena to Menzerath’s law. Glottometrika 2(2). 1-10.
Altmann, Gabriel. 1983a. Das Piotrowski-Gesetz und seine Verallgemeinerungen. In Karl-Heinz
Best & Jörg Kohlhase (eds.), Exakte Sprachwandelforschung, 59-90. Göttingen: edition
herodot.
Altmann, Gabriel. 1983b. H. Arens’ „Verborgene Ordnung“ und das Menzerathsche Gesetz. In
Manfred Faust, Roland Harweg, Werner Lehfeldt & Wienold Götz (eds.), Allgemeine
Sprachwissenschaft, Sprachtypologie und Textlinguistik, 31-39. Tübingen: Gustav Narr.
Altmann, Gabriel. 1988. Linguistische Meßverfahren. In Ulrich Ammon, Norbert Dittmar & Klaus
J. Mattheier (eds.), Sociolinguistics. Soziolinguistik, 1026-1039. Berlin, New York: Walter
de Gruyter.
Bunge, Mario. 1967. Scientific Research I, II. Berlin, Heidelberg, New York: Springer.
Bunge, Mario. 1998. Philosophy of science. From problem to theory. New Brunswick, London:
Transaction Publishers.
Bunge, Mario. 2007 [1998]. Philosophy of science. From explanation to justification, 4th edn.
New Brunswick, London: Transaction Publishers.
Covington, Michael A. & Joe D. McFall (2010). Cutting the Gordian Knot: The Moving-Average
Type-Token Ratio (MATTR). Journal of Quantitative Linguistics 17(2). 94-100.
Linguistic modelling of sequential phenomena  123

Cramer, Irene. 2005. Das Menzerathsche Gesetz. In Reinhard Köhler, Gabriel Altmann &
Rajmond G. Piotrowski (eds.), Quantitative Linguistik. Ein internationales Handbuch.
Quantitative Linguistics. An International Handbook, 659-688. Berlin, New York: Walter de
Gruyter.
Köhler, Reinhard & Matthias Galle. 1993. Dynamic aspects of text characteristics. In Ludĕk
Hřebíček & Gabriel Altmann (eds.), Quantitative text analysis (Quantitative Linguistics 52),
46-53. Trier: Wissenschaftlicher Verlag.
Kubát, Miroslav & Jiří Milička. 2013. Vocabulary Richness Measure in Genres. Journal of
Quantitative Linguistics 20(4). 339-349.
Trevisani, Matilda & Arjuna Tuzzi. 2012. Chronological analysis of textual data and curve
clustering: preliminary results based on wavelets. In Società Italiana di Statistica,
Proceedings of the XLVI Scientific Meeting. Padova: CLEUP.
Trevisani, Matilda & Arjuna Tuzzi. 2013. Shaping the history of words. In Ivan Obradović,
Emmerich Kelih & Reinhard Köhler (eds.), Methods and Applications of Quantitative
Linguistics: Selected papers of the VIIIth International Conference on Quantitative
Linguistics (QUALICO), Belgrade, Serbia, April 16-19, 2012, 84-95. Belgrade, Serbia:
Akademska Misao.
Wimmer, Gejza & Gabriel, Altmann. 1999. On Vocabulary Richness. Journal of Quantitative
Linguistics 6(1). 1-9.
Ján Mačutek and George K. Mikros
Menzerath-Altmann Law for Word Length
Motifs

1 Introduction
Motifs are relatively new linguistic units which make possible an in-depth inves-
tigation of sequential properties of texts (for the general definition cf. Köhler, this
volume, pp. 89-90). They were studied in a handful of papers (Köhler 2006,
2008a,b, this volume, pp. 89-108; Köhler and Naumann 2008, 2009, 2010, Maču-
tek 2009, Sanada 2010, Milička, this volume, pp. 133-145). Specifically, a word
length motif is a continuous series of equal or increasing word lengths (measured
here in the number of syllables, although there are also other options, like, e.g.,
morphemes).
In the papers cited above it is supposed that motifs should have properties
similar to those of their basic units, i.e., words in our case. Indeed, word fre-
quency and motif frequency, as well as word length (measured in the number of
syllables) and motif length (measured in the number of words) can be modelled
by the same distributions (power laws, like, e.g., the Zipf-Mandelbrot distribu-
tion, and Poisson-like distributions, respectively; cf. Wimmer and Altmann 1999).
Also the type-token relations for words and motifs display similar behaviour, dif-
fering only in parameters values, but not in models.
We enlarge the list of analogous properties of motifs and their basic units,
demonstrating (cf. Section 3.1) that for word length motifs also the Menzerath-
Altmann law (cf. Cramer 2005; MA law henceforth) is valid. The MA law describes
the relation between sizes of the construct, e.g., a word, and its constituents, e.g.,
syllables. It states that the larger the construct (the whole), the smaller its con-
stituents (parts). In particular, for our data it holds the longer is the motif (in the
number of words), the shorter the mean length of words (in the number of sylla-
bles) which constitute the motif. In addition, in Section 3.2 we show that for ran-
domly generated texts the MA law is valid as well, but its parameters differ from
those obtained from real texts.
126  Ján Mačutek and George K. Mikros

2 Data
In order to study the MA law for word length motifs we compiled a Modern
Greek literature corpus totaling 236,233 words. The corpus contains complete
versions of literary texts from the same time period and has been strictly con-
trolled for editorial normalization. It contains five novels from four widely
known Modern Greek writers published from the same publishing house
(Kastaniotis Publishing House). All the novels were best-sellers in the Greek
market and belong to the “classics” of the Modern Greek literature. More specif-
ically, the corpus consists of:
– The mother of the dog, 1990, by Matesis (47,852 words).
– Murders, 1991, by Michailidis (72,475 words).
– From the other side of the time, 1988, by Milliex [1] (77,692 words).
– Dreams, 1991, by Milliex [2] (9,761 words) - Test novel.
– The dead liqueur, 1992, by Xanthoulis (28,453 words).

The basic descriptive statistics of the corpus appear in table 1:

Table 1: Basic descriptive statistics of the data.

Authors Matesis Michailidis Milliex [1] Milliex [2] Xanthoulis


number of words 47,852 72,475 77,692 9,761 28,453
number of motifs 19,283 29,144 32,034 4,022 11,236
different motifs 316 381 402 192 289
mean word length in syllables 2.09 2.03 2.10 2.13 2.07
mean motif length in words 2.48 2.49 2.43 2.43 2.53
mean motif length in syllables 5.20 5.05 5.09 5.17 5.23

3 Results

3.1 MA law for word length motifs in Modern Greek texts

The results obtained confirmed our expectation that the MA law should be valid
also for word length motifs. The tendency of mean word length (measured in the
number of syllables) to decrease with the increasing motif length (measured in
the number of words) is obvious in all five texts investigated, cf. Table 2.
We modelled the relation by the function
Menzerath-Altmann law for word length motifs  127

𝑦𝑦𝑦𝑦(𝑥𝑥𝑥𝑥) = 𝛼𝛼𝛼𝛼𝑥𝑥𝑥𝑥 𝑏𝑏𝑏𝑏 (1)

where 𝑦𝑦𝑦𝑦(𝑥𝑥𝑥𝑥) is the mean length of words which occur in motifs consisting of 𝑥𝑥𝑥𝑥
words; 𝛼𝛼𝛼𝛼 and 𝑏𝑏𝑏𝑏 are parameters. Given that

𝑦𝑦𝑦𝑦(1) = 𝛼𝛼𝛼𝛼,

we replaced 𝛼𝛼𝛼𝛼 with the mean length of words from motifs of length 1, i.e., motifs
consisting of one word only (cf. Kelih 2010, Mačutek and Rovenchak 2011). In or-
der to avoid too strong fluctuations, only motif lengths which appeared in partic-
ular texts at least 10 times were taken into account (cf. Kelih 2010). The appropri-
ateness of the fit was assessed in terms of the determination coefficient R2 (values
higher than 0.9 are usually considered satisfying, cf., e.g., Mačutek and Wimmer
2013). The numerical results (values of R2 and parameter values for which R2 reach
its maximum) are presented in Table 2.

Table 2: Fitting function (1) to the data. ML- motif length, MWL0- observed mean word length,
MWLt theoretical mean word length resulting from (1).

Matesis Michailidis Milliex 1 Milliex 2 Xanthoulis


ML MWL0 MWLt MWL0 MWLt MWL0 MWLt MWL0 MWLt MWL0 MWLt
1 2.30 2.30 2.32 2.32 2.34 2.34 2.39 2.39 2.42 2.42
2 2.13 2.15 2.08 2.13 2.13 2.18 2.16 2.22 2.13 2.19
3 2.08 2.07 2.00 2.03 2.08 2.09 2.11 2.12 2.02 2.06
4 2.04 2.01 1.97 1.96 2.04 2.02 2.09 2.05 2.01 1.97
5 1.96 1.97 1.90 1.90 1.99 1.98 2.01 2.01 1.96 1.91
6 1.94 1.94 1.88 1.86 1.94 1.94 1.94 1.97 1.92 1.86
7 1.98 1.91 1.87 1.83 1.94 1.91 1.96 1.88 1.74 1.82
8 1.81 1.88 1.69 1.80 1.84 1.88
9 1.84 1.77

b=−0.096 b=−0.123 b=−0.105 b=−0.109 b=−0.147


R2=0.9195 R2=0.9126 R2=0.9675 R2=0.9582 R2=0.9312

3.2 MA law for word length motifs in random texts

The results presented in the previous section show that longer motifs contain
shorter words and vice versa. The relation between the lengths (i.e., the MA law)
can be modelled by a simple power function. One cannot, however, apriori ex-
clude the possibility that the observed regularities are necessary in the sense that
they could be only a consequence of some other laws. In this particular case, it
seems reasonable to ask whether MA law remains valid if the distribution of word
128  Ján Mačutek and George K. Mikros

length is kept, but the sequential structure of word length is deliberately forgot-
ten.
Randomization (i.e., random generating of texts – or, only some properties
of texts – by means of computer programs) is a useful tool for finding answers to
questions of this type. It is slowly finding its way to linguistic research (cf., e.g.,
Benešová and Čech, this volume, pp. 57-69, and Milička, this volume, pp. 133-145
for other analyses of the MA law; Liu and Hu 2008 applied randomization to re-
fute claims that small-world and scale-free complex language networks automat-
ically give rise to syntax).
In order to get rid of the sequential structure of word lengths, while at the
same time preserving the word length distribution, we generated random num-
bers from the distribution of word length in each of the five texts under our inves-
tigation. The number of generated random word lengths is always equal to the
text length of the respective real text (e.g., we generated 47,852 random word
lengths for the text by Matesis, as it contains 47,852 words, cf. Table 1). Then, we
fitted function (1) to the randomly generated data. The outcomes of the fitting can
be found in Table 3. The generated data were truncated at the same points as their
counterparts from real texts, cf. Section 3.1.

Table 3: Fitting function (1) to the data. ML- motif length, MWLr - mean word length from ran-
domly generated data, MWLf - fitted values resulting from (1).

Matesis Michailidis Milliex 1 Milliex 2 Xanthoulis


ML MWLr MWLf MWLr MWLf MWLr MWLf MWLr MWLf MWLr MWLf
1 2.41 2.41 2.36 2.36 2.44 2.44 2.42 2.42 2.42 2.42
2 2.27 2.19 2.20 2.13 2.28 2.20 2.29 2.20 2.26 2.18
3 2.11 2.07 2.06 2.01 2.15 2.07 2.15 2.08 2.12 2.05
4 2.02 1.99 1.96 1.93 2.02 1.99 2.07 2.00 1.98 1.97
5 1.94 1.93 1.87 1.87 1.94 1.92 1.97 1.94 1.89 1.90
6 1.86 1.88 1.84 1.82 1.83 1.87 1.86 1.89 1.82 1.85
7 1.81 1.84 1.77 1.78 1.82 1.83 1.72 1.85 1.76 1.81
8 1.78 1.81 1.69 1.74 1.73 1.79
9 1.66 1.71

b=−0.138 b=−0.146 b=−0.148 b=−0.137 b=−0.150


R2=0.9685 R2=0.9682 R2=0.9553 R2=0.8951 R2=0.9595

It can be seen that the MA law holds also in this case; however, parameters 𝑏𝑏𝑏𝑏 in
random texts are different from the ones from real texts. The parameters in the
random texts have always larger absolute values, which means that the respec-
tive curves are steeper, i.e., they decrease more quickly.
Menzerath-Altmann law for word length motifs  129

As an example, in Fig. 1, we present data from the first text (by Matesis, cf. Section
2) together with the mathematical model (1) fitted to the data.

Fig. 1: Data from the text by Matesis (circles) and from the respective random text (diamonds),
together with fitted models (dashed line – the model for the real text, solid line – the model for
the random text)

The other texts behave similarly.


The validity of the law in random texts (created under condition that their
word length distribution in the same as in the real ones) can be deductively ex-
plained directly from the definition of word length motifs (cf. Section 1). A word
length motif is ended at the place where the sequence of word lengths decreases.
Hence, if a long word appears in a motif, it is likely to be the last element of the
motif (the probability that the next word would be of the same length – or even
longer – is small, because long words occur relatively seldom). Consequently, if
a long word appears soon, the motif tends to be short in terms of words, but the
mean word length in such a motif will be high (because the length of one long
word will have a higher weight in a short motif than in a long one).
Differences in values of parameters 𝑏𝑏𝑏𝑏 in real and randomized texts indicate
that, regardless of the obvious impact of the word length distribution, also the
130  Ján Mačutek and George K. Mikros

sequential structure of word lengths plays an important role in the MA law for
word length motifs.

4 Conclusions
The paper brings another confirmation that word length motifs behave in the
same way as other, more “traditional” linguistic units. In addition to the motif
frequency distribution and the distribution of motif length (which were shown to
follow the same patterns as words), also the MA law is valid for word length mo-
tifs, specifically, the more words a motif contains, the shorter mean syllabic
length of words in the motif.
The MA law can be observed also in random texts, if word lengths distribu-
tions in a real text and in its random counterpart are the same. The validity of the
law in random texts can be explained deductively from word length distribution.
However, parameters in the exponents of the power function which is a mathe-
matical model of the MA law are different for real and random texts. The power
functions corresponding to random texts are steeper. The difference in parameter
values proves that not only word length distribution, but also the sequential
structure of word lengths has an impact on word length motifs.
It remains an open question whether parameters of the MA law can be used
as characteristics of languages, genres or authors. In case of the positive answer,
they could possibly be applied to language classification, authorship attribution
and similar fields.

Acknowledgments
J. Mačutek was supported by VEGA grant 2/0038/12.

References
Benešová, Martina & Radek Čech. 2015. Menzerath-Altmann law versus random models. This
volume, pp. 57-69.
Cramer, Irene M. 2005. Das Menzerathsche Gesetz. In Reinhard Köhler, Gabriel Altmann &
Rajmund G. Piotrowski (eds.), Quantitative Linguistics. An international handbook, 659–
688. Berlin & New York: de Gruyter.
Menzerath-Altmann law for word length motifs  131

Liu, Haitao & Fengguo Hu. 2008. What role does syntax play in a language network? EPL 83.
18002.
Kelih, Emmerich. 2010. Parameter interpretation of the Menzerath law: Evidence from Serbian.
In Peter Grzybek, Emmerich Kelih & Ján Mačutek (eds.), Text and language. Structures,
functions, interrelations, quantitative perspectives, 71–79. Wien: Praesens.
Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef
Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Viktor Krupa, 145–
152. Bratislava: Slovak Academic Press.
Köhler, Reinhard. 2008a. Word length in text. A study in the syntagmatic dimension. In Sybila
Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: Veda.
Köhler, Reinhard. 2008b. Sequences of linguistic quantities. Report on a new unit of investiga-
tion. Glottotheory 1(1). 115–119.
Köhler, Reinhard. 2015. Linguistic motifs. This volume, pp. 89-108.
Köhler, Reinhard & Sven Naumann. 2008. Quantitative text analysis using L-, F- and T-seg-
ments. In Christine Preisach, Hans Burkhardt, Lars Schmidt-Thieme & Reinhold Decker
(eds.), Data analysis, machine learning and applications, 635–646. Berlin & Heidelberg:
Springer.
Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sen-
tence level. In Reinhard Köhler (ed.), Issues in quantitative linguistics, 34–57.
Lüdenscheid: RAM-Verlag.
Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classifica-
tion. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Em-
merich Kelih & Ján Mačutek (eds.), Text and language. Structures, functions, interrela-
tions, quantitative perspectives, 81–89. Wien: Praesens.
Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in quantitative linguistics,
51–60. Lüdenscheid: RAM-Verlag.
Mačutek, Ján & Andrij Rovenchak. 2011. Canonical word forms: Menzerath-Altmann law, phone-
mic length and syllabic length. In Emmerich Kelih, Victor Levickij & Yuliya Matskulyak
(eds.), Issues in quantitative linguistics 2, 136–147. Lüdenscheid: RAM-Verlag.
Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models
in quantitatitive linguistics. Journal of Quantitative Linguistics 20(3). 227–240.
Milička, Jiří. 2015. Is the distribution of L-motifs inherited from the word lengths distribution?
This volume, pp. 133-145.
Sanada, Haruko. 2010. Distribution of motifs in Japanese texts. In Peter Grzybek, Emmerich
Kelih & Ján Mačutek (eds.), Text and Language. Structures, functions, interrelations, quan-
titative perspectives, 183–194. Wien: Praesens.
Wimmer, Gejza & Gabriel Altmann. 1999. Thesaurus of univariate discrete probability distribu-
tions. Essen: Stamm
Jiří Milička
Is the Distribution of L-Motifs Inherited from
the Word Length Distribution?

1 Introduction
An increasing number of papers1 shows that word length sequences can be
successfully analyzed by means of L-motifs, which are a very promising attempt
to discover the syntagmatic relations of the word lengths in a text.
The L- motif2 has been defined by Reinhard Köhler (2006a) as:

(...) the text segment which, beginning with the first word of the given text, consists of word
lengths which are greater or equal to the left neighbour. As soon as a word is encountered
which is shorter than the previous one the end of the current L-Segment is reached. Thus,
the fragment (1) will be segmented as shown by the L-segment sequence (2):

Azon a tájon, ahol most Budapest fekszik, már nagyon régen laknak emberek.

2,122,13,2,12223

The main advantage of such segmentation is that it can be applied iteratively, i.e.
L-motifs of the L-motifs can be obtained (so called LL-motifs). Applying the
method several times results in not very intuitive sequences, which, however, fol-
low lawful patterns3 and they are even practically useful, e.g. for automatic text
classification (Köhler – Naumann 2010).

2 Hypothesis
However it needs to be admitted that the fact that the L-motifs follow lawful
patterns does not imply that the L-motifs reflect a syntagmatic relation of the
word lengths, since these properties could be merely inherited from the word


1 See Köhler – Naumann (2010), Mačutek (2009), Sanada (2010).
2 Former term was L-segments, see Köhler (2006a).
3 E.g. the rank-frequency relation of the L-motifs distribution can be successfully described by
the Zipf-Mandelbrot distribution, which is a well-established law for the word types rank-fre-
quency relation.
134  Jiří Milička

length distribution in the text, which has not been tested yet. The paper focuses
on the most important property of L-motifs – the frequency distribution of their
types and tests the following hypothesis:
The distribution of L-motifs measured on the text T differs from the
distribution of L-motifs measured on a pseudotext T’. The pseudotext T’ is created
by the random transposition of all tokens of the text T within the text T.4

3 Data
The hypothesis was tested on three Czech and six Arabic texts:

Table 1: The list of texts

Tag Author Title Cent. Lang. Tokens


[Zer] Milan Kundera Žert 20 Czech 88435
[Kat] Kohout Katyně 20 Czech 99808
[Bab] Božena Němcová Babička 19 Czech 70140
[Ham] al-Ḥāzimī al-Hamadānī Al-ʾIʿtibār fi ʼn-nāsiḫ wa-ʼl-mansūḫ 15 Arabic 71482
[Sal] ibn aṣ-Ṣallāḥ Maʿrifatu ʾanwāʿi ʿulūmi ʼl-ḥadīṯ 13 Arabic 54915
[Zam] ibn abī Zamanīn Uṣūlu ʼs-sunna 11 Arabic 18607
[Maw] al-Mawwāq Tāǧ wa-l-ʿiklīl 2 15 Arabic 274840
[Baj] al-Bāǧī al-ʿAndalūsī Al-Muntaqī 2 11 Arabic 301232
[Bah] Manṣūr al-Bahūtī Šarḥ muntahīyu ʼl-irādāt 2 17 Arabic 263175

The graphical word segmentation was respected when determining the number
of syllables in the Arabic texts. In the Czech texts zero syllabic words (e.g. s, z, v,
k) were merged with the following words according to the conclusion in Antić et
al. (2006), to maintain the compatibility with other studies in this field (e.g. Köh-
ler 2006b).


4 The null hypothesis is: “The distribution of L-motifs measured on the text T is the same as
the distribution of L-motifs measured on a pseudotext T’. The pseudotext T’ is created by random
transposition of all tokens of the text T within the text T.”
Is the Distribution of L-Motifs Inherited from the Word Length Distribution?  135

4 Motivation
One of those texts [Kat] was randomized for one million5 times and the rank–
frequency relation (RFR) of L-motifs was measured for every randomised
pseudotext. Then these RFRs were averaged. This average RFR can be seen on the
following chart, accompanied by the RFR of L-structures measured on the real
text:

Fig. 1: RFR of the L-motifs, [Kat].

Visually, the RFR of the L-motifs for the real text does not differ from the average
pseudotext RFR of the L-motifs very much. This impression is supported by the
Chi-squared discrepancy coefficient 𝐶𝐶𝐶𝐶 = 0.0008.6 Also the fact, that both the
real text’s L-motifs RFR and the randomized text’s L-motifs RFR can be success-
fully fitted by the Right truncated Zipf-Alexeev distribution with similar parame-
ters7 encourages us to assume that the RFR of L-motifs is given by the word length
distribution in the text.


5 1 million has been arbitrarily chosen as a “sufficiently large number”, which makes it the
weakest point of our argumentation.
6 𝐶𝐶𝐶𝐶 = Χ ⁄𝑁𝑁𝑁𝑁 , where N is the sample size (Mačutek – Wimmer 2013).
2

7 For the real data: 𝑎𝑎𝑎𝑎 = 0.228; 𝑏𝑏𝑏𝑏 = 0.1779; 𝑛𝑛𝑛𝑛 = 651; 𝛼𝛼𝛼𝛼 = 0.1089; 𝐶𝐶𝐶𝐶 = 0.0066. For the
randomized pseudotexts: 𝑎𝑎𝑎𝑎 = 0.244; 𝑏𝑏𝑏𝑏 = 0.1761; 𝑛𝑛𝑛𝑛 = 651; 𝛼𝛼𝛼𝛼 = 0.1032; 𝐶𝐶𝐶𝐶 = 0.0047. Altmann
Fitter was used.
136  Jiří Milička

Very similar results can be obtained for LL-motifs, LLL-motifs8 etc. (the Zipf-Man-
delbrot distribution fits the distribution of the higher orders L-motifs better than
the right truncated Zipf-Alexeev distribution).
But these results do not answer the question asked. The next section pro-
ceeds to the testing of the hypothesis.

5 Methods
Not only L-motifs as a whole, but every single L-motif has the distribution of its
frequencies within those one million randomized pseudotexts. For example the
number of pseudotexts (randomized [Bab]), where the L-motif (1, 1, 2, 2, 2) oc-
curred 72 times, is 111. From this distribution we can obtain confidence intervals
(95%) as depicted on the following chart:

Fig. 2: Distribution of one of the L-motif types in one million pseudotexts (randomized [Bab])
vs. the frequency of the L-motif in the real text.

In this case, the frequency of the motif (1, 1, 2, 2, 2) measured on the real text [Bab]
is 145, which is above the upper confidence interval limit (in this case 125). But
the frequencies of many other L-motifs are within these intervals, such as the mo-
tif (1, 1, 1, 2, 2):


8 For the LL-motifs: 𝐶𝐶𝐶𝐶 = 0.0009; for the LLL-motifs: 𝐶𝐶𝐶𝐶 = 0.0011.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution?  137

Fig. 3: Distribution of one of the L-motif types in one million pseudotexts (randomized [Bab])
vs. the frequency of the L-motif in the real text.

The fact that the frequencies are not independent from each other does not allow
us to test them separately as multiple hypotheses, and moves us to merge all val-
ues of the distribution into one number. The following method was chosen:
1. The text is many times randomized (in this case 1 million times) and for each
pseudotext frequencies of L-motifs are measured. The average frequency of
every L-motif is calculated. The average frequency of the motif (indexed by
the variable 𝑖𝑖𝑖𝑖, 𝑁𝑁𝑁𝑁 is the maximal 𝑖𝑖𝑖𝑖) will be referred as 𝑚𝑚𝑚𝑚
𝑚 𝑖𝑖𝑖𝑖 .
2. The total distance (𝐷𝐷𝐷𝐷) between the frequencies of each motif (𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 ) in the text
𝑇𝑇𝑇𝑇 and their average frequencies in the randomized pseudotexts (𝑚𝑚𝑚𝑚 𝑚 𝑖𝑖𝑖𝑖 ) are cal-
culated:
𝑁𝑁𝑁𝑁

𝑚 𝑖𝑖𝑖𝑖 − 𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 |
𝐷𝐷𝐷𝐷 = �|𝑚𝑚𝑚𝑚
𝑖𝑖𝑖𝑖𝑖𝑖

3. All total distances (𝐷𝐷𝐷𝐷𝐷) between the frequencies of each motif (𝑚𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 ) in one mil-
lion pseudotexts 𝑇𝑇𝑇𝑇𝑇 (these pseudotexts must be different from those that were
measured in the step 1) and their average frequencies in the randomized
pseudotexts (𝑚𝑚𝑚𝑚
𝑚 𝑖𝑖𝑖𝑖 ) (must be the same as in the previous step) are calculated:
𝑁𝑁𝑁𝑁

𝑚 𝑖𝑖𝑖𝑖 − 𝑚𝑚𝑚𝑚𝑚𝑖𝑖𝑖𝑖 |
𝐷𝐷𝐷𝐷𝐷 = �|𝑚𝑚𝑚𝑚
𝑖𝑖𝑖𝑖𝑖𝑖

4. The distribution of the 𝐷𝐷𝐷𝐷𝐷 distances is obtained.


138  Jiří Milička

5. The upper confidence limit is set. The distance D significantly lower than the
distances D’ would mean that the real distribution is even closer to the distri-
bution generated by random transposing tokens than another distributions
measured on randomly transposed tokens. This would not reject the null hy-
pothesis. Considering this, the lower confidence limit is senseless and the
test can be assumed to be one-tailed.
6. D is compared with the upper confidence limit. If D is larger than the upper
confidence limit, then the null hypothesis is rejected.

An example result of this method follows (applied on L-motifs of [Bab]):

Fig. 4: The distribution of the variable D in one million pseudotexts (randomized [Bab]) vs. the
value of the D variable in the real text. Here, for example, 4371 out of one million randomized
texts have D’ equal to 1500.

As the D is larger than the upper confidence limit, we shall assume that the dis-
tribution of the L-motifs measured on [Bab] is more distant from the average dis-
tribution of L-motifs measured on pseudotexts (derived by the random transpo-
sition of tokens in [Bab]), than the distribution of L-motifs measured on another
pseudotexts (also derived by the random transposition of tokens in [Bab]).
Is the Distribution of L-Motifs Inherited from the Word Length Distribution?  139

6 Results
In the following charts, one column represents the 𝐷𝐷𝐷𝐷’ values compared to the
measured D value (like in the Fig. 4, but in a more concise form) for 7 orders of
motifs. Confidence limits of the 95% confidence intervals are indicated by the
error bars.

Fig. 5: The D-value of the distribution of L-motifs (in the [Zer]) is significantly different from the
D’-value measured on randomly transposed tokens of the same text. Notice that the LL-motifs
distribution D-value is also close to the upper confidence limit.
140  Jiří Milička

Fig. 6: The D-values of the distributions of L-motifs and LL-motifs (in the [Kat]) are significantly
different from the D’-values measured on randomly transposed tokens of the same text. Notice
that the LL-motifs distribution D-value is very close to the upper confidence limit.

Fig. 7: The D-values of the distributions of L-motifs, LL-motifs and LLLL-motifs (in the [Bab]) are
significantly different from the D’-values measured on randomly transposed tokens of the
same text. Consider that the LLLL-motifs distribution can be different just by chance.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution?  141

Fig. 8: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Ham]) are
significantly different from the D’-values. The ratios between the D-values and the upper confi-
dence limits are more noticeable than those measured on the Czech texts (the y-axis is log
scaled). As the size of these texts is comparable, it seems that the L-motif structure is more
substantial for the Arabic texts than for the Czech ones.

Fig. 9: The D-values of the distributions of L-motifs and LL-motifs (in the [Sal]) are significantly
different from the D’-values.
142  Jiří Milička

Fig. 10: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Zam]) are
significantly different from the D’-values despite of the fact, that the text is relatively short.

Fig. 11: The D-values of the distributions of L-motifs, LL-motifs, LLL-motifs and LLLL-motifs (in
the [Maw]) are significantly different from the D’-values despite of the fact, that the text is rela-
tively large and incoherent.
Is the Distribution of L-Motifs Inherited from the Word Length Distribution?  143

Fig. 12: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Baj]) are
significantly different from the D’-values.

Fig. 13: The D-values of the distributions of L-motifs, LL-motifs and LLL-motifs (in the [Bah]) are
significantly different from the D’-values.
144  Jiří Milička

Table 2: Exact figures presented in the charts. U stands for the upper confidence limit. All D;
D>U are marked bold.

[Zer] [Kat] [Bab] [Ham] [Sal] [Zam] [Maw] [Baj] [Bah]


L-motifs 𝐷𝐷𝐷𝐷 2930 2943 2331 7712 3909 2673 21899 25437 18070
𝐷𝐷𝐷𝐷′ 1860 2084 1512 1659 1547 867.6 4065.1 4108.6 4094.7
𝑈𝑈𝑈𝑈 2072 2312 1699 1851 1716 967 4448 4506 4472
LL-motifs 𝐷𝐷𝐷𝐷 1705 1825 1649 2207 1394 1057 6133.9 6798.9 4245.5
𝐷𝐷𝐷𝐷′ 1584 1635 1391 1364 1171 603.2 2779.0 2980.9 2654.6
𝑈𝑈𝑈𝑈 1719 1780 1510 1486 1278 666 3023 3234 2895
LLL-motifs 𝐷𝐷𝐷𝐷 839.1 874.5 721.5 1035 631.0 487.2 2444.2 3279.7 1909.4
𝐷𝐷𝐷𝐷′ 792.8 869.4 690.6 716.9 620.5 339.2 1568.0 1627.4 1551.8
𝑈𝑈𝑈𝑈 876 959 764 792 687 378 1719 1785 1701
LLLL-motifs 𝐷𝐷𝐷𝐷 474.3 539.3 483.5 441.1 352.1 201.9 1085.3 1072.0 911.3
𝐷𝐷𝐷𝐷′ 481.3 525.0 417.7 430.9 370.5 195.6 957.3 1000.2 942.0
𝑈𝑈𝑈𝑈 532 580 463 477 411 219 1051 1097 1034
LLLLL-motifs 𝐷𝐷𝐷𝐷 260.3 297.2 225.6 230.5 209.0 107.3 504.3 572.2 527.3
𝐷𝐷𝐷𝐷′ 269.4 293.6 232.3 239.4 204.8 105.7 544.0 569.4 534.7
𝑈𝑈𝑈𝑈 300 327 260 267 229 119 601 629 591
LLLLLL-motifs 𝐷𝐷𝐷𝐷 156.1 156.6 118.7 117.7 113.1 56.6 310.2 332.8 316.8
𝐷𝐷𝐷𝐷′ 148.5 163.0 127.4 131.6 112.2 56.1 308.4 322.8 303.1
𝑈𝑈𝑈𝑈 167 183 144 148 127 64 343 359 337
LLLLLLL-motifs 𝐷𝐷𝐷𝐷 82.7 85.9 73.0 75.6 63.3 28.9 166.2 186.5 177.1
𝐷𝐷𝐷𝐷′ 80.1 88.4 68.5 70.8 59.6 29.0 171.7 180.1 168.6
𝑈𝑈𝑈𝑈 91 100 78 80 68 33 193 202 189

7 Conclusion
The null hypothesis was rejected for the L-motifs (all texts) and for LL-motifs
(except [Zer]) and was not rejected for L-motifs of higher orders (LLL-motifs etc.)
in Czech, but was rejected also for LLL-motifs in Arabic (except [Sal]). As type-
token relation and distribution of lengths are to some extent dependent on the
frequency distribution, similar results for these properties can be expected, but
proper tests are needed. Our methodology can be also used for testing F-motifs
and other types and definitions of motifs.
It needs to be said that non-rejecting the null hypothesis does not mean, that
the L-motifs of higher orders are senseless – even if their distribution was inher-
ited from the distribution of word lengths in the text (which is still not sure), it
still could be used as a tool mediating to see the distribution of the word lengths
from another point of view. However, it turns out that if we wish to use the L-
Is the Distribution of L-Motifs Inherited from the Word Length Distribution?  145

motifs to examine the syntagmatic relations of the word lengths, the structure
inherited from the word length distribution must be taken into account.

Acknowledgements
The author is grateful to Reinhard Köhler for helpful comments and suggestions.
This work was supported by the project Lingvistická a lexikostatistická analýza ve
spolupráci lingvistiky, matematiky, biologie a psychologie, grant no.
CZ.1.07/2.3.00/20.0161 which is financed by the European Social Fund and the
National Budget of the Czech Republic.

References
Antić, Gordana, Emmerich Kelih & Peter Grzybek. 2006. Zero-syllable words in determining
word length. In Peter Grzybek (ed.), Contributions to the science of text and language.
Word length studies and related issues, 117–156. Dordrecht: Springer.
Köhler, Reinhard. 2006. The frequency distribution of the lengths of length sequences. In Jozef
Genzor & Martina Bucková (eds.), Favete linguis. Studies in honour of Victor Krupa, 145–
152. Bratislava: Slovak Academic Press.
Köhler, Reinhard. 2008. Word length in text. A study in the syntagmatic dimension. In Sibyla
Mislovičová (ed.), Jazyk a jazykoveda v pohybe, 416–421. Bratislava: VEDA.
Köhler, Reinhard & Sven Naumann. 2009. A contribution to quantitative studies on the sen-
tence level. In Reinhard Köhler (ed.), Issues in quantitative linguistics, 34–57. Lüden-
scheid: RAM-Verlag.
Köhler, Reinhard & Sven Naumann. 2010. A syntagmatic approach to automatic text classifica-
tion. Statistical properties of F- and L-motifs as text characteristics. In Peter Grzybek, Em-
merich Kelih & Ján Mačutek (eds.). Text and language. Structures, functions, interrela-
tions, quantitative perspectives, 81–89. Wien: Praesens.
Mačutek, Ján. 2009. Motif richness. In Reinhard Köhler (ed.), Issues in Quantitative Linguistics,
51–60. Lüdenscheid: RAM-Verlag.
Mačutek, Ján & Gejza Wimmer. 2013. Evaluating goodness-of-fit of discrete distribution models
in quantitative linguistics. Journal of Quantitative Linguistics 20(3). 227–240.
Sanada, Haruko. 2010. Distribution of motifs in Japanese texts. In Peter Grzybek, Emerich Kelih
& Ján Mačutek (eds.), Text and language. Structures, functions, interrelations, quantitative
perspectives, 183–193. Wien: Praesens.
Adam Pawłowski and Maciej Eder
Sequential Structures in “Dalimil’s
Chronicle”
Quantitative analysis of style variation

1 The problem and the aim of the study


When in the 1960s Jerzy Woronczak analysed the text of the Old-Czech Chronicle
of So-called Dalimil (Staročeská Kronika tak řečeného Dalimila, henceforth
Dalimilova Kronika or Dalimil’s Chronicle), he claimed to have observed the
process of gradual disintegration of its regular verse texture in subsequent
chapters (Woronczak 1993[1963]). He called this phenomenon “prosaisation”.
To elucidate his findings, he formulated a hypothesis that it is easier to compose
a versified text if the historical events that are the subject of the narrative lay
beyond the memory of the author’s contemporaries (e.g. in the distant mythical
past and/or in pagan times). Indeed, as the gap between the years being de-
scribed in Dalimilova Kronika and the times of the author’s life decreases, thus
moving from the beginning towards the end of the Chronicle, its style seems
increasingly prose-like. Woronczak offers a convincing explanation of this
gradual shift in versification. He claims that the contemporaneous events de-
scribed in the Chronicle were known to the public and for this reason needed to
be reported faithfully. In other words, historical facts could not be simply ad-
justed to the constraints imposed by versification and the annalist had less
freedom in selecting the appropriate metrical and/or prosodic structures when
composing his text.1 One could put forward another argument in support of this
hypothesis: the initial parts of the Chronicle probably echoed traces of oral
tradition involving the use of repetitive paratactic segments (referred to as for-
mulae) and equivalent rhythmical structures facilitating memorisation. At the
eve of the Middle Ages, when the culture of scripture (and a fortiori of print)
began to dominate the entire realm of social communication in Europe, formu-
laic elements started to disappear from literature and historiography. However,


1 This explanation was transmitted to prof. Pawłowski during a conversation with prof.
Woronczak and so far has not been publicised in print.
148  Adam Pawłowski and Maciej Eder

they persisted as a living heritage of the oral tradition in a form of intertextual


borrowings in literary and paraliterary works.
Woronczak’s idea was simple and – in a sense – universal since the inter-
play between historical facts and imagination has been present in historiog-
raphy and historical fiction since time immemorial. His methodological ap-
proach was appropriate and convincing as it relied on empirical evidence and
statistical tools. However, Woronczak, who was conducting his research in the
middle of the twentieth century, could not efficiently work on a digitalised text
or take full advantage of computerised numerical methods. In particular, he did
not use trend analysis or time series methods to verify the existence of repetitive
patterns in the text of the Chronicle. Instead, he analysed the distribution of
verse lengths by means of the runs test method, which is useful for testing the
randomness of a series of observations. The advantage of this method – quite
suitable in this case – is that it does not require the processing of a digitalised
text or extensive computations.
The objective of this study is thus to scrutinise some of Woronczak’s as-
sumptions by means of tests performed on a variety of sequential text structures
in the Chronicle. The following data will be analysed: (1) series of chapter
lengths (in letters, syllables and words); (2) series of verse lengths (in letters,
syllables letters and words); (3) alternations and correlations of rhyme pairs; (4)
quantity-based series of syllables (binary coding); (4) stress-based series of
syllables (binary coding). The method we use had been successfully applied in
our earlier studies concerning text rhythm, carried out on texts in both modern
(French, Polish, Russian) and classical languages (Latin, and Greek) (cf.
Pawłowski 1998, 2003; Pawłowski, Eder 2001; Pawłowski, Krajewski, Eder 2010;
Eder 2008).
We intend to explore the rhythm patterns in the Chronicle’s text and thus
either corroborate – or indeed challenge – Woronczak’s hypothesis. The results
we expect to obtain apply, first of all, to Dalimil’s Chronicle and medieval
historiography. But the debate concerning the questions raised by the
characteristics of and relations between orality and literacy – historically the
two most important forms of verbal interaction – go far beyond the medieval
universum. Our argument is all the more momentous that it relies on solid,
quantitative research.
Sequential structures in “Dalimil’s Chronicle”  149

2 Description of the “Chronicle”


The Chronicle of [So-called] Dalimil (Staročeská Kronika tak řečeného Dalimila) is
one of the oldest and certainly most important monuments of medieval Czech
and is often referred to as the foundation myth of the Czech Kingdom. It was
created at the beginning of the fourteenth century by an anonymous author
(individual or collective), subsequently named Dalimil. The Chronicle includes
original parts, some of them being transcriptions of older Latin texts (e.g.
Chronica bohemica), and later additions. It is written in irregular rhymed verse.
Since its creation the Chronicle has had two major critical editions. In this study
both the one from 1882 (Jireček 1882) and the one from 1988 (Daňhelka et al.
1988) – served as a data source. In its definitive version the Chronicle consists of
one hundred and six chapters of length varying from sixteen to ninety four
verses.
The Chronicle describes the history of the Kingdom of Bohemia from its
mythical origins (the construction of the Tower of Babel) to the year 1314. The
order of chapters follows the chronological development of the narrated story.
There is no evidence, however, that the subsequent chapters or passages were
composed according to the timeline of the events described. Although it is
hardly imaginable that any historical prose, and especially a medieval
chronicle, could be written from its final passages back to the beginning (or, for
that matter, obey another sequence of episodes), it is still possible, or perhaps
even likely, that some chapters of the Chronicle were composed in non-linear
order or that passages were rearranged after their composition.
Another fact that one ought to bear in mind is the uncertain authorship of
the Chronicle. A mixed authorship cannot be reliably ruled out and even if one
assumes the existence of a single author named Dalimil one still does not know
how many passages were actually written by the author and how many were
adopted from earlier (oral and written, literary and non-literary) texts. The only
element that seems beyond all doubt is that the times presented in the opening
chapters are located far beyond the memory of the author’s contemporaries,
gradually approaching the events that could have been experienced firsthand or
secondhand by the Chronicle’s first readers. Again, it is highly probable that the
author’s knowledge of the past was founded on texts composed by his/her
predecessors (including Latin sources) as well as on the popular oral literature
of his/her times. These traces of the oral tradition include, inter alia,
metatextual intrusions, and a characteristic narrative redundancy.
Yet, the fact that the text may contain borrowings from earlier sources is in
no way a critical obstacle to the undertaking of an analysis of sequential
150  Adam Pawłowski and Maciej Eder

structures in Dalimilova Kronika as a whole. Even if some passages were re-


written or adopted from other authors, the Chronicle can be viewed as a more or
less coherent “story” that develops both in historical time and in longitudinal
textual representation.

3 Hypothesis
It has been postulated that in the text of the Chronicle a gradual shift in
versification can be observed, consisting in the growing dissimilation in the
length of adjacent verses and chapters. Called by Jerzy Woronczak
“prosaisation” (Woronczak 1993[1963]: 70), this process is presumably related to
the evolution of the author’s linguistic preferences and to his/her attitude
towards documented historical facts. It may also reflect the shift from the
ancient oral tradition, based on very regular verse structures that enhanced
memorisation, to the culture of script, based on literacy and its complex
versification rules that satisfied the readers’ growingly refined taste. Apart from
the above hypothesis, following directly from Woronczak’s considerations,
additional experiments in modelling longitudinal structures based on stress and
quantity were carried out. The main reason for this was that a similar method,
applied to Latin and ancient Greek texts, produced good results (cf. Pawłowski,
Eder 2001; Pawłowski, Krajewski, Eder 2010).

4 Method
Quantitative text analysis is a combination of applied mathematics and
linguistics. It relies on scientific methodology but its epistemological
foundations are arguable, since they reduce the complexity of language to a
small number of parameters, simplify the question of its ontological status and
often avoid critical dialogue with non-quantitative approaches. Despite these
reservations quantitative methods have proven to be efficient in resolving many
practical questions. They also support cognitive research, revealing new facets
of human verbal communication, especially statistical laws of language and
synergetic relations. As Reinhard Köhler says, “A number of linguistic laws has
been found during the last decades, some of which could successfully be
integrated into a general model, viz. synergetic linguistics. Thus, synergetic
linguistics may be considered as a first embryonic linguistic theory. [...]
According to the results of the philosophy of science, there is one widely
Sequential structures in “Dalimil’s Chronicle”  151

accepted type of explanation, the deductive-nomologic one.” (Köhler 2005:


764).
The present study of Dalimil’s Chronicle is a typical example of the approach
combining philological apparatus and quantitative techniques of sequential
text modelling. On the one hand, our reflections and conclusions rely on an
empirical analysis of the medieval versification of a given text, carried out with
the help of quantitative techniques. On the other hand, they touch upon the
relations between facts and fiction in ancient historiography, as well as on the
passage from orality to literacy in European culture.
The nature of a versified text requires a combination of conventional
statistics and tools appropriate in the analysis of sequential data. Since the
publication of the monumental study The Advanced Theory of Language as
Choice and Chance by the British physicist and linguist Gustav Herdan (Herdan
1966), the first approach is referred to as the “analysis of language in the line”
and the other one as the “analysis of language in the mass” (ibid. 423). The
principal tools of sequential analysis are the time series analysis (conducted in
the time domain and/or in the frequency domain), information theory and the
theory of stochastic processes. All these methods have been used so far, but the
time series analysis in the time domain has turned out to be the most efficient
and the most appropriate in text analysis (Pawłowski 1998, 2001).
The idea of sequential text analysis can be explained as follows. Let us
assume that in a hypothetical language L there are two types of lexical units,
noted as A and B that form sequential structures called texts according to the
rules of syntax. Given is a corpus of texts:

(1) AAAAAABBBBBB, (2) AAABBBAAABBB, (3) AABBAABBAABB, (4) ABABABABABAB,


(5) AABBABABABAB, (6) BABBAABBBAAA

From the linguistic point of view these sequences should be qualified as


different syntagmatic structures, assuming specific meanings in a normal
communicative interaction. Nevertheless, conventional statistics such as
positional parameters and distributions will not reveal any differences between
the sequences because the frequencies of As and Bs remain the same in every
piece of “text”. In such cases, sequential methods taking into account the order
of units are more effective than conventional ones (unless frequencies of n-
grams are processed).
It must be emphasised that syntagmatic dependencies in a line of text,
referred to as textual series spanned over syntagmatic time, are not the only
longitudinal structures modelled with sequential methods. There also exists a
possibility to treat as a time series a sequence of separate texts or long sections,
152  Adam Pawłowski and Maciej Eder

sampled from a large body of texts. Good examples of such series are successive
works of a given author, periodical press articles or chronological samples of
public discourse (cf. Pawłowski 2006). As in such cases the process of
quantification concerns foremostly lexical units, such sequences can be referred
to as lexical series.
In our study the method of time series analysis was applied. It comprises
several components, such as the autocorrelation function (ACF) and the partial
autocorrelation function (PACF), as well as several models of stochastic
processes, such as AR(p) (autoregressive model of order p) and MA(q) (moving
average model of order q). There are also complex models aggregating a trend
function, periodical oscillations and a random element in one formula.
Periodical oscillations are represented by AR and MA models, while complex
models such as ARMA or ARIMA include their combinations. From the linguistic
point of view the order of the model (p or q) corresponds to the depth of
statistically significant textual memory. This means that p or q elements in a
text statistically determine the subsequent element p + 1 or q + 1. As the time
series method has been exhaustively, copiously described in existing literature
– cf. its applications in economy and engineering (cf. Box, Jenkins 1976; Cryer
1986), in psychology (Glass, Wilson, Gottman 1975; Gottman 1981, Gregson 1983,
Gottman 1990), in social sciences (McCleary, Hay 1980) and in linguistics
(Pawłowski 1998, 2001, 2005; Pawłowski, Eder 2001; Pawłowski, Krajewski,
Eder 2010) – it seems unnecessary to discuss it here in detail (basic formulae are
given in the Appendix).
It is important, nonetheless, to keep in mind that a series of discrete
observations of a random variable is called time series, if it can be spanned over
a longitudinal axis (representing real time or any other sequential quantity). A
full time series can be decomposed into three components: the trend, the
periodical functions (including seasonal oscillations) and the random element
called white noise. In linguistics such full time series is hardly imaginable
(albeit theoretically possible). In a single text analysis trends usually do not
appear because the observed variables are stationary, i.e. their values do not
significantly outrange the limits imposed by the term’s frequency.2 If a greater
number of texts is put in a sequence according to their chronology of
appearance, or according to some principle of internal organisation, trend
approximation, calculated for a given parameter, is possible. For example, in


2 Our previous research on time series analysis of Latin, Greek, Polish and English verse and
prose, proves that “text series” are always stationary (cf. Pawłowski 1998, 2001, 2003;
Pawłowski, Eder 2001; Pawłowski, Krajewski, Eder 2010).
Sequential structures in “Dalimil’s Chronicle”  153

the case of Dalimil’s Chronicle subsequent passages are represented by their


average verse length.
Out of all the parameters possibly calculable for longitudinal data, with an
exception made for trend function, which – if it exists – should be estimated
separately, the most important one is the autocorrelation of the series (ACF). If it
is too small to be considered as significant, there is no need to proceed with
model estimation or with further steps of the analysis as it will certainly not
yield any noteworthy results.
In his study Jerzy Woronczak applied the runs test,3 which is a non-
parametrical test used in applied research to evaluate the randomness of binary
sequences, usually produced as results of experiments or other empirical
observations. The advantages and disadvantages of this approach will now be
examined.
Given is a series of symbols A and B: {AAABBAAAABBABBBAAABBAABBB},
where na is the number of As, nb is the number of Bs and N is the series length
(na + nb = N). If the probabilities of occurrence of an A and a B are known, it is
possible to estimate the cumulative distribution function of the number of the
runs, i.e. the successive appearances of As or Bs, and to estimate its confidence
interval. Woronczak’s idea was to consider a versified text as a “generator” of a
random variable related to the structure of subsequent verses. The value of this
variable was the verse length expressed in syllables. He considered two or more
adjacent verses of the same length as a series corresponding to an AAA...
sequence, while clusters composed of varying verse lengths were in his study
considered as dissimilated units corresponding to the BBB... sequences.
Woronczak called this phenomenon the tendency for verse assimilation or
dissimilation respectively. An example of this type of coding (twelve first lines
from chapter 7) can be illustrated as follows:
Když sobě ten div ukázachu, 9 B sequence
na Přěmysłu potázachu, 8
A sequence
které by było znamenie 8
té suché otky vzektvěnie. 7 B sequence
Jim tak Přěmysł otpovědě, 8
řka: „To jáz vám všě povědě. 8 A sequence
Otka suchá jesť znamenie 8
mého chłapieho urozenie. 9
Ale že-ť jesť brzo vzkvetła, 7 B sequence
jakž vem Liubušě była řekła, 9
mój rod z chłapieho pořáda 8 A sequence


3 Also called Wald-Wolfowitz test.
154  Adam Pawłowski and Maciej Eder

Když sobě ten div ukázachu, 9 B sequence


dojde králového řáda; 8

The weakness of the runs test here is the assumption that a versified text is
considered as a generator of randomness. Without any preliminary tests or
assumptions, one has to admit that texts are a fruit of a deliberate intellectual
activity aimed at creating a coherent and structured message. Versification
plays here an aesthetic function and cannot be regarded as random. Another
point that seems crucial is that one can easily imagine several series presenting
exactly the same parameters (N, na and nb) but at the same time different
orderings of AA... and BB... series. It is for these reasons that the runs test was
not used in our study. Instead, trend modelling and time-series tools were
applied.

5 Sampling and quantification of data


As the Chronicle is composed of one hundred and six chapters or sections, two
basic approaches to the question of sampling have been adopted. Respecting
the formal divisions of the text, individual chapters were considered as units
and were then quantified. In this case entire verses were coded according to
their length expressed in the number of syllables (coding in the number of
letters and words was carried out additionally for the sake of comparison). The
average verse length was then calculated for all the chapters so that the entire
text of the Chronicle could be processed as a time series. As for the research into
rhythmical patterns in a line of text, two basic prosodic features, i.e. stress and
quantity, were considered as potentially significant. Consequently, syllables in
randomly selected samples of continuous text were coded in two ways: as
stressed (1) or unstressed (0), and as long (1) or short (0). Every sample was
composed of ca one hundred and fifty syllables. In this way sixty binary textual
series were generated and processed with the time series method.
A brief comment on the way of measuring verse length in a medieval
chronicle would be of interest here since an inappropriate unit of measure
might distort the results despite the sophisticated mathematical apparatus
being used for the treatment of data. Although it seems impossible to reproduce
the way the text of the Chronicle might have sounded more than seven hundred
years ago, some of its metrical features could be reconstructed. The most time-
resistant and convincing unit in this case is the syllable, as it is the carrier of
prosodic features of rhythm, i.e. stress or quantity. Interestingly enough, our
Sequential structures in “Dalimil’s Chronicle”  155

tests have shown that the coding of the text with graphical units such as letters
and words produced very similar results as compared with syllable coding.

6 Results and interpretation


In order to test the hypothesis of the gradual “prosaisation” of the Chronicle
basic statistical measures were calculated. These were meant to provide a
preliminary insight into the data and included arithmetic means and standard
deviations applied to a sequence of consecutive chapters of the Chronicle. This
first step of the procedure was intended to let the data “speak for itself”.
Surprisingly, even these most elementary techniques have clearly shown some
stylistic changes, or rather a development of style throughout the text of the
Chronicle.
Our analysis ought to start with the most intuitive measure of the
versification structure, namely the constancy of the verse length. Even if the
whole Chronicle was composed using irregular metre (bezrozměrný verš), one
could expect the line lengths to be roughly justified in individual chapters as
well as across the chapters. As evidenced by Fig. 1, the mean number of
syllables per line increases significantly, starting with roughly eight in the
opening chapters and reaching ca twelve syllables in the final ones, with a very
characteristic break in the middle of the Chronicle.4 The same process could be
observed when verse lengths were measured in words (Fig. 2) and letters (not
shown). The above observations can be represented in terms of the following
linear regression model:

𝑦𝑦𝑦𝑦𝑖𝑖𝑖𝑖 = 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 𝛽𝛽𝛽𝛽1 + 𝛽𝛽𝛽𝛽2 + 𝜀𝜀𝜀𝜀,

where:
β1, β2 – coefficients of the model
ε – random noise.
Given the considerably clear picture of the increasing length of lines in
subsequent chapters, it is not surprising that the following linear regression
model (for syllables) fits the data sufficiently well:

𝑦𝑦𝑦𝑦�𝚤𝚤𝚤𝚤 = 0.017757𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 + 9.28473 + 𝜀𝜀𝜀𝜀


4 The trend lines on the plots are generated using the procedure introduced by Cleveland
(1979).
156  Adam Pawłowski and Maciej Eder

Although this model (its estimate marked with dashed line in Fig. 1) seems to be
satisfactory it does not capture the disturbance of the trend in the middle of the
Chronicle (roughly covering chapters 30–60). It is highly probable that one has
to do here with the influence of an external factor on the stable narration of the
Chronicle.

Fig. 1: Mean verse length in subsequent chapters of the Chronicle (in syllables).

Fig. 2: Mean verse length in subsequent chapters of the Chronicle (in words).
Sequential structures in “Dalimil’s Chronicle”  157

Certainly, the above results alone do not indicate any shift from verse to prose
per se. However, when this observation is followed by a measure of dispersion,
such as the standard deviation of the line length in subsequent chapters or –
more accurately – a normalised standard deviation (i.e. coefficient of variation,
or standard deviation expressed in the units of the series mean value), the
overall picture becomes clearer, because the dispersion of line lengths (Fig. 3)
increases significantly as well. The lines not only become longer throughout the
whole Chronicle but also more irregular in terms of their syllabic justification,
thus resembling prose more than poetry. The results obtained for syllables are
corroborated by similar tests performed on letters and words (not shown): the
lines of the Chronicle display a decreasing trend of verse justification, whichever
units are being analysed.
One should also note that the syllabic model of the opening chapters,
namely eight syllable lines with a few outliers, is a very characteristic pattern of
archaic oral poetry in various literatures and folkloric traditions of Europe. In
the case of Dalimil’s Chronicle, it looks as if the author composed the earliest
passages using excerpts from oral epic poetry followed by chapters inspired by
non-literary (historical) written sources. If this assumption were to be true, it
should be confirmed by a closer inspection of other prosodic factors of the
Chronicle such as rhymes.

Fig. 3: Coefficient of variation of line lengths in subsequent chapters of the Chronicle (in
syllables).
158  Adam Pawłowski and Maciej Eder

Fig. 4: Rhyme repertory, or the percentage of discrete rhyme pairs used in sub-sequent
chapters of the Chronicle.

Another series of tests addressed the question of rhyme homogeneity and rhyme
repertory. Provided that oral/archaic poetry tends to reduce alternations in
rhymes or to minimise the number of rhymes used (as opposed to elaborate
written poetry, usually boasting a rich inventory of rhymes), we examined the
complexity of rhyme types in the Chronicle.
Firstly, we computed the percentage of discrete rhyme pairs per chapter –
an equivalent to the classical type/token ratio – in order to get an insight into
the rhyme inventory, with an obvious caveat that this measure is quite sensitive
to the sample size. Although the dispersion of results is rather large across
subsequent chapters (Fig. 4), the existence of the decreasing trend is evident.
This is rather counter-intuitive and contradicts the above-formulated
observations since it suggests a greater rhyme inventory at the beginning of the
Chronicle, as though text was more “oral” in its final passages than in its
opening chapters.
The next measure is the index of rhyme homogeneity, aimed to capture
longer sequences of matching rhyme patterns or clusters of lines using the same
rhyme. It was computed as a number of instances where a rhyme followed the
preceding one without alternation, divided by the number of lines of a given
chapter. Thus, the value of 0.5 means a regular alternation of rhyme pairs,
whilst a chapter of n verses using no rhyme changes (i.e. based entirely on one
rhyme pattern) will raise the homogeneity index to (n – 1)/n. On theoretical
grounds, both measures – rhyme inventory index and homogeneity index – are
Sequential structures in “Dalimil’s Chronicle”  159

correlated to some extent. However, the following example from chapter 38 will
explain the difference between these measures:
Tehdy sě sta, že kněz Ołdřich łovieše
a sám v pustém lesě błúdieše.
Když u velikých túhách bieše,
około sebe všady zřieše,
uzřě, nali-ť stojí dospěłý hrad.
Kněz k němu jíti chtieše rád.
Ale že cěsty neumějieše
a około hłožie husté bieše,
ssěd s koně, mečem cěstu proklesti,
i počě po ostrvách v hrad lézti;
neb sě nemože nikohého dovołati,
by v něm liudie byli, nemože znamenati.
Most u něho vzveden bieše,
a hrad zdi około sebe tvrdé jmějieše.
Když kněz s úsilím v hrad vnide
a všecky sklepy znide,
zetlełé rúcho vidieše,
však i čłověka na něm nebieše.
Sbožie veliké a vína mnoho naleze.
Ohledav hrad, kudyž był všeł, tudyž vyleze.
Pak kněz hrad ten da pánu tomu, jemuž Přiema diechu;
proto tomu hradu Přimda vzděchu.

In the above passage there are twelve lines that follow the preceding line’s
rhyme; however, the number of unique rhyme patterns is seven as the rhyme
-ieše returns several times. Hence, both indexes will differ significantly and
their value will be as high as 0.318 and 0.545 (inventory and homogeneity
respectively).
Fig. 5 shows the homogeneity index for the whole Chronicle. Despite some
interesting outliers a general increasing trend is again easily noticeable: the
opening chapters not only tend to use more rhyme types as compared with the
final passages, but also avoid aggregating rhymes into longer clusters. To keep
things simple, one might say that rhyme diversity and copiousness
(traditionally associated with elaborate written poetry) turns slowly into
repetitiveness and formulaic “dullness” (usually linked to archaic oral poetry).
Yet, another simple test confirms these results: out of all the rhymes used in the
Chronicle we computed the relative frequencies of the most frequent ones,
dividing the number of a particular rhyme’s occurrence by the total number of
verses in a chapter. The summarised frequencies of five predominant groups of
rhymes are represented in Fig. 6, which shows very clearly that a variety of
rhyme patterns of the initial passages is systematically displaced by a limited
160  Adam Pawłowski and Maciej Eder

number of line endings. It also means a syntactic simplification of the final parts
of the Chronicle, since the overwhelming majority of rhymes are in fact a few
inflected morphemes of verbs (-echu|-ichu|-achu, -eše|-aše, -ati|-eti, -idu|-edu|
-adu) and a genitive singular morpheme for masculine nouns, adjectives and
pronouns (-eho|-oho).

Fig. 5: Clustering of homogeneous rhymes in subsequent chapters of the Chronicle.

Fig. 6: Relative frequency of the most frequent rhyme types in subsequent chapters of the
Chronicle.
Sequential structures in “Dalimil’s Chronicle”  161

Next come more sophisticated measures that may betray internal correlations
between subsequent lines of the Chronicle, namely the aforementioned time
series analysis techniques. The aim of the following tests is to examine whether
short lines are likely to be followed by long ones and vice versa (Hrabák 1959).
Certainly, the alternative hypothesis is that no statistically significant
correlations exist between the consecutive lines of the Chronicle. To test this
suggestion we first computed the autocorrelation function (ACF) for subsequent
line lengths, expressed in syllables. Next came partial autocorrelation function
(PACF), followed by the stage of model estimation and evaluation (if applicable,
because most of the results not proved insufficiently significant and thus
unsuitable for modelling). The whole procedure was replicated independently
for individual chapters.5
The results of the aforementioned procedures are interesting, yet
statistically not very significant. As evidenced by Fig. 7 (produced for chapter 1),
the autocorrelation function reveals no important values at lags higher than 1,
the only significant result appearing at lag 1. It means that a minor negative
correlation between any two neighbouring lines (i.e. at lag 1) can be observed in
the dataset. To interpret this fact in linguistic terms, any given verse is slightly
affected by the preceding one but does not depend at all on the broader
preceding context. The ACF function applied to the remaining chapters
corroborates the above results. Still, the only value that turned out to be
significant is negative in some chapters (Fig. 7) and positive in the other ones
(Fig. 8), and it generally yields a substantial variance. It seemed feasible
therefore to extract throughout the consecutive chapters all the values for lag 1
and to represent them in one figure (Fig. 9). This procedure reveals some further
regularities; namely the correlation between neighbouring lines slowly but
systematically increases. It means, once again, that the Chronicle becomes
somewhat more rhythmical in its last chapters despite – paradoxically – the
prose-like instability of the verse length.
In any attempt at measuring textual rhythmicity the most obvious type of
data that should be assessed is the sequence of marked and unmarked syllables
or other alternating linguistic units (such as metrical feet). In the case of
Dalimil’s Chronicle – or the Old Czech language in general – the feature
responsible for rhythmicity is the sequence of stressed and unstressed syllables
on the one hand, and the sequence of long and short syllables on the other


5 Since the ARIMA modelling is a rather time-consuming task, we decided to assess every
second chapter instead of computing the whole dataset. We believe, however, that this
approach gives us a good approximation of the (possible) regularities across chapters.
162  Adam Pawłowski and Maciej Eder

hand. To test possible rhythmical patterns in the Chronicle, we prepared two


independent datasets containing a binary coding – one for stress, the other for
quantity.

Fig. 7: Autocorrelation function (ACF) of lengths of subsequent lines of chapter 1.

Fig. 8: Autocorrelation function (ACF) of lengths of subsequent lines of chapter 41.


Sequential structures in “Dalimil’s Chronicle”  163

Fig. 9: Results of the autocorrelation function values (lag 1 only) for the consecutive chapters of
the Chronicle.

The results of ACF function applied to a sample chapter are shown in Fig. 10
(stress) and Fig. 11 (quantity). Both figures are rather straightforward to
interpret. The stress series contains one predominant and stable value at lag 1,
and no significant values at higher lags. This means a strong negative
correlation between any two adjacent syllables. This phenomenon obviously
reflects a predominant alternation of stressed and unstressed syllables in the
sample; it is not surprising to see the same pattern in all the remaining samples.
Needless to say, no stylistic shift could be observed across consecutive chapters.
These findings seem to indicate that stress alternation is a systemic language
phenomenon with no bearing on a stylistic analysis of the Chronicle, and the
quantity alternation in a line of text does not generate any significant
information. Having said that, both measures should produce more interesting
results in a comparative analysis of poetic genres in modern Czech.
The quantity-based series reveals a fundamentally different behaviour of
the sequence of long and short syllables. The actual existence of quantity in the
Old Czech language seems to be totally irrelevant as a prosodic feature in
poetry. The values shown in Fig. 11 evidently suggest no autocorrelation at any
lag. It means that the time series is simply unpredictable; in linguistic terms, it
means no rhythm in the dataset.
164  Adam Pawłowski and Maciej Eder

Fig. 10: Autocorrelation function (ACF) of a sequence of stressed and non-stressed syllables
from chapter 3.

Fig. 11: Autocorrelation function (ACF) of a sequence of long and short syllables from chapter 7.

Significantly enough, both prosodic features which one would associate with
the techniques of poetic composition and which are quantity and stress, proved
to be independent of any authorial choice and did not play any role in the
gradual stylistic shift in the Chronicle.
Sequential structures in “Dalimil’s Chronicle”  165

7 Conclusions
We have verified the presence of latent rhythmic patterns and have partially
corroborated the hypothesis advanced by Jerzy Woronczak in 1963. However,
the obtained results elucidate further questions raised by our analysis of the
Chronicle. One of these questions concerns the role played by the underlying
oral culture in the process of text composition in the Middle Ages. Although in
Woronczak’s work there is no direct reference to the works by J. Lord, M. Parry,
W. Ong or E. Havelock, all of whom were involved in the study of orality in
human communication, the Polish scholar’s reasoning is largely consistent with
theirs. Woronczak agrees with the assumption that in oral literature subsequent
segments (to be transformed in future into verses of written texts) tend to keep a
repetitive and stable rhythm, cadenced by syllable stress, quantity, identical
metrical feet or verse length. The reason for this apparent monotony is that
illiterate singers of tales and their public appreciated rhythmicity. While the
former took advantage of its mnemonic properties that helped them remember
hundreds or even thousands of verses, the latter enjoyed its hypnotising power
during life performances. Variable and irregular segments marking ruptures of
rhythm occurred only in critical moments of the story.
However, the bare opposition orality vs. literacy does not suffice to explain
the quite complex, as it turns out, stylistic shift in the Chronicle. Without
rejecting the role of orality as one of the underlying factors here, we propose a
considerably straightforward hypothesis that seems to explain the examined
data satisfactorily. Since in the initial chapters of the Dalimilova Kronika the
archaic eight-syllable verse type was used, it is feasible that this part of the
Chronicle had been inspired by oral poetry of the author’s predecessors (or even
copied in some passages). Probably, when the author took on some elements of
the oral heritage of his/her times he wished to keep the original syllabic
structure, rhymes and line lengths. This was possible – however difficult – for a
while, yet the poetic form eventually proved to be too demanding. At some
point, as we stipulate, the author realised that creating or supplementing
history would require more and more aesthetic compromises; thus, he put less
effort into the poetic form of the text’s final chapters.
Certainly, other explanations cannot be ruled out. The most convincing one
is the hypothesis of the collaborative authorship if the Chronicle. However, had
there been two (or more) authors, a sudden stylistic break somewhere in the
Chronicle, rather than the gradual shift from poetry to prose, would be more
likely.
166  Adam Pawłowski and Maciej Eder

Hence, the results of our study go far beyond the Medieval universum or the
stylometric routine research; partaking in the debate concerning the questions
raised by the opposition between orality and literacy in culture as well as the
opposition between fact and fiction in historical narrative, omnipresent in the
text-audio-visual space. Our argument is all the more imperative that it is relies
on empirical data and on solid, quantitative methodology.

Source Texts
Jireček, Josef (ed.). 1882. Rýmovaná kronika česká tak řečeného Dalimila. In Prameny dějin
českých 3(1). Praha.
Daňhelka Jiří, Karel Hádek, Bohulav Havránek, Naděžda Kvítková (eds.). 1988. Staročeská
kronika tak řečeného Dalimila: Vydání textu a veškerého textového materiálu. Vol. 1–2.
Praha: Academia.

References
Box, George E.P. & Gwilym M. Jenkins. 1976. Time series analysis: Forecasting and control. San
Francisco: Holden-Day.
Cleveland, William S. 1979 Robust locally weighted regression and smoothing scatterplots.
Journal of the American Statistical Association 74(368). 829–836.
Cryer, Jonathan. 1986. Time series analysis. Boston: Duxbury Press.
Eder, Maciej. 2008. How rhythmical is hexameter: A statistical approach to ancient epic poetry.
Digital Humanities 2008: Book of Abstracts, 112–114. Oulu: University of Oulu.
Glass, Gene V., Victor L. Wilson & John M.Gottman. 1975. Design and analysis of time-series
experiments. Boulder (CO): Colorado Associated University Press.
Gottman, John M. 1981. Time-series analysis: A comprehensive introduction for social
scientists. Cambridge: Cambridge University Press.
Gottman, John M. 1990. Sequential analysis. Cambridge: Cambridge University Press.
Gregson, Robert A.M. 1983. Time series in psychology. Hillsdale (NJ): Lawrence Erlbaum
Associates.
Herdan, Gustav. 1966. The advanced theory of language as choice and chance. Berlin:
Springer.
Hrabák, Josef. 1959. Studie o českém verši. Praha: SPN.
Köhler, Reinhard. 2005. Synergetic linguistics. In Reinhard Köhler, Gabriel Altmann & Rajmund
G. Piotrowski (eds.), Quantitative linguistics. An international handbook, 760–774. Berlin
& New York: de Gruyter.
McCleary, Richard & Richard A.Hay. 1980. Applied time series analysis for the social sciences.
Beverly Hills: Sage Publications.
Pawłowski, Adam. 1998. Séries temporelles en linguistique. Avec application à l’attribution de
texts. Paris: Romain Gary et Émile Ajar, and Genève: Champion-Slatkine.
Sequential structures in “Dalimil’s Chronicle”  167

Pawłowski, Adam. 2003. Sequential analysis of versified texts in fixed- and free-accent
languages: Example of Polish and Russian. In Lew N. Zybatow (ed.), Europa der Sprachen:
Sprachkompetenz – Mehrsprachigkeit – Translation. Teil II: Sprache und Kognition, 235–
246. Frankfurt/M.: Peter Lang.
Pawłowski, Adam. 2005 Modelling of the sequential structures in text. In Reinhard Köhler,
Gabriel Altmann & Rajmund G. Piotrowski (eds.), Quantitative linguistics. An international
handbook, 738–750. Berlin & New York: de Gruyter.
Pawłowski, Adam. 2006 Chronological analysis of textual data from the “Wrocław corpus of
Polish”. Poznań Studies in Contemporary Linguistics 41. 9–29.
Pawłowski, Adam & Maciej Eder. 2001. Quantity or stress? Sequential analysis of Latin
prosody. Journal of Quantitative Linguistics 8(1). 81–97.
Pawłowski, Adam, Marek Krajewski & Maciej Eder. 2010. Time series modelling in the analysis
of Homeric verse. Eos 47(1). 79–100.
Woronczak, Jerzy. 1963. Zasada budowy wiersza “Kroniki Dalimila” [Principle of verse building
in the “Kranika Dalimila”]. Pamiętnik Literacki 54(2). 469–478.
168  Adam Pawłowski and Maciej Eder

Appendix
A series of discrete observations xi, representing the realisations of a random
variable Xt, is called time series, if it can be spanned over a longitudinal axis
(corresponding to time or any other sequential quantity):

𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 = {𝑥𝑥𝑥𝑥1 , 𝑥𝑥𝑥𝑥2 … 𝑥𝑥𝑥𝑥𝑖𝑖𝑖𝑖 } (1)

The mean 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 of a time series 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 is defined as:

𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 = 𝐸𝐸𝐸𝐸(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 ) (2)

estimated by:
𝑁𝑁𝑁𝑁
1
𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 = � 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 (3)
𝑁𝑁𝑁𝑁
𝑡𝑡𝑡𝑡=1

where N – series length


xt – value of a series at the moment or position t
Variance of a time series is defined as:

𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2 = 𝐸𝐸𝐸𝐸(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 − 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 )2 (4)

estimated by (using the same notation):


𝑁𝑁𝑁𝑁
1
𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥2 = �(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 − 𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 )2 (5)
𝑁𝑁𝑁𝑁
𝑡𝑡𝑡𝑡=1

Autocovariance 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 of a time series 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 is defined as:

𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 = 𝐸𝐸𝐸𝐸{(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 − 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 )(𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡+𝑘𝑘𝑘𝑘 − 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 )} (6)

where 𝜇𝜇𝜇𝜇𝑥𝑥𝑥𝑥 – series theoretical mean


k – lag separating the values 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡
Autocovariance is estimated by:
𝑁𝑁𝑁𝑁−𝑘𝑘𝑘𝑘
1
𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘 = �(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 − 𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 )(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡+𝑘𝑘𝑘𝑘 − 𝑚𝑚𝑚𝑚𝑥𝑥𝑥𝑥 ) (7)
𝑁𝑁𝑁𝑁 − 𝑘𝑘𝑘𝑘
𝑡𝑡𝑡𝑡=1

where N – series length


𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 – value of a series at the moment or position t
k – lag separating the values 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡
Autocorrelation function of order k of a series is defined as:
Sequential structures in “Dalimil’s Chronicle”  169

𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘
𝜌𝜌𝜌𝜌𝑘𝑘𝑘𝑘 = = (8)
𝛾𝛾𝛾𝛾0 𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2
where 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 – autocovariance function (if 𝑘𝑘𝑘𝑘 = 0 then 𝛾𝛾𝛾𝛾𝑘𝑘𝑘𝑘 = 𝜎𝜎𝜎𝜎𝑥𝑥𝑥𝑥2 )
k – lag
ACF is estimated by the following formula:
𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘 𝑐𝑐𝑐𝑐𝑘𝑘𝑘𝑘
𝑟𝑟𝑟𝑟𝑘𝑘𝑘𝑘 = = (9)
𝑐𝑐𝑐𝑐0 𝑠𝑠𝑠𝑠𝑥𝑥𝑥𝑥2
ACF is the most important parameter in time series analysis. If it is non-
random, estimation of stochastic models is justified and purposeful. If it is not
the case, it proves that there is no deterministic component (metaphorically
called “memory”) in the data.
There are three basic models of time series, as well as many complex
models. As no complex models were discovered in the Chronicle of Dalimil, only
the simple ones will be presented.
A random process consists of independent realisations of a variable 𝑋𝑋𝑋𝑋𝑡𝑡𝑡𝑡 =
{𝑒𝑒𝑒𝑒1 , 𝑒𝑒𝑒𝑒2 , … }. It is characterised by zero autocovariance and zero autocorrelation.
By analogy to the spectrum of light, the values 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 which have normal
distribution are called white noise.
An autoregressive process of order p consists of the subsequent 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 values
defined as:

𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 = 𝑎𝑎𝑎𝑎1 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−1 + 𝑎𝑎𝑎𝑎2 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−2 + ⋯ + 𝑎𝑎𝑎𝑎𝑝𝑝𝑝𝑝 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−𝑝𝑝𝑝𝑝 + 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 (10)

Where 𝑎𝑎𝑎𝑎𝑖𝑖𝑖𝑖 – model coefficients


𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖 – normally distributed random values
p – order of the AR process
A moving average process of order q consists of the subsequent 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 values
defined as:
𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 = 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 − 𝑏𝑏𝑏𝑏1 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−1 − 𝑏𝑏𝑏𝑏2 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−2 − ⋯ − 𝑏𝑏𝑏𝑏𝑞𝑞𝑞𝑞 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−𝑞𝑞𝑞𝑞 (11)

where 𝑏𝑏𝑏𝑏𝑖𝑖𝑖𝑖 – model coefficients


𝑒𝑒𝑒𝑒𝑖𝑖𝑖𝑖 – normally distributed random values
q – order of the MA process
Random, autoregressive and moving average processes (without trend) can
be aggregated in one complex model, called ARMA (the same notation, as
above):

𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 = 𝑎𝑎𝑎𝑎1 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−1 + 𝑎𝑎𝑎𝑎2 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−2 + ⋯ + 𝑎𝑎𝑎𝑎𝑝𝑝𝑝𝑝 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡−𝑝𝑝𝑝𝑝 + 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡 − 𝑏𝑏𝑏𝑏1 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−1 − 𝑏𝑏𝑏𝑏2 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−2 − ⋯ − 𝑏𝑏𝑏𝑏𝑞𝑞𝑞𝑞 𝑒𝑒𝑒𝑒𝑡𝑡𝑡𝑡−𝑞𝑞𝑞𝑞 (12)
170  Adam Pawłowski and Maciej Eder

The type of the model is determined by the behaviour of the ACF and PACF
(partial autocorrelation) functions, as their values may either slowly decrease or
abruptly truncate (cf. Pawłowski 2001: 71). It should be emphasized, however,
that in many cases of text analysis it is just the ACF function, which is quite
sufficient to analyse and linguistically interpret the data. This is precisely the
case in the hereby study of the Dalimil’s Chronicle text.
Taraka Rama and Lars Borin
Comparative Evaluation of String Similarity
Measures for Automatic Language
Classification

1 Introduction
Historical linguistics, the oldest branch of modern linguistics, deals with lan-
guage-relatedness and language change across space and time. Historical lin-
guists apply the widely-tested comparative method (Durie and Ross, 1996) to
establish relationships between languages to posit a language family and to
reconstruct the proto-language for a language family.1 Although historical
linguistics has parallel origins with biology (Atkinson and Gray, 2005), unlike the
biologists, mainstream historical linguists have seldom been enthusiastic about
using quantitative methods for the discovery of language relationships or
investigating the structure of a language family, except for Kroeber and Chrétien
(1937) and Ellegård (1959). A short period of enthusiastic application of
quantitative methods initiated by Swadesh (1950) ended with the heavy criticism
levelled against it by Bergsland and Vogt (1962). The field of computational
historical linguistics did not receive much attention again until the beginning of
the 1990s, with the exception of two noteworthy doctoral dissertations, by
Sankoff (1969) and Embleton (1986).
In traditional lexicostatistics, as introduced by Swadesh (1952), distances
between languages are based on human expert cognacy judgments of items in
standardized word lists, e.g., the Swadesh lists (Swadesh, 1955). In the termi-
nology of historical linguistics, cognates are related words across languages that
can be traced directly back to the proto-language. Cognates are identified
through regular sound correspondences. Sometimes cognates have similar sur-
face form and related meanings. Examples of such revealing kind of cognates are:
English ~ German night ~ Nacht ‘night’ and hound ~ Hund ‘dog’. If a word has
undergone many changes then the relatedness is not obvious from visual in-
spection and one needs to look into the history of the word to exactly understand


1 The Indo-European family is a classical case of the successful application of comparative
method which establishes a tree relationship between some of the most widely spoken
languages in the world.
172  Taraka Rama and Lars Borin

the sound changes which resulted in the synchronic form. For instance, the
English ~ Hindi wheel ~ chakra ‘wheel’ are cognates and can be traced back to the
proto-Indo-European root kw ekw lo-.
Recently, some researchers have turned to approaches more amenable to
automation, hoping that large-scale lexicostatistical language classification will
thus become feasible. The ASJP (Automated Similarity Judgment Program) pro-
ject2 represents such an approach, where automatically estimated distances
between languages are provided as input to phylogenetic programs originally
developed in computational biology (Felsenstein, 2004), for the purpose of
inferring genetic relationships among organisms.
As noted above, traditional lexicostatistics assumes that the cognate judg-
ments for a group of languages have been supplied beforehand. Given a stand-
ardized word list, consisting of 40–100 items, the distance between a pair of
languages is defined as the percentage of shared cognates subtracted from 100%.
This procedure is applied to all pairs of languages under consideration, to
produce a pairwise inter-language distance matrix. This inter-language distance
matrix is then supplied to a tree-building algorithm such as Neighbor-Joining (NJ;
Saitou and Nei, 1987) or a clustering algorithm such as Unweighted Pair Group
Method with Arithmetic Mean (UPGMA; Sokal and Michener, 1958) to infer a tree
structure for the set of languages. Swadesh (1950) applies essentially this method
– although completely manually – to the Salishan languages. The resulting
“family tree” is reproduced in figure 1.
The crucial element in these automated approaches is the method used for
determining the overall similarity between two word lists.3 Often, this is some
variant of the popular edit distance or Levenshtein distance (LD; Levenshtein,
1966). LD for a pair of strings is defined as the minimum number of symbol
(character) additions, deletions and substitutions needed to transform one string
into the other. A modified LD (called LDND) is used by the ASJP consortium, as
reported in their publications (e.g., Bakker et al. 2009 and Holman et al. 2008).


2 http://email.eva.mpg.de/~wichmann/ASJPHomePage.htm
3 At this point, we use “word list” and “language” interchangeably. Strictly speaking, a
language, as identified by its ISO 639-3 code, can have as many word lists as it has recognized
(described) varieties, i.e., doculects (Nordhoff and Hammarström, 2011).
Comparative evaluation of string similarity measures  173

2 Related Work
Cognate identification and tree inference are closely related tasks in historical
linguistics. Considering each task as a computational module would mean that
each cognate set identified across a set of tentatively related languages feed into
the refinement of the tree inferred at each step. In a critical article, Nichols (1996)
points out that the historical linguistics enterprise, since its beginning, always
used a refinement procedure to posit relatedness and tree structure for a set of
tentatively related languages.4 The inter-language distance approach to tree-
building, is incidentally straightforward and comparably accurate in comparison
to the computationally intensive Bayesian-based tree-inference approach of
Greenhill and Gray (2009).5
The inter-language distances are either an aggregate score of the pairwise
item distances or based on a distributional similarity score. The string similarity
measures used for the task of cognate identification can also be used for compu-
ting the similarity between two lexical items for a particular word sense.

2.1 Cognate identification

The task of automatic cognate identification has received a lot of attention in


language technology. Kondrak (2002a) compares a number of algorithms based
on phonetic and orthographical similarity for judging the cognateness of a word
pair. His work surveys string similarity/distance measures such as edit distance,
dice coefficient, and longest common subsequence ratio (LCSR) for the task of
cognate identification. It has to be noted that, until recently (Hauer and Kondrak,
2011, List, 2012), most of the work in cognate identification focused on
determining the cognateness between a word pair and not among a set of words
sharing the same meaning.
Ellison and Kirby (2006) use Scaled Edit Distance (SED)6 for computing intra-
lexical similarity for estimating language distances based on the dataset of Indo-


4 This idea is quite similar to the well-known Expectation-Maximization paradigm in machine
learning. Kondrak (2002b) employs this paradigm for extracting sound correspondences by
pairwise comparisons of word lists for the task of cognate identification. A recent paper by
Bouchard-Côté et al. (2013) employs a feed-back procedure for the reconstruction of Proto-
Austronesian with a great success.
5 For a comparison of these methods, see Wichmann and Rama, 2014.
6 SED is defined as the edit distance normalized by the average of the lengths of the pair of
strings.
174  Taraka Rama and Lars Borin

European languages prepared by Dyen et al. (1992). The language distance matrix
is then given as input to the NJ algorithm – as implemented in the PHYLIP
package (Felsenstein, 2002) – to infer a tree for 87 Indo-European languages.
They make a qualitative evaluation of the inferred tree against the standard Indo-
European tree.

Fig. 1: Salishan language family box-diagram from Swadesh 1950.

Kondrak (2000) developed a string matching algorithm based on articulatory


features (called ALINE) for computing the similarity between a word pair. ALINE
was evaluated for the task of cognate identification against machine learning
algorithms such as Dynamic Bayesian Networks and Pairwise HMMs for
automatic cognate identification (Kondrak and Sherif, 2006). Even though the
Comparative evaluation of string similarity measures  175

approach is technically sound, it suffers due to the very coarse phonetic


transcription used in Dyen et al.’s Indo-European dataset.7
Inkpen et al. (2005) compared various string similarity measures for the task
of automatic cognate identification for two closely related languages: English
and French. The paper shows an impressive array of string similarity measures.
However, the results are very language-specific, and it is not clear that they can
be generalized even to the rest of the Indo-European family.
Petroni and Serva (2010) use a modified version of Levenshtein distance for
inferring the trees of the Indo-European and Austronesian language families. LD
is usually normalized by the maximum of the lengths of the two words to account
for length bias. The length normalized LD can then be used in computing
distances between a pair of word lists in at least two ways: LDN and LDND (Le-
venshtein Distance Normalized Divided). LDN is computed as the sum of the
length normalized Levenshtein distance between the words occupying the same
meaning slot divided by the number of word pairs. Similarity between phoneme
inventories and chance similarity might cause a pair of not-so related languages
to show up as related languages. This is compensated for by computing the
length-normalized Levenshtein distance between all the pairs of words occu-
pying different meaning slots and summing the different word-pair distances.
The summed Levenshtein distance between the words occupying the same
meaning slots is divided by the sum of Levenshtein distances between different
meaning slots. The intuition behind this idea is that if two languages are shown
to be similar (small distance) due to accidental chance similarity then the de-
nominator would also be small and the ratio would be high.
If the languages are not related and also share no accidental chance simi-
larity, then the distance as computed in the numerator would be unaffected by
the denominator. If the languages are related then the distance as computed in
the numerator is small anyway, whereas the denominator would be large since
the languages are similar due to genetic relationship and not from chance sim-
ilarity. Hence, the final ratio would be smaller than the original distance given in
the numerator.
Petroni and Serva (2010) claim that LDN is more suitable than LDND for
measuring linguistic distances. In reply, Wichmann et al. (2010a) empirically
show that LDND performs better than LDN for distinguishing pairs of languages
belonging to the same family from pairs of languages belonging to different
families.


7 The dataset contains 200-word Swadesh lists for 95 language varieties. Available on
http://www. wordgumbo.com/ie/cmp/index.htm.
176  Taraka Rama and Lars Borin

As noted by Jäger (2014), Levenshtein distance only matches strings based on


symbol identity whereas a graded notion of sound similarity would be a closer
approximation to historical linguistics as well as achieving better results at the
task of phylogenetic inference. Jäger (2014) uses empirically determined weights
between symbol pairs (from computational dialectometry; Wieling et al. 2009) to
compute distances between ASJP word lists and finds that there is an
improvement over LDND at the task of internal classification of languages.

2.2 Distributional similarity measures

Huffman (1998) compute pairwise language distances based on character n-


grams extracted from Bible texts in European and American Indian languages
(mostly from the Mayan language family). Singh and Surana (2007) use character
n-grams extracted from raw comparable corpora of ten languages from the Indian
subcontinent for computing the pairwise language distances between languages
belonging to two different language families (Indo-Aryan and Dravidian). Rama
and Singh (2009) introduce a factored language model based on articulatory
features to induce an articulatory feature level n-gram model from the dataset of
Singh and Surana, 2007. The feature n-grams of each language pair are compared
using a distributional similarity measure called cross-entropy to yield a single
point distance between the language pair. These scholars find that the
distributional distances agree with the standard classification to a large extent.
Inspired by the development of tree similarity measures in computational
biology, Pompei et al. (2011) evaluate the performance of LDN vs. LDND on the
ASJP and Austronesian Basic Vocabulary databases (Greenhill et al., 2008). They
compute NJ and Minimum Evolution trees8 for LDN as well as LDND distance
matrices. They compare the inferred trees to the classification given in the
Ethnologue (Lewis, 2009) using two different tree similarity measures: General-
ized Robinson-Foulds distance (GRF; A generalized version of Robinson-Foulds
[RF] distance; Robinson and Foulds 1979) and Generalized Quartet distance
(GQD; Christiansen et al. 2006). GRF and GQD are specifically designed to ac-
count for the polytomous nature – a node having more than two children – of the
Ethnologue trees. For example, the Dravidian family tree shown in figure 3
exhibits four branches radiating from the top node. Finally, Huff and Lonsdale
(2011) compare the NJ trees from ALINE and LDND distance metrics to Ethnologue
trees using RF distance. The authors did not find any significant improvement by


8 A tree building algorithm closely related to NJ.
Comparative evaluation of string similarity measures  177

using a linguistically well-informed similarity measure such as ALINE over


LDND.

3 Is LD the best string similarity measure for


language classification?
LD is only one of a number of string similarity measures used in fields such as
language technology, information retrieval, and bio-informatics. Beyond the
works cited above, to the best of our knowledge, there has been no study to
compare different string similarity measures on something like the ASJP dataset
in order to determine their relative suitability for genealogical classification9. In
this paper we compare various string similarity measures10 for the task of auto-
matic language classification. We evaluate their effectiveness in language dis-
crimination through a distinctiveness measure; and in genealogical classification
by comparing the distance matrices to the language classifications provided by
WALS (World Atlas of Language Structures; Haspelmath et al., 2011)11 and
Ethnologue.
Consequently, in this article we attempt to provide answers to the following
questions:
Out of the numerous string similarity measures listed below in section 5:
– Which measure is best suited for the tasks of distinguishing related lan-
guages from unrelated languages?
– Which is measure is best suited for the task of internal language classifica-
tion?
– Is there a procedure for determining the best string similarity measure?


9 One reason for this may be that the experiments are computationally demanding, requiring
several days for computing a single measure over the whole ASJP dataset.
10 A longer list of string similarity measures is available on: http://www.coli.uni-saarland.de/
courses/LT1/2011/slides/stringmetrics.pdf
11 WALS does not provide a classification to all the languages of the world. The ASJP
consortium gives a WALS-like classification to all the languages present in their database.
178  Taraka Rama and Lars Borin

4 Database and language classifications

4.1 Database

The ASJP database offers a readily available, if minimal, basis for massive cross-
linguistic investigations. The ASJP effort began with a small dataset of 100-word
lists for 245 languages. These languages belong to 69 language families. Since its
first version presented by Brown et al. (2008), the ASJP database has been going
through a continuous expansion, to include in the version used here (v. 14,
released in 2011)12 more than 5500 word lists representing close to half the
languages spoken in the world (Wichmann et al., 2011). Because of the findings
reported by Holman et al. (2008), the later versions of the database aimed to cover
only the 40-item most stable Swadesh sublist, and not the 100-item list.
Each lexical item in an ASJP word list is transcribed in a broad phonetic
transcription known as ASJP Code (Brown et al., 2008). The ASJP code consists of
34 consonant symbols, 7 vowels, and four modifiers (∗, ”, ∼, $), all rendered by
characters available on the English version of the QWERTY keyboard. Tone,
stress, and vowel length are ignored in this transcription format. The three
modifiers combine symbols to form phonologically complex segments (e.g.,
aspirated, glottalized, or nasalized segments).
In order to ascertain that our results would be comparable to those published
by the ASJP group, we successfully replicated their experiments for LDN and
LDND measures using the ASJP program and the ASJP dataset version 12
(Wichmann et al., 2010b).13 This database comprises reduced (40-item) Swadesh
lists for 4169 linguistic varieties. All pidgins, creoles, mixed languages, artificial
languages, proto-languages, and languages extinct before 1700 CE were ex-
cluded for the experiment, as were language families represented by less than 10
word lists (Wichmann et al., 2010a),14 as well as word lists containing less than
28 words (70% of 40).
This leaves a dataset with 3730 word lists. It turned out that an additional 60
word lists did not have English glosses for the items, which meant that they could


12 The latest version is v. 16, released in 2013.
13 The original python program was created by Hagen Jung. We modified the program to handle
the ASJP modifiers.
14 The reason behind this decision is that correlations resulting from smaller samples (less than
40 language pairs) tend to be unreliable.
Comparative evaluation of string similarity measures  179

not be processed by the program, so these languages were also excluded from the
analysis.
All the experiments reported in this paper were performed on a subset of
version 14 of the ASJP database whose language distribution is shown in figure
2.15 The database has 5500 word lists. The same selection principles that were
used for version 12 (described above) were applied for choosing the languages to
be included in our experiments. The final dataset for our experiments has 4743
word lists for 50 language families. We use the family names of the WALS
(Haspelmath et al., 2011) classification.

Fig. 2: Distribution of languages in ASJP database (version 14).

The WALS classification is a two-level classification where each language belongs


to a genus and a family. A genus is a genetic classification unit given by Dryer
(2000) and consists of set of languages supposedly descended from a common
ancestor which is 3000 to 3500 years old. For instance, Indo-Aryan languages are
classified as a separate genus from Iranian languages although, it is quite well
known that both Indo-Aryan and Iranian languages are descended from a
common proto-Indo-Iranian ancestor.
The Ethnologue classification is a multi-level tree classification for a language
family. This classification is often criticized for being too “lumping”, i.e., too
liberal in positing genetic relatedness between languages. The highest node in a
family tree is the family itself and languages form the lowest nodes (leaves).


15 Available for downloading at http://email.eva.mpg.de/~wichmann/listss14.zip.
180  Taraka Rama and Lars Borin

An internal node in the tree is not necessarily binary. For instance, the
Dravidian language family has four branches emerging from the top node (see
figure 3 for the Ethnologue family tree of Dravidian languages).

Table 1: Distribution of language families in ASJP database. WN and WLs stands for WALS
Name and Word Lists.

Family Name WN # WLs Family Name WN # WLs


Afro-Asiatic AA 287 Mixe-Zoque MZ 15
Algic Alg 29 MoreheadU.Maro MUM 15
Altaic Alt 84 Na-Dene NDe 23
Arwakan Arw 58 Nakh-Daghestanian NDa 32
Australian Aus 194 Niger-Congo NC 834
Austro-Asiatic AuA 123 Nilo-Saharan NS 157
Austronesian An 1008 Otto-Manguean OM 80
Border Bor 16 Panoan Pan 19
Bosavi Bos 14 Penutian Pen 21
Carib Car 29 Quechuan Que 41
Chibchan Chi 20 Salish Sal 28
Dravidian Dra 31 Sepik Sep 26
Eskimo-Aleut EA 10 Sino-Tibetan ST 205
Hmong-Mien HM 32 Siouan Sio 17
Hokan Hok 25 Sko Sko 14
Huitotoan Hui 14 Tai-Kadai TK 103
Indo-European IE 269 Toricelli Tor 27
Kadugli Kad 11 Totonacan Tot 14
Khoisan Kho 17 Trans-NewGuinea TNG 298
Kiwain Kiw 14 Tucanoan Tuc 32
LakesPlain LP 26 Tupian Tup 47
Lower-Sepik-Ramu LSR 20 Uralic Ura 29
Macro-Ge MGe 24 Uto-Aztecan UA 103
Marind Mar 30 West-Papuan WP 33
Mayan May 107 WesternFly WF 38
Comparative evaluation of string similarity measures  181

Fig. 3: Ethnologue tree for the Dravidian language family.

5 Similarity measures
For the experiments decribed below, we have considered both string similarity
measures and distributional measures for computing the distance between a pair
of languages. As mentioned earlier, string similarity measures work at the level
of word pairs and provide an aggregate score of the similarity between word pairs
whereas distributional measures compare the n-gram profiles between a
language pair to yield a distance score.

5.1 String similarity measures

The different string similarity measures for a word pair that we have investigated
are the following:
– IDENT returns 1 if the words are identical, otherwise it returns 0.
– PREFIX returns the length of the longest common prefix divided by the length
of the longer word.
182  Taraka Rama and Lars Borin

– DICE is defined as the number of shared bigrams divided by the total number
of bigrams in both the words.
– LCS is defined as the length of the longest common subsequence divided by
the length of the longer word (Melamed, 1999).
– TRIGRAM is defined in the same way as DICE but uses trigrams for computing
the similarity between a word pair.
– XDICE is defined in the same way as DICE but uses “extended bigrams”,
which are trigrams without the middle letter (Brew and McKelvie, 1996).
– Jaccard’s index, JCD, is a set cardinality measure that is defined as the ratio
of the number of shared bigrams between the two words to the ratio of the
size of the union of the bigrams between the two words.
– LDN, as defined above.

Each word-pair similarity score is converted to its distance counterpart by


subtracting the score from 1.0.16 Note that this conversion can sometimes result
in a negative distance which is due to the double normalization involved in
LDND.17 This distance score for a word pair is then used to compute the pairwise
distance between a language pair. The distance computation between a language
pair is performed as described in section 2.1. Following the naming convention of
LDND, a suffix “D” is added to the name of each measure to indicate its LDND
distance variant.

5.2 N-gram similarity

N-gram similarity measures are inspired by a line of work initially pursued in the
context of information retrieval, aiming at automatic language identification in a
multilingual document. Cavnar and Trenkle (1994) used character n-grams for
text categorization. They observed that different document categories –
including documents in different languages – have characteristic character n-
gram profiles. The rank of a character n-gram varies across different categories
and documents belonging to the same category have similar character n-gram
Zipfian distributions.


16 Lin (1998) investigates three distance to similarity conversion techniques and motivates the
results from an information-theoretical point of view. In this article, we do not investigate the
effects of similarity to distance conversion. Rather, we stick to the traditional conversion
technique
17 Thus, the resulting distance is not a true distance metric.
Comparative evaluation of string similarity measures  183

Building on this idea, Dunning (1994, 1998) postulates that each language has its
own signature character (or phoneme; depending on the level of transcription) n-
gram distribution. Comparing the character n-gram profiles of two languages can
yield a single point distance between the language pair. The comparison
procedure is usually accomplished through the use of one of the distance
measures given in Singh 2006. The following steps are followed for extracting the
phoneme n-gram profile for a language:
– An n-gram is defined as the consecutive phonemes in a window of N. The
value of N usually ranges from 1 to 5.
– All n-grams are extracted for a lexical item. This step is repeated for all the
lexical items in a word list.
– All the extracted n-grams are mixed and sorted in the descending order of
their frequency. The relative frequency of the n-grams is computed.
– Only the top G n-grams are retained and the rest of them are discarded. The
value of G is determined empirically.

For a language pair, the n-gram profiles can be compared using one of the
following distance measures:
1. Out-of-Rank measure is defined as the aggregate sum of the absolute differ-
ence in the rank of the shared n-grams between a pair of languages. If there
are no shared bigrams between an n-gram profile, then the difference in
ranks is assigned a maximum out-of-place score.
2. Jaccard’s index is a set cardinality measure. It is defined as the ratio of the
cardinality of the intersection of the n-grams between the two languages to
the cardinality of the union of the two languages.
3. Dice distance is related to Jaccard’s Index. It is defined as the ratio of twice
the number of shared n-grams to the total number of n-grams in both the
language profiles.
4. Manhattan distance is defined as the sum of the absolute difference between
the relative frequency of the shared n-grams.
5. Euclidean distance is defined in a similar fashion to Manhattan distance
where the individual terms are squared.

While replicating the original ASJP experiments on the version 12 ASJP database,
we also tested if the above distributional measures, (1–5) perform as well as LDN.
Unfortunately, the results were not encouraging, and we did not repeat the
experiments on version 14 of the database. One main reason for this result is the
relatively small size of the ASJP concept list, which provides a poor estimate of
the true language signatures.
184  Taraka Rama and Lars Borin

This factor speaks equally, or even more, against including another class of n-
gram-based measures, namely information-theoretic measures such as cross
entropy and KL-divergence. These measures have been well-studied in natural
language processing tasks such as machine translation, natural language
parsing, sentiment identification, and also in automatic language identification.
However, the probability distributions required for using these measures are
usually estimated through maximum likelihood estimation which require a fairly
large amount of data, and the short ASJP concept lists will hardly qualify in this
regard.

6 Evaluation measures
The measures which we have used for evaluating the performance of string
similarity measures given in section 5 are the following three:
1. dist was originally suggested by Wichmann et al. (2010a), and tests if LDND
is better than LDN at the task of distinguishing related languages from un-
related languages.
2. RW is a special case of Pearson’s r – called point biserial correlation (Tate,
1954) – computes the agreement between the intra-family pairwise distances
and the WALS classification for the family.
3. γ is related to Goodman and Kruskal’s Gamma (1954) and measures the
strength of association between two ordinal variables. In this paper, it is used
to compute the level of agreement between the pairwise intra-family
distances and the family’s Ethnologue classification.

6.1 Distinctiveness measure (dist)

The dist measure for a family consists of three components: the mean of the
pairwise distances inside a language family (din); and the mean of the pairwise
distances from each language in a family to the rest of the language families (dout).
sdout is defined as the standard deviation of all the pairwise distances used to
compute dout. Finally, dist is defined as (din−dout)/sdout. The resistance of a string
similarity measure to other language families is reflected by the value of sdout.
A comparatively higher dist value suggests that a string similarity measure is
particularly resistant to random similarities between unrelated languages and
performs well at distinguishing languages belonging to the same language family
from languages in other language families.
Comparative evaluation of string similarity measures  185

6.2 Correlation with WALS

The WALS database provides a three-level classification. The top level is the
language family, the second level is the genus and the lowest level is the
language itself. If two languages belong to different families, then the distance is
3. Two languages that belong to different genera in the same family have a
distance of 2. If the two languages fall in the same genus, they have a distance of
1. This allows us to define a distance matrix for each family based on WALS. The
WALS distance matrix can be compared to the distance matrices of any string
similarity measure using point biserial correlation – a special case of Pearson’s r.
If a family has a single genus in the WALS classification there is no computation
of RW and the corresponding row for a family is empty in table 7.

6.3 Agreement with Ethnologue

Given a distance-matrix d of order N × N, where each cell dij is the distance be-
tween two languages i and j; and an Ethnologue tree E, the computation of γ for
a language family is defined as follows:
1. Enumerate all the triplets for a language family of size N. A triplet, t for a
language family is defined as {i, j, k}, where i ≠ j ≠ k are languages belonging
to a family. A language family of size N has n(n-1)(n-2)/6 triplets.
2. For the members of each such triplet t, there are three lexical distances dij ,
dik, and djk. The expert classification tree E can treat the three languages {i, j,
k} in four possible ways (| denotes a partition): {i, j | k}, {i, k | j}, {j, k | i} or can
have a tie where all languages emanate from the same node. All ties are
ignored in the computation of γ.18
3. A distance triplet dij , dik, and djk is said to agree completely with an Ethno-
logue partition {i, j | k} when dij < dik and dij < djk. A triplet that satisfies these
conditions is counted as a concordant comparison, C; else it is counted as a
discordant comparison, D.


18 We do not know what a tie in the gold standard indicates: uncertainty in the classification,
or a genuine multi-way branching? Whenever the Ethnologue tree of a family is completely
unresolved, it is shown by an empty row. For example, the family tree of Bosavi languages is a
star structure. Hence, the corresponding row in table 5 is left empty.
186  Taraka Rama and Lars Borin

4. Steps 2 and 3 are repeated for all the triplets to yield γ for a family defined as
γ = (C−D)/(C+D). γ lies in the range [−1, 1) where a score of −1 indicates perfect
C+D disagreement and a score of +1 indicates perfect agreement.

At this point, one might wonder about the decision for not using an off-the-
shelf tree-building algorithm to infer a tree and compare the resulting tree with
the Ethnologue classification. Although both Pompei et al. (2011) and Huff and
Lonsdale (2011) compare 12 their inferred trees – based on Neighbor-Joining and
Minimum Evolution algorithms – to Ethnologue trees using cleverly crafted tree-
distance measures (GRF and GQD), they do not make the more intuitively useful
direct comparison of the distance matrices to the Ethnologue trees. The tree
inference algorithms use heuristics to find the best tree from the available tree
space. The number of possible rooted, non-binary and unlabeled trees is quite
large even for a language family of size 20 – about 256 × 106.
A tree inference algorithm uses heuristics to reduce the tree space to find the
best tree that explains the distance matrix. A tree inference algorithm can make
mistakes while searching for the best tree. Moreover, there are many variations
of Neighbor-Joining and Minimum Evolution algorithms.19 Ideally, one would
have to test the different tree inference algorithms and then decide the best one
for our task. However, the focus of this paper rests on the comparison of different
string similarity algorithms and not on tree inference algorithms. Hence, a direct
comparison of a family’s distance matrix to the family’s Ethnologue tree
circumvents the choice of the tree inference algorithm.

7 Results and discussion


In table 2 we give the results of our experiments. We only report the average
results for all measures across the families listed in table 1. Further, we check the
correlation between the performance of the different string similarity measures
across the three evaluation measures by computing Spearman’s ρ. The pairwise
ρ is given in table 3. The high correlation value of 0.95 between RW and γ suggests
that all the measures agree roughly on the task of internal classification.
The average scores in each column suggest that the string similarity measures
exhibit different degrees of performance. How does one decide which measure is
the best in a column? What kind of statistical testing procedure should be


19 http://www.atgc-montpellier.fr/fastme/usersguide.php
Comparative evaluation of string similarity measures  187

adopted for deciding upon a measure? We address these questions through the
following procedure:
1. For a column i, sort the average scores, s in descending order.
2. For a row index 1 ≤ r ≤ 16, test the significance of sr ≥ sr+1 through a sign test
(Sheskin, 2003). This test yields a p−value.

The above significant tests are not independent by themselves. Hence, we


cannot reject a null hypothesis H0 at a significance level of α = 0.01. The α needs
to be corrected for multiple tests. Unfortunately, the standard Bonferroni’s
multiple test correction and Fisher’s Omnibus test work for a global null
hypothesis and not at the level of a single test. We follow the procedure, called
False Discovery Rate (FDR), given by Benjamini and Hochberg (1995) for ad-
justing the α value for multiple tests. Given H1 . . . Hm null hypotheses and P1 . . .
Pm p-values, the procedure works as follows:
1. Sort the Pk, 1 ≤ k ≤ m, values in ascending order. k is the rank of a p-value.
2. The adjusted α*k value for Pk is (k/m)α.
3. Reject all the H0s from 1, . . . , k where Pk+1 > α*k.

Table 2: Average results for each string similarity measure across the 50 families. The rows are
sorted by the name of the measure.

Measure Average Dist Average RW Average γ


DICE 3.3536 0.5449 0.6575
DICED 9.4416 0.5495 0.6607
IDENT 1.5851 0.4013 0.2345
IDENTD 8.163 0.4066 0.3082
JCD 13.9673 0.5322 0.655
JCDD 15.0501 0.5302 0.6622
LCS 3.4305 0.6069 0.6895
LCSD 6.7042 0.6151 10.6984
LDN 3.7943 0.6126 0.6984
LDND 7.3189 0.619 0.7068
PREFIX 3.5583 0.5784 0.6747
PREFIXD 7.5359 0.5859 0.6792
TRIGRAM 1.9888 0.4393 0.4161
TRIGRAMD 9.448 0.4495 0.5247
XDICE 0.4846 0.3085 0.433
XDICED 2.1547 0.4026 0.4838
Average 6.1237 0.5114 0.5739
188  Taraka Rama and Lars Borin

Table 3: Spearman’s ρ between γ, RW, and Dist

Dist RW
γ 0.30 0.95
Dist 0.32

The above procedure ensures that the chance of incorrectly rejecting a null
hypothesis is 1 in 20 for α = 0.05 and 1 in 100 for α = 0.01. In this experimental
context, this suggests that we erroneously reject 0.75 true null hypotheses out of
15 hypotheses for α = 0.05 and 0.15 hypotheses for α = 0.01. We report the Dist, γ,
and RW for each family in tables 5, 6, and 7. In each of these tables, only those
measures which are above the average scores from table 2, are reported.
The FDR procedure for γ suggests that no sign test is significant. This is in
agreement with the result of Wichmann et al., 2010a, who showed that the choice
of LDN or LDND is quite unimportant for the task of internal classification. The
FDR procedure for RW suggests that LDN > LCS, LCS > PREFIXD, DICE > JCD, and
JCD > JCDD. Here A > B denotes that A is significantly better than B. The FDR
procedure for Dist suggests that JCDD > JCD, JCD > TRID, DICED > IDENTD, LDND
> LCSD, and LCSD > LDN.
The results point towards an important direction in the task of building
computational systems for automatic language classification. The pipeline for
such a system consists of (1) distinguishing related languages from unrelated
languages; and (2) internal classification accuracy. JCDD performs the best with
respect to Dist. Further, JCDD is derived from JCD and can be computed in O(m +
n), for two strings of length m and n. In comparison, LDN is in the order of O(mn).
In general, the computational complexity for computing distance between two
word lists for all the significant measures is given in table 4. Based on the
computational complexity and the significance scores, we propose that JCDD be
used for step 1 and a measure like LDN be used for internal classification.

Table 4: Computational complexity of top performing measures for computing distance be-
tween two word lists. Given two word lists each of length l. m and n denote the lengths of a
word pair wa and wb and C = l(l − 1)/2.

Measure Complexity
JCDD CO(m + n + min(m − 1, n − 1))
JCD lO(m + n + min(m − 1, n − 1))
LDND CO(mn)
LDN lO(mn)
PREFIXD CO(max(m, n))
LCSD CO(mn)
Comparative evaluation of string similarity measures  189

Measure Complexity
LCS lO(mn)
DICED CO(m + n + min(m − 2, n − 2))
DICE lO(m + n + min(m − 2, n − 2))

8 Conclusion
In this article, we have presented the first known attempt to apply more than 20
different similarity (or distance) measures to the problem of genetic classification
of languages on the basis of Swadesh-style core vocabulary lists. The experiments
were performed on the wide-coverage ASJP database (about half the world’s
languages).
We have examined the various measures at two levels, namely: (1) their ca-
pability of distinguishing related and unrelated languages; and (2) their perfor-
mance as measures for internal classification of related languages. We find that
the choice of string similarity measure (among the tested pool of measures) is not
very important for the task of internal classification whereas the choice affects
the results of discriminating related languages from unrelated ones.

Acknowledgments
The authors thank Søren Wichmann, Eric W. Holman, Harald Hammarström, and
Roman Yangarber for useful comments which have helped us to improve the
presentation. The string similarity experiments have been made possible through
the use of ppss software20 recommended by Leif-Jöran Olsson. The first author
would like to thank Prasant Kolachina for the discussions on parallel
implementations in Python. The work presented here was funded in part by the
Swedish Research Council (the project Digital areal linguistics; contract no. 2009-
1448).


20 http://code.google.com/p/ppss/
190  Taraka Rama and Lars Borin

References
Atkinson, Quentin D. & Russell D. Gray. 2005. Curious parallels and curious connections —
phylogenetic thinking in biology and historical linguistics. Systematic Biology 54(4). 513–
526.
Bakker, Dik, André Müller, Viveka Velupillai, Søren Wichmann, Cecil H. Brown, Pamela Brown,
Dmitry Egorov, Robert Mailhammer, Anthony Grant & Eric W. Holman. 2009. Adding
typology to lexicostatistics: A combined approach to language classification. Linguistic
Typology 13(1):169–181.
Benjamini, Yoav & Yosef Hochberg. 1995. Controlling the false discovery rate: A practical and
powerful approach to multiple testing. Journal of the Royal Statistical Society B 57(1).
289–300.
Bergsland, Knut & Hans Vogt. 1962. On the validity of glottochronology. Current Anthropology,
3(2):115–153. ISSN 00113204.
Bouchard-Côté, Alexandre, David Hall, Thomas L. Griffiths & Dan Klein. 2013. Automated
reconstruction of ancient languages using probabilistic models of sound change.
Proceedings of the National Academy of Sciences, 110(11). 4224–4229.
Brew, Chris & David McKelvie. 1996. Word-pair extraction for lexicography. In Kemal Oflazer &
Harold Somers (eds.), Proceedings of the Second International Conference on New
Methods in Language Processing, 45–55. Ankara.
Brown, Cecil H., Eric W. Holman, Søren Wichmann, & Viveka Velupillai. 2008. Automated
classification of the world’s languages: A description of the method and preliminary
results. Sprachtypologie und Universalienforschung 61(4). 285–308.
Cavnar, William B. & John M. Trenkle. 1994. N-gram-based text categorization. In Proceedings
of SDAIR-94, 3rd Annual Symposium on Document Analysis and Information Retrieval,
161–175. Las Vegas (NV): UNLV Publications.
Christiansen, Chris, Thomas Mailund, Christian N. S. Pedersen, Martin Randers, & Martin S.
Stissing. 2006. Fast calculation of the quartet distance between trees of arbitrary
degrees. Algorithms for Molecular Biology 1. Article No. 16.
Dryer, Matthew S. 2000. Counting genera vs. counting languages. Linguistic Typology 4. 334–
350.
Dryer, Matthew S. & Martin Haspelmath. 2013. The world atlas of language structures online.
Leipzig: Max Planck Institute for Evolutionary Anthropology.
http://wals.info (accessed on 26 November 2014).
Dunning, Ted E. 1994. Statistical identification of language. Technical Report CRL MCCS-94-
273. Las Cruces (NM): New Mexico State University, Computing Research Lab.
Dunning, Ted E. 1998. Finding structure in text, genome and other symbolic sequences.
Sheffield: University of Sheffield.
Durie, Mark & Malcolm Ross (eds.). 1996. The comparative method reviewed: Regularity and
irregularity in language change. Oxford & New York: Oxford University Press.
Dyen, Isidore, Joseph B. Kruskal, & Paul Black. 1992. An Indo-European classification: A
lexicostatistical experiment. Transactions of the American Philosophical Society 82(5). 1–
132.
Ellegård, Alvar. 1959. Statistical measurement of linguistic relationship. Language 35(2). 131–
156.
Comparative evaluation of string similarity measures  191

Ellison, T. Mark & Simon Kirby. 2006. Measuring language divergence by intra-lexical
comparison. In Proceedings of the 21st International Conference on Computational
Linguistics and 44th Annual Meeting of the Association for Computational Linguistics,
pages 273–280. Association for Computational Linguistics.
http://www.aclweb.org/anthology/P06-1035 (accessed 27 November 2014).
Embleton, Sheila M. 1986. Statistics in historical linguistics (Quantitative Linguistics 30).
Bochum: Brockmeyer.
Felsenstein, Joseph. 2002. PHYLIP (phylogeny inference package) version 3.6 a3. Distributed
by the author. Seattle (WA): University of Washington, Department of Genome Sciences.
Felsenstein, Joseph. 2004. Inferring phylogenies. Sunderland (MA): Sinauer Associates.
Goodman, Leo A. & William H. Kruskal. 1954. Measures of association for cross classifications.
Journal of the American Statistical Association 49(268). 732–764.
Greenhill, Simon J. & Russell D. Gray. 2009. Austronesian language phylogenies: Myths and
misconceptions about Bayesian computational methods. In Alexander Adelaar & Andrew
Pawlye (eds.), Austronesian historical linguistics and culture history: A festschrift for
Robert Blust, 375–397. Canberra: Pacific Linguistics
Greenhill, Simon J., Robert Blust & Russell D. Gray. 2008. The Austronesian basic vocabulary
database: From bioinformatics to lexomics. Evolutionary Bioinformatics 4. 271–283.
Haspelmath, Martin, Matthew S. Dryer, David Gil, and Bernard Comrie. 2011. WALS online.
Munich: Max Planck Digital Library. http://wals.info.
Hauer, Bradley & Grzegorz Kondrak. 2011. Clustering semantically equivalent words into
cognate sets in multilingual lists. In Proceedings of 5th International Joint Conference on
Natural Language Processing, pages 865–873. Asian Federation of Natural Language
Processing. http://www.aclweb.org/anthology/I11-1097 (accessed 27 November 2014).
Holman, Eric W., Søren Wichmann, Cecil H. Brown, Viveka Velupillai, André Müller & Dik
Bakker. 2008. Advances in automated language classification. In Antti Arppe, Kaius
Sinnemäki & Urpu Nikanne (eds.), Quantitative investigations in theoretical linguistics,
40–43. Helsinki: University of Helsinki.
Huff, Paul & Deryle Lonsdale. 2011. Positing language relationships using ALINE. Language
Dynamics and Change 1(1). 128–162.
Huffman, Stephen M. 1998. The genetic classification of languages by n-gram analysis: A
computational technique. Washington (DC): Georgetown University dissertation.
Inkpen, Diana, Oana Frunza & Grzegorz Kondrak. 2005. Automatic identification of cognates
and false friends in French and English. In Proceedings of the International Conference
Recent Advances in Natural Language Processing, 251–257.
Jäger, Gerhard. 2013. Phylogenetic inference from word lists using weighted alignment with
empirically determined weights. Language Dynamics and Change 3(2). 245–291.
Kondrak, Grzegorz. 2000. A new algorithm for the alignment of phonetic sequences. In
Proceedings of the First Meeting of the North American Chapter of the Association for
Computational Linguistics, 288–295.
Kondrak, Grzegorz. 2002a. Algorithms for language reconstruction. Toronto: University of
Toronto dissertation.
Kondrak, Grzegorz. 2002b. Determining recurrent sound correspondences by inducing
translation models. In Proceedings of the 19th international conference on Computational
linguistics, Volume 1. Association for Computational Linguistics.
http://www.aclweb.org/anthology/C02-1016?CFID=458963627&CFTOKEN=39778386
(accessed 26 November 2014).
192  Taraka Rama and Lars Borin

Kondrak, Grzegorz & Tarek Sherif. 2006. Evaluation of several phonetic similarity algorithms
on the task of cognate identification. In Proceedings of ACL Workshop on Linguistic
Distances, 43–50. Association for Computational Linguistics.
Kroeber, Alfred L. & C. Douglas Chrétien. 1937. Quantitative classification of Indo-European
languages. Language 13(2). 83–103.
Levenshtein, Vladimir I. 1966. Binary codes capable of correcting deletions, insertions and
reversals. Soviet Physics - Doklady 10(8). 707–710.
Lewis, Paul M. (ed.). 2009. Ethnologue: Languages of the world, 16th edn. Dallas (TX): SIL
International.
Lin, Dekang. 1998. An information-theoretic definition of similarity. In Proceedings of the 15th
International Conference on Machine Learning, Volume 1, 296–304.
List, Johann-Mattis. 2012. LexStat: Automatic detection of cognates in multilingual wordlists.
In Proceedings of the EACL 2012 Joint Workshop of LINGVIS & UNCLH, 117–125. Association
for Computational Linguistics. http://www.aclweb.org/anthology/W12-0216 (accessed 27
November 2014).
Melamed, Dan I. 1999. Bitext maps and alignment via pattern recognition. Computational
Linguistics 25(1). 107–130.
Nichols, Johanna. 1996. The comparative method as heuristic. In Mark Durie & Malcom Ross
(eds), The comparative method revisited: Regularity and Irregularity in Language Change,
39–71. New York: Oxford University Press.
Nordhoff, Sebastian & Harald Hammarström. 2011. Glottolog/Langdoc: Defining dialects,
languages, and language families as collections of resources. In Proceedings of the First
International Workshop on Linked Science, Volume 783. http://ceur-ws.org/Vol-
783/paper7.pdf (accessed 27 November 2014).
Petroni, Filippo & Maurizio Serva. 2010. Measures of lexical distance between languages.
Physica A: Statistical Mechanics and its Applications 389(11). 2280–2283.
Pompei, Simone, Vittorio Loreto, & Francesca Tria. 2011. On the accuracy of language trees.
PloS ONE 6(6). e20109.
Rama, Taraka & Anil K. Singh. 2009. From bag of languages to family trees from noisy corpus.
In Proceedings of the International Conference RANLP-2009, 355–359. Association for
Computational Linguistics. http://www.aclweb.org/anthology/R09-1064 (accessed 27
November 2014).
Robinson, David F. & Leslie R. Foulds. 1981. Comparison of phylogenetic trees. Mathematical
Biosciences 53(1-2). 131–147.
Saitou, Naruya & Masatoshi Nei. 1987. The neighbor-joining method: A new method for
reconstructing phylogenetic trees. Molecular Biology and Evolution 4(4). 406–425.
Sankoff, David. 1969. Historical linguistics as stochastic process. Montreal: McGill University
dissertation.
Sheskin, David J. 2003. Handbook of parametric and nonparametric statistical procedures. Ba
Raton (FL): Chapman & Hall/CRC Press.
Singh, Anil K. 2006. Study of some distance measures for language and encoding
identification. In Proceedings of Workshop on Linguistic Distances, 63–72. Association for
Computational Linguistics. http://www.aclweb.org/anthology/W06-1109 (accessed 27
November 2014).
Singh, Anil K. & Harshit Surana. 2007. Can corpus based measures be used for comparative
study of languages? In Proceedings of Ninth Meeting of the ACL Special Interest Group in
Comparative evaluation of string similarity measures  193

Computational Morphology and Phonology, 40–47. Association for Computational


Linguistics. http://aclweb.org/anthology/W07-1306 (accessed 27 November 2014).
Sokal, Robert R. & Charles D Michener. 1958. A statistical method for evaluating systematic
relationships. University of Kansas Science Bulletin 38. 1409–1438.
Swadesh, Morris. 1950. Salish internal relationships. International Journal of American
Linguistics 16(4). 157–167.
Swadesh, Morris. 1952. Lexico-statistic dating of prehistoric ethnic contacts with special
reference to North American Indians and Eskimos. Proceedings of the American
Philosophical Society 96(4). 452–463.
Swadesh, Morris. 1955. Towards greater accuracy in lexicostatistic dating. International
Journal of American Linguistics 21(2). 121–137.
Tate, Robert F. 1954. Correlation between a discrete and a continuous variable. Point-biserial
correlation. The Annals of Mathematical Statistics 25(3). 603–607.
Wichmann, Søren & Taraka Rama. 2014. Jackknifing the black sheep: ASJP classification
performance and Austronesian. Submitted to the proceedings of the symposium ``Let’s
talk about trees'', National Museum of Ethnology, Osaka, Febr. 9-10, 2013.
Wichmann, Søren, Eric W. Holman, Dik Bakker & Cecil H. Brown. 2010a. Evaluating linguistic
distance measures. Physica A: Statistical Mechanics and its Applications 389(17). 3632–
3639.
Wichmann, Søren, André Müller, Viveka Velupillai, Cecil H. Brown, Eric W. Holman, Pamela
Brown, Matthias Urban, Sebastian Sauppe, Oleg Belyaev, Zarina Molochieva, Annkathrin
Wett, Dik Bakker, Johann-Mattis List, Dmitry Egorov, Robert Mailhammer, David Beck &
Helen Geyer. 2010b. The ASJP database (version 12).
Søren Wichmann, André Müller, Viveka Velupillai, Annkathrin Wett, Cecil H. Brown, Zarina
Molochieva, Sebastian Sauppe, Eric W. Holman, Pamela Brown, Julia Bishoffberger, Dik
Bakker, Johann-Mattis List, Dmitry Egorov, Oleg Belyaev, Matthias Urban, Robert
Mailhammer, Helen Geyer, David Beck, Evgenia Korovina, Pattie Epps, Pilar Valenzuela,
Anthony Grant, & Harald Hammarström. 2011. The ASJP database (version 14).
Wieling, Martijn, Jelena Prokić & John Nerbonne. 2009. Evaluating the pairwise string
alignment of pronunciations. In Proceedings of the EACL 2009 Workshop on Language
Technology and Resources for Cultural Heritage, Social Sciences, Humanities, and
Education, 26–34. Association for Computational Linguistics.
Appendix
Table 5: Distances for families and measures above average

Family JCDD JCD TRIGRAMD DICED IDENTD PREFIXD LDND LCSD LDN
Bos 15.0643 14.436 7.5983 10.9145 14.4357 10.391 8.6767 8.2226 4.8419
NDe 19.8309 19.2611 8.0567 13.1777 9.5648 9.6538 10.1522 9.364 5.2419
NC 1.7703 1.6102 0.6324 1.1998 0.5368 1.0685 1.3978 1.3064 0.5132
Pan 24.7828 22.4921 18.5575 17.2441 12.2144 13.7351 12.7579 11.4257 6.8728
Hok 10.2645 9.826 3.6634 7.3298 4.0392 3.6563 4.84 4.6638 2.7096
Chi 4.165 4.0759 0.9642 2.8152 1.6258 2.8052 2.7234 2.5116 1.7753
194  Taraka Rama and Lars Borin

Tup 15.492 14.4571 9.2908 10.4479 6.6263 8.0475 8.569 7.8533 4.4553
WP 8.1028 7.6086 6.9894 5.5301 7.0905 4.0984 4.2265 3.9029 2.4883
AuA 7.3013 6.7514 3.0446 4.5166 3.4781 4.1228 4.7953 4.3497 2.648
An 7.667 7.2367 4.7296 5.3313 2.5288 4.3066 4.6268 4.3107 2.4143
Que 62.227 53.7259 33.479 29.7032 27.1896 25.9791 23.7586 21.7254 10.8472
Kho 6.4615 6.7371 3.3425 4.4202 4.0611 3.96 3.8014 3.3776 2.1531
Dra 18.5943 17.2609 11.6611 12.4115 7.3739 10.2461 9.8216 8.595 4.8771
Aus 2.8967 3.7314 1.5668 2.0659 0.7709 1.8204 1.635 1.5775 1.4495
Tuc 25.9289 24.232 14.0369 16.8078 11.6435 12.5345 12.0163 11.0698 5.8166
Ura 6.5405 6.1048 0.2392 1.6473 -0.0108 3.4905 3.5156 3.1847 2.1715
Arw 6.1898 6.0316 4.0542 4.4878 1.7509 2.9965 3.5505 3.3439 2.1828
May 40.1516 37.7678 17.3924 22.8213 17.5961 14.4431 15.37 13.4738 7.6795
LP 7.5669 7.6686 3.0591 5.3684 5.108 4.8677 4.3565 4.2503 2.8572
OM 4.635 4.5088 2.8218 3.3448 2.437 2.6701 2.7328 2.4757 1.3643
Car 15.4411 14.6063 9.7376 10.6387 5.1435 7.7896 9.1164 8.2592 5.0205
Family JCDD JCD TRIGRAMD DICED IDENTD PREFIXD LDND LCSD LDN
TNG 1.073 1.216 0.4854 0.8259 0.5177 0.8292 0.8225 0.8258 0.4629
MZ 43.3479 40.0136 37.9344 30.3553 36.874 20.4933 18.2746 16.0774 9.661
Bor 9.6352 9.5691 5.011 6.5316 4.1559 6.5507 6.3216 5.9014 3.8474
Pen 5.4103 5.252 3.6884 3.8325 2.3022 3.2193 3.1645 2.8137 1.5862
MGe 4.2719 4.0058 1.0069 2.5482 1.6691 2.0545 2.4147 2.3168 1.1219
ST 4.1094 3.8635 0.9103 2.7825 2.173 2.7807 2.8974 2.7502 1.3482
Tor 3.2466 3.1546 2.2187 2.3101 1.7462 2.1128 2.0321 1.9072 1.0739
TK 15.0085 13.4365 5.331 7.7664 7.5326 8.1249 7.6679 6.9855 2.8723
IE 7.3831 6.7064 1.6767 2.8031 1.6917 4.1028 4.0256 3.6679 1.4322
Alg 6.8582 6.737 4.5117 5.2475 1.2071 4.5916 5.2534 4.5017 2.775
NS 2.4402 2.3163 1.1485 1.6505 1.1456 1.321 1.3681 1.3392 0.6085
Sko 6.7676 6.3721 2.5992 4.6468 4.7931 5.182 4.7014 4.5975 2.5371
AA 1.8054 1.6807 0.7924 1.2557 0.4923 1.37 1.3757 1.3883 0.6411
LSR 4.0791 4.3844 2.2048 2.641 1.5778 2.1808 2.1713 2.0826 1.6308
Mar 10.9265 10.0795 8.5836 7.1801 6.4301 5.0488 4.7739 4.5115 2.8612
Alt 18.929 17.9969 6.182 9.1747 7.2628 9.4017 8.8272 7.9513 4.1239
Sep 6.875 6.5934 2.8591 4.5782 4.6793 4.3683 4.1124 3.8471 2.0261
Hui 21.0961 19.8025 18.4869 14.7131 16.1439 12.4005 10.2317 9.2171 4.9648
NDa 7.6449 7.3732 3.2895 4.8035 2.7922 5.7799 5.1604 4.8233 2.3671
Sio 13.8571 12.8415 4.2685 9.444 7.3326 7.8548 7.9906 7.1145 4.0156
Kad 42.0614 40.0526 27.8429 25.6201 21.678 17.0677 17.5982 15.9751 9.426
MUM 7.9936 7.8812 6.1084 4.7539 4.7774 3.8622 3.4663 3.4324 2.1726
WF 22.211 20.5567 27.2757 15.8329 22.4019 12.516 11.2823 10.4454 5.665
Sal 13.1512 12.2212 11.3222 9.7777 5.2612 7.4423 7.5338 6.7944 3.4597
Kiw 43.2272 39.5467 46.018 30.1911 46.9148 20.2353 18.8007 17.3091 10.3285
UA 21.6334 19.6366 10.4644 11.6944 4.363 9.6858 9.4791 8.9058 4.9122
Tot 60.4364 51.2138 39.4131 33.0995 26.7875 23.5405 22.6512 21.3586 11.7915
HM 8.782 8.5212 1.6133 4.9056 4.0467 5.7944 5.3761 4.9898 2.8084
Comparative evaluation of string similarity measures  195
Family JCDD JCD TRIGRAMD DICED IDENTD PREFIXD LDND LCSD LDN
EA 27.1726 25.2088 24.2372 18.8923 14.1948 14.2023 13.7316 12.1348 6.8154

Table 6: GE for families and measures above average.

Family LDND LCSD LDN LCS PREFIXD PREFIX JCDD DICED DICE JCD
WF
Tor 0.7638 0.734 0.7148 0.7177 0.7795 0.7458 0.7233 0.7193 0.7126 0.7216
Chi 0.7538 0.7387 0.7748 0.7508 0.6396 0.7057 0.7057 0.7057 0.7057 0.7477
HM 0.6131 0.6207 0.5799 0.5505 0.5359 0.5186 0.4576 0.429 0.4617 0.4384
196  Taraka Rama and Lars Borin

Hok 0.5608 0.5763 0.5622 0.5378 0.5181 0.4922 0.5871 0.5712 0.5744 0.5782
Tot 1 1 1 1 0.9848 0.9899 0.9848 0.9899 0.9949 0.9848
Aus 0.4239 0.4003 0.4595 0.4619 0.4125 0.4668 0.4356 0.4232 0.398 0.4125
WP 0.7204 0.7274 0.7463 0.7467 0.6492 0.6643 0.6902 0.6946 0.7091 0.697
MUM 0.7003 0.6158 0.7493 0.7057 0.7302 0.6975 0.5477 0.5777 0.6594 0.6213
Sko 0.7708 0.816 0.7396 0.809 0.7847 0.7882 0.6632 0.6944 0.6458 0.6181
ST 0.6223 0.6274 0.6042 0.5991 0.5945 0.5789 0.5214 0.5213 0.5283 0.5114
Sio 0.8549 0.8221 0.81 0.7772 0.8359 0.8256 0.772 0.7599 0.7444 0.7668
Pan 0.3083 0.3167 0.2722 0.2639 0.275 0.2444 0.2361 0.2694 0.2611 0.2306
AuA 0.5625 0.5338 0.5875 0.548 0.476 0.4933 0.5311 0.5198 0.5054 0.5299
Mar 0.9553 0.9479 0.9337 0.9017 0.9256 0.9385 0.924 0.918 0.9024 0.9106
Kad
May 0.7883 0.7895 0.7813 0.7859 0.7402 0.7245 0.8131 0.8039 0.7988 0.8121
NC 0.4193 0.4048 0.3856 0.3964 0.2929 0.2529 0.3612 0.3639 0.2875 0.2755
Kiw
Hui 0.9435 0.9464 0.9435 0.9464 0.9464 0.9435 0.8958 0.9107 0.9137 0.8988
LSR 0.7984 0.7447 0.7234 0.6596 0.7144 0.692 0.7626 0.748 0.6484 0.6775
TK 0.7757 0.7698 0.7194 0.7158 0.7782 0.7239 0.6987 0.6991 0.6537 0.6705
Family LDND LCSD LDN LCS PREFIXD PREFIX JCDD DICED DICE JCD
LP 0.6878 0.6893 0.7237 0.7252 0.6746 0.7065 0.627 0.6594 0.6513 0.6235
Que 0.737 0.7319 0.758 0.7523 0.742 0.7535 0.7334 0.7335 0.7502 0.7347
NS 0.5264 0.4642 0.4859 0.4532 0.4365 0.3673 0.5216 0.5235 0.4882 0.4968
AA 0.6272 0.6053 0.517 0.459 0.6134 0.5254 0.5257 0.5175 0.4026 0.5162
Ura 0.598 0.5943 0.6763 0.6763 0.5392 0.6495 0.7155 0.479 0.6843 0.7003
MGe 0.6566 0.6659 0.6944 0.716 0.6011 0.662 0.7245 0.7099 0.7508 0.6983
Car 0.325 0.3092 0.3205 0.3108 0.2697 0.2677 0.313 0.3118 0.2952 0.316
Bor 0.7891 0.8027 0.7823 0.7914 0.7755 0.7619 0.7846 0.8005 0.7914 0.7823
Bos
EA 0.844 0.8532 0.8349 0.8349 0.8716 0.8899 0.8716 0.8716 0.8899 0.8899
TNG 0.6684 0.6692 0.6433 0.6403 0.643 0.6177 0.5977 0.5946 0.5925 0.5972
Dra 0.6431 0.6175 0.6434 0.6288 0.6786 0.6688 0.6181 0.6351 0.655 0.6112
IE 0.7391 0.7199 0.7135 0.6915 0.737 0.7295 0.5619 0.5823 0.6255 0.5248
OM 0.9863 0.989 0.9755 0.9725 0.9527 0.9513 0.9459 0.9472 0.9403 0.9406
Tuc 0.6335 0.623 0.6187 0.6089 0.6189 0.6153 0.5937 0.5983 0.5917 0.5919
Arw 0.5079 0.4825 0.4876 0.4749 0.4475 0.4472 0.4739 0.4773 0.4565 0.4727
NDa 0.9458 0.9578 0.9415 0.9407 0.9094 0.9121 0.8071 0.8246 0.8304 0.8009
Alg 0.5301 0.5246 0.5543 0.5641 0.4883 0.5147 0.4677 0.4762 0.5169 0.5106
Sep 0.8958 0.8731 0.9366 0.9388 0.8852 0.9048 0.8535 0.8724 0.892 0.8701
NDe 0.7252 0.7086 0.7131 0.7017 0.7002 0.6828 0.6654 0.6737 0.6715 0.6639
Pen 0.8011 0.7851 0.8402 0.831 0.8092 0.8092 0.7115 0.7218 0.7667 0.7437
An 0.2692 0.2754 0.214 0.1953 0.2373 0.1764 0.207 0.2106 0.1469 0.2036
Tup 0.9113 0.9118 0.9116 0.9114 0.8884 0.8921 0.9129 0.9127 0.9123 0.9119
Kho 0.8558 0.8502 0.8071 0.7903 0.8801 0.8333 0.8052 0.8146 0.736 0.7378
Alt 0.8384 0.8366 0.85 0.8473 0.8354 0.8484 0.8183 0.8255 0.8308 0.8164
UA 0.8018 0.818 0.7865 0.8002 0.7816 0.7691 0.8292 0.8223 0.8119 0.8197
Sal 0.8788 0.8664 0.8628 0.8336 0.8793 0.8708 0.7941 0.798 0.7865 0.7843
MZ 0.7548 0.7692 0.7476 0.7524 0.7356 0.7212 0.6707 0.6779 0.6731 0.6683
Comparative evaluation of string similarity measures  197
Table 7: RW for families and measures above average.

Family LDND LCSD LDN LCS PREFIXD PREFIX DICED DICE JCD JCDD TRIRAMD
NDe 0.5761 0.5963 0.5556 0.5804 0.5006 0.4749 0.4417 0.4372 0.4089 0.412 0.2841
Bos
NC 0.4569 0.4437 0.4545 0.4398 0.3384 0.3349 0.3833 0.3893 0.3538 0.3485 0.2925
Hok 0.8054 0.8047 0.8048 0.8124 0.6834 0.6715 0.7987 0.8032 0.7629 0.7592 0.5457
Pan
Chi 0.5735 0.5775 0.555 0.5464 0.5659 0.5395 0.5616 0.5253 0.5593 0.5551 0.4752
Tup 0.7486 0.7462 0.7698 0.7608 0.6951 0.705 0.7381 0.7386 0.7136 0.7125 0.6818
WP 0.6317 0.6263 0.642 0.6291 0.5583 0.5543 0.5536 0.5535 0.5199 0.5198 0.5076
AuA 0.6385 0.6413 0.5763 0.5759 0.6056 0.538 0.5816 0.5176 0.5734 0.5732 0.5147
198  Taraka Rama and Lars Borin

Que
An 0.1799 0.1869 0.1198 0.1003 0.1643 0.0996 0.1432 0.0842 0.1423 0.1492 0.1094
Kho 0.7333 0.7335 0.732 0.7327 0.6826 0.6821 0.6138 0.6176 0.5858 0.582 0.4757
Dra 0.5548 0.5448 0.589 0.5831 0.5699 0.6006 0.5585 0.589 0.5462 0.5457 0.5206
Aus 0.2971 0.2718 0.3092 0.3023 0.2926 0.3063 0.2867 0.257 0.2618 0.2672 0.2487
Tuc
Ura 0.4442 0.4356 0.6275 0.6184 0.4116 0.6104 0.2806 0.539 0.399 0.3951 0.1021
Arw
May
LP 0.41 0.4279 0.4492 0.4748 0.3864 0.4184 0.3323 0.336 0.3157 0.3093 0.1848
OM 0.8095 0.817 0.7996 0.7988 0.7857 0.7852 0.7261 0.7282 0.6941 0.6921 0.6033
Car
MZ
TNG 0.5264 0.5325 0.4633 0.4518 0.5 0.472 0.469 0.4579 0.4434 0.4493 0.3295
Bor
Pen 0.8747 0.8609 0.8662 0.8466 0.8549 0.8505 0.8531 0.8536 0.8321 0.8308 0.7625
MGe 0.6833 0.6976 0.6886 0.6874 0.6086 0.6346 0.6187 0.6449 0.6054 0.6052 0.4518
Family LDND LCSD LDN LCS PREFIXD PREFIX DICED DICE JCD JCDD TRIRAMD
ST 0.5647 0.5596 0.5435 0.5261 0.5558 0.5412 0.4896 0.4878 0.4788 0.478 0.3116
IE 0.6996 0.6961 0.6462 0.6392 0.6917 0.6363 0.557 0.5294 0.5259 0.5285 0.4541
TK 0.588 0.58 0.5004 0.4959 0.5777 0.4948 0.5366 0.4302 0.5341 0.535 0.4942
Tor 0.4688 0.4699 0.4818 0.483 0.4515 0.4602 0.4071 0.4127 0.375 0.3704 0.3153
Alg 0.3663 0.3459 0.4193 0.4385 0.3456 0.3715 0.2965 0.3328 0.291 0.2626 0.1986
NS 0.6118 0.6072 0.5728 0.5803 0.5587 0.5118 0.578 0.5434 0.5466 0.5429 0.4565
Sko 0.8107 0.8075 0.806 0.7999 0.7842 0.7825 0.6798 0.6766 0.6641 0.6664 0.5636
AA 0.6136 0.6001 0.4681 0.431 0.6031 0.4584 0.5148 0.3291 0.4993 0.4986 0.4123
LSR 0.5995 0.5911 0.6179 0.6153 0.5695 0.5749 0.5763 0.5939 0.5653 0.5529 0.5049
Mar 0.654 0.6306 0.6741 0.6547 0.6192 0.6278 0.568 0.5773 0.5433 0.5366 0.4847
Alt 0.8719 0.8644 0.8632 0.8546 0.8634 0.8533 0.7745 0.7608 0.75 0.7503 0.6492
Hui 0.6821 0.68 0.6832 0.6775 0.6519 0.6593 0.5955 0.597 0.5741 0.5726 0.538
Sep 0.6613 0.656 0.6662 0.6603 0.6587 0.6615 0.6241 0.6252 0.6085 0.6079 0.5769
NDa 0.6342 0.6463 0.6215 0.6151 0.6077 0.5937 0.501 0.5067 0.4884 0.4929 0.4312
Sio
Kad
WF
MUM
Sal 0.6637 0.642 0.6681 0.6463 0.6364 0.6425 0.5423 0.5467 0.5067 0.5031 0.4637
Kiw
UA 0.9358 0.9332 0.9296 0.9261 0.9211 0.9135 0.9178 0.9148 0.8951 0.8945 0.8831
Tot
EA 0.6771 0.6605 0.6639 0.6504 0.6211 0.6037 0.5829 0.5899 0.5317 0.5264 0.4566
HM
Comparative evaluation of string similarity measures  199
Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos
Predicting Sales Trends
Can sentiment analysis on social media help?

1 Introduction
Over the last years, social media have gained a lot of attention and there are many
users sharing their opinions and experiences on them. As a result, there is an ag-
gregation of personal wisdom and different viewpoints. If all this information is
extracted and analyzed properly, the data on social media can lead to useful pre-
dictions of several human related events. Such prediction has great benefits in
many aspects, such as finance and product marketing (Yu & Kak 2012). The latter
has attracted the attention of researchers of the social network analysis fields,
and several approaches have been proposed.
Although there is a lot of research conducted on predicting events' outputs
relevant to finance, stock market and politics, so far there is not an analogous
research focusing on prediction of products' sales. Most of the current ap-
proaches analyse the sentiment of the tweets in order to predict different events.
The limitation of current approaches is that they use sentiment metrics in a
strictly quantitative way, taking into account the pure number of favourites or the
fraction of likes over dislikes which are not always accurate or representative of
a people's sentiment. In other words, the current approaches are not taking into
account the sentiment trends or a longitudinal sentiment fluctuation expressed
by people over time about a product, which could help in estimating a future
trend.
To this end, we propose a computational approach, which conducts predic-
tions on the sales trends of products based on the public sentiment (i.e. posi-
tive/negative/neutral stance) expressed via Twitter. The sentiment expressed in
a tweet is determined on a per context basis through the context of each tweet,
by taking into account the relations among its words. The sentiment feature used
for making predictions on sales trends is not considered as an isolated parameter
but is used in correlation and in interaction with other features extracted from
sequential historical data.
The rest of the chapter is organized as follows: In the next section (2) the
state-of-the-art in the area of prediction of various human related events exploit-
ing sentiment analysis is discussed. Then, in section 3, the proposed approach is
202  Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos

presented. The method is evaluated in section 4, where the conducted experi-


ments and the obtained results are presented. The chapter is concluded in section
5.

2 State of the art


Stock market prediction has attracted much attention both from academia and
business. Research towards this direction has started by (Eppen & Fama 1969,
Fama 1991, Cootner 1964). Their approaches were based on random walk theory
and the Efficient Market Hypothesis (EMH) (Fama 1965). According to the EMH
stock market prices are largely driven by new information, i.e. news, rather than
present and past prices. Since news is unpredictable, stock market prices will fol-
low a random walk pattern and cannot be predicted with more than 50 percent
accuracy (Qian & Rasheed 2007, Bollen et al. 2011).
According to Bollen et al. (2011), there are two main problems with EMH.
First, there are numerous studies that show that stock market prices do not follow
a random walk and can indeed to some degree be predicted (Qian & Rasheed
2007, Gallagher & Taylor 2002, Kavussanos & Dockery 2001). The latter puts into
question the basic assumptions of EMH. Second, recent research has shown that
even though news are unpredictable, indicators can be extracted from online so-
cial media (like blogs, Twitter and feeds) in order to predict changes in various
economic and commercial indicators. For example, Gruhl et al. (2005) show that
online chat activity can help to predict book sales. In the same vein, in Mishne &
Glance (2006) the authors use perform sentiment analysis in blogs to predict
movie sales. Most recently Asur & Huberman (2010) demonstrate how public sen-
timent related to movies, as expressed on Twitter, can actually predict box office
receipts.
To this end, different sentiment tracking approaches have been proposed in
the literature and significant progress has been made in extracting public mood
directly from social media content such as blog content (Gilbert & Karahalios
2010, Mishne & Glance 2006, Liu et al. 2007, Dodds & Danforth 2010). Twitter has
gained a lot of attention towards this direction. Although tweets are limited to
only 140 characters, if we aggregate the millions of tweets submitted to Twitter at
any given moment, we may have an accurate representation of public mood and
sentiment (Pak & Paroubek 2010). As a result, real-time sentiment-tracking meth-
ods have been proposed, such as Dodds & Danforth (2010) and “Pulse of Nation”.
Predicting Sales Trends. Can sentiment analysis on social media help?  203

The proposed approaches use different metrics about social media which have
been employed in prediction. The metrics used may be divided into two catego-
ries (Yu & Kak 2012): message characteristics and social network characteristics.
The message characteristics focus on the messages themselves, such as the sen-
timent and time series metrics. On the other hand, the social network character-
istics concern structural features.
The sentiment metrics are the static features of posts. With a qualitative sen-
timent analysis system, the messages could be assigned a positive, negative, or
neutral sentiment. Thus naturally the numbers of positive, negative, neutral,
non-neutral, and total posts are five elementary content predictors.
These metrics may have different prediction power at different stages (Yu &
Kak 2012). In various other endeavours, they try to calculate the relative strength
of the computed sentiment. To this end, the prediction approach adopted by
Zhang & Skiena (2009), computes various ratios between different types of senti-
ment bearing posts: they specifically use the ratio between the total number of
positive and total posts, the ratio between the total counts of negative and total
posts and the ratio between the counts of neutral and total posts. In Asur & Hu-
berman (2010), they calculate the ratio between the total number of neutral and
non-neutral posts, and the ratio between the numbers of positive and negative
posts.
Further, with the use of time series metrics researchers try to investigate the
posts more dynamically, including the speed and process of the message gener-
ation. In that case different time windows size can be taken into account in order
to calculate the generating rate (such as hourly, daily or weekly) of posts. The
intuition behind the use of post generation rate is that a higher post generation
rate implies the engagement of more people with a topic, and thus topic is con-
cerned more attractive. For example, in Asur & Huberman (2010), they have
shown that the use of a daily generating post rate, as a feature before the release
of a movie is a good predictor for movie box-office.
As mentioned before, the social network characteristics measuring structural
features can be of a great importance in prediction methodologies. For example,
centrality which measures the relative importance of a node within a network, or
the number followers of a user, as well as the re-tweets in twitter when they are
combined with the message characteristics can provide invaluable information
about the importance and the relative strength of the computed sentiment.
Concerning the prediction methods used, different approaches have been
proposed. For example, prediction of future product sales using a Probabilistic
Latent Semantic Analysis (PLSA) model for sentiment analysis in blogs, is pro-
posed by Liu et al. (2007). A Bayesian approach has been also proposed in Liu et
al. (2007). In particular, if the prediction result is discrete, the Bayes classifier can
204  Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos

be applied directly, otherwise the prediction result must be discretized first


(Sharda & Delen 2006). A Regression model has been also used in Szabo & Hu-
berman (2010). Regression methods analyse relationship between the dependent
variable, prediction result, and one or more independent variables, such as the
social network characteristics.
Another direction that has started to be explored is the model based predic-
tion. A mathematical model is built on the object before prediction, which re-
quires deep insight into the object (i.e. knowledge about social media to develop
effective models for them). Up to now, there are few works towards this direction
(e. g. Romero et al. 2011, Lerman & Hogg 2010).
The currently existing approaches for prediction of sales' trends as we have
discussed in the previous paragraphs, are usually based on common classifiers
like SVM or Naive Bayes which they both exploit a bag-of-words representation
of features. These methods cannot encode the structural relations of historical
data which are very often responsible for the revelation of a future sales' trend;
Therefore, we are challenged to explore an approach based on structural models
which contrary to bag-of-words classification approaches, takes advantage of the
sequential structure of historical data and thus is able to exploit patterns learned
from their relations which can reveal the fluctuation of sales that would lead to
the correct prediction of a sales' trend.
The limitation of current approaches is that they use sentiment metrics in a
strictly quantitative way, taking into account the pure number of favourites or the
fraction of likes over dislikes which are not always accurate or representative of
a people's sentiment; For example somebody could “like” a product in general
but could also make a negative comment (post) on aspects of it. Most importantly,
by just calculating ratios of positive over negative posts, the current approaches
are not taking into account the sentiments trends or a longitudinal sentiment
fluctuation expressed by people over time about a product, which could help in
estimating a future trend. On the other hand in our approach the sentiment ex-
pressed in a tweet is determined on a per context basis through the context of
each tweet, by taking into account the relations among its word' as it is described
in Section 3. Then the sentiment feature is integrated in our methodology for pre-
diction of sales trends' not as an isolated parameter but as a feature in correlation
and in interaction with all the remaining features and sentiments included within
our historical data.
Predicting Sales Trends. Can sentiment analysis on social media help?  205

3 Methodology
The proposed approach for prediction on sales which is based on sentiment anal-
ysis is articulated in two consecutive two stages as follows:
– sentiment analysis stage
– prediction of sales' trends of products based on sentiment.

The output of the first stage (i.e. tweets annotated with positive, negative or neu-
tral sentiment) is providing the input for its successor.

3.1 Sentiment Analysis of tweets

In Sentiment Analysis the problem of detecting the polarity orientation of sen-


tences (i.e. tweets) can be regarded as a classification problem: each sentence can
be represented by an ordered sequence of features concerning words. These fea-
tures can be effectively exploited by a Conditional Random Fields (CRF) model
(Lafferty et al. 2001).
The motivation for using CRF in sentiment analysis is based on the principle
that the meaning a sentence can imply, is tightly bound to the ordering of its con-
stituent words. Especially in the case of the English language which lacks rich
infection, the syntactic and semantic relations of its constituents are mainly im-
plied through word ordering. In English the change of word ordering in a sen-
tence can affect significantly its meaning. Similarly we have detected that word
ordering can also affect not only the meaning but also the polarity that a sentence
conveys. This can be further illustrated through the following examples:
a) Honestly, they could not have answered those questions.
b) They could not have answered those questions honestly.

Examples (a) and (b) contain the same word information, but convey a different
polarity orientation. Example (a) implies a positive polarity since the author,
placing the adverb in the initial position is trying to support its claim, providing
excuses for the behaviour of someone. On the other hand in example (b), the ad-
verb relocation at the end of the sentence changes the meaning of the sentence
and creates a negative connotation. Here the author criticizes the non-honest be-
haviour of someone, implying a negative polarity.
CRF is particularly appropriate to capturing such fine distinctions since it is able
to capture the relations among sentence constituents (word senses) as a function
of their sentential position.
206  Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos

Moreover considering structured models in sentiment analysis, Choi et al.


(2006) use CRFs to learn a sequence model in order to assign sentiments to the
sources (persons or entities) to which these sentences belong. Mao & Lebanon
(2007) propose a sequential CRF regression model to measure sentence level po-
larity for determining the sentiments flow of authors in reviews. McDonald et al.
(2007) propose a method for learning a CRF model to decide upon the polarity
orientation of a document and its sentences. In the last two approaches above the
aim is to determine the document level polarity, detecting the polarity orientation
of each constituent sentence.
In Sadamitsu et al. (2008), structured models are utilized in order to deter-
mine the sentiment of documents which are conceived as sequences of sentences.
They are based on the assumption that the polarity of the contributing sentences
affects the overall document polarity. Similarly in our approach, although sen-
tence polarity is targeted, we investigate how the polarities of the individual
words in a sentence contribute to the overall sentence level polarity.
Sadamitsu et al. (2008) proposed a method for sentiment analysis of product
reviews utilizing inter-sentence structures. This approach is based on the as-
sumption that within a positive product review there can also be negative sen-
tences. They claim that this polarity reversal occurs on a sentence basis, therefore
the sentential structures can be modelled by Hidden Conditional Random Fields
(HCRF) (Quattoni et al. 2004, Gunawardana et al. 2005). HCRF discriminative
models are trained with polarity and valence reversers (e.g. negations) for words.
Weights for these features are learned for positive and negative document struc-
tures.
Sadamitsu et al. (2008) used HMMs at the level of sentences, for determining
the sentiment of a document. This approach was soon abandoned since increas-
ing the number of HMM states they had a lower accuracy for sentiment detection.
They attribute this effect to the fact that HMM models are generative and not dis-
criminative models.
In Rentoumi et al. (2009), structured models such as Hidden Markov Models
(HMMs) are exploited in sentiment classification of headlines. The advantage of
HMMs against other machine learning approaches employed in sentiment analy-
sis is that the majority of them are based on at bag-of-features representations of
sentences, without capturing the structural nature of sub-sentential interactions.
On the contrary, HMMs being sequential models encode this structural infor-
mation, since sentence elements are represented as sequential features.
In Rentoumi et al. (2012), the authors proposed the use of CRF for computing the
polarity at the sentence level: this is motivated by the fact that CRF models exploit
more information about a sentence, in comparison to HMMs or other approaches
(i.e. bag-of-words). In particular in Rentoumi et al. (2012) the authors provided
Predicting Sales Trends. Can sentiment analysis on social media help?  207

further experimental evidence that metaphorical expressions assigned with a po-


larity orientation when exploited by CRF models have been proven valuable in
revealing the overall polarity of a sentence.
On the other hand a bag-of-words classifier, such as an SVM classifier, which
cannot exploit adequately the structure of sentences, can be proven misleading
in assigning a correct polarity on examples (a) and (b), since the bag-of-words
representation is the same for both sentences.

Fig. 1: A linear chain CRF.

On the other hand a bag-of-words classifier, such as an SVM classifier, which can-
not exploit adequately the structure of sentences, can be proven misleading in
assigning a correct polarity on examples (a) and (b), since the bag-of-words rep-
resentation is the same for both sentences.
CRF is an undirected graph model which specifies the joint probability for the
existence of possible sequences given an observation. We used a linear chain CRF
implementation1 with default settings. Figure 1 shows the abstract structure of
the model, where x indicate word senses and y states. A linear chain CRF is able
to model arbitrary features from the input instead of just the information con-
cerning the previous state of a current observation (as in Hidden Markov Models).
Therefore it is able to take into account contextual information, concerning the
constituents of a sentence. This is already discussed is valuable when it comes to
evaluate the polarity orientation of a sentence.


1 http://crfpp.sourceforge.net/
208  Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos

In discriminative CRF, we can directly model the conditional probability


p(Y|X) of the output Y given the input X. As p(X) is not modelled, CRF allows ex-
ploiting the structure of X without modelling the interactions between its parts,
but only those with the output. In CRF the conditional probability p(Y|X) is com-
puted as follows:
𝑇𝑇𝑇𝑇 𝐾𝐾𝐾𝐾
1
𝑝𝑝𝑝𝑝(𝑌𝑌𝑌𝑌|𝑋𝑋𝑋𝑋) = exp �� � 𝜆𝜆𝜆𝜆𝑘𝑘𝑘𝑘 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 (𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 , 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡𝑡𝑡 , 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 )�
𝑍𝑍𝑍𝑍(𝑋𝑋𝑋𝑋)
𝑡𝑡𝑡𝑡𝑡𝑡 𝑘𝑘𝑘𝑘𝑘𝑘

𝑍𝑍𝑍𝑍(𝑋𝑋𝑋𝑋) is a normalization factor that depends on 𝑋𝑋𝑋𝑋 and the parameter of the
training process 𝜆𝜆𝜆𝜆. In order to train the classification model, i.e. to label 𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 de-
picted in Figure 3 - indicating a positive/negative/neutral sentiment for the sen-
tence, we exploit information from the previous (𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 − 1) and the following words
(𝑥𝑥𝑥𝑥𝑡𝑡𝑡𝑡 + 1) , exploiting feature functions 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 . A feature function 𝑓𝑓𝑓𝑓𝑘𝑘𝑘𝑘 looks at a pair of
adjacent states 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 − 1, 𝑦𝑦𝑦𝑦𝑡𝑡𝑡𝑡 , the whole input sequence 𝑋𝑋𝑋𝑋 and the position of the cur-
rent word. This way CRF exploits the structural information of a sentential se-
quence.
The CRF component assigns a sentence (i.e. tweet) polarity for every word
participating in a sentential sequence. This sentence level polarity corresponds
to the polarity of a sentence in which this word would most probably participate,
according to the contextual information. Then a majority voting process is used
to determine the polarity class for a specific sentence. Thus the majority voting
does not involve judging the sentence level polarity from word level polarities,
but from the sentence level polarities already given to the words with the CRF
component. In other words, the final polarity of the sentence is the polarity that
a sentence would have wherein the majority of the words would participate. For
example for a sentence of five words, CRF would assign a sentence level polarity
to each word. If there are 3 negative and 2 positive sentence level polarities given,
then the dominant polarity would be the final result, i.e. the negative.
In order to train the CRF we used tweets derived from Sanders' collection2.
This corpus consists of 2471 manually-classified tweets. Each tweet was classified
with respect to one of four different topics into one of the three classes (positive,
negative, and neutral). Every word in a sentence is represented by a feature vector
of the form:

𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓_𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓 = (𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑡𝑡𝑡𝑡𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡, 𝑠𝑠𝑠𝑠𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡𝑡_𝑝𝑝𝑝𝑝𝑣𝑣𝑣𝑣𝑝𝑝𝑝𝑝)


2 http://www.sananalytics.com/lab/twitter-sentiment/
Predicting Sales Trends. Can sentiment analysis on social media help?  209

where "𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡𝑓𝑓𝑓𝑓𝑡𝑡𝑡𝑡" is the word, and "𝑠𝑠𝑠𝑠𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓_𝑝𝑝𝑝𝑝𝑣𝑣𝑣𝑣𝑝𝑝𝑝𝑝"is the sentence level polarity class
(positive/negative/neutral). Each sentence (i.e. tweet) is represented by a se-
quence of such feature vectors. Within the test set, the last feature representing
the class (pos/neg) for the sentence is absent and is assessed. The test set consists
of a corpus of four subsets of tweets corresponding to four different topics: ipad,
sony experia, samsung galaxy, kindle fire.

3.2 Prediction Methodology

In the current section, we are describing the methodology adopted for the predic-
tion of the fluctuation of sales, and in particular, of electronic devices. The pro-
posed methodology is based on structural machine learning and on sentiment
analysis of historical data from Twitter. The rationale behind the adopted meth-
odology for prediction lies on the assumption that a prediction for the sales fluc-
tuation of an electronic device (e.g. for ipad) can be based on the sales fluctuation
of similar gadgets (e.g. Samsung galaxy, kindle fire etc.). As it has been already
argued in Section 2 the fluctuation of sales for a product is highly dependent on
social media, and particularly it is argued that there exists a strong correlation
between the spikes in sale rank and the number of related blog posts. Even
though there is strong correlation between blog posts and sales' trends, only few
researchers work on prediction of sales based on Social Media. We are based on
historical data from Twitter and we exploit sentiment as a strong prediction indi-
cator. The latter is based on the fact that on Twitter, organization or product
brands are referred in about 19% of all the tweets, and the majority of them com-
prise strongly expressed negative or positive sentiments.
In the proposed methodology, the prediction about a product's sales is based
on the sentiment (positive/negative/neutral) expressed in tweets about similar
products, whose sales trends (increase/decrease) is already known. We found
there is a correlation among similar products, largely because they share the
same market competition. Therefore using historical data that can adequately
simulate the sales trends for a number of similar products can presumably reveal
a generic pattern which, consequently, can lead to accurate prediction. There-
fore, we are introducing a classification function which relates tweets' sentiment
and historical information (i.e. tweets' dates) concerning specific electronic de-
vices with their sales trends. Doing so, we intend to test the extent to which these
features could be seen as a valuable indicator in the correct prediction of sales
trends (i.e. increase/decrease).
In our case, the sales trends prediction problem can be seen as a discrimina-
tive machine learning classification problem which is dealt with the assistance of
210  Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos

CRF. Table 1 briefs the number of tweets accompanied by their corresponding


yearly quartiles that were used for training and testing the CRF model in order to
make predictions of sales' trend for a future (i.e. unknown) quartile for an elec-
tronic device. For this reason historical tweets' data have been collected for four
electronic devices (i.e. ipad, samsung galaxy, sony experia, kindle fire).

Table 1: Historical Twitter data for training and testing the CRF prediction approach

Device # of historical tweets' data Prediction dates Correct


tweets used for training/ testing class
kindle fire 100 01/01/2012 31/03/2012 1st quartile 2013 increase
kindle fire 01/10/2012 31/12/2012 increase
kindle fire 100 01/10/2011 31/12/2011 4th quartile 2012 decrease
kindle fire 01/07/2012 30/09/2012 decrease
ipad 91 01/04/2010 30/06/2010 2nd quartile 2011 increase
ipad 01/01/2011 31/03/2011 increase
ipad 86 01/10/2011 31/12/2011 4th quartile 2011 decrease
ipad 01/07/2012 30/09/2012 decrease
sony experia 90 01/04/2011 30/06/2011 2nd quartile 2012 increase
sony experia 01/01/2012 31/03/2012 increase
sony experia 48 01/10/2010 31/12/2010 4th quartile 2011 decrease
sony experia 01/07/2011 30/09/2011 decrease
samsung galaxy 84 01/10/2011 31/12/2011 4th quartile 2012 increase
samsung galaxy 01/07/2012 30/09/2012 increase
samsung galaxy 100 01/10/2009 31/12/2009 4th quartile 2010 decrease
samsung galaxy 01/07/2010 30/09/2010 decrease

As Table 1 depicts, in order to train the CRF we exploited sequential historical


data consisting of tweets (i.e. training set) corresponding to four yearly quartiles
for each device; two of them represent the class “increase” and two are represent-
ing the class “decrease”. The prediction class indicates the correct prediction
trend (increase/decrease) for a particular product in a future time frame as found
in the official financial journals.
In essence, this study involves the conduct of predictions for four distinct devices
(ipad, samsung galaxy, sony experia, kindlefire). For instance, as Table 1 shows,
in order to predict the sales performance of the iPad (i.e. increase/decrease) in
the second quartile of 2011, we were based and use as training set historical
Predicting Sales Trends. Can sentiment analysis on social media help?  211

tweets' data and sentiment values expressed on Twitter about Sony Experia, Kin-
dle Fire and Samsung Galaxy devices in specific early quartiles for which we al-
ready knew their sales fluctuation. In general in order to make predictions for
each device, CRF is trained with tweets' sequential data for the remaining three
devices. The tweets' data for each device as shown in Table 1 are taken from two
yearly quartiles for each class, decrease/increase.
As a test set we used tweet sequences for iPad taken from the immediately
preceding quartile from the one on which we wish to make prediction that is the
first quartile of 2011 as well as from the corresponding quartile of the previous
year, that is the second quartile of 2010. The specific information about specific
time intervals of drop or raise of sales has been taken from prestigious financial
and IT online media3. Training and Test data integrated in the prediction meth-
odology are detailed in Table 1.
More formally CRF component assigns a tweets' sequence - which is defined
as a series of consecutive tweets taken from two yearly quartiles – a sales trend
(i.e. increase/decrease) for every tweet used in this sequence. This trend corre-
sponds to the trend of a tweet sequence in which this tweet would most probably
participate. Then a majority voting process is used to decide the prediction class
for a specific tweets' sequence. For example if there are 3 decrease and 2 increase
sequence based sale trends given to the tweets comprising the sequence, then the
dominant trend for this sequence would be the class decrease.
Under more formalistic means within the training process every tweet in a
tweets' data sequence is represented by a feature vector of the form:

𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓_𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑓𝑓𝑓𝑓 = (𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓 , 𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑣𝑣𝑣𝑣𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑝𝑝𝑝𝑝𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓, 𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝𝑝,

𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑃𝑃𝑃𝑃𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑆𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑃𝑃𝑃𝑃𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑃𝑃𝑃𝑃)

where "𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓" is the date of the corresponding tweet, probability is the output
“Sentiment Probablity” derived within the sentiment analysis stage (a) for that
tweet, polarity is the output sentiment value, positive, negative, neutral extracted
within the sentiment analysis stage (a) and the prediction trend is the trend (in-
crease/decrease) assigned to this tweet representing the whole tweets' sequence
sales trend; each tweets' sequence representing a class (increase or decrease) is
consisting of tweets corresponding to two yearly quartiles. Therefore, each
tweets' data sequence is represented by a sequence of such feature vectors put in
chronological order (according to their features values "𝐷𝐷𝐷𝐷𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓𝑓"). Within the test


3 Among others: The Financial Times, Business Insider, Reuters, The Telegraph, CNET, ZDnet,
Mobile Statistics.
212  Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos

set, the last feature representing the trend (increase/decrease) for a tweets' se-
quence representing a device is absent and is assessed.

4 Experimental Results
To verify our claim that the proposed prediction approach is well suited for the
particular task, we compare with a system variant that uses Naive Bayes instead
of CRF within the last step of the methodology (stage b. which assigns a predic-
tion class for sales' trend, on tweets' sequential data).
In order to evaluate this method of classifying tweets' sequential data con-
cerning four electronic devices, we used a 4-fold cross validation method. In Sec-
tion 2, we took into consideration data for four electronic devices; the algorithm
was run 4 times, each time using a different combination of 3 subsets for training
and the remaining one for testing. So each classification fold is predicting a sales'
trend for each of the four devices under question using its corresponding tweets'
sequential data as a test set and the tweets' sequential data for the three remain-
ing devices' have been used as a training set (table 1). The result of the cross-
validation is the average performance of the algorithm in the 4 runs. The availa-
ble tweets' data sequences (data points for training and testing) are two for each
product, each one representing a product's trend to increase or to decrease.
Table 2 illustrates for the increase/decrease prediction task the confusion
matrices and the accuracy scores for the proposed methodology in comparison
with a Naive Bayes approach. Each class increase/decrease is including four
tweets' data sequences, representing one out of the four devices. As Table 2 re-
ports, our method exhibits a higher level of accuracy than that of the NB variant
(0.50)(𝑝𝑝𝑝𝑝 < 0.05). Moreover, the results indicate that exploited structural infor-
mation of tweet's data series feature are the best fit for.
Predicting Sales Trends. Can sentiment analysis on social media help?  213

Table 2: System Evaluation using only CRF for prediction of sales' trends vs Naive Bayes: error
threshold 5%, p-value: 0.06

Prediction CRF Prediction NB


increase decrease increase decrease
increase n=4 4 4 3 1
decrease n=4 3 1 2 2
accuracy 88% 63%

In parallel we aim to show that our method when it integrates in the CRF
component more linguistic information is more representative of the structure of
the sentence, and therefore can lead to judging more accurately the polarity ori-
entation of a sentence, than a feature representation that does not take into ac-
count enough information from the sentential context.

5 Conclusions & Discussion


This paper presents a two-stage approach for prediction of sales trends' in prod-
ucts based on tweets' sequential data taking into account the sentiment values
expressed through these tweets. We provided empirical and experimental evi-
dence on the appropriateness of the particular computational approach which is
based on structural models compared to approaches that are based on bag-of-
words representations.
In particular we showed that since CRF exploits structural information con-
cerning a tweets' data time series, it can capture non local dependencies of im-
portant sentiment bearing tweets. In doing so, we verified the assumption that
the representation of structural information, exploited by the CRF, simulates the
semantic and sequential structure of time series' data, implying their sales' trend
that they can represent.

References
Asur, Sitaram & Bernardo A. Huberman. 2010. Predicting the future with social media. In Pro-
ceedings of the IEEE/WIC/ACM international conference on web intelligence and intelligent
agent technology, 492–499.
Bollen, Johan, Huina Mao & Xiao-Jun Zeng. 2011. Twitter mood predicts the stock market. Jour-
nal of Computational Science 2(1). 1–8.
214  Vassiliki Rentoumi, Anastasia Krithara, Nikos Tzanos

Choi, Yejin, Eric Breck & Claire Cardie. 2006. Joint extraction of entities and relations for opin-
ion recognition. In Proceedings of the conference on empirical methods in natural lan-
guage processing, 431–439. Association for Computational Linguistics.
Cootner, Paul H. 1964. The random character of stock market prices. Cambridge (MA): MIT
Press.
Dodds, Peter & Christopher Danforth. 2010. Measuring the happiness of large-scale written ex-
pression: Songs, blogs, and presidents. Journal of Happiness Studies 11(4). 441–456.
Eppen, Gary D. & Eugene F. Fama. 1969. Cash balance and simple dynamic portfolio problems
with proportional costs. International Economic Review 10(2). 119–133.
Fama, Eugene F. 1965. The behavior of stock-market prices. Journal of Business 38(1). 34–105.
Fama, Eugene F. 1991. Efficient capital markets: IΙ. Journal of Finance 46(5). 1575–1617.
Gallagher, Liam A. & Mark P. Taylor. 2002. Permanent and temporary components of stock
prices: Evidence from assessing macroeconomic shocks. Southern Economic Journal
69(2). 345–362.
Gilbert, Eric & Karrie Karahalios. 2010. Widespread worry and the stock market. In Proceedings
of the fourth international AAAI conference on weblogs and social media, 58–65.
Gruhl, Daniel, Ramanthan Guha, Ravi Kumar, Jasmine Novak & Andrew Tomkins. 2005. The pre-
dictive power of online chatter. In Proceedings of the eleventh ACM SIGKDD international
conference on knowledge discovery in data mining, 78–87.
Gunawardana, Asela, Milind Mahajan, Alex Acero & John C. Platt. 2005. Hidden conditional ran-
dom fields for phone classification. In Ninth European conference on speech communica-
tion and technology, 1117–1120.
Kavussanos, Manolis & Everton Dockery. 2001. A multivariate test for stock market efficiency:
the case of ASE. Applied Financial Economics 11(5). 573–579.
Lafferty, John, Andrew McCallum & Fernardo Pereira. 2001. Conditional random fields: Proba-
bilistic models for segmenting and labeling sequence data. In Proceedings of the 18th in-
ternational conference on machine learning, 282–289.
Lerman Kristina & Tad Hogg. 2010. Using a model of social dynamics to predict popularity of
news. In Proceedings of the 19th international conference on World Wide Web, 621–630.
Liu, Yang, Xiangji Huang, Aijun An & Xiaohui Yu. 2007. Arsa: A sentiment-aware model for pre-
dicting sales performance using blogs abstract. In Proceedings of the 30th annual inter-
national ACM SIGIR conference on research and development in information retrieval,
607–614.
Mao, Yi & Guy Lebanon. 2007. Isotonic conditional random fields and local sentiment flow. In
Advances in Neural Information Processing Systems 19. http://papers.nips.cc/pa-
per/3152-isotonic-conditional-random-fields-and-local-sentiment-flow.pdf (accessed 1
December 2014).
McDonald, Ryan, Kerry Hannan, Tyler Neylon, Mike Wells & Jeff Reynar. 2007. Structured mod-
els for ne-to-coarse sentiment analysis. In Proceedings of the 45th annual meeting of the
Association of computational linguistics, 432–439.
Mishne, Gilad & Natalie Glance. 2006. Predicting movie sales from blogger sentiment. In AAAI
symposium on computational approaches to analysing weblogs, 155–158.
Pak, Alexander and Patrick Paroubek. 2010. Twitter as a corpus for sentiment analysis and
opinion mining. In Proceedings of the 7th conference on international language resources
and evaluation, 1320–1326.
Qian, Bo & Khaled Rasheed. 2007. Stock market prediction with multiple classifiers. Applied
Intelligence 26(1). 25–33.
Predicting Sales Trends. Can sentiment analysis on social media help?  215

Quattoni, Ariadna, Michael Collins & Trevor Darrell. 2004. Conditional random fields for object
recognition. Proceedings of Neural Information Processing Systems, 1097–1104.
Rentoumi, Vassiliki, George Giannakopoulos, Vangelis Karkaletsis & George Vouros. 2009.
Sentiment analysis of figurative language using a word sense disambiguation approach.
In International conference on recent advances in natural language processing (RANLP
2009), 370–375.
Rentoumi, Vassiliki, George A.Vouros, Vangelis Karkaletsis & Amalia Moser. 2012. Investigat-
ing Metaphorical Language in Sentiment Analysis: a Sense-to-Sentiment Perspective.
ACM Transactions on Speech and Language Processing 9(3). Article no. 6.
Romero, Daniel M., Wojciech Galuba, Sitaram Asur & Bernardo A. Huberman. 2011. Influence
and passivity in social media. In Proceedings of the 20th international conference com-
panion on World Wide Web, 113–114.
Sadamitsu, Kugatsu, Satoshi Sekine & Mikio Yamamoto. 2008. Sentiment analysis based on
probabilistic models using inter-sentence information. In Proceedings of the 6th interna-
tional conference on language resources and evaluation, 2892–2896.
Sharda, Ramesh & Dursun Delen. 2006. Predicting box-office success of motion pictures with
neural networks. Expert Systems with Applications 30(2). 243–254.
Szabo, Gabor & Huberman, Bernardo A. 2010. Predicting the popularity of online content. Com-
munications of the ACM 53(8). 80–88.
Yu, Sheng & Subhash Kak. 2012. A survey of prediction using social media. The Computing Re-
search Repository (CoRR).
Zhang, Wenbin & Steven Skiena. 2009. Improving movie gross prediction through news analy-
sis. In Proceedings of the IEEE/WIC/ACM international joint conference on web intelligence
and intelligent agent technology - volume 1, 301–304.
Andrij Rovenchak
Where Alice Meets Little Prince
Another approach to study language relationships

1 Introduction
Quantitative methods, including those developed in physics, proved to be quite
successful for studies in various domains, including biology (Ogasawara et al.
2003), social sciences (Gulden 2002), and linguistics (Fontanari and Perlovsky
2004, Ferrer i Cancho 2006, Čech et al. 2011).
According to the recently suggested model (Rovenchak and Buk 2011a),
which is based on the analogy between word frequency structure of text and
Bose-distribution in physics, a set of parameters can be obtained to describe
texts. These parameters are related to the grammar type or, being more precise,
to the analyticity level of a language (Rovenchak and Buk 2011b).
One of the newly defined parameters is an analog of the temperature in phys-
ics. One should not mix such a term with other definitions of the “temperature of
text” introduced by, e. g., Mandelbrot (1953), de Campos (1982), Kosmidis et al.
(2006), Miyazima and Yamamoto (2008), etc.
In the present work, two famous novels are analyzed, namely Alice’s Adven-
tures in Wonderland by Lewis Carroll (also known under a shortened title as Alice
in Wonderland) and The Little Prince by Antoine de Saint-Exupéry. These texts
were translated into numerous languages from different language families and
thus are seen as a proper material for contrastive studies. The calculations are
made for about forty translations of The Little Prince (LP) and some twenty trans-
lations of Alice in Wonderland (AW) with both texts available in thirteen lan-
guages.
The following are the languages for LP:
Arabic, Armenian, Azerbaijani, Bamana, Belarusian, Bulgarian, Catalan,
Czech, German, Dutch, English, Esperanto*, Spanish, Estonian, Euskara
(Basque), Farsi, French, Georgian, Greek, Hebrew, Hindi, Croatian, Hungarian,
Italian, Lojban*, Korean, Latvian, Lithuanian, Mauritian Creole, Mongolian,
Polish, Portuguese, Romanian, Russian (2 texts), Serbian, Turkish, Ukrainian, Vi-
etnamese, Chinese, Thai.
The following are the languages for AW:
218  Andrij Rovenchak

Bulgarian, Cymraeg (Welsh), German, English, Esperanto*, French, Gaelic, Ha-


waiian, Hungarian, Italian, Lojban*, Latin, Lingua Franca Nova*, Polish, Roma-
nian, Russian (3 texts), Swedish, Swahili, Ukrainian, Chinese.
Therefore, both texts are available in the following thirteen languages:
Bulgarian, German, English, Esperanto*, French, Hungarian, Italian,
Lojban*, Polish, Romanian, Russian, Ukrainian, Chinese.
In the above lists, asterisks mark artificial languages.
For both texts, available languages allow achieving some diversity to study
interlingual relations within groups and families, as well as a variety of grammar
types, from highly synthetic to highly analytic languages. Note that some ideas,
on which this contribution is based, were recently applied by the author to a dif-
ferent language material (Rovenchak 2014).

2 Method description
For a given text, the frequency list of words is compiled. In order to avoid issues
of lemmatization for inflectional languages, words are understood as orthograph-
ical words – i.e., alphanumeric sequences between spaces and/or punctuation
marks. Thus, for instance, ‘hand’ and ‘hands’ are treated as different words (dif-
ferent types).
Orthographies without “western” word separation can be studied within this
approach with some precautions. These will be discussed in Section 3 with re-
spect to texts in Chinese, Japanese, and Thai languages.
Let the types with the absolute frequency equal j occupy the jth “energy
level”. The number of such types is the occupation number Nj. Somewhat special
role is assigned to hapax legomena (types or words occurring only once in a given
text), their number is N1.
Evidently, there is no natural ordering of words within the level they occupy,
which is seen as an analog of indistinguishability of particles in quantum physics
and thus a quantum distribution can be applied to model the frequency behavior
of words. Since the number of types with a given frequency can be very large, it
seems that the Bose-distribution (Isihara 1971: 82; Huang 1987: 183) is relevant to
this problem. The occupation number for the jth level in the Bose-distribution
equals:
1
𝑁𝑁𝑁𝑁𝑗𝑗𝑗𝑗 = , (1)
𝑧𝑧𝑧𝑧 −1 𝑒𝑒𝑒𝑒 𝜀𝜀𝜀𝜀𝑗𝑗𝑗𝑗 /𝑇𝑇𝑇𝑇 − 1
Where Alice meets Little Prince  219

where T is temperature, z is called activity or fugacity, and εj is the energy of the


jth level (excitation spectrum).
As was shown previously (Rovenchak and Buk 2011a; 2011b), the following
power dependence of εj on the level number j leads to a proper description of the
observed data for small j:
𝜀𝜀𝜀𝜀𝑗𝑗𝑗𝑗 = (𝑗𝑗𝑗𝑗 − 1)𝛼𝛼𝛼𝛼 (2)

where the values of α appear to be within the domain between 1 and 2. The unity
is subtracted to ensure that the lowest energy ε1 = 0. Therefore, one can easily
obtain the relation between the number of hapaxes N1 and fugacity analog z:
𝑁𝑁𝑁𝑁1
𝑧𝑧𝑧𝑧 = . (3)
𝑁𝑁𝑁𝑁1 + 1
The parameters T and α are defined from observable frequency data for every text
or its part. The procedure is as follows. First, the fugacity analog z is calculated
from the number of hapaxes using Eq. (3). Then, observed values of Nj with
j = 2 ÷ jmax are fitted to
1
𝑁𝑁𝑁𝑁𝑗𝑗𝑗𝑗 = 𝛼𝛼𝛼𝛼 (4)
𝑧𝑧𝑧𝑧 −1 𝑒𝑒𝑒𝑒 (𝑗𝑗𝑗𝑗−1) /𝑇𝑇𝑇𝑇 − 1
The upper limit jmax for fitting can be defined empirically. Due to the nature of
word frequency distributions, the occupations of high levels rapidly decreases,
or, being more precise, Nj is either 1 or 0 for j being large enough as this is the
number of words with distinct large absolute frequencies.
To define the value of jmax one can use some naturally occurring separators
linked to frequency distributions. In this work, the k-point is applied. The defini-
tion of the k-point is similar to that of the h-point rh in the rank-frequency distri-
bution, being the solution of equation f(rh) = rh, where f(r) is the absolute fre-
quency of a word with rank r. The k-point corresponds to the so-called cumulative
distribution (Popescu and Altmann 2006), thus it is the solution of equation
Nj = j. An extension of these definitions must be applied if the respective equa-
tions do not have integer roots.
In practice, however, some average values were defined based on different
translations of each novel. Such an approach simplifies the automation of the
calculation process and, according to preliminary verifications, does not influ-
ence the values of the fitting parameters significantly due to the rapid exponen-
tial decay of the fitting function (4) for large j.
The use of the k-point can be justified from the observations that the low-
frequency vocabulary is composed mostly of the autosemantic (full-meaning)
words (Popescu et al. 2009: 37).
220  Andrij Rovenchak

Note that the parameters of the model of the frequency distribution (T, z, and α)
are not given any in-depth interpretation based on the analogy with the respec-
tive physical quantities. Possibility of such a parameter attribution is yet to be
studied.

3 Analysis of full texts


Full texts were analyzed first. Several comments are required regarding transla-
tions into Chinese, Japanese, and Thai. None of these languages uses a writing
system with a word separation typical to western orthographies. Therefore, the
frequency of characters instead of that of words was studied in the Chinese trans-
lations. For the Japanese translations, special software was used to obtain word
frequencies. Thai translation of The Little Prince is demonstrated for comparison:
in the Thai script, spaces separate phrases but not single words.
Several frequency lists are shown in Tables 1 and 2 to demonstrate how the
k-point is used to estimate jmax.

Table 1: Observed values of Nj for different translations of The Little Prince. The nearest js to
the k-points are marked in bold italic and grey-shaded.

j Chinese Lojban English French Polish Russian Ukrainian


1 420 582 1032 1578 2025 1916 2207
2 213 221 359 399 574 571 513
3 140 124 206 176 261 271 216
4 92 84 115 107 122 146 115
5 78 69 65 71 95 96 81
6 43 45 39 44 52 39 54
7 51 53 37 40 37 38 27
8 41 25 27 28 33 43 28
9 20 21 22 18 24 24 18
10 20 23 19 22 15 18 19
11 23 27 20 10 8 11 12
12 17 17 18 12 15 11 11
13 17 16 12 14 10 7 10
14 18 19 13 3 15 8 14
15 15 6 7 12 3 6 5
16 11 10 7 7 9 4 4
17 10 7 6 4 6 7 6
18 16 5 6 9 4 6 9
Where Alice meets Little Prince  221

j Chinese Lojban English French Polish Russian Ukrainian


19 9 9 6 5 6 4 3
20 8 6 4 4 2 6 4

As Table 1 demonstrates, one can use the following value for The Little Prince:
jmax = 10, which appears to be close to the mean of the k-points for the whole set
of languages.

Table 2: Observed values of Nj for different translations of Alice in Wonderland. The nearest js
to the k-points are marked in bold italic and grey-shaded.

j Chinese Lojban English French Polish Russian Ukrainian


1 460 939 1456 2070 3141 2709 3328
2 241 294 558 615 1000 1051 1142
3 146 172 312 305 468 474 524
4 86 118 187 183 225 292 243
5 97 91 117 110 127 171 140
6 72 60 71 84 98 110 77
7 57 57 55 57 65 59 67
8 32 37 59 52 59 52 40
9 30 29 29 32 40 27 32
10 37 36 46 34 27 30 34
11 32 31 27 20 20 26 29
12 36 25 33 23 17 17 18
13 22 23 17 22 18 18 6
14 25 22 18 20 14 16 11
15 19 19 16 17 11 6 12
16 11 16 9 20 11 7 6
17 18 7 12 8 11 9 5
18 16 15 9 3 6 11 8
19 15 12 8 8 10 4 7
20 12 12 8 7 5 3 9

From Table 2 the estimate for Alice can be taken as jmax = 15 on the same grounds
as above.
After the fitting procedure is applied, for each text a pair of T and α parame-
ters is obtained. It was established previously (Rovenchak and Buk 2011a; 2011b)
that the parameter τ = ln T / ln N has a weak dependence on text length N and
thus can be used to compare texts in different languages as they can contain dif-
ferent number of tokens. This is related to the parameter scaling discussed in Sec-
tion 4.
Each text is thus represented by a point on the α– τ plane or, being more precise,
by a domain determined through the uncertainty of the fitting. Results are shown
in Figures 1 and 2.
222  Andrij Rovenchak

(a)

(b)

Fig. 1: Positions of different translations of Alice in Wonderland on the α–τ-plane (a). Enlarged view
is given on bottom panel (b). Language codes are explained in the Appendix. For clarity, not all the
analyzed languages are shown.
Where Alice meets Little Prince  223

(a)

(b)

Fig. 2: Positions of different translations of The Little Prince on the α–τ-plane (a). Enlarged view
is given on bottom panel (b). Language codes are explained in the Appendix. For clarity, not all
the analyzed languages are shown.
224  Andrij Rovenchak

Interestingly, texts are mostly located on a wide band stretching between small-
α– small-τ and large-α–large-τ values. In the lower-left corner languages of
highly analytical nature are placed, namely: Chinese, Vietnamese, Lojban, Ha-
waii, Bamana, Mauritian Creole. Top-right corner is, on the contrary, occupied by
highly synthetic languages (Slavic group, Hungarian, Latin, etc.). Intermediate
positions correspond to language with medium analyticity level.

4 Parameter scaling
The main focus in this study is made on the dependence of the defined parame-
ters on text length, meaning the parameter development in the course of text pro-
duction. The parameters appear to be quite stable (their behavior being easily
predictable) for highly analytic languages (Chinese and Lojban, an artificial lan-
guage). On the other hand, even if the difference is observed between the same
language versions for two novels, the divergence is not significant for different
translations. This means that translator strategies do not influence the parameter
values much, but rather text type (genre) is more important.
To study the scaling of the temperature parameter, the languages were
grouped into several classes depending on the coarsened mean value of the α ex-
ponent:
α = 1.15: Lojban;
α = 1.20: Chinese;
α = 1.40: English;
α = 1.45: French;
α = 1.50: Bulgarian, German, Esperanto, Italian, Romanian;
α = 1.60: Hungarian, Polish, Russian, Ukrainian.
With the value of α fixed as above, temperature T was calculated consecu-
tively for first 1000, 2000, 3000, etc. words of each text. The results for T demon-
strate a monotonously increasing dependence on the number of words N.
The simplest model for the temperature scaling was tested, namely:

𝑇𝑇𝑇𝑇 = 𝑡𝑡𝑡𝑡𝑁𝑁𝑁𝑁𝛽𝛽𝛽𝛽 . (5)

Obviously, the new parameters, t and β, are not independent of the previ-
ously defined τ = ln T / ln N:
ln 𝑡𝑡𝑡𝑡
𝛽𝛽𝛽𝛽 = 𝜏𝜏𝜏𝜏 − , (6)
ln 𝑁𝑁𝑁𝑁
therefore, β coincides with τ in the limit of an infinitely long text.
Where Alice meets Little Prince  225

Results of fitting are demonstrated in Figure 3, the parameters of temperature


scaling are given in Table 3. Note that the range of N for the fitting was put
N = 10000 ÷ 40000.

Fig. 3: “Temperature” T (vertical axis) versus text length N (horizontal axis) for several
laguages. Longer series of data correspond to Alice in Wonderland. Solid line is a fit for Alice in
Wonderland, dashed line is a fit for The Little Prince.
226  Andrij Rovenchak

According to Eq. (6), the scaling exponent β is related to language analyticity as


does the parameter τ, which is confirmed by Table 3. The abnormally large differ-
ence between both parameters for some translations of two texts under consider-
ation (e.g., Italian or Hungarian, cf. also Figures 1 and 2) requires additional de-
tailed analysis.

Table 3: Scaling parameters for temperature T. The results of fitting are within intervals t ± Δt
and β ± Δβ. Language codes are explained in the Appendix.

Code Text t Δt β Δβ

LP 0.19 0.06 0.87 0.04


blg
AW 0.20 0.04 0.88 0.02
LP 1.19 0.24 0.64 0.02
deu
AW 0.62 0.05 0.72 0.01
LP 0.49 0.07 0.73 0.02
eng
AW 0.57 0.05 0.72 0.01
LP 0.23 0.05 0.82 0.02
epo
AW 0.61 0.04 0.73 0.01
LP 0.85 0.18 0.68 0.02
fra
AW 0.53 0.05 0.73 0.01
LP 0.28 0.06 0.84 0.02
hun
AW 0.15 0.02 0.93 0.01
LP 0.52 0.12 0.74 0.03
ita
AW 0.21 0.02 0.89 0.01
LP 2.58 0.51 0.51 0.02
jbo
AW 4.20 0.46 0.45 0.01
LP 0.15 0.04 0.93 0.03
pol
AW 0.23 0.03 0.88 0.01
LP 0.46 0.12 0.75 0.03
ron
AW 0.47 0.05 0.76 0.01
LP1 0.10 0.03 0.96 0.04
LP2 0.13 0.02 0.92 0.01

rus AW1 0.17 0.02 0.92 0.01


AW2 0.14 0.01 0.94 0.01
AW3 0.14 0.02 0.94 0.01
Where Alice meets Little Prince  227

Code Text t Δt β Δβ
LP 0.33 0.12 0.82 0.04
ukr
AW 0.16 0.01 0.93 0.01

zho LP 5.61 1.09 0.44 0.02


AW 9.87 1.37 0.38 0.01

5 Text trajectories
Text trajectories are obtained for each translation. The values of two parameters
correspond to a point and their change as text grows is a trajectory on the respec-
tive plane.
First of all, one can observe that stability in the parameter values is not
achieved immediately for a small text length. Generally, the number of tokens
greater than several thousand is preferable, and translations of The Little Prince
are close to the lower edge with respect to text length (cf. Rovenchak and Buk
2011b).

Fig. 4: Trajectories of Russian translations of The Little Prince (+ and × symbols) and Alice in
Wonderland (□, ■, and ✴). As with the scaling analysis, step of 1000 was used; thus, the first
point corresponds to first 1000 tokens, the second one to first 2000 tokens, etc. Errorbars are
not shown for clarity.
228  Andrij Rovenchak

Results for text trajectories are demonstrated in Figure 4 for Russian, with two
translations of The Little Prince and three translations of Alice in Wonderland.
There is a qualitative similarity in the shapes of these trajectories, while numeri-
cally different results are obtained for different texts, which, as was already men-
tioned in Section 4, suggests the influence of genre on the parameter values (even
within a single language).

6 Discussion
The method to analyze texts using an approach inspired by a statistical-mechan-
ical analogy is described and applied for studying translations of two novels, Al-
ice in Wonderland and The Little Prince. A result obtained earlier is confirmed:
there exists a correlation between the level of language analyticity and the values
of parameters calculated in this approach.
Behavior of the parameters in the course of text production is studied. Within
the same language, the dependence on a translator of a given text appears much
weaker than the dependence on the text genre. So far, it is yet not possible to
provide exact attribution of a language with respect to parameter values since the
influence of genre has not been studied in detail.
Further prospects of the presented approach are seen in analyzing texts of
different genres, which are translated in a number of languages, preferably from
different families. These could be religious texts (being though generally very
specific in language style) or some popular fiction works (e.g., Pinocchio by Carlo
Collodi, Twenty Thousand Leagues under the Sea by Jules Verne, Winnie-the-Pooh
by Alan A. Milne, The Alchemist by Paulo Coelho, etc.). With texts translated into
now-extinct languages (e.g., Latin or Old Church Slavonic) some hints about lan-
guage evolution can be obtained as well (Rovenchak 2014).

Acknowledgements
I am grateful to Michael Everson for providing electronic versions of Alice in Won-
derland translated into Cymraeg (Welsh), Gaelic, Hawaiian, Latin, and Swedish.
Where Alice meets Little Prince  229

References
de Campos, Haroldo. 1982. The informational temperature of the text. Poetics Today 3(3). 177–
187.
Čech, Radek, Ján Mačutek & Zdeněk Žabokrtský. 2011. The role of syntax in complex networks:
Local and global importance of verbs in a syntactic dependency network. Physica A
390(20). 3614–3623.
Ferrer i Cancho, Ramon. 2006. Why do syntactic links not cross? Europhysics Letters 76(6).
1228–1234.
Fontanari, José F. & Leonid I. Perlovsky. 2004. Solvable null model for the distribution of word
frequencies. Physical Review E 70(4). 042901.
Gulden, Timothy R. 2002. Spatial and temporal patterns of civil violence in Guatemala, 1977–
1986. Politics and the Life Sciences 21(1). 26–36.
Huang, Kerson. 1987. Statistical mechanics. 2nd edn. New York: Wiley.
Isihara, Akira. 1971. Statistical physics. New York & London: Academic Press.
Kosmidis, Kosmas, Alkiviadis Kalampokis & Panos Argyrakis. 2006. Statistical mechanical ap-
proach to human language. Physica A 366. 495–502.
Mandelbrot, Benoit. 1953. An informational theory of the statistical structure of language. In
Willis Jackson (ed.), Communication theory, 486–504. London: Butterworths.
Miyazima, Sasuke & Keizo Yamamoto. 2008. Measuring the temperature of texts. Fractals
16(1). 25–32.
Ogasawara, Osamu, Shoko Kawamoto & Kousaku Okubo. 2003. Zipf's law and human tran-
scriptomes: An explanation with an evolutionary model. Comptes rendus biologies
326(10–11). 1097–1101.
Popescu, Ioan-Iovitz & Gabriel Altmann. 2006. Some aspects of word frequencies. Glottomet-
rics 13. 23–46.
Popescu, Ioan-Iovitz, Gabriel Altmann, Peter Grzybek, Bijapur D. Jayaram , Reinhard Köhler,
Viktor Krupa, Ján Mačutek, Regina Pustet, Ludmila Uhlířová & Matummal N. Vidya. 2009.
Word frequency studies (Quantitative Linguistics 64). Berlin & New York: Mouton de Gruy-
ter.
Rovenchak, Andrij. 2014. Trends in language evolution found from the frequency structure of
texts mapped against the Bose-distribution. Journal of Quantitative Linguistics 21(3). 281–
294.
Rovenchak, Andrij & Solomija Buk. 2011a. Application of a quantum ensemble model to lin-
guistic analysis. Physica A 390(7). 1326–1331.
Rovenchak, Andrij & Solomija Buk. 2011b. Defining thermodynamic parameters for texts from
word rank–frequency distributions. Journal of Physical Studies 15(1). 1005.
230  Andrij Rovenchak

Appendix
Language Codes

Code Language Code Language

ara Arabic hin Hindi


arm Armenian hrv Croatian
aze Azerbaijani hun Hungarian
bam Bamana ita Italian
bel Belarusian jbo Lojban
blg Bulgarian jpn Japanese
cat Catalan kor Korean
cym Cymraeg (Welsh) lat Latin
cze Czech lfn Lingua Franca Nova
deu German lit Lithuanian
dut Dutch mcr Mauritian Creole
eng English mon Mongolian
epo Esperanto pol Polish
eps Spanish ron Romanian
est Estonian rus Russian
eus Euskara (Basque) ser Serbian
far Farsi sve Swedish
fra French swa Swahili
gai Gaelic tur Turkish
geo Georgian ukr Ukrainian
gre Greek vie Vietnamese
haw Hawaiian zho Chinese
Peter Zörnig
A Probabilistic Model for the Arc Length in
Quantitative Linguistics

1 Introduction
Texts are written or spoken in form of linear sequences of some entities. From
the qualitative point of view, they are sequences of phonic, lexical, morphologi-
cal, semantic and syntactical units. Qualitative properties are secondary for
quantitative linguistics; hence they are simply omitted from the research. Thus
a text may be presented as a sequence (x1,…,xn) whose elements represent phon-
ic, lexical, grammatical, morphological, syntactic or semantic entities. Of
course, the positions of entities depend among others on the grammar, when a
sentence is constructed, or on a historical succession of events, when a story is
written, and both aspects are known to the speaker. What is unknown and hap-
pens without a conscious application, are effects of some background laws
which cannot be learned or simply applied. The speaker abides by them without
knowing them.
A number of such laws – i.e. derived and sufficiently positively tested state-
ments - can be discovered using the sequential form of the text. If such a se-
quence is left in its qualitative form, there appear distances between equal enti-
ties, and every text may be characterized using some function of the given dis-
tances. If the text is transformed into a quantitative sequence (e.g. in terms of
word lengths, sentence lengths, word frequencies, semantic complexities etc.),
then we have a time series which is the object of an extensive statistical disci-
pline.
Even if one does not attribute the positions to the entities, one may find
Markov chains, transition probabilities, autocorrelations, etc. If one replaces the
entities by quantities, a number of other methods become available which may
reveal hidden structures of language that would not be accessible in any other
way. The original complete text shows only deliberate facts such as meanings
and grammatical rules, but rewriting it e.g. in terms of measured morphological
complexities may reveal some background mechanisms. Their knowledge may
serve as a basis for future research.
The text in its quantitative form may change to a fractal, it may display reg-
ular or irregular oscillation, the distances between neighbors can be measured,
etc. Here we restrict ourselves to the study of the arc length expressing the sum
232  Peter Zörnig

of Euclidean distances between the measured properties of neighbors. The arc


length of a sequence (x1,…,xn) is defined by
n −1
=L: ∑i =1
( xi − xi +1 ) 2 + 1 (1)

It is illustrated geometrically in Fig. 1 for the sequence (x1,…,x6) = (4,1,2,5,2,3),


where 𝐿𝐿 = √10 + √2 + √10 + √10 + √2 = 12.32.

Fig. 1: Arc length

The arc length is an alternative to the usual statistical measures of variation: It


is frequently used in linguistics (see Popescu et al. (2009, 2010, 2011) and
Wälchli (2011, p. 7)).
). In the aforementioned literature, (x1,…,xn) is assumed to be
a rank-frequency sequence. In the present article the elements of the sequence
may represent any linguistic entities.

2 The probability model


We model the arc length as a discrete random variable defined by
𝑛𝑛−1

𝐿𝐿 = � �(𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑖𝑖+1 )2 + 1 , (2)


𝑖𝑖=1
A probabilistic model for the arc length in quantitative linguistics  233

where X1,...,Xn are independently and identically distributed random variables


assuming values from the set of integers M = {1,...,m}. The number m represents
the inventory size of the abstract text (X1,...,Xn) of length n, whose elements are
assumed to be integers that count observed linguistic entities.
We are interested in the characteristics of the distribution of L. To this end
we first introduce the following two auxiliary variables Z and B related to an
element of the sum in (2). Let X and Y be independently and identically distrib-
uted variables taking values over M, such that
P(X=i) = P(Y=i) = pi for i = 1,...,m (pi ≥ 0 for all i and p1+...+pm = 1) (3)
We consider the variables
𝑍𝑍 ≔ |𝑋𝑋 − 𝑌𝑌| (4)
and
𝐵𝐵 ≔ �𝑍𝑍 2 + 1 = �(𝑋𝑋 − 𝑌𝑌)2 + 1 . (5)
Example 1: For m = 5 the possible values of Z are illustrated in the following
table:
X 1 2 3 4 5
Y
1 0 1 2 3 4
2 1 0 1 2 3
3 2 1 0 1 2
4 3 2 1 0 1
5 4 3 2 1 0

a) It follows immediately that Z assumes the values 0, 1,…,4 with proba-


bilities
P(Z = 0) = p12+…+ p52,
P(Z = 1) = 2 (p1p2 + p2p3 + p3p4 + p4p5),
P(Z = 2) = 2 (p1p3 + p2p4 + p3p5),
P(Z = 3) = 2 (p1p4 + p2p5),
P(Z = 4) = 2 p1p5.
For example, Z takes on the value 3 in the four cases: X = 4 and Y = 1; X = 5 and
Y = 2; X = 1 and Y = 4; X = 2 and Y = 5, occurring with probabilities p1p4, p2p5,
p1p4, p2p5, respectively.
b) The variable B has the value √𝑖𝑖 2 + 1 if and only if Z has the value i,
since B is a monotone increasing function of Z (see (4), (5)). Therefore
𝑃𝑃𝑃𝑃𝑃 = √𝑖𝑖 2 + 1� = 𝑃𝑃(𝑍𝑍 = 𝑖𝑖) for i=0,...,4, and from part (a) we obtain:
P(B = 1) = p12+…+ p52,
P(B = √2) = 2 (p1p2 + p2p3 + p3p4 + p4p5),
P(B =√5 ) = 2 (p1p3 + p2p4 + p3p5),
P(B =√10 ) = 2 (p1p4 + p2p5),
234  Peter Zörnig

P(B =√17 ) = 2 p1p5.


Generalizing the example yields the following result.
Theorem 1: The distributions of Z and B are given by
a) P(Z = 0) = p12+…+ pm2,
P(Z = i) = 2 (p1 p1+i + p2 p2+i +...+ pm-i pm) for i=1,...,m-1
b) P(B = 1) = p12+…+ pm2,
P(B =√𝑖𝑖 2 + 1) = 2 (p1 p1+i + p2 p2+i +...+ pm-i pm) for i = 1,...,m-1
To simplify the presentations below we will introduce the following nota-
tions for the probabilities occurring in Theorem 1:
q0:= p12+…+ pm2
qi := 2 (p1 p1+i + p2 p2+i +...+ pm-i pm) for i = 1,...,m-1.
The next statement holds trivially.
Theorem 2: The expectations of Z and B are given by
m−1

a) E(Z) = � i qi
i=1
m−1

b) E(B) = � �i2 + 1 qi
i=0

Example 2:
(a) Assuming that the elements of M are chosen with equal probabilities,
i.e. p1=…= pm = 1/m, the preceding theorems yield in particular
𝑚𝑚𝑚𝑚𝑚
P(Z=0)=q0=1/m; P(Z=i)=qi=2 for i= 1,…,m,
𝑚𝑚2
𝑚𝑚𝑚𝑚𝑚
P(B = 1) = 1/m; P(B = √𝑖𝑖 2 + 1) = 2 for i = 1,...,m,
𝑚𝑚2
𝑚𝑚𝑚𝑚 𝑚𝑚𝑚𝑚
𝑚𝑚 𝑚 𝑚𝑚 𝑚𝑚2 − 1
𝐸𝐸(𝑍𝑍) = � 𝑖𝑖𝑖𝑖𝑖𝑖 = � 𝑖𝑖 ∙ 2 =
𝑚𝑚2 3𝑚𝑚
𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖
𝑚𝑚−1 𝑚𝑚−1
1 2
𝐸𝐸(𝐵𝐵) = � �𝑖𝑖 2 + 1 ∙ 𝑞𝑞𝑖𝑖 = + 2 � �𝑖𝑖 2 + 1 ∙ (𝑚𝑚 − 𝑖𝑖) (6)
𝑚𝑚 𝑚𝑚
𝑖𝑖=0 𝑖𝑖=1

This expectation can be approximated by linear terms as

0.31 m+0.46 for 4≤m≤20,


E(B) ≈ (7)
m/3 for m>20

It can be easily verified that the relative error satisfies


A probabilistic model for the arc length in quantitative linguistics  235

𝐸𝐸(𝐵𝐵) − 0.31𝑚𝑚 𝑚 0.46


≤ 0.023
𝐸𝐸(𝐵𝐵)

for 4≤m ≤20, and for m > 20 the sequence of relative errors is monotone
decreasing starting with the value 0.021. The underlying idea for obtain-
ing the approximation (7) in the case m>20 is as follows. The sum in (6)
is approximately equal to
𝑚𝑚𝑚𝑚
𝑚𝑚3 − 𝑚𝑚
� 𝑖𝑖 ∙ (𝑚𝑚 𝑚𝑚𝑚) =
6
𝑖𝑖𝑖𝑖

Thus

1 2 𝑚𝑚3 − 𝑚𝑚 1 𝑚𝑚 1 𝑚𝑚
𝐸𝐸(𝐵𝐵) ≈ + 2+ = + − ≈
𝑚𝑚 𝑚𝑚 6 𝑚𝑚 3 3𝑚𝑚 3
(b) For m = 4, p1 = 1/2 , p2 = 1/4, p3 = 1/6, p4 = 1/12 we obtain
25 13 5 1
𝑞𝑞0 = , 𝑞𝑞1 = , 𝑞𝑞2 = , 𝑞𝑞3 = ,
72 36 24 12
3
13 5 1 37
E(Z) = � i qi = +2 +3 = = 1.0278
36 24 12 36
i=1
3
25 13 5 1
𝐸𝐸(𝐵𝐵) = � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 = + √2 + √5 + √10 = 1,5873.
72 36 24 12
𝑖𝑖𝑖𝑖

By using the simple relation V(X) = E(X2) – (E(X))2 we obtain the following
formulas for the variances:
Theorem 3
m−1 m−1 2

a) V(Z) = � i2 q1 − �� i q1 � ,
i=1 i=1

m−1 m−1 2

b) V(B) = �(i2 + 1)qi − �� �i2 + 1qi � .


i=0 i=1

Example 3:
(a) For the special case of equal probabilities pi the theorem yields
m−1 2 2
m−i m2 − 1 m2 − 1 m2 − 1
V(Z) = � i2 ∙ 2 ∙ − � � = − � �
m2 3m 6 3m
i=1
236  Peter Zörnig

m4 + m2 − 2
= ,
18m2
m−1
1 2
V(B) = + 2 � (i2 + 1) ∙ (m − 1) −
m m
i=1

m−1 2
1 2
� + 2 � �i2 + 1 ∙ (m − 1)�
m m
i=1

m−1 2
m2 + 5 1 2
= − � + 2 � �i2 + 1 ∙ (m − 1)� .
6 m m
i=1

(b) For the probabilities in Example 2(b) we obtain

2 13 5 1
𝑉𝑉(𝑍𝑍) = 𝑞𝑞1 + 4𝑞𝑞2 + 9𝑞𝑞3 − �𝐸𝐸(𝑍𝑍)� = +4 +9
36 24 12
37 2
−� � = 1.3395 .
36
2 25 13 5
V(B) = q0 + 2q1 + 5q2 + 10q3 − �E(B)� = +2 +5
72 36 24
1
+10 − 1.58732 = 0.4249
12
In order to obtain the variance of the arc length (2) we need the following two
theorems. Let 𝐵𝐵𝑖𝑖 = �(𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑖𝑖𝑖𝑖 )2 + 1 denote the summands in (2) (see also (5)).
Since the Xi are assumed to be independent, two of the variables Bi are also in-
dependent if they are not directly consecutive. This means e.g. that B1 and Bj are
independent for j≥3, but the variables 𝐵𝐵1 = �(𝑋𝑋1 − 𝑋𝑋2 )2 + 1 and 𝐵𝐵2 =
�(𝑋𝑋2 − 𝑋𝑋3 )2 + 1 are not independent. This can be made clear as follows. As-
sume for example that m = 5, i.e. the values of the Xi are chosen from M =
{1,…,5}. Then the maximal length of a line segment of the arc is �(1 − 5)2 + 1 =
√17 (see Fig. 1). But if e.g. the second element of the sequence (x1,...,xn) is x2 = 3
(end height of the first segment), then the length of the second segment is a
𝑚𝑚𝑚𝑚𝑚𝑚
most 𝑖𝑖 ∈𝑀𝑀
�(3 − 𝑖𝑖)2 + 1 = √5. Hence the first line segment of the arc prevents
that the second assumes the maximum lengths, i.e. there is a dependency be-
tween these segments.
A probabilistic model for the arc length in quantitative linguistics  237

Theorem 4: Consider the random variables B1 and B2 above, representing any


two consecutive summands in (2). Then the expectation of the product of B1 and
B2 is
𝑚𝑚 𝑚𝑚 𝑚𝑚

𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1�(𝑗𝑗𝑗 𝑗𝑗)2 + 1 .


𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘

The proof follows from the definition of the expectation, since any element
of the triple sum represents a product of a value of the random variable B1 B2
and its probability (where the values of B1 B2 need not be distinct).
The last formula cannot be simplified, not even for equal pi. The triple sum
consists of m3 summands and for not too large values of m it can be calculated
by means of suitable software.
Example 4:
For m = 2, p1 = 3/5 and p2 = 2/5 we get
𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = 𝑝𝑝1 𝑝𝑝1 𝑝𝑝1 ∙ 1 ∙ 1 + 𝑝𝑝1 𝑝𝑝1 𝑝𝑝2 ∙ 1 ∙ √2 + 𝑝𝑝1 𝑝𝑝2 𝑝𝑝1 ∙ √2 ∙ √2 +∙ 𝑝𝑝1 𝑝𝑝2 𝑝𝑝2 √2 ∙ 1
+𝑝𝑝2 𝑝𝑝1 𝑝𝑝1 √2 ∙ 1 + 𝑝𝑝2 𝑝𝑝1 𝑝𝑝2 ∙ √2 ∙ √2 ∙ 𝑝𝑝2 𝑝𝑝2 𝑝𝑝1 ∙ 1 ∙ √2 + 𝑝𝑝2 𝑝𝑝2 𝑝𝑝2 ∙ 1 ∙ 1
3 3 3 22 3 22 3 2 2 3 22 3 2 2
= � � + � � √2 + � � ∙ 2 + � � √2 + � � √2 + � � ∙ 2
5 5 5 5 5 5 5 5 5 5 5
3 2 2 2 3
+ � � √2 + � � = 1.4388.
5 5 5
By using Theorem 2(b) we now obtain the following statement.
Theorem 5: The covariance of B1 and B2 is
Cov (B1 B2) = E (B1 B2) - E(B1) E(B2)
𝑚𝑚 𝑚𝑚 𝑚𝑚 𝑚𝑚𝑚𝑚 2

= � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1�(𝑗𝑗𝑗 𝑗𝑗)2 + 1 − �� �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 � .


𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘 𝑖𝑖𝑖𝑖

Example 5: Let be m = 3, p1 = 2/10, p2 = 3/10, p3 = 5/10, hence q0 = 19/50, q1 =


21/50, q2 = 1/5. As illustrated in the above examples we obtain E(B1 B2) ≈ 2.0468
and E(B1) = E(B2) ≈ 1.4212, thus Cov(B 1 B2) ≈ 0.0270. Since the covariance is not
zero, we have in particular verified that B1 and B2 are dependent variables.
We are now able to determine some characteristics of the arc length L in (2).
From the additivity of the expectation it follows:
Theorem 6: The expectation of the arc length is
𝑚𝑚𝑚𝑚

𝐸𝐸(𝐿𝐿) = (𝑛𝑛 𝑛 1) 𝐸𝐸(𝐵𝐵) = (𝑛𝑛 𝑛 1) � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 .


𝑖𝑖𝑖𝑖
238  Peter Zörnig

This means in particular that E(L) increases linearly with n and the relative arc
𝐸𝐸(𝐿𝐿)
length is an increasing function of the inventory size m.
𝑛𝑛𝑛𝑛
For equal probabilities pi we obtain (see Example 2(a))
𝑚𝑚−1
1 2
𝐸𝐸(𝐿𝐿) = (𝑛𝑛 − 1) 𝐸𝐸(𝐵𝐵) = (𝑛𝑛 − 1) � + � �𝑖𝑖 2 + 1(𝑚𝑚 − 1)� (8)
𝑚𝑚 𝑚𝑚2
𝑖𝑖=1
or the approximate relation
(n-1) (0.31 m+0.46) for 4≤m≤20,
E(L) ≈ (9)
(n-1) m/3 for m>20
An additivity law corresponding to Theorem 6 holds for the variance only
when a sum of independent random variables is given. As we have seen above,
this is not the case for the arc length. Therefore we make use of the following
fact to determine V(L) (see e.g. Ross (2007, p. 54)).
Theorem 7: The variance of a sum of arbitrary random variables Y1+…+Yn is
given by
𝑛𝑛 𝑛𝑛

𝑉𝑉(𝑌𝑌1 + ⋯ + 𝑌𝑌𝑛𝑛 ) = � 𝑉𝑉(𝑌𝑌𝑖𝑖 ) + 2 � 𝐶𝐶𝐶𝐶𝐶𝐶�𝑌𝑌𝑖𝑖 , 𝑌𝑌𝑗𝑗 �.


𝑖𝑖𝑖𝑖 1≤𝑖𝑖𝑖𝑖𝑖𝑖𝑖𝑖

From this we obtain:


Theorem 8: The variance of the arc length is
𝑉𝑉(𝐿𝐿) = (𝑛𝑛 𝑛 1)𝑉𝑉(𝐵𝐵) + 2(𝑛𝑛 𝑛 2)𝐶𝐶𝐶𝐶𝐶𝐶(𝐵𝐵1 𝐵𝐵2 )
where V(B) and Cov(B1B2) are given by Theorems 3(b) and 5, respectively.
Proof: From Theorem 7 it follows that
𝑛𝑛𝑛𝑛

𝑉𝑉(𝐿𝐿) = 𝑉𝑉(𝐵𝐵1 + ⋯ + 𝐵𝐵𝑛𝑛𝑛𝑛 ) = 𝑉𝑉(𝐵𝐵1 ) + ⋯ + 𝑉𝑉(𝐵𝐵𝑛𝑛𝑛𝑛 ) + 2 � 𝐶𝐶𝐶𝐶𝐶𝐶(𝐵𝐵𝑖𝑖 , 𝐵𝐵𝑖𝑖𝑖𝑖 )


𝑖𝑖𝑖𝑖

because the other covariances are zero. Since the last sum has (n-2) ele-
ments, it follows the statement.
Theorem 9: The variance of the arc length can be alternatively computed as
V(L) = (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2
𝑚𝑚𝑚𝑚 𝑚𝑚 𝑚𝑚 𝑚𝑚

= (𝑛𝑛 𝑛 1) � (𝑖𝑖 2 + 1)𝑞𝑞𝑖𝑖 + 2(𝑛𝑛 𝑛 2) � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1�(𝑗𝑗𝑗 𝑗𝑗)2 + 1
𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘

𝑚𝑚𝑚𝑚 2

+(5 − 3𝑛𝑛) � � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 � .


𝑖𝑖𝑖𝑖
A probabilistic model for the arc length in quantitative linguistics  239

Proof: From Theorem 8 we obtain


V(L) = (n-1) [E(B2) - E(B)2] + 2(n-2) [E(B1B2) - E(B)2]
= (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2.
Calculating E(B2), E(B1B2) and E(B) by means of Theorems 2-4 yields the
statement.
We will now consider a concrete linguistic application.
Example 6: Consider the sequence of syllabic lengths of the verses in the poem
Der Erlkönig by Goethe:

(x1,…,x32) = (8,7,8,8,9,6,6,6,7,7,7,6,8,5,6,6,7,6,6,8,9,5,8,7,8,9,9,6,6,7,7,7) (10)

(see Popescu et al. (2013, p. 52)). The corresponding observed arc length is L
≈ 50.63.
We consider the probabilistic model in which a random variable Xi assumes
the values 1,…,9 with probabilities corresponding to the observed relative
frequencies in (10), i.e. (p1,…,p9) = (0,0,0,0,2/32,10/32,9/32,7/32,4/32) and m
= 9. From the definition of the numbers qi we obtain (q0,…,q8) = (125/512,
201/512, 31/128, 27/256, 1/64, 0, 0, 0, 0). Hence
8 8

𝐸𝐸(𝐵𝐵) = � �𝑖𝑖 2 + 1𝑞𝑞𝑖𝑖 ≈ 1.7388, 𝐸𝐸(𝐵𝐵) = �(𝑖𝑖 2 + 1)𝑞𝑞𝑖𝑖 ≈ 3.5605,


𝑖𝑖𝑖𝑖 𝑖𝑖𝑖𝑖
9 9 9

𝐸𝐸(𝐵𝐵1 𝐵𝐵2 ) = � � � 𝑝𝑝𝑖𝑖 𝑝𝑝𝑗𝑗 𝑝𝑝𝑘𝑘 �(𝑖𝑖 𝑖 𝑖𝑖)2 + 1 �(𝑗𝑗𝑗 𝑗𝑗)2 + 1 ≈ 3.1115,
𝑖𝑖𝑖𝑖 𝑗𝑗𝑗𝑗 𝑘𝑘𝑘𝑘

yielding
E(L) = (n-1) ⋅ E(B) = 31 ⋅ 1.7388 ≈ 53.90,
V(L) = (n-1) E(B2) + 2(n-2) E(B1B2) + (5-3n) E(B)2
= 31 ⋅ 3.5605 + 60 ⋅ 3.1115 -91 ⋅ 1.73882 ≈ 21.93.
Thus in this example the observed arc length is smaller than expected in the
probability model, however the absolute deviation between observed and
expected arc length is less than the standard deviation 𝜎𝜎 = �𝑉𝑉(𝐿𝐿) ≈ 4.68.

3 Applications
We study the sequences (x1,…,xn) determined for 32 texts from 14 different lan-
guages (see Table 1), where xi represents the length of the i-th word in the text.
The texts are the same as in Zörnig (2013, Section 4, Tab. 1-9). The first column
of Table 1 indicates the text, e.g. number 3a indicates the third text in Table 3 of
240  Peter Zörnig

the aforementioned article. For each sequence the observed arc length L was
determined according to (1) (see column 2 of Table 1).

Table 1: Characteristics of the arc length for 32 texts

Text L Lmin Lmax E(L) V(L) 𝐿𝐿 − 𝐸𝐸(𝐿𝐿) n 𝐿𝐿 m


�𝑉𝑉(𝐿𝐿) 𝑛𝑛 − 1
1a) Bulgarian 1644.85 927.07 1994.13 1571.16 748.80 2.6930 926 1.7782 6
N. Ostrovskij
1b) Hungarian 2841.26 1316.31 3679.39 2792.90 2621.41 0.9446 1314 2.1639 9
A nomina lizmus...
1c) Hungarian 1016.71 460.31 1244.57 956.70 920.81 1.9775 458 2.2247 9
Kunczekolbász
1d) Macedonian 2251.33 1124.07 2719.36 2101.74 1413.60 3.9787 1123 2.0065 6
N. Ostrovskij
2a) Romanian 1681.01 892.49 2008.51 1591.80 1065.11 2.7337 891 1.8888 7
O. Paler
2b) Romanian 2718.28 1512.49 3361.98 2685.09 1903.83 0.7606 1511 1.8002 7
N. Steinhardt
2c) Russian 1319.67 793.49 1683.19 1339.49 726.42 -0.7354 792 1.6684 7
N. Ostrovskij
2d) Serbian 1703.14 1002.07 2075.82 1651.02 808.56 1.8328 1001 1.7031 6
N. Ostrovskij
3a) Slovak 1445.03 876.23 1783.55 1448.23 733.63 -0.1182 873 1.6571 6
Bachletová
3b) Slovak 1655.59 925.49 2139.51 1675.45 939.62 -0.6479 924 1.7937 7
Bachletová
3c) Slovenian 1556.70 978.07 1923.95 1566.72 694.73 -0.3798 977 1.5950 6
N. Ostrovskij
4a) Sundanese 2011.23 1283.66 2318.10 1937.87 652.49 2.8720 1183 1.5688 5
Aki Satimi
4b) Sundanese 664.27 417.07 761.00 637.63 251.58 1.6795 416 1.6007 6
Agustusan
4c) Indonesian 573.80 346.07 664.96 535.43 176.88 0.1784 345 1.6680 6
Pengurus
4d) Indonesian 456.16 281.07 551.39 446.27 189.24 0.7194 280 1.6350 6
Sekolah ditutup
5a) Bamana 4057.15 2617.90 4559.65 4059.86 3270.44 - 0.0474 2616 1.5515 8
Masadennin
5b) Bamana 3617.14 2394.49 4015.23 3585.09 2440.98 0.6488 2393 1.5122 7
Sonsanin
5c) Bamana 1893.39 1407.66 2123.92 1913.07 663.41 -0.7641 1407 1.3467 5
Namakɔrɔba
5d) Bamana 1739.97 1139.07 1966.32 1719.04 912.12 0.6930 1138 1.5303 6
Bamak’ sigicoya
6a) Vai 4079.90 3140.66 4579.51 4083.40 772.52 -0.1259 3140 1.2997 5
A probabilistic model for the arc length in quantitative linguistics  241

Text L Lmin Lmax E(L) V(L) 𝐿𝐿 − 𝐸𝐸(𝐿𝐿) n 𝐿𝐿 m


�𝑉𝑉(𝐿𝐿) 𝑛𝑛 − 1
Mu ja vaa I
6b) Vai 631.40 495.24 719.79 632.45 95.25 -0.1081 495 1.2781 4
Sa’bu Mu’a’…
6c) Vai 571.29 426.24 612.39 552.79 105.29 1.8034 426 1.3442 4
Vande bɛ Wu’u
7a) Vai 2950.33 1663.07 3492.93 2817.18 1697.88 3.2314 1662 1.7762 6
Siika
7b) Tagalog 3794.27 1959.90 4309.66 3460.82 2393.52 6.8158 1958 1.9388 8
Rosales
7c) Tagalog 3238.09 1739.90 3740.27 3011.21 1946.99 5.1416 1738 1.8642 8
Hernandez
7d) Tagalog 2838.63 1467.90 3255.29 2594.65 1735.92 5.8558 1466 1.9376 8
Hernandez
8a) Romanian 1658.32 1003.07 1947.99 1583.01 742.53 2.7640 1002 1.6567 6
Popescu
8b) German 2587.27 1417.73 3228.75 2574.53 2515.87 0.2539 1415 1.8289 10
Assads
Familiendiktatur
8c) German 2153.84 1448.31 2652.87 2130.62 2228.24 0.4917 1146 1.8811 9
ATT00012
9a) German 2871.06 1569.73 3502.33 2817.12 2938.33 0.9952 1567 1.8334 10
Die Stadt des
Schweigens
9b) German 2475.40 1400.31 2972.81 2418.87 2074.72 1.2412 1398 1.7719 9
Terror in Ost-
Timor
9c) German 2558.37 1365.31 3147.71 2529.07 2613.70 0.5731 1363 1.8784 9
Unter Hackern...

The columns 3 and 4 contain the minimal and approximate maximal arc length
Lmin and Lmax that could be achieved by permuting the respective sequence (see
Popescu et al. (2013, Section 2)). Columns 5 and 6 contain the expectation and
variance of the arc length, when it is considered as a random variable as in Sec-
tion 2. In this context the probabilities are assumed to be equal to the observed
sequences of the abstract text. For example, the sequence (x1,...,xn) of the first
text in Table 1 has length n = 926 and consists of m = 6 different elements cho-
sen from M = {1,...,6}. The frequency of ocurrence of the element i in the se-
quence is ki, where (k1,...,k6) = (336, 269, 213, 78, 27, 3), see Zörnig (2013, p. 57).
Thus the probabilities in the probabilistic model are assumed to be pi = ki /n for i
= 1,…,6 (see also Example 6).
Column 7 contains the values of the standardized arc length defined by
𝐿𝐿𝐿𝐿𝐿(𝐿𝐿)
𝐿𝐿𝑠𝑠𝑠𝑠 = which is calculated in order to make the observed arc lengths
√𝜎𝜎
242  Peter Zörnig

comparable (see e.g. Ross (2007, p.38) for this concept). For example, for the
fourth text it was found Lst ≈ 3.98. This means that the observed arc length is
equal to the expected plus 3.98 times the standard deviation 𝜎𝜎, i.e. the observed
arc length is significantly larger than in the random model. The columns 8 to 10
𝐿𝐿
of Table 1 contain the sequence length n, the relative arc length 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟 = and
𝑁𝑁𝑁𝑁
the inventory size m.
It can be observed that the standardized arc length varies between -0.7641
(text 5c: Bamana) and 6.8158 (text 7b: Tagalog). In particular, the latter text is
characterized by an extremely high arc length which corresponds to a very fre-
quent change between different entities. The results, which should be validated
for further texts, can be used e.g. for typological purposes. The resulting quanti-
ties are no isolated entities but their relations to other text properties must still
be investigated in order to incorporate them into synergetic control cycles.
Comparing the observed arc length L with the sequence length n, one can
easily detect the linear correlation illustrated graphically in Fig. 2. By means of a
simple linear regression we found the relation L = 254.58 + 1.5023 n correspond-
ing to the straight line.
As a measure for the goodness of fit we calculate the coefficient of determi-
nation (see Zörnig (2013, p. 56)), defined by
∑(254.58 + 1.5023𝑛𝑛 𝑛𝑛𝑛(𝑛𝑛))2
𝑅𝑅2 = 1 −
∑(𝐿𝐿(𝑛𝑛) − 𝐿𝐿𝑚𝑚𝑚𝑚𝑚𝑚𝑚𝑚 )2

where L(n) is the observed arc length corresponding to the sequence length
n and Lmean is the mean of the 32 observed L-values. The summation runs over all
32 values of n in Tab. 3.1. The above formula yields R2 = 0.9082. Since R2 > 0.9,
the fitting can be considered good.
Finally, Fig. 3 illustrates graphically the relation between inventory size m
𝐿𝐿
and observed relative arc length 𝐿𝐿𝑟𝑟𝑟𝑟𝑟𝑟 = . Though different values of Lrel can
𝑛𝑛𝑛𝑛
belong to the same inventory size m, it seems that there is a slight tendency that
Lrel increases with m. This observation is in accordance with the relations follow-
ing from the probability model (see the remark after Theorem 6). The simple
linear regression yields Lrel = 1.0192 + 0.1012 m, which corresponds to the
straight line. The coefficient of determination is R2 = 0.5267.
A probabilistic model for the arc length in quantitative linguistics  243

Fig. 2: Linear regression

Fig. 3: Relative arc length in dependence on the inventory size


244  Peter Zörnig

4 Simulating the distribution of the arc length


In Section 2 we derived formulas for the two most important characteristics of
the random variable arc length (2), namely expectation and variance. It would
be desirable to have a formula for the probability mass function (pmf) available.
But this is virtually impossible in view of the long sequences that may arise in
linguistics. A feasible approach to obtain more information about the pmf is
simulation. The simple procedure which can be performed by any statistical
software is as follows:
Generate k random sequences (x1,…,xn) with elements from a set M =
{1,…,m}, such that i is chosen with probability pi (p1+…+pm = 1). For each se-
quence calculate the arc length L according to formula (2) and store the k values
in a list. This list is an “approximation” for the probability distribution of L with
specified pi.
For two of the texts in Tab. 3.1, the histograms of the lists of L-values are
shown in Fig. 4. (As before, the probabilities have been set equal to the ob-
served relative frequencies in the respective texts). Each of the lists consisted of
k = 5000 values and it can be observed that the distributions are similar to nor-
mal distributions indicated by continuous lines in the histograms.
For most texts of Table 1, the statistical hypothesis that the list of L-values
originates from a normal distribution cannot be rejected by means of the
Shapiro-Wilk normality test.

Fig. 4: Histograms for the distribution of the arc length.


A probabilistic model for the arc length in quantitative linguistics  245

5 Concluding remarks
It is possible to consider generalizations or variations of the concept “arc
length”. By using the Minkowski distance we obtain a more general concept by
defining
𝑛𝑛−1

𝐿𝐿 = �(|𝑋𝑋𝑖𝑖 − 𝑋𝑋𝑖𝑖+1 |𝑝𝑝 + 1)1/𝑝𝑝 for 𝑝𝑝 ≥ 1 (11)


𝑖𝑖=1
which reduces to (2) for p = 2. One could also develop a continuous model
for the arc length by assuming that the variables Xi in (2) are distributed accord-
ing to a given probability density.

References
Popescu, Ioan-Iovitz, Ján Mačutek & Gabriel Altmann. 2009. Aspects of word frequencies.
Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Emmerich Kelih, Ján Mačutek, Radek Čech, Karl-Heinz Best & Gabriel
Altmann. 2010. Vectors and codes of text. Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Radek Čech & Gabriel Altmann. 2011. The lambda-structure of texts.
Lüdenscheid: RAM-Verlag.
Popescu, Ioan-Iovitz, Peter Zörnig, Peter Grzybek, Sven Naumann & Gabriel Altmann. 2013.
Some statistics for sequential text properties. Glottometrics 26. 50–99.
Ross, Sheldon M. 2007. Introduction to probability models, 9th edn. Amsterdam: Elsevier.
Wälchli, Bernhard. 2011. Quantifying inner forms: A study in morphosemantics. Working paper
46. Bern: University of Bern, Institute for Linguistics. http://www.isw.unibe.ch (accessed
December 2013)
Zörnig, Peter. 2013. A continuous model for the distances between coextensive words in a text.
Glottometrics 25. 54–68.
Subject Index

accuracy, 188, 202, 206, 212 – generalized, 9, 10, 11, 12, 13, 14, 21, 22, 26,
adjectives, 4, 17, 160 29, 32
adverbs, 17, 42 – part of speech, 16, 17, 20, 25, 26, 27, 28
affix, 2 – partial, 9, 10, 11, 12, 13, 14, 18, 21, 23, 32
ANOVA, 109, 110 – rhymed, 17, 20, 23, 25, 30
arc, 20 – unrhymed, 18, 23, 24, 25, 28
– length, 4, 84, 231, 232, 236, 237, 238, 239, – weighted sum, 11
240, 241, 242, 243, 244, 245 – weights, 11, 12, 13, 18, 19, 21, 28
associative symmetric sums, 10 characters, 36, 39, 41, 91, 178, 202
author, 77, 78, 94, 149, 152, 157, 165, 205 – frequency, 220
– conscious control, 77 chi-square, 40, 41, 42, 43, 45, 46, 54, 94, 100
– subconscious control, 77 Chomsky, Noam, 67
authorship attribution, 28, 72, 110, 130 clauses, 58, 59, 60, 95
autocorrelation, 4, 35, 43, 231 – length, 58, 59, 60, 61
– function. See function coefficient of variation, 157
– index, 38, 39, 40, 41, 42, 45, 46, 51, 52 cognacy judgments, 171
– negative, 39, 44 cognates, 171, 172
– positive, 41, 44, 48 – automatic identification, 173, 174, 175
– semantic, 50 comparative method, 171
– textual, 35, 38, 39, 46, 47, 48, 49, 54 conditional random fields (CRF) model, 205,
automated similarity judgment program 206, 207, 208, 210, 211, 212, 213
(ASJP), 172, 176, 177, 178, 179, 180, 183, confidence intervals, 61, 65, 136, 139, 153
189, 193 confusion matrix, 212
autoregressive integrated moving average control cycle, 2, 3, 4
(ARIMA) model, 152, 161 corpus, 49, 112, 126, 176, 208
autoregressive–moving-average (ARMA) – Brown, 50
model, 152, 169 – Potsdam, 98
axioms, 9, 10, 14 – reference, 50
– commutativity, 11 correlation, 13, 30, 148, 161, 186, 209
Bayes classifier, 203, 204, 212 – coefficient, 13, 17
binary coding, 148, 162 – coefficient of determination, 61, 62, 63, 64,
Bonferroni correction, 80, 187 65, 127, 242
borderline conditions, 9 – negative, 161, 163
boundary conditions, 2, 5, 72 – Pearson’s r, 17, 185
centroid, 44 – point biserial, 185
chapters, 1, 8, 26, 27, 28, 29, 32, 75, 76, 147, – Spearman’s ρ, 186
149, 154, 155, 157, 158, 159, 161, 163, correspondence analysis, 46
164, 165 cross entropy, 184
– length, 72, 73, 74, 76, 148, 150 cross validation, 212
characteristics, 8, 9, 11, 13, 14, 17, 18, 21, 22, dependent variable, 204
23, 28, 32 dictionary, 2, 90, 91
– consistency, 8, 12, 13, 14, 16, 17, 18, 20, 21, dimensionality, 11
22, 23, 24, 25, 26, 28, 29, 32 dispersion, 157, 158
248  Subject Index

distribution – autocorrelation (ACF), 153, 161, 162, 163,


– Bose, 217, 218 164, 168, 170
– cumulative, 219 – autocovariance. See time series
– distances, 1 – continuous, 9
– empirical, 94 – cumulative distribution, 153
– frequency, 95, 130, 134, 144 – geometric, 96
– hyper-binomial, 98 – logistic growth, 114
– hyper-Pascal, 96 – partial autocorrelation (PACF), 152, 161, 170
– hyper-Poisson, 96, 97 – periodical, 152
– mixed negative binomial, 99 – power, 127, 130
– motif length, 130 – recurrence, 1
– motifs, 96, 98, 133, 134, 137, 138, 139 – symmetric, 9
– negative binomial, 99 – trend, 153
– normal, 39 fuzzy graph, 20, 21, 22, 23
– nouns, 46 fuzzy quantifiers, 12
– part of speech, 28 gender, 7, 42, 43, 44
– probability, 35, 89, 94, 95, 111 generalized estimation, 9, 11, 12, 13, 16, 26
– rank-frequency, 92, 93, 100, 219 goodness of fit, 61, 62, 63, 242
– right truncated Zipf-Alexeev, 135 hapax legomena, 218, 219
– stationary, 35, 36, 41, 44, 47 Haspelmath, Martin, 191
– verbs, 46 hidden conditional random fields (HCRF), 206
– verse length, 148 hidden Markov models (HMM), 206, 207
– word frequency, 219 Hurst’s coefficient, 4
– word length, 96, 128, 130, 133, 134, 135, hyperlinks, 46, 47, 49, 54
144 hyponymy, 50
– Zipf, 182 interjections, 42
– Zipf-Mandelbrot, 92, 94, 95, 100, 133 inter-language distance, 172, 173
document models intuition, 2, 203
– bag-of-words, 35, 37, 38, 54, 204, 207 KL-divergence, 184
– random, 57, 58, 59, 65 lambda indicator, 71, 72, 73, 74, 75, 76, 77,
– sequential, 35, 54 78, 79, 83, 84, 85
document weights, 44, 45, 47 language change, 171
Dryer, Matthew S., 191 language family, 171, 178, 179, 184, 185
Durbin-Watson statistic, 42 – Dravidian, 176, 180, 181
edges, 20, 22 – Indo-Aryan, 176
Ethnologue, 176, 177, 179, 180, 181, 184, 185, – Indo-European, 3, 171
186 – Mayan, 176
Euclidean – Salishan, 174
– dissimilarities, 38, 39, 40, 51, 54 languages
– distances, 51, 71 – Arabic, 134, 141
expectation–maximization (EM) algorithm, – artificial, 178, 218
173 – Chinese, 91
false discovery rate (FDR), 187 – Czech (modern), 77, 78, 79, 80, 82, 84, 134,
Flesh formula, 110 141, 144, 163
Fourier series, 4 – Czech (old), 147, 149, 161, 163
fractal, 4, 57, 231 – English, 77, 78, 79, 81, 82, 83, 85, 110, 152,
function, 46 171, 178, 205
Subject Index  249

– French, 49, 148 – granularity, 91


– German, 171 – L-motif, 90, 92, 93, 96, 97, 133, 134, 135,
– Greek (ancient), 150 136, 137, 138, 139, 140, 141, 142, 143,
– Greek (modern), 126 144, 145
– Italian, 93, 99, 100 – P-motif, 90
– Latin, 148, 149, 150, 152 – rank-frequency relation (RFR), 133, 135
– Polish, 148, 152 – R-motif, 97, 98, 99, 100, 107
– Russian, 148 – T-motif, 90
letters, 90, 155, 157 – word length, 125, 126, 127, 129, 130
Levenshtein distance, 172, 175, 176 moving average model, 152
lexical series, 152 multidimensional scaling (MDS), 51
lexicostatistics, 171, 172 multivariate estimation, 8
linear regression, 155, 242 navigation
lines, 42, 44, 157, 159, 161, 163 – free, 37, 38, 40, 44, 45
– endings, 160 – hypertext, 41, 46, 47, 48, 49, 54
– length, 155, 157, 165 – linear periodic, 40, 41
linguistic laws, 57, 58, 77, 92, 111, 150, 231 – textual, 36, 44
– Menzerath-Altmann, 57, 58, 61, 62, 63, 65, neighbor-joining (NJ) algorithm, 172, 174, 176
95, 125, 126, 127, 128, 130 neighbourhoodness, 35, 37, 54
– Piotrowski-Altmann, 112 n-grams, 181, 182, 183
– Zipf, 59 – bigrams, 182, 183
– Zipf-Mandelbrot, 125 – character, 176, 182, 183
linguistics – frequency, 151
– computational, 111 – relative frequency, 183
– computational historical, 171 – trigrams, 39, 41, 182
– corpus, 111 nodes, 20, 21, 22, 23, 203
– historical, 171, 173, 176 nouns, 16, 17, 46, 50, 51, 160
– quantitative, 231 null hypothesis, 39, 41, 80, 134, 138, 144,
– synergetic, 4, 6, 96, 150 187, 188
majority voting, 208, 211 oscillations, 1, 4, 152, 231
Mandelbrot, Benoit, 67 PageRank algorithm, 47
Markov chain, 37, 39, 47, 49, 231 parts-of-speech (POS) tags, 42, 44
Markov transition matrix, 35, 36 passages, 42, 149, 153, 157, 158, 159
Markov transition probability, 41, 231 permutations, 41
Menzerath, Paul, 67 – test, 41, 45, 48, 49
Miller, George A., 67 phonemes, 2, 175, 183
Minkowski distance, 245 polysemy, 2
Mitzenmacher, Michael, 67 polytextuality, 2, 91, 96
Moran's I index of spatial statistics, 39 positional weights, 35
morphemes, 2, 125, 160 predicate, 59
motifs, 4, 89, 90, 91, 95, 97, 107, 109, 125, probabilistic latent semantic analysis (PLSA),
129, 139 203
– categorical, 98 probability, 35, 44, 47, 50, 51, 55, 207
– D-motif, 98, 107 – conditional, 208
– F-motif, 90, 144 process
– F-motiv, 89 – autoregressive, 169
– frequency, 125, 136, 137 – moving average, 169
250  Subject Index

– random, 169 – syllables, 148, 161, 163, 164


pronouns, 17, 160 – symbolic, 4
proto-language, 171 – symbols, 4
pseudotext, 134, 135, 136, 137, 138 – texts, 112
punctuation marks, 218 – tweets, 211
quantity (syllabic), 154, 162, 163, 164, 165 – word length, 133
random variable, 152, 153, 168, 232, 237, 239, – words, 59
241, 244 Shapiro-Wilk normality test, 244
rhyme, 28, 29, 158, 159, 165 similarity models, 8, 9, 12, 13, 14, 16, 20, 22,
– frequency, 159, 160 23, 24, 25, 26, 27, 28, 32
– homogeneity, 158, 159 social media, 201, 202, 203, 204, 209
– inventory, 158, 159 – blogs, 202, 203, 209
– pairs, 148, 158 – Twitter, 201, 202, 209, 210, 211
– repertory, 158 social networks
rhythm, 4, 29, 154, 161, 165 – analysis, 201
runs test, 148, 154 – characteristics, 203, 204
sales trends, 201, 204, 209, 213 spatial analysis, 35
screeplot, 52, 53 standard deviation, 155, 157, 184, 239, 242
semantic similarities, 50 stress, 154, 162, 163, 164, 165, 178
sentences, 1, 2, 3, 57, 58, 59, 60, 61, 67, 68, string similarity, 171, 173, 175, 177, 181, 184,
90, 96, 206 185, 186, 187
– bag-of-features representation, 206 style development, 29
– length, 59, 95, 110, 231 subgraphs, 20, 21
– polarity, 205, 206, 207, 208, 209, 211, 213 support vector machines (SVM), 204, 207
– polarity reversal, 206 syllables, 2, 4, 90, 91, 92, 125, 126, 134, 154,
sentiment analysis, 201, 203, 205, 206, 209 155, 156, 157, 161, 165
sequences, 1, 2, 3, 92, 99, 151, 152, 153 – coding, 155
– aphanumeric, 218 – long, 154
– chapters, 149, 155, 156, 157, 158, 160, 161, – short, 154
163 – stressed, 110, 154, 163
– distances, 4 – unstressed, 154, 163
– graphemes, 57 synonyms, 2
– historical data, 204, 210 synsets, 50
– letters, 59 syntactic complexity, 91
– linear, 1, 109 syntagmatic relations, 133, 145
– lines, 161, 162 syntagmatic time, 151
– mathematical, 1 syntax, 29
– motifs, 94 term-document matrix, 44, 45, 46, 54
– multidimensional, 4 terms, 3, 46, 49
– musical, 4 – frequency, 152
– numerical, 4 – magnification factor, 48
– part of speech, 4, 99 – weights, 45
– phonemes, 183 text
– rank-frequency, 232 – categorization, 182
– relative errors, 235 – classification, 7, 32, 91, 92
– sentences, 206, 208 – entities, 4
– syllable lengths, 239 – frequency structure, 71, 77, 78, 84, 217
Subject Index  251

– genre, 71, 72, 79, 130, 163, 224, 228 unweighted pair group method with
– length, 71, 72, 73, 77, 78, 79, 83, 84, 85, arithmetic mean (UPGMA), 172
110, 111, 221, 224, 225, 227 u-test, 80
– production, 224 verbs, 4, 16, 17, 42, 44, 46, 50, 51, 53, 54, 160
– randomization, 59, 128, 129, 135, 136, 137, verse, 28, 29, 147, 149, 153, 157, 159, 165, 239
138 – length, 148, 150, 153, 154, 155, 156, 161, 165
– topic, 203, 208, 209 vertices, 20
– trajectories, 227 vocabulary "richness", 71, 109
textual neighbourhoods, 35 word length, 1, 90, 91, 110, 125, 126, 127, 129,
textual series, 151 133, 231
time series, 35, 109, 148, 151, 152, 154, 161, word lists, 171, 172, 173, 175, 176, 178, 179,
168, 169, 203, 213, 231 188, 191
– autocovariance, 168, 169 WordNet, 50, 51, 53
– variance, 168 words, 36, 155, 157
tokens, 49, 50, 109, 134, 138, 227 – autosemantic, 219
translations, 8, 13, 26, 27, 28, 29, 30, 32, 217, – content, 114, 116, 117, 118
219, 220, 221, 223, 224, 226, 227, 228 – frequency, 112, 113, 125, 218, 220, 231
translator, 224, 228 – function, 114
tree inference algorithm, 173, 186 – ordering, 205
trend analysis, 148, 154 – orthographical, 218
tweets, 201, 202, 204, 208, 209, 212 – repetition, 3, 77
types, 35, 36, 39, 40, 41, 45, 49, 54, 84, 109, – senses, 205, 207
133, 218 – unrhymed, 28, 33
type-token ratio, 71, 72, 109, 110, 158 – valence reversers, 206
undirected graph, 207 – zero syllabic, 134
– fuzzy, 20, 23 world atlas of language structures (WALS),
177, 179, 180, 184, 185
Authors Index

Albert, Réka, 59, 66 Coleridge, Samuel T., 13, 16, 17, 18, 20, 26,
Altmann, Gabriel, 1, 5, 6, 28, 29, 33, 57, 66, 27, 29, 30, 31
71, 72, 86, 87, 95, 96, 108, 110, 113, 122, Cootner, Paul H., 202, 214
123, 125, 130, 131, 219, 229 Covington, Michael A., 71, 86, 110, 122
Andreev, Sergey, 7, 28, 33 Cramer, Irene M., 61, 66, 110, 123, 125, 130
Andres, Jan, 57, 61, 66 Cressie, Noel A.C, 35, 55
Anselin, Luc, 35, 55 Cryer, Jonathan, 152, 166
Antić, Gordana, 134, 145 Danforth, Christopher, 202, 214
Argamon, Shlomo, 28, 33 Daňhelka Jiří, 149, 166
Asur, Sitaram, 202, 203, 213, 215 de Campos, Haroldo, 217, 229
Atkinson, Quentin D., 171, 190 de Saint-Exupéry, Antoine, 217
Baixeries, Jaume, 57, 66 Delen, Dursun, 204, 215
Bakker, Dik, 172, 190, 191, 193 Dementyev, Andrey V., 9, 33
Barabási, Albert-László, 59, 66 Dickens, Charles, 72, 74, 76
Bavaud, François, 35, 39, 40, 41, 51, 55 Dockery, Everton, 202, 214
Beliankou, Andrei, 97, 98, 108 Dodds, Peter, 202, 214
Benešová, Martina, 57, 66, 128, 130 Dryer, Matthew S., 179, 190, 191
Benjamini, Yoav, 187, 190 Dubois, Didier, 9, 10, 11, 12, 33
Bergsland, Knut, 171, 190 Dunning, Ted E., 183, 190
Best, Karl-Heinz, 96, 108, 114 Durie, Mark, 171, 190, 192
Bollen, Johan, 202, 213 Dyen, Isidore, 174, 175, 190
Borisov, Vadim V., 7, 9, 26, 33 Eder, Maciej, 147, 148, 150, 152, 166, 167
Boroda, Mojsej, 4, 6, 89, 108 Ellegård, Alvar, 171, 190
Bouchard-Côté, Alexandre, 173, 190 Ellison, T. Mark, 173, 191
Box, George E., 152, 166 Elvevåg, Brita, 59, 66
Brew, Chris, 182, 190 Embleton, Sheila M., 171, 191
Brown, Cecil H., 178, 190, 191, 193 Eppen, Gary D., 202, 214
Buk, Solomija, 217, 219, 221, 227, 229 Fama, Eugene F., 202, 214
Bunge, Mario, 111, 122 Fedulov, Alexander S., 9, 26, 33
Bychkov, Igor A., 9, 33 Felsenstein, Joseph, 172, 174, 191
Carroll, John B., 2, 6 Ferrer i Cancho, Ramon, 59, 66, 217, 229
Carroll, Lewis, 217 Fontanari, José F., 217, 229
Cavnar, William B., 182, 190 Foulds, Leslie R., 176, 192
Čech, Radek, 29, 33, 71, 72, 86, 128, 130, 217, Francis, Nelson W., 50, 55
229 Gallagher, Liam A., 202, 214
Choi, Yejin, 206, 214 Gilbert, Eric, 202, 214
Chomsky, Noam, 59, 67 Glance, Natalie, 202, 214
Chrétien, C. Douglas, 171, 192 Glass, Gene V., 152, 166
Christiansen, Chris, 176, 190 Goethe, 239
Cleveland, William S., 155, 166 Goodman, Leo A., 184, 191
Cliff, Andrew D., 39, 55 Gottman, John M., 152, 166
Cohen, Avner, 59, 66 Gray, Russell D., 171, 173, 190, 191
Greenacre, Michael, 46, 55
254  Authors Index

Greenhill, Simon J., 173, 176, 191 Kroeber, Alfred L., 171, 192
Gregson, Robert A., 152, 166 Kruskal, William H., 184, 190, 191
Grinstead, Charles M., 47, 55 Kubát, Miroslav, 110, 123
Gruhl, Daniel, 202, 214 Kučera, Henry, 50, 55
Gulden, Timothy R., 217, 229 Lafferty, John, 205, 214
Gumilev, Nikolay, 13, 26, 27, 28, 29, 30, 31 Le Roux, Brigitte, 40, 55
Gunawardana, Asela, 206, 214 Lebanon, Guy, 206, 214
Hammarström, Harald, 172, 189, 192, 193 Lebart, Ludovic, 39, 55
Hašek, Jaroslav, 72, 73, 75 Lee, Kwang H., 18, 33
Haspelmath, Martin, 177, 179, 190, 191 Lerman Kristina, 204, 214
Hauer, Bradley, 173, 191 Levenshtein, Vladimir I., 172, 192
Havel, Václav, 60, 61 Levik, Wilhelm, 13, 26, 27, 28, 29, 30, 31
Hay, Richard A., 152, 166 Lewis, Paul M., 176, 192
Herdan, Gustav, 151, 166 Li, Wientian, 59, 67
Hess, Carla W., 71, 86 Lin, Dekang, 182, 192
Hochberg, Yosef, 187, 190 List, Johann-Mattis, 173, 192, 193
Hogg, Tad, 204, 214 Liu, Haitao, 128, 131
Holman, Eric W., 172, 178, 189, 190, 191, 193 Liu, Yang, 202, 203, 214
Hoover, David L., 7, 33 Lonsdale, Deryle, 176, 186, 191
Hrabák, Josef, 161, 166 Mačutek, Ján, 60, 66, 71, 86, 95, 108, 125,
Hřebíček, Luděk, 57, 66 127, 130, 131, 133, 135, 145
Hu, Fengguo, 128, 131 Mandelbrot, Benoit, 59, 67, 217, 229
Huang, Kerson, 218, 229 Mao, Yi, 206, 213, 214
Huberman, Bernardo A., 202, 203, 204, 213, Mardia, Kanti V., 51, 55
215 Martynenko, Gregory, 71, 86
Huff, Paul, 176, 186, 191 Matesis, Pavlos, 126, 127, 128, 129
Huffman, Stephen M., 176, 191 McCleary, Richard, 152, 166
Inkpen, Diana, 175, 191 McDonald, Ryan, 206, 214
Isihara, Akira, 218, 229 McFall, Joe D., 71, 86, 110, 122
Jäger, Gerhard, 176, 191 McKelvie, David, 182, 190
Jenkins, Gwilym M., 152, 166 Melamed, Dan I., 182, 192
Jireček, Josef, 149, 166 Menzerath, Paul, 57, 67
Juola, Patrick, 7, 33 Michailidis, George, 126, 127, 128
Kak, Subhash, 201, 203, 215 Michener, Charles D., 172, 193
Karahalios, Karrie, 202, 214 Mikros, George K., 7, 33
Kavussanos, Manolis, 202, 214 Milička, Jiří, 110, 123, 125, 128, 131
Keeney, Ralph L., 12, 33 Miller, George A., 50, 55, 59, 67
Kelih, Emmerich, 58, 66, 127, 131 Miller, Rupert G., 80, 86
Kirby, Simon, 173, 191 Milliex-Gritsi, Tatiana, 126, 127, 128
Köhler, Reinhard, 4, 6, 29, 33, 89, 91, 94, 95, Mishne, Gilad, 202, 214
96, 97, 98, 108, 109, 110, 123, 125, 130, Mitzenmacher, Michael, 59, 67
131, 133, 134, 145, 150, 151, 166, 167 Miyazima, Sasuke, 217, 229
Kohlhase, Jörg, 114, 122 Molière, 42
Kondrak, Grzegorz, 173, 174, 191, 192 Moran, Patrick, 39, 55
Koppel, Moshe, 28, 33 Müller, Dieter, 71, 86
Kosmidis, Kosmas, 217, 229 Naumann, Sven, 28, 33, 89, 91, 94, 95, 96,
Krajewski, Marek, 148, 150, 152, 167 97, 98, 108, 125, 131, 133, 145
Authors Index  255

Nei, Masatoshi, 172, 192 Sharda, Ramesh, 204, 215


Nichols, Johanna, 173, 192 Sherif, Tarek, 174, 192
Nordhoff, Sebastian, 172, 192 Sheskin, David J., 187, 192
Ogasawara, Osamu, 217, 229 Shimoni, Anat R., 28, 33
Ord, John K., 39, 55 Singh, Anil K., 176, 183, 192
Ourednik, André, 49, 55 Skiena, Steven, 203, 215
Page, Lawrence, 47, 55 Snell, Laurie J., 47, 55
Pak, Alexander, 202, 214 Sokal, Robert R., 172, 193
Paroubek, Patrick, 202, 214 Solé, Ricard V., 66
Pawłowski, Adam, 147, 148, 150, 151, 152, Solovyev, Alexander P., 9, 33
166, 167, 170 Stede, Manfred, 98, 108
Pedersen, Ted, 51, 55 Surana, Harshit, 176, 192
Perlovsky, Leonid I., 217, 229 Swadesh, Morris, 171, 172, 174, 175, 178, 189,
Petroni, Filippo, 175, 192 193
Poe, Edgar Allan, 51 Szabo, Gabor, 204, 215
Pompei, Simone, 176, 186, 192 Tate, Robert F., 184, 193
Popescu, Ioan-Iovitz, 28, 33, 71, 72, 77, 79, Taylor, Mark P., 202, 214
84, 85, 86, 219, 229, 232, 239, 241, 245 Tešitelová, Marie, 71, 86
Prade, Henri, 9, 10, 11, 12, 33 Torgeson, Warren S., 51, 55
Qian, Bo, 202, 214 Trenkle, John M., 182, 190
Quattoni, Ariadna, 206, 215 Trevisani, Matilda, 111, 112, 123
Raiffa, Howard, 12, 33 Tuldava, Juhan, 71, 87
Rama, Taraka, 171, 173, 176, 192, 193 Tuzzi, Arjuna, 109, 111, 112, 123
Rasheed, Khaled, 202, 214 Vogt, Hans, 171, 190
Ratkowsky, David A., 71, 86 Wälchli, Bernhard, 232, 245
Rentoumi, Vassiliki, 201, 206, 215 Weizman, Michael, 71, 87
Resnik, Philip, 50, 51, 55 Wichmann, Søren, 173, 175, 178, 184, 188,
Richards, Brian, 71, 86 189, 190, 191, 193
Robinson, David F., 176, 192 Wieling, Martijn, 176, 193
Romero, Daniel M., 204, 215 Wilson, Victor L., 152, 166
Ross, Malcolm, 171, 190, 192 Wimmer, Gejza, 5, 6, 71, 87, 96, 108, 110, 123,
Ross, Sheldon M., 238, 241, 245 125, 127, 131, 135, 145
Rouanet, Henry, 40, 55 Woronczak, Jerzy, 147, 148, 150, 153, 165, 167
Rovenchak, Andrij, 127, 131, 217, 218, 219, Xanthos, Aris, 35, 40, 55
221, 227, 228, 229 Xanthoulis, Giannis, 126, 127, 128
Rudman, Joseph, 7, 33 Yamamoto, Keizo, 217, 229
Saaty, Thomas L., 18, 26, 33 Yangarber, Roman, 189
Sadamitsu, Kugatsu, 206, 215 Yu, Sheng, 201, 203, 214, 215
Saitou, Naruya, 172, 192 Zernov, Mikhail M., 21, 34
Sanada, Haruko, 125, 131, 133, 145 Zhang, Wenbin, 203, 215
Sankoff, David, 171, 192 Zimmermann, Hans-Jürgen, 11, 34
Schmid, Helmut, 42, 51, 55 Zörnig, Peter, 231, 239, 241, 242, 245
Serva, Maurizio, 175, 192 Zysno, Peter, 11, 34
Authors’ addresses

Altmann, Gabriel
Stüttinghauser Ringstrasse 44, DE-58515 Lüdenscheid, Germany
email: ram-verlag@t-online.de

Andreev, Sergey N.
Department of Foreign Languages
Smolensk State University
Przhevalskogo 4, RU-214000 Smolensk, Russia
email: smol.an@mail.ru

Bavaud, François
Department of Language and Information Sciences
University of Lausanne
CH-1015 Lausanne, Switzerland
email: francois.bavaud@unil.ch

Benešová, Martina
Department of General Linguistics
Faculty of Arts
Palacky University
Křížkovského 14, CZ-77147 Olomouc, Czech Republic
email: martina.benesova@upol.cz

Borin, Lars
Department of Swedish
University of Göteborg
Box 200, SE- 40530 Göteborg, Sweden
email: lars.borin@svenska.gu.se

Borisov, Vadim V.
Department of Computer Engineering
Smolensk Branch of National Research University “Moscow Power
Engineering Institute”
Energeticheskiy proezd 1, RU-214013 Smolensk, Russia
email: vbor67@mail.ru
258  Authors’ addresses

Čech, Radek
Department of Czech Language
University of Ostrava
Reální 5, CZ-77147 Ostrava, Czech Republic
email: cechradek@gmail.com

Cocco, Christelle
Department of Language and Information Sciences
University of Lausanne
CH-1015 Lausanne, Switzerland
email: christelle.cocco@alumnil.unil.ch

Eder, Maciej
Institute of Polish Studies
Pedagogical University of Kraków
ul. Podchorążych 2, PL-30084 Kraków, Poland

and

Institute of Polish Language


Polish Academy of Sciences
al. Mickiewicza 31, PL-31120 Kraków, Poland
email: maciejeder@ijp-pan.krakow.pl

Köhler, Reinhard
Computerlinguistik und Digital Humanities
University of Trier
FB II / Computerlinguistik, DE-54286 Trier, Germany
email: koehler@uni-trier.de

Krithara, Anastasia
Institute of Informatics and Telecommunications
National Center for Scientific Research (NCSR) ‘Demokritos’
Terma Patriarchou Grigoriou, Aghia Paraskevi, GR-15310, Athens, Greece
email: akrithara@iit.demokritos.gr

Mačutek, Ján
Department of Applied Mathematics and Statistics
Comenius University
Mlynská dolina, SK-84248 Bratislava, Slovakia
email: jmacutek@yahoo.com
Authors’ addresses  259

Mikros, George K.
Department of Italian Language and Literature
School of Philosophy
National and Kapodistrian University of Athens
Panepistimioupoli Zografou, GR-15784 Athens, Greece
email: gmikros@gmail.com

and

Department of Applied Linguistics


College of Liberal Arts
University of Massachusetts Boston
100 Morrissey Boulevard, US-02125 Boston, MA, United States of America
email: Georgios.Mikros@umb.edu

Milička, Jiří
Institute of Comparative Linguistics
Faculty of Arts
Charles University
Nám. Jana Palacha 2, CZ-11638 Praha 1
email: jiri@milicka.cz

and

Department of General Linguistics


Faculty of Arts
Palacky University
Křížkovského 14, CZ-77117 Olomouc, Czech Republic

Pawłowski, Adam
University of Wrocław
Institute of Library and Information Science
pl. Uniwersytecki 9/13, 50-137 Wrocław, Poland
email: apawlow@uni.wroc.pl

Rentoumi, Vassiliki
SENTImedia Ltd.
86 Hazlewell road, Putney, London, SW156UR, UK
email: vassiliki.rentoumi@gmail.com
260  Authors’ addresses

Rama K., Taraka


Department of Swedish
University of Göteborg
Box 200, SE- 40530 Göteborg, Sweden
email: taraka.rama.kasicheyanula@gu.se

Rovenchak, Andrij
Department for Theoretical Physics
Ivan Franko National University of Lviv
12 Drahomanov St., UA-79005 Lviv, Ukraine
email: andrij.rovenchak@gmail.com

Tuzzi, Arjuna
Department of Philosophy, Sociology, Education and Applied Psychology
University of Padova
via Cesarotti 10/12, IT-35123 Padova, Italy
email: arjuna.tuzzi@unipd.it

Tzanos, Nikos
SENTImedia Ltd.
86 Hazlewell road, Putney, London, SW156UR, UK
email: corelon@gmail.com

Xanthos, Aris
Department of Language and Information Sciences
University of Lausanne
CH-1015 Lausanne, Switzerland
email: aris.xanthos@unil.ch

Zörnig, Peter
Department of Statistics, Institute of Exact Sciences,
University of Brasília, 70910-900, Brasília-DF, Brazil
email: peter@unb.br

You might also like