Historical Linguistics: Numerical Methods: Sheila M Embleton, York University, Toronto, ON, Canada

Historical Linguistics: Numerical Methods
Sheila M Embleton, York University, Toronto, ON, Canada

Ó 2015 Elsevier Ltd. All rights reserved.
Abstract
Numerical methods in historical linguistics strive to determine how closely two languages are related. Pairwise, this can be
used to reconstruct a tree for a language family (lexicostatistics). More ambitious methods assign dates to splits within trees
(glottochronology). The methods have been extended to distant relationships, often to try to answer whether the languages
involved are related at all. All methods, no matter how sophisticated, rest on the simplest observation: since languages change
over time, originally related languages gradually become less similar. This enables inferences about diachrony from
synchronic information only, attractive when little is known about history. Methodological difficulties come in quantifying
rates of change, measuring difference, and accounting for factors interfering with normal change processes (e.g., mutual
influence between languages). The attractions of numerical methods are the same as in any humanities discipline: speed,
objectivity, replicability, and ability to handle large volumes of data efficiently.
Introduction the degree of linguistic differentiation within a stock the greater

is the period of time that must be assumed for the development
The main goal of numerical methods in historical linguistics of such differentiation”. Sapir’s influence on later linguists was
has always been to answer the question of how closely two great, so even though he himself never followed this statement
languages are related. Applied pairwise to languages, this can up by taking the basic steps of quantifying linguistic differen-
be used to reconstruct a tree for a whole language family tiation or rate of change, this can be seen as the genesis of the
(lexicostatistics). More ambitious methods have additionally most widely known numerical method in historical linguistics,
assigned dates to the splits within those trees (glottochro- that of Swadesh (1950, and see below).
nology). The questions and methods have been extended to The first explicit numerical treatment of genetic proximity of
involve distant (hence controversial) relationships between languages was that of Kroeber and Chrétien (1937), following
languages, so the basic question becomes not just how closely a method created by two anthropologists, Jan Czekanowski
two languages are related, but whether they are related at all. All (1928) and Stanislaw Klimek (1935), for use in physical
methods in this area, no matter how sophisticated, rest on the anthropology and ethnography. For each pair of languages, on
simplest of observations: since languages change over time, two a list of N features, calculate a (number of features shared by
originally related languages gradually become less similar over both languages), b (number of features in the first language but
time. This enables the linguist to infer something about not in the second), c (number of features in the second
diachrony from synchronic information only, particularly language but not in the first), and d (number of features in
attractive in situations where little is known about the history. neither language). This produces a 2 2 contingency table,
The methodological difficulties come in quantifying rates of a well-studied mathematical entity enabling various forms
change, measuring difference, and accounting for the myriad of of analysis. p particular, the quantity Nr2, where
Inffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffiffi
factors that can interfere with the so-called normal change r ¼ ðad bcÞ= ða þ bÞðc þ dÞða þ cÞðb þ dÞ, is c2 (one degree
processes (e.g., mutual influence between languages). The of freedom), enabling one to test the statistical significance of
attractions of numerical methods in historical linguistics are the relationship between the two languages. The values of r for
the same as those attributed to quantitative methods in almost all pairs of languages under consideration can become the
any discipline in the humanities or social sciences: speed, input to various methods (e.g., hierarchical cluster analysis) for
objectivity, replicability, and ability to handle large volumes of family tree reconstruction. The method is simple but not
data efficiently. unproblematic (e.g., construction of the feature list, difficulties
with nonindependence of features, interpretation of correla-
tions especially if negative) (for complete treatment and
Early Methods commentary, see Embleton, 1986).
The second numerical method in historical linguistics is
The earliest precursor to a numerical method in historical Ross’s root-retention model (1950). For N features and m
linguistics is a statement made by Robert Gordon Latham languages, construct an N m table in which a cell has a cross if
(1850: p. 565, see also Hewes in Hymes, 1960: p. 338): “the that language exhibits that feature. The question of relationship
average rate at which languages change is capable of being of languages can then be recast as whether or not the crosses are
approximated” and “the maximum difference, at a given distributed randomly in the table. For each pair of languages,
period, between two or more languages is also capable of being calculate n1 (number of features found in the first language, the
approximated”. He never pursued this, and the next milestone number of crosses in the first column), n2 (number of features
in the development of numerical methods in historical found in the second language, the number of crosses in the
linguistics was a statement by Sapir (1916: p. 76): “The greater second column), and r (number of features shared by both
International Encyclopedia of the Social & Behavioral Sciences, 2nd edition, Volume 11 http://dx.doi.org/10.1016/B978-0-08-097086-8.52012-1 23
24 Historical Linguistics: Numerical Methods
languages, the number of rows with a cross in both columns). resistant to change and borrowing than other vocabulary
Then the probability for that pair of languages of obtaining the items. Such a list, now known as a Swadesh list, contains such
shared number of crosses or more is items as lower numerals, simple kinship terms, body parts,
n1 Nn1 topographical terms, widespread flora and fauna, pronouns
X N R n2 R (personal, demonstrative, and interrogative), verbs denoting
¼ n1 basic actions, and naturally occurring noncultural pheno-
N
R¼r n2 mena. Assuming a constant rate of replacement, R, for words
representing meanings on this list, one can calculate the time
for which a normal approximation exists. These probabilities elapsed (time depth) since any pair of languages split using the
can be tested for significance, and could become the input formula
for various family tree reconstruction methods. Interestingly, 1 nco
Ross’s method can be shown to be mathematically equivalent ¼ In
2R N
under certain conditions to that of Kroeber and Chrétien, and
thus subject to the same problems (for complete treatment and where c is the number of cognates on the Swadesh list between
commentary on this and the method in general, see Embleton, the two languages compared. By calculating time depths for all
1986). Although certain elements of these models recur today, pairs of languages under consideration, we can construct
both have largely been abandoned, although the problems a family tree with dates.
encountered and lessons learned remain relevant. Swadesh’s method was greeted with widespread enthusiasm
Historically, the next method to evolve was Swadesh’s, but and widely applied in the late 1950s, the 1960s, and the early
given its centrality to the field and ongoing relevance, the 1970s, almost with a sense of euphoria about finally having an
author deals with that in a separate section. A final numerical exact tool for obtaining insights into hitherto unknowable
method to consider is that of counterindications originated by linguistic prehistory. The debate over Swadesh’s method
Gleason (1959), later rediscovered by others. All topologically quickly became emotional, but regardless of the researcher’s
possible family trees for the languages in question are con- subjectivity, a common core of problems can be identified,
structed, each is evaluated for how well it fits against a fixed sometimes with potential solutions.
list of features (every time the tree does not fit the distribution Difficulties with particular meanings on the Swadesh list
of that feature, a counterindication is recorded), and the tree include the following:
with the best score (even if not perfect) is chosen. This method
is difficult to implement for two reasons. Even with modern 1. Some meanings cannot be translated into the target
high-speed computing and sophisticated algorithms, the language. This is fairly rare, and best treated as missing data,
number of topologically possible trees for even a moderately reducing N accordingly.
sized language family quickly becomes unworkably large. 2. Sometimes there are multiple possible translations in
The method is also highly vulnerable to the effect of bor- the target language for a meaning on the list. Different
rowing across languages. A variant of Gleason’s method, researchers might make different but equally valid choices
relying on hierarchical clustering and attempting to estimate from the range of possibilities (‘multiple synonymy’).
branch lengths (i.e., used glottochronologically), is shown in Solutions include always counting only the most frequent
Embleton (1986), along with a complete discussion of the possible translation, choosing at random, always choosing
method itself. Taylor et al. (2000), using recent advances and to maximize the cognate count, always choosing to mini-
sophisticated techniques from computer science and biolog- mize the cognate count, or various fractional scoring
ical cladistics, are developing a character-based (rather than systems.
distance-based) method for family tree reconstruction with
Difficulties with cognacy judgments (remember that one is
some similarities to Gleason’s method. Despite difficulties
not necessarily dealing with languages where the history and
with dealing with borrowing in realistic amounts, such
hence cognacy are well known) are as follows:
methods are promising partly because they closely parallel the
cognitive processes of skilled practitioners of traditional 1. Due to accumulated sound change, cognates may not be
comparative historical reconstruction techniques, while being recognizable, causing us to overestimate time depth.
more objective and simultaneously handling vast amounts 2. Chance similarities may be wrongly interpreted as cognates,
of data. causing us to underestimate time depth. In practice, one
might hope that this and the preceding cancel out.
3. Sometimes only one part of a word is cognate to a word in
Swadesh
another language (‘partial cognation’); how should one
score this? Solutions include always counting this as posi-
By far the best known numerical method in historical
tive, always counting this as negative, or various fractional
linguistics is that introduced by Swadesh (e.g., 1950), actual-
scoring systems.
izing Sapir’s statement by regarding lexical replacement as
analogous to radioactive decay. As with carbon-14 dating, Problems with the universality of R, which was shown to
dates could then be provided for branching points in the tree vary over time, between languages, and according to meaning.
(glottochronology). Take a list of N meanings of supposed R had been estimated from languages with written historical
basic core vocabulary, found universally in all cultures, records as about 14% per millennium for the standard
expressing basic noncultural concepts, expected to be more 100-word Swadesh list and about 19% per millennium for the
Historical Linguistics: Numerical Methods 25
standard 200-word Swadesh list. Not only might the very fact replacement rates for each language, borrowing rates between
that these languages are written bias the rate (one might expect each language pair (not necessarily symmetric), and knowledge
less replacement in a language with a lengthy written history), of which languages are in contact with one another.
but 11 of the 13 pairs used to establish the rate were Indo-
European (representing only three of its branches).
Problems with the family tree (Stammbaum) model, gen- Starostin
erally centered on the splits not being clean (thus giving any
precise date is artificial and meaningless). Languages are often Largely independently of the work described above, research on
in close physical proximity even after they split, mutually numerical methods in historical linguistics was proceeding in
influencing one another; tendencies in the two languages pre- the Soviet Union (later Russia or ‘in diaspora’). Unlike the
dating the split may come to fruition only afterward. Both these impetus for Swadesh’s work, which was largely to aid in
interfere with the model, but are more properly problems with providing histories and trees for relatively little studied
the family tree model, and Swadesh’s adoption of it, rather language families (e.g., Amerindian), much of this work gained
than problems with Swadesh’s method itself. its impetus from questions of more distant linguistic relation-
Borrowings between languages (related or unrelated) can ship. This had its effect on the development of the methods,
alter replacement rates quite substantially. For some languages, because it was crucial that the methods be able to produce
even the Swadesh list contains a high amount of borrowed useful results even at much greater time depths, where
vocabulary. Swadesh’s method typically has difficulty distinguishing the
For various reasons (see Embleton, 1986, and elsewhere), signal (cognacy, true genetic relationship) from the noise
the use of Swadesh-style lexicostatistics and glottochronology (chance resemblance, contact relationship).
waned by the mid-1970s, although applications lingered on in Historically, the first advance here was due to Dolgopolsky
several language areas, and the methods continued to be (1986), who capitalized on the fact that different meanings
refined by applied mathematicians. change at different rates. He compiled a list of the 15 most stable
meanings, based on examination of 140 languages. One can
then calculate the probability of fortuitous coincidence of the
Sankoff words representing those 15 most stable meanings for any
language pair; if the probability is significantly less than that
Most refinements introduced by applied mathematicians expected by mere chance, this can be taken as evidence of rela-
involved adding further parameters to the basic Swadesh tionship. This method is most striking in its extreme simplicity,
model. For example, parameters were added to account for drift but can also be related to some more complex and ongoing
and for ‘chance recurrent cognation’ (i.e., when a word was work, largely carried out in Eastern Europe, founded on the
replaced twice, such that the original word was restored), observation that several characteristics of a word (e.g., length,
replacement rates were allowed to vary from meaning to number of meanings, age) are intercorrelated with its frequency.
meaning, a very crude allowance was made for borrowing, and Words of more recent origin tend to be less frequent, longer, and
there was talk of allowing the rates to vary from language to have fewer meanings, whereas older words tend to be more
language and from time to time (further details, along with frequent, shorter, and have more meanings. The age–frequency
some models, can be found in Embleton, 1986). Some of these relationship in particular is being extensively researched, seeking
are useful refinements (e.g., allowing replacement and bor- a small hard core of ancient short vocabulary that may reveal
rowing rates to vary over time, between languages, and between distant relationships, and may even be able to be harnessed into
meanings), but these are very much a mathematician’s way of a lexicostatistical or even glottochronological method.
solving the problem, producing quite elegant solutions. But in Another advance has been made by Sergei Starostin, with
the real world in which the linguist works, all those parameters work first published in the 1980s and continuing until his death
have to be estimated – and this is surprisingly difficult even for in 2005, known as root glottochronology or etymostatistics (see
well-studied languages for which we actually have both long Starostin, 1999). Compared to standard Swadesh-style glotto-
and extensive written records. It is also the case that an ex- chronology, the main innovation is the observation that while
traordinary amount of data is required to estimate so many glottochronology regards any change (however small) in a
parameters, and this is most often simply impossible. So these word’s meaning as a loss of that word from that meaning, root
elegant theoretical solutions have very little practical import. glottochronology does not. Thus, while standard glottochro-
However, the single most important development was the nology sees English dog and German Hund in the meaning ‘dog’
incorporation of a geographical dimension into the family tree, as noncognate, root glottochronology notices the existence of
thus allowing for borrowing to be largely dealt with. The basic the English word hound; this more conservative approach to the
insight here is due to Sankoff (1972), with correction and maintenance of a language’s root stock enables one to work
further refinement by Embleton (see 1986 for the most better at greater time depths. Starostin’s method also gives more
comprehensive treatment). The resulting set of differential accurate results for known cases where the time depths are more
equations required to reconstruct the tree and estimate the shallow (standard glottochronology tends to give dates that are
branch lengths is complicated, reflecting the increased power of too recent). Root glottochronology takes a text, containing
the method (equations are given in Embleton, 1986, and a given set of words in the parent language, and then measures
elsewhere). The parameters required all have some reasonable the number of these words still extant (in any derivative, with
chance of being estimated, even for lesser studied families; any semantic shift) in the daughter language. It proceeds only
these are a similarity measure for each language pair, with detailed comparative historical and etymological analysis
26 Historical Linguistics: Numerical Methods
(eliminating look-alikes and loans), and can also be repeated numerical methods while not providing any magic solutions
for more than one text (hence a different set of words), giving to any long-standing problems, make a useful contribution as
greater precision and potentially some estimate of error (see one of the tools that we use to explore our linguistic history
Starostin, 1999, for formulas and more detailed explanation). and prehistory.
See also: Language Families, Archaeology and History of;

Ringe Lexicology and Lexicography.
Various linguists have used probabilistic methods to assess
similarity between vocabularies of two different languages,
usually focusing on phonological similarities (an early example
Bibliography
is Ross, 1950). The basic question is whether the observed
degree of similarity between the two vocabularies is greater
Czekanowski, Jan, 1928. Na marginesie recenzji P.K. Moszyn-skiego o Ksia˛ zce: Wste˛ p
than what could be expected based on chance. Although the do Historji Slowian. Lud 7 (Series II). Reprint Lwow.
problem is conceptually simple, it quickly becomes mathe- Dolgopolsky, Aaron, 1986. A probabilistic hypothesis concerning the oldest relationships
matically intractable even between just two languages, as one among the language families in northern Eurasia. In: Shevoroshkin, Vitaliy,
has to wrestle with different phoneme distributions, different Markey, Thomas (Eds.), Typology, Relationship and Time: A Collection of Papers on
Language Change and Relationship. Karoma, Ann Arbor, MI, pp. 28–50.
phoneme frequencies, how to measure phonetic distance Embleton, Sheila M., 1986. Statistics in Historical Linguistics. In: Quantitative
between different phone(me)s, different phonotactic struc- Linguistics, vol. 30. Brockmeyer, Bochum, Germany.
tures, and lengths of roots, quite apart from the questions Eska, Joseph F., Ringe, Donald, 2004. Recent work in computational linguistic
(familiar from glottochronology) of how much semantic lati- phylogeny. Language 80, 569–582.
Gleason Jr., Henry A., 1959. Counting and calculating for historical reconstruction.
tude to allow between the words compared. The most recent
Anthropological Linguistics 1, 22–32.
and sophisticated treatment of this problem is by Ringe (e.g., Holman, Eric W., Brown, Cecil H., Wichmann, Søren, et al., 2011. Automated dating
1998 and references therein; McMahon and McMahon, 2005: of the world’s language families based on lexical similarity. Current Anthropology
pp. 50–68), who attempts to account for many of the above 52 (6), 841–875.
issues, albeit with simplifying assumptions. This research Hymes, Dell H., 1960. Lexicostatistics so far. Current Anthropology 1, 3–44,
338–345.
remains active, yet without consensus as to its utility, partly Klimek, Stanislaw, 1935. The structure of California Indian culture, culture element
because of the degree of the simplifying assumptions, and also distributions. University of California Publications in American Archaeology and
partly because it has mostly been applied in the realm of long- Ethnology 37 (1), 1–70.
distance relationships, where the emotions over the results can Kroeber, Alfred L., Chrétien, Charles, D., 1937. Quantitative classification of Indo-
European languages. Language 13, 83–103.
sometimes cloud the objectivity with which the method is
Latham, Robert G., 1850. The Natural History of the Varieties of Man. Van Voorst,
viewed. London.
Levinson, Stephen C., Gray, Russell D., 2012. Tools from evolutionary biology shed
new light on the diversification of languages. Trends in Cognitive Sciences 16,
Current Work and Future Directions 167–173.
McMahon, April, 2005. Introduction. Transactions of the Philological Society 103 (2),
113–119.
Research in numerical methods in historical linguistics has McMahon, April, McMahon, Robert, 2005. Language Classification by Numbers.
reentered the mainstream after several decades on the side- Oxford University Press, Oxford.
lines. Several recent collections, published in mainstream Ringe Jr., Donald R., 1998. Probabilistic evidence for Indo-Uralic. In: Salmons, Joseph
venues, demonstrate the diversity and debates within the C., Joseph, Brian D. (Eds.), Nostratic: Sifting the Evidence. Benjamins, Amsterdam,
pp. 153–197.
field; see in particular two special issues of journals
Ross, Alan S.C., 1950. Philological probability problems. Journal of the Royal Statistical
(McMahon, 2005; Wichmann and Grant, 2010), as well as Society B 12, 19–59.
Holman et al. (2011), Eska and Ringe (2004), and Levinson Sankoff, David, 1972. Reconstructing the history and geography of an evolutionary
and Gray (2012). The most active areas of current research tree. American Mathematical Monthly 79, 596–603.
are improvements to the basic Swadesh model; exploration Sapir, Edward, 1916. Time Perspective in Aboriginal American Culture. In: Memoir
90, Anthropological Series, vol. 13. Government Printing Bureau, Ottawa,
of the age–frequency relationship; root glottochronology; Canada, pp. 389–462.
computer-implemented tree reconstruction using methods Starostin, Sergei, 1999. Comparative-historical linguistics and lexicostatistics. In:
borrowed from biological cladistics, due to the conceptual Renfrew, Colin, McMahon, April, Trask, Larry (Eds.), Time Depth in Historical
parallels between language change and biological evolution Linguistics, vol. 2. McDonald Institute for Archaeological Research, Cambridge, UK,
pp. 377–413.
(computational cladistics, computational phylogenetics); and
Swadesh, Morris, 1950. Salish internal relationships. International Journal of American
Ringe’s probabilistic methods. The degree of mathematical, Linguistics 16, 157–167.
statistical, and computational expertise required is extremely Taylor, Ann, Warnow, Tandy, Ringe, Donald, 2000. Character-based reconstruction of
high, which prevents the majority of historical linguists from a linguistic cladogram. In: Smith, John Charles, Bentley, Delia (Eds.), Historical
ever mastering the methods, and consequently renders them Linguistics 1995. Selected Papers from the 12th International Conference on
Historical Linguistics, Manchester, August 1995, General Issues and Non-
unable to do more than use results produced by others Germanic Languages, vol. 1. Benjamins, Amsterdam, pp. 393–408.
or participate in research teams where others control the Wichmann, Søren, Grant, Anthony, P. (Eds.), 2010. Quantitative approaches to
methods. There is a growing realization, however, that Linguistic Diversity. Diachronica 27(2), 168 pp. (pp. 341–358).

Historical Linguistics: Numerical Methods: Sheila M Embleton, York University, Toronto, ON, Canada

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Historical Linguistics: Numerical Methods: Sheila M Embleton, York University, Toronto, ON, Canada

Uploaded by

Copyright:

Available Formats

Historical Linguistics: Numerical Methods

Sheila M Embleton, York University, Toronto, ON, Canada

Introduction the degree of linguistic differentiation within a stock the greater

See also: Language Families, Archaeology and History of;

You might also like