Professional Documents
Culture Documents
Historical Linguistics: Numerical Methods: Sheila M Embleton, York University, Toronto, ON, Canada
Historical Linguistics: Numerical Methods: Sheila M Embleton, York University, Toronto, ON, Canada
Abstract
Numerical methods in historical linguistics strive to determine how closely two languages are related. Pairwise, this can be
used to reconstruct a tree for a language family (lexicostatistics). More ambitious methods assign dates to splits within trees
(glottochronology). The methods have been extended to distant relationships, often to try to answer whether the languages
involved are related at all. All methods, no matter how sophisticated, rest on the simplest observation: since languages change
over time, originally related languages gradually become less similar. This enables inferences about diachrony from
synchronic information only, attractive when little is known about history. Methodological difficulties come in quantifying
rates of change, measuring difference, and accounting for factors interfering with normal change processes (e.g., mutual
influence between languages). The attractions of numerical methods are the same as in any humanities discipline: speed,
objectivity, replicability, and ability to handle large volumes of data efficiently.
International Encyclopedia of the Social & Behavioral Sciences, 2nd edition, Volume 11 http://dx.doi.org/10.1016/B978-0-08-097086-8.52012-1 23
24 Historical Linguistics: Numerical Methods
languages, the number of rows with a cross in both columns). resistant to change and borrowing than other vocabulary
Then the probability for that pair of languages of obtaining the items. Such a list, now known as a Swadesh list, contains such
shared number of crosses or more is items as lower numerals, simple kinship terms, body parts,
n1 Nn1 topographical terms, widespread flora and fauna, pronouns
X N R n2 R (personal, demonstrative, and interrogative), verbs denoting
¼ n1 basic actions, and naturally occurring noncultural pheno-
N
R¼r n2 mena. Assuming a constant rate of replacement, R, for words
representing meanings on this list, one can calculate the time
for which a normal approximation exists. These probabilities elapsed (time depth) since any pair of languages split using the
can be tested for significance, and could become the input formula
for various family tree reconstruction methods. Interestingly, 1 nco
Ross’s method can be shown to be mathematically equivalent ¼ In
2R N
under certain conditions to that of Kroeber and Chrétien, and
thus subject to the same problems (for complete treatment and where c is the number of cognates on the Swadesh list between
commentary on this and the method in general, see Embleton, the two languages compared. By calculating time depths for all
1986). Although certain elements of these models recur today, pairs of languages under consideration, we can construct
both have largely been abandoned, although the problems a family tree with dates.
encountered and lessons learned remain relevant. Swadesh’s method was greeted with widespread enthusiasm
Historically, the next method to evolve was Swadesh’s, but and widely applied in the late 1950s, the 1960s, and the early
given its centrality to the field and ongoing relevance, the 1970s, almost with a sense of euphoria about finally having an
author deals with that in a separate section. A final numerical exact tool for obtaining insights into hitherto unknowable
method to consider is that of counterindications originated by linguistic prehistory. The debate over Swadesh’s method
Gleason (1959), later rediscovered by others. All topologically quickly became emotional, but regardless of the researcher’s
possible family trees for the languages in question are con- subjectivity, a common core of problems can be identified,
structed, each is evaluated for how well it fits against a fixed sometimes with potential solutions.
list of features (every time the tree does not fit the distribution Difficulties with particular meanings on the Swadesh list
of that feature, a counterindication is recorded), and the tree include the following:
with the best score (even if not perfect) is chosen. This method
is difficult to implement for two reasons. Even with modern 1. Some meanings cannot be translated into the target
high-speed computing and sophisticated algorithms, the language. This is fairly rare, and best treated as missing data,
number of topologically possible trees for even a moderately reducing N accordingly.
sized language family quickly becomes unworkably large. 2. Sometimes there are multiple possible translations in
The method is also highly vulnerable to the effect of bor- the target language for a meaning on the list. Different
rowing across languages. A variant of Gleason’s method, researchers might make different but equally valid choices
relying on hierarchical clustering and attempting to estimate from the range of possibilities (‘multiple synonymy’).
branch lengths (i.e., used glottochronologically), is shown in Solutions include always counting only the most frequent
Embleton (1986), along with a complete discussion of the possible translation, choosing at random, always choosing
method itself. Taylor et al. (2000), using recent advances and to maximize the cognate count, always choosing to mini-
sophisticated techniques from computer science and biolog- mize the cognate count, or various fractional scoring
ical cladistics, are developing a character-based (rather than systems.
distance-based) method for family tree reconstruction with
Difficulties with cognacy judgments (remember that one is
some similarities to Gleason’s method. Despite difficulties
not necessarily dealing with languages where the history and
with dealing with borrowing in realistic amounts, such
hence cognacy are well known) are as follows:
methods are promising partly because they closely parallel the
cognitive processes of skilled practitioners of traditional 1. Due to accumulated sound change, cognates may not be
comparative historical reconstruction techniques, while being recognizable, causing us to overestimate time depth.
more objective and simultaneously handling vast amounts 2. Chance similarities may be wrongly interpreted as cognates,
of data. causing us to underestimate time depth. In practice, one
might hope that this and the preceding cancel out.
3. Sometimes only one part of a word is cognate to a word in
Swadesh
another language (‘partial cognation’); how should one
score this? Solutions include always counting this as posi-
By far the best known numerical method in historical
tive, always counting this as negative, or various fractional
linguistics is that introduced by Swadesh (e.g., 1950), actual-
scoring systems.
izing Sapir’s statement by regarding lexical replacement as
analogous to radioactive decay. As with carbon-14 dating, Problems with the universality of R, which was shown to
dates could then be provided for branching points in the tree vary over time, between languages, and according to meaning.
(glottochronology). Take a list of N meanings of supposed R had been estimated from languages with written historical
basic core vocabulary, found universally in all cultures, records as about 14% per millennium for the standard
expressing basic noncultural concepts, expected to be more 100-word Swadesh list and about 19% per millennium for the
Historical Linguistics: Numerical Methods 25
standard 200-word Swadesh list. Not only might the very fact replacement rates for each language, borrowing rates between
that these languages are written bias the rate (one might expect each language pair (not necessarily symmetric), and knowledge
less replacement in a language with a lengthy written history), of which languages are in contact with one another.
but 11 of the 13 pairs used to establish the rate were Indo-
European (representing only three of its branches).
Problems with the family tree (Stammbaum) model, gen- Starostin
erally centered on the splits not being clean (thus giving any
precise date is artificial and meaningless). Languages are often Largely independently of the work described above, research on
in close physical proximity even after they split, mutually numerical methods in historical linguistics was proceeding in
influencing one another; tendencies in the two languages pre- the Soviet Union (later Russia or ‘in diaspora’). Unlike the
dating the split may come to fruition only afterward. Both these impetus for Swadesh’s work, which was largely to aid in
interfere with the model, but are more properly problems with providing histories and trees for relatively little studied
the family tree model, and Swadesh’s adoption of it, rather language families (e.g., Amerindian), much of this work gained
than problems with Swadesh’s method itself. its impetus from questions of more distant linguistic relation-
Borrowings between languages (related or unrelated) can ship. This had its effect on the development of the methods,
alter replacement rates quite substantially. For some languages, because it was crucial that the methods be able to produce
even the Swadesh list contains a high amount of borrowed useful results even at much greater time depths, where
vocabulary. Swadesh’s method typically has difficulty distinguishing the
For various reasons (see Embleton, 1986, and elsewhere), signal (cognacy, true genetic relationship) from the noise
the use of Swadesh-style lexicostatistics and glottochronology (chance resemblance, contact relationship).
waned by the mid-1970s, although applications lingered on in Historically, the first advance here was due to Dolgopolsky
several language areas, and the methods continued to be (1986), who capitalized on the fact that different meanings
refined by applied mathematicians. change at different rates. He compiled a list of the 15 most stable
meanings, based on examination of 140 languages. One can
then calculate the probability of fortuitous coincidence of the
Sankoff words representing those 15 most stable meanings for any
language pair; if the probability is significantly less than that
Most refinements introduced by applied mathematicians expected by mere chance, this can be taken as evidence of rela-
involved adding further parameters to the basic Swadesh tionship. This method is most striking in its extreme simplicity,
model. For example, parameters were added to account for drift but can also be related to some more complex and ongoing
and for ‘chance recurrent cognation’ (i.e., when a word was work, largely carried out in Eastern Europe, founded on the
replaced twice, such that the original word was restored), observation that several characteristics of a word (e.g., length,
replacement rates were allowed to vary from meaning to number of meanings, age) are intercorrelated with its frequency.
meaning, a very crude allowance was made for borrowing, and Words of more recent origin tend to be less frequent, longer, and
there was talk of allowing the rates to vary from language to have fewer meanings, whereas older words tend to be more
language and from time to time (further details, along with frequent, shorter, and have more meanings. The age–frequency
some models, can be found in Embleton, 1986). Some of these relationship in particular is being extensively researched, seeking
are useful refinements (e.g., allowing replacement and bor- a small hard core of ancient short vocabulary that may reveal
rowing rates to vary over time, between languages, and between distant relationships, and may even be able to be harnessed into
meanings), but these are very much a mathematician’s way of a lexicostatistical or even glottochronological method.
solving the problem, producing quite elegant solutions. But in Another advance has been made by Sergei Starostin, with
the real world in which the linguist works, all those parameters work first published in the 1980s and continuing until his death
have to be estimated – and this is surprisingly difficult even for in 2005, known as root glottochronology or etymostatistics (see
well-studied languages for which we actually have both long Starostin, 1999). Compared to standard Swadesh-style glotto-
and extensive written records. It is also the case that an ex- chronology, the main innovation is the observation that while
traordinary amount of data is required to estimate so many glottochronology regards any change (however small) in a
parameters, and this is most often simply impossible. So these word’s meaning as a loss of that word from that meaning, root
elegant theoretical solutions have very little practical import. glottochronology does not. Thus, while standard glottochro-
However, the single most important development was the nology sees English dog and German Hund in the meaning ‘dog’
incorporation of a geographical dimension into the family tree, as noncognate, root glottochronology notices the existence of
thus allowing for borrowing to be largely dealt with. The basic the English word hound; this more conservative approach to the
insight here is due to Sankoff (1972), with correction and maintenance of a language’s root stock enables one to work
further refinement by Embleton (see 1986 for the most better at greater time depths. Starostin’s method also gives more
comprehensive treatment). The resulting set of differential accurate results for known cases where the time depths are more
equations required to reconstruct the tree and estimate the shallow (standard glottochronology tends to give dates that are
branch lengths is complicated, reflecting the increased power of too recent). Root glottochronology takes a text, containing
the method (equations are given in Embleton, 1986, and a given set of words in the parent language, and then measures
elsewhere). The parameters required all have some reasonable the number of these words still extant (in any derivative, with
chance of being estimated, even for lesser studied families; any semantic shift) in the daughter language. It proceeds only
these are a similarity measure for each language pair, with detailed comparative historical and etymological analysis
26 Historical Linguistics: Numerical Methods
(eliminating look-alikes and loans), and can also be repeated numerical methods while not providing any magic solutions
for more than one text (hence a different set of words), giving to any long-standing problems, make a useful contribution as
greater precision and potentially some estimate of error (see one of the tools that we use to explore our linguistic history
Starostin, 1999, for formulas and more detailed explanation). and prehistory.