Type Token

A Type-Token Identity in the Simon-Yule Model of Text
Ye-Sho Chen Department of Quantitative Business Analysis, Louisiana State University, Baton Rouge, LA 70803
Ferdinand F. Leimkuhler
School of Industrial Engineering, Purdue University, West Lafayette, IN 47907
There are three significant results in this paper. First, we establish a type-token identity relating the typetoken ratio and the biiogarithmic type-token ratio. The plays of Shakespeare and other interesting texts serve as demonstrative examples. Second, the Simon-Yule model of Zipfs law is used to derive the type-token identity and provide a promising statistical model of text generation. Third, a realistic refinement of the Simon-Yule model is made to allow for a decreasing entry rate of new words. Simulation methods are used to show that the type-token identity is preserved with this change in assumptions.
1.
introduction
Linguistics is a relatively old discipline that has taken on new life because of recent developments in natural language understanding and artificial intelligence. Edmundson [4,5] classified the field of linguistics into three subfields: linguistic metatheory, which is the study of the universal properties of natural languages; computational linguistics, in which the use of a computer is indispensable; and mathematical linguistics, in which the use of mathematical methods and theories are paramount. In the subfield of mathematical linguistics itself, Edmundson followed Y. Bar-Hillels dichotomy: algebraic linguistics and statistical linguistics, which include deterministic and stochastic models of language, respectively. Edmundson continued: deterministic models and stochastic models deserve equal attention. It was noted that models of linguistic competence have typically been deterministic models, while models of linguistic performance have been stochastic models. In any case researchers such as Miller and Chomsky [14] have also stressed this difference and have indicated their preferences and reasons.
Deterministic models of text include grammatical models and semantic models. Stochastic models include models of text generation, models of sentence structure, models of vocabulary size, models of rank frequency relations, and models of type-token relations. Type-token models are concerned with relationships between the number of different words (types) and the total number of words (tokens) in a literary text. Usually, each word form is regarded as a distinct token, but sometimes inflected word forms are counted with the root word, see Yule [25] and Thomdike [23]. Recently, Edmundson [5] classified models of typetoken relations as: the type-token ratio [8], the bilogarithmic type-token ratio [2,8], and indices of vocabulary richness [6,7] and vocabulary concentration [25]. In this paper, we identify an interesting identity relating the type-token ratio and the bilogarithmic type-token ratio. We give a theoretical justification of the identity based on the stochastic model of text proposed by Simon in 1955 [ 161. The model is called the Simon-Yule model because he derived the same equations as Yule [24] in a study of biological problems. This model is useful for its ability to describe the rank-frequency relations observed by Zipf [26], the so-called Zipfs law. Subsequently, a basic assumption underlying the Simon-Yule model is relaxed to increase realism. The modified model is tested and also shown to support the type-token identity. Further refinements relating to computational models of text generation are discussed.
2.
The Type-Token Identity
Received July 22, 1986; revised December 19, 1986. 0 1989 by John Wiley & Sons, Inc.
10, 1986; accepted December
Define t as the total number of words or tokens used in a text, and V, as the number of different words or types found in the same text. Compared with bilogarithmic typetoken ratio, the type-token ratio is not as stable. However, when the two ratios are summed the following identity is approximately true: 3+-r In V = 1. f in t (1)
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE.
40(1):45-53,
1989
CCC
0002-8231/89/010045-09$04.00
The last column of Table 1 and Table 2 provides several examples. Table 1 shows the ,fourteen comedies, the fourteen tragedies, and the ten historical plays of Shakespeare [ 191. Table 2 lists some examples of the bilogarithmic ratio, written in Czech by different authors [21]. The texts used are among diverse fields which include fiction, popular, and scientific literature. Elsewhere, Herdan [8] showed that the bilogarithmic type-token ratio provided linear fits for sixty-one works of Chaucer and two works of Pushkin. In Tables 1 and 2, the bilogarithmic ratio 0.8 shows up in almost texts. In 1957, the Dutch mathematician Devooght [3] derived the constancy of the bilogarithmic type-token ratio from the generalized Zipfs law [12] under the assumption
that the vocabulary size V, approaches infinity as does the text length t. Herdan [8] (on p. 26) pointed out: It appears, however, that to use the Zipf-Mandelbrot law as the starting point for the derivation of the bilogarithmic type/token relation makes matters unnecessarily complicated, and even unsatisfactory from the mathematical point of view. Since, in order to arrive at the relation, Devooght has to make certain assumptions which are not in accordance with reality, e.g. to let the vocabulary approach infinity. As an alternative, Herdan assumed that the growth rate of the vocabulary is proportional simultaneously to (a) increasing text length, (b) the size of vocabulary already accumulated, and (c) the particular conditions of writing,
TABLE 1. The 14 comedies, the 14 tragedies, and the 10 historical plays of Shakespeare [19]. v, Play t v, f
COMEDY 1. Comedy of Errors 2. The Tempest

3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. A Midsummer Nights Dream Two Gentlemen of Verona Twelfth Night The Taming of the Shrew Much Ado About Nothing Merchant of Venice Loves Labours Lost Merry Wives of Windsor Measure for Measure As You Like It Alls Well That Ends Well A Winters Tale
14,369 16,036 16,087 16,883 19,401 20,411 20,768 20,921 21,033 21,119 21,269 21,305 22,550 24,543
2522 3149 2984 2718 3096 3240 2954 3265 3772 3267 3325 3248 3513 3913
0.176 0.196 0.185 0.161 0.160 0.159 0.142 0.156 0.179 0.155 0.156 0.152 0.156 0.159
0.818 0.832 0.826 0.812 0.814 0.815 0.804 0.813 0.827 0.813 0.814 0.811 0.815 0.818
0.994 1.028 1.012 0.973 0.974 0.973 0.946 0.969 1.007 0.967 0.970 0.964 0.970 0.978
TRAGEDY 1. 2. 3. 4. 5. 6. 7. 8. 9. 10. 11. 12. 13. 14. Macbeth Pericles Timon of Athens Julius Caesar Titus Andronicus Two Noble Kinsmen Anthony and Cleopatra Romeo and Juliet King Lear Troilus and Cressida Othello Coriolanus Cymbeline Hamlet PLAY 20,386 20,515 21,809 23,295 23,295 23,955 24,450 25,577 25,706 28,309 3576 3812 3671 3581 3558 3817 4058 4562 4122 4092 0.175 0.186 0.168 0.154 0.153 0.159 0.166 0.178 0.160 0.145 0.825 0.830 0.822 0.814 0.813 0.818 0.822 0.830 0.820 0.811 1.000 1.015 0.990 0.968 0.966 0.977 0.988 1.009 0.980 0.956 16,436 17,723 17,748 19,110 19,790 23,403 23,742 23,913 25,221 25,516 25,887 26,579 26,778 29,551 3306 3270 3269 2867 3397 3895 3906 3707 4166 4251 3783 4015 4260 4700 0.201 0.185 0.184 0.150 0.172 0.166 0.165 0.155 0.165 0.167 0.146 0.151 0.159 0.159 0.835 0.827 0.827 0.808 0.822 0.822 0.821 0.815 0.822 0.823 0.811 0.814 0.820 0.821 1.036 1.012 1.011 0.958 0.994 0.988 0.986 0.970 0.987 0.990 0.957 0.965 0.979 0.980
HISTORICAL
1. King John 2.HenryVIPartl 3. Richard II 4. Henry VI Part 3 5. Henry VIII 6. Henry IV Part 1 7. HenryVIPart2 8. Henry V 9. Henry IV Part 2 10. Richard III
46
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
TABLE 2.
Examples
of type-token
relationships
in Czech [21]. In v, In I
0.797 0.844 0.850 0.862 0.827 0.770 0.792 0.818 0.833 0.822 0.853 0.833 0.824 0.806 0.840 0.827 0.802 0.851 0.837 0.817 0.851 0.834 0.806 0.810 0.784 0.814 0.833 0.838 0.845 0.845 0.838 0.842 0.831 0.957
v,
Text 1
2 3 4 5 6 7 8 9
t
8374 8729 9290 10,103 14,714 17,249 18,085 18,448 20,340 20,603 21,640 21,963 23,231 23,802 24,353 25,658 26,908 28,084 29,803 29,813 30,145 30,281 31,195 31,250 31,655 32,972 33,700 33,774 35,187 35,273 29,360 47,542 55,164
K
1337 2116 2360 2831 2790 1831 2372 3088 3870 3502 5006 4145 3970 3381 4858 4308 3577
t
0.160 0.242 0.254 0.280 0.190 0.106 0.131 0.167 0.190 0.170 0.231 0.189 0.171 0.142 0.199 0.175 0.133 0.218 0.187 0.151 0.216 0.181 0.134 0.140 0.107 0.144 0.176 0.185 0.197 0.197 0.189 0.182 0.157
1.086 1.104 1.142 1.017

0.876 0.923 0.985 0.023 0.992
10 11
12 13 14 15 16 17 18 19 20 21 22 23 24 25 26 27 28 29 30 31 32 33
1.085
1.022 0.995 0.948
1.040 1.002
0.935
6111
5559 4516 6498 5469 4188 4366 3388 4750 5916 6265 6927 6939 5539 8673 8675
1.069 1.024
0.968
1.067 1.015
0.940 0.950 0.891 0.958 1.009
1.024
1.042 1.041 1.027 1.024 0.988
such as style, content, etc. Treating V, and t as continuous variables, he established two differential equations from the assumptions, and derived the bilogarithmic type-token relation. Mandelbrot [12] also attempted to explain typetoken relationship by Zipfs law. First, he assumed [13] (on p. 216) that the use of successive words is independent, as in the multinomial urn model. Second, he assumed V, and t as continuous variables. Finally, he approximated the sum of certain terms by integration. The type-token identity in Eq. (1) is quite stable except when t is relatively small or large. Table 3 shows that when t 5 1100 or t 2 884647, the equation does not hold. So, we ask why does equation (1) remain stable, and under what conditions is the equation true? We pursue these questions in the following sections after introducing the Simon-Yule model of text generation.
works, cesses f(n, t) exactly
by himself or other authors). Simons selection proare stated in the following assumptions, where is the number of different words that have occurred n times in the first t words.
Assumption I: There is a constant probability, cy, that the (t + I)-st word be a new word-a word that has not oc-
curred in the first t words.

Assumption II: The probability that (t + I)-st words is a word that has appeared n times is proportional to rzf(n,t)that is, to the total number of occurrences of all the words that have appeared exactly n times.
3.
The Simon-Yule Model of Text Generation
According to Simon, the stochastic process by which words are chosen to be included in written text is a twofold process. Words are selected by an author by processes of association (i.e. sampling earlier segments of his word sentences) and by imitation (i.e. sampling from other
A theoretical justification of Eq. (1) based on two assumptions is discussed in Section 4. With regard to the first assumption of a constant probability CY,Simon [9, p. 41 noted: We cannot conclude from this that the theory should be rejected; the only valid conclusion to be drawn is that the theory is only a first approximation and that the next step in the investigation is to look for an additional mechanism that should be incorporated in the theory so as to give a better second approximation. This problem is discussed further in Section 5.
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
47
TABLE 3.
Examples of type-token relationship [15].

v,
Text 1. La Gorse by Rousseau (French) 2. LaCorse 3. I-III John (Greek) 4. Language of Peiping (Chinese) 5. La Corse 6. The Captains Daughter by Pushkin (Russian) 7. Plautus (Latin) 8. Epistles of Paul (Greek) 9. American Newspaper (English) (Compiled by Eldridge) 10. Moby Dick by Melville (English) 11. Ulysses by Joyce (English) 12. Shakespeare (English) 13. Shakespeare (Taylor poem)*
v,
300
1100 2599
13,248 19,400
157 446 623 3332 3223 4783 8437 7281
0.523 0.405 0.240 0.252 0.166 0.163 0.255 0.195
0.886 0.871 0.818 0.855 0.818 0.824 0.869 0.845
1.410 1.277
1.058 1.106
0.984 0.987 1.124
29,345 33,094
37.327
1.040
43,989 136,800 260,430 884,647
6002 14172 29899 31534 258
0.136
0.814
0.950 0.912 0.941 0.792 1.518
0.104
0.115 0.036 0.601
0.808
0.826 0.757 0.916
429
*This poem was recently discovered by Shakespeare scholar Gary Taylor on November 14, 1985 [ll, 201. In their interesting paper, Thisted and Efron [22] used a nonparametric empirical Bayes model [6] to examine the poem and found that it was actually written by Shakespeare. The poem was called the Taylor poem to name it after the discoverer.
Assumption II in the Simon-Yule model is intended to incorporate imitation and association as the basis for a stochastic model of text generation. Simon [9, p. 971 argued: The rationale. . . (is) as follows: Writing and speaking involve both imitative and associative processes. They involve imitation because any given piece of writing or speech is simply a segment from the whole stream of communication in the language. The subjects of communication depend, in large measure, upon what subjects have been previously communicated about in the language, and
4. A Theoretical Justification of the Identity Based on the Simon-Yule Model

Let us define V, = 1, which serves as the first old word for the generation of text. Without loss of generality and for the convenience of mathematical treatment, we do not include this word in the type-token study. The following lemma shows that v/t = (Y, when t is large. This is true in any written text where f is usually over 10,ooO words. Lemmal:Fort (1) = 1,2,...,wehave
are being communicated about contemporaneously. Vocabulary choice depends sensitively on the choices of other writers and speakers (e.g. the choice, in American
v, lim 7 = (Y with probability I--tFort= Define x, = 1 0 1,2 ,..., var V
English, among auto, vehicle). Simon continued
and car, automobile, motor
(2) Proof:
t o
one; cr(1 - a) f *
as follows: if the t-st word is a new word if the t-st word is an old word
Communication also involves association, because the associative processes in the communicators memory are used by him in generating the sentences he writes or utters. Both imitation and association will tend to evoke any particular word with a probability somewhat proportional to ik frequency of occurrence in the language, and to its fre-
then P(X, = 1) = (Y and P(X, = 0) = 1 - (Y. That is, X, is a Bernoulli distribution with parameter (Y. Also, the random variables X,, X,, . . . . ,X, are independent. Thus, V, = x:=,X, is a binomial distribution with two parameters t and (Y.
quency of use by the communicator. Successive refinements discussed in Section 6. of the second assumption are
48
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
(1)
From the strong law of large number,
we have one.
lim x, + - * * + x, t-m (2) = a(1 t
v, = lim ; = (Y with probability *+c=
var(V,) = tcu(1 - a) implies var - 4 . 1,2,...,
Corollary 1 shows that given the values (or the ranges) of (Y and t, we can predict the value (or the range) of In V,/ln t. Two examples are illustrated below: In Table 1, we see that the last two columns have very stable numbers. The column In V,/ln t has the numbers around 0.8 and the column V,/t + In V,/ln t has the numbers around 1.00. We explain this interesting phenomenon in the following corollary. Corollary 2: If 0.145 t 5 29,551, then 0.801 5 and 5 (Y 5 0.201 and 16,436 2
Theoreml:Fort=
pJc In t
Proof: Fort
In (Y = 1 + in + term of smaller order. = 1,2,...,
&J ln t 5 0.844
In t By Mean Value Theorem,
$J
1,: =1+In we have
0.946 5 ;
Proof: Using Corollary
In V 5 1.045 + f In t
1, the proof is immediate.
f 7 --f(a) =f(a*) 7 - (Y) 0 ( 1

wheref(x) is a continuous for Vr/t 5 x 5 a (or V,/t 2 x 2 a), and poses a derivative at each x for V,/t < x < a (or Vi/t > x > a); (Y* is a number between V,/t and (Y. let f(x) = In x, x > 0, then lnT=lnol+-& ( and v - 1 1-a a* ( t lnt :-CX > .
As we can see, in Table 1, the numbers in the In V, /ln t column are all within 0.801-0.844. Also, the numbers in the last column are all within 0.946-1.045. In Table 2, we also notice the stable patterns in the last two columns. The column In V,/ln t has numbers around 0.8 and the column V,/t + In V,/ln t has numbers around 1.00. The following corollary explains this interesting phenomenon. Corollary 3: If 0.106 t 5 55,164, then 0.752 5 5 a! s 0.280 and 8,374 5
q ln t 5 0.883
* =1+=+ In t From Lemma 1, one has
and In V 0. 858 5 T + f S 1.163 In t Proof: Using Corollary 1, the proof is immediate.
In (Y jg = 1 + in + term of smaller order, In r with probability one. When t is as large as the size of a written text, then the term of small order indicated above may be neglected. In such cases, we can consider the possible range for In V,/ln t in the following corollary. Corollary 1: t 5 tm, then Let 0 < amin 5 (Y 5 (Y,, < 1 and tmin5
As we can see, in Table 2, the numbers in the In V,/ln t column are all within 0.752-0.883. Also, the numbers in the last column are all within 0.858-1.163.
5.
A Refinement of the Simon-Yule Model
In (Y 2+1sL ln tmax
In V In t
lna SC++.
Proof: Since amin, (Y, (Y,, are all less than one and greater max 5 In (Y 5 In (Y,,,~,,.Also than zero, we have In (Y In tmin5 In t I In t,,,,, we have
ln amu
From Theorem
ln b,=
.= e <ln Sin
lnt In
t,,
1, we obtain Ina. IS----!In V sG++. In t
In cr A+ ln Lx
Simon recommended further refinement of his model by modifying the assumptions so as to better represent the real world. This process of successive approximation is demonstrated in his study of the size distribution of business firms [9], where he was able to give a significant economic explanation of the two assumptions and to show the effect of public policy on the size of firms. Furthermore, he explained the concavity to the origin of the log-log plot of Zipfs law by allowing for mergers and acquisitions, autocorrelated growth of firms, and a decreasing entry rate for new firms. Since the latter modification is directly applicable to text, it is considered below in greater detail. In this section the first assumption of the Simon-Yule model is modified [18] so that the entry rates for new
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
49
words are a decreasing function of the length of the text. That is, Assumption I: There is a decreasing probability function
a(t), 0 5 cr(t) 5 1, that the (t + I)-st word be a new word-a word that has not occurred in the first t words.
(3)
a(t) a(t) a(t) a(t) a(t) a(t) a(t) a(t)
= 0.217, = 0.160, = 0.089, = 0.093, = 0.089, = 0.075, = 0.065, = 0.072,
1 I t 5 26,600, 26,600 < t 5 45,900, 45,900 -=I t 5 84,000 ) 84,000 < t 5 109,400, 109400 < I S 134200, 134200 < t 5 160600, 160600 < t 5 186800, 186800 < t 5 213400, 213400 < t 5 234100.
Even for this slight modification, the problem is analytically untractable. We cannot extend the proof in Lemma 1 and Theorem 1 to examine equation (1) now, since V, is no longer a binominal distribution. A good way to examine the type-token relationship is by using simulation methods. Simon used computer simulation methods to examine the fit of the stochastic models to word frequency data under relaxation of the assumption of a constant rate of entry of new words. We do three experiments by choosing: 1 5 t 5 100, (1) a(t) = 0.5, a(t) (2) a(t) a(t) a(t) a(t) a(t) a(t) = 0.179, = 0.386, = 0.217, = 0.160, = 0.240, = 0.160, = 0.139, 100 < t s l~t~1000, loo0 < t % 2000 ) 2000 < t s 3000 ) 3000 < t 5 4000, 4000 < t i 5000, 5000 < t 5 10557.
TABLE 4. Case I.
a(r) = 0.078,
10,000.
These three functions are used in Simons simulations on Zipfs law [ 181. The rationale for the first function is that certain common function words come into a text at a very early stage, say t = 100 or 200. Thus, the entry rate is initially quite high, drops off rapidly, and then maintains itself at a relatively constant low level. The second function is established from a piece of continuous prose, 10,557 words in length, written by a schizophrenic, Jackson M. The last function comes from a very large sample of text from a Russian physics journal, 234,096 words in length. The simulation program, originally written by Brien Johnson [lo], is modified and tested here. Table 4 to Table 6
Type-token relationship with decreasing entry rates for new words:
Simulation I 4750 5612 6474 7336 7767 8198 8629 9060 9491 9922 Simulation II 2879 3598 4317 5036 5755 6474 7193 7912 8631 9350 Simulation III 2784 3794 4449 5559 6669 7779 8334 8889 9444 9999
931 1099 1264 1411
1486 1546 1609 1673 1741 1820 594 730 864 988 1127 1264 1390 1504 1609 1715 569 790 886 1086 1290 1488 1570 1647 1735 1832
0.19600 0.19583 0.19524 0.19234 0.19132 0.18858 0.18646 0.18466 0.18344 0.18343 0.20632 0.20289 0.20014 0.19619 0.19583 0.19524 0.19324 0.19009 0.18642 0.18342 0.20438 0.20288 0.19915 0.19536 0.19343 0.19128 0.18838 0.18529 0.18371 0.18322
0.80751
0.81112 0.81386 0.81479 0.81538 0.81488 0.81468 0.81460 0.81482 0.81571 0.80185 0.80520 0.80780 0.80894 0.81167 0.81386 0.81490 0.81504 0.81466 0.81451 0.79982 0.80705 0.80790 0.81064 0.81343 0.81539 0.81510 0.81459 0.81489 0.81574
1.00351 1.00695 1.00910 1.00713 1.OQ670 1.00346 1.00115 0.99926 0.99826 0.99914 1.00817 1.00809 1.00794 1.00513 1.00750 1.00910 1.00815 1.00513 1.00108 0.99793 1.00420 1.00993 1.00705 1.00600 1.00686 1.00667 1.00349 0.99987 0.99860 0.99896
50
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
TABLE 5. data.
Type-token relationship with decreasing entry rates for new words: Jackson
Simulation I 3163
3952 4741 5530 6319 7108 7897 8686 9475 10,264
825 993
1114 1245 1357 1475 1574 1660

1755 1860
0.26083 0.25127 0.23497 0.22514 0.21475 0.20751 0.19132
0.19111 0.18522 0.18122

0.23359 0.22005 0.20921 0.20113 0.19348 0.18878 0.18732 0.18584 0.18306 0.18074
0.83325 0.83322 0.82889 0.82698 0.82422 0.82269 0.82028 0.81753 0.81585 0.81507
1.09408
1.08449 1.06386 1.05212 1.03897 1.03020 1.01959 1.00864

0.00107 0.99629
Simulation II
4936 5876 6816 7756 8461 8931 9166 9401 9871 10,341 1153 1293 1426 1560 1637 1686 1717 1747 1807 1869 0.82901 0.82556 0.82277 0.82093 0.81836 0.81674 0.81641 0.81605 0.81539 0.81494
1.06260
1.04561 1.03199 1.02207
1.01184 1.00552 1.00373 1.00188

0.99845 0.99567
Simulation III
33,338 55,560 66,67 1 88,893 133,337 155,559 166,670 188,892 222,225 233,336 5091 8246 9828 12,845 18,916 21,972 23,568 26,628 31,341 32,900 0.15271 0.14842 0.14741 0.14450 0.14187 0.14125 0.81956 0.82538 0.82764 0.83024 0.8345 1 0.83628 0.83731 0.83873 0.84090 0.84151 0.97226 0.97380 0.97505 0.97474 0.97638 0.97752 0.97872 0.97970 0.98193 0.98251
0.14141
0.14097 0.14103 0.14100
list simulation results for each function ing observations are g = 0.80 In t and :
alpha. The interest-
In V + 2 = 1.00. In t
6.
Conclusions
In this paper we show three significant contributions:
(1) Establish a type-token identity relating the type-token

ratio and the bilogarithmic type-token ratio. The plays of Shakespeare and other interesting texts serve
as demonstrative examples.
(2) The type-token identity is derived from the SimonYule model which is useful in explaining Zipfs law.
An important implication of this result is to provide a further support for the use of the Simon-Yule model as a promising statistical model of text generation. A realistic refinement of the Simon-Yule model is made by considering a decreasing entry rate for new words in the generation of text. Simulation methods are used to show that the type-token identity is preserved under this assumption.
Further refinements of the Simon-Yule model of text are possible and should be based on a deeper theoretical understanding of the nature of text generation. A promising way of doing this is to effect a closer relationship between statistical and computational models of text. Statistical models provide a powerful descriptive approach based on empirical observation. Computational models, on the other hand, take a constructive approach and attempt to create text which is similar to human writing through a deeper understanding of linguistic processes. Current computational approaches, however, do not take advantage of the inherent statistical properties of text generation. The authors believe that further refinement of the Simon-Yule model based on computational theory is a promising way to bridge the gap between the deterministic and stochastic approaches and to develop better models of text.
(3)
Acknowledgment
This research was supported in part by the National Science Foundation Grant IST-7911893Al and by the Council on Research of Louisiana State University.
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
51
TABLE 6. data.
Type-token
relationship
with decreasing
entry rates for new words: Russian
v,
t Simulation 30,008 50,008 70,008 80,008 90,008 110,008 130,008 150,008 180,008 220,008 Simulation 11,385 56,893 68,270 79,647 102,401 113,778 136,532 159,286 204,794 227,548 Simulation 55,560 66,671 77,782 88,893 100,004 111,115 133,337 155,559 177,781 233,336 III 9671 10,664 11,595 12,594 13,580 14,589 16,548 18,211 19,726 23,856 0.17406 0.15995 0.14907 0.14168 0.13579 0.13130 0.12411 0.11707 0.11096 0.10224 II 2423 9794 10,797 11,753 13,774 14,812 16,836 18,502 21,581 23,396 0.21282 0.17215 0.15815 0.14756 0.13451 0.13018 0.12331 0.11616 0.10538 0.10282 I 6235 9163 10,932 11,793 12,690 14,496 16,263 17,811 19,878 22,789 0.20778 0.18323 0.15615 0.14740 0.14099 0.13177 0.12509 0.11873 0.11043 0.10358 v, T
In V, In
+z t
0.84758 0.84316 0.83356 0.83041 0.82827 0.82541 0.82347 0.82121 0.81791 0.81568
1.05536 1.02639 0.98971 0.97781 0.96925 0.95718 0.94856 0.93995 0.92834 0.91926
0.83434 0.83931 0.83432 0.83044 0.82611 0.82487 0.82299 0.82028 0.81601 0.81558
1.04716 1.01146 0.99247 0.97801 0.96062 0.95506 0.94630 0.93643 0.92139 0.91840
0.83997 0.83499 0.83099 0.82851 0.82658 0.82525 0.82318 0.82057 0.81812 0.81550
1.01404 0.99494 0.98006 0.97018 0.96237 0.95655 0.94728 0.93764 0.92908 0.91774
References
Chen, Y. S. Statistical Models of Text: A System Theory Approach. Ph.D. dissertation, Purdue University; 1985. Chotolos, .I. Studies in Language Behavior. Psycho/ogy Monograph. V56; 1944. Devooght, J. Sur la loi de Zipf-Mandelbrot. Bull. Cl. Sci. Acad. Roy. Belg. 4; 1957. Edmunson, H. P. Statistical Inference in Mathematical and Computational Linguistics. international Journal of Computer and lnformation Sciences. 95-129, V6, N2; 1977. Edmumson, H. P. Mathematical Models of Text. Informafion Processing & Management. 20( l-2): 261-268; 1984. Efron, B. and Thisted, R. Estimating the Number of Unseen Species: How Many Words Did Shakespeare Know? Biometrika 63, 435-437. Guiraud, P. Les Caracteres Statistiques du Vocabulaire. Presses Universitaires de France, Paris; 1954. Herdan, G. Type-Token Mathematics: A Textbook of Mathematical Linguistics. Moutor & Co., The Hague; 1960. Ijiri, Y.; Simon, H. A. Skew Distributions and the Sizes of Business Firms. North-Holland Publishing Company; 1977. Johnson, B. D. Analysis and Simulation of the information Productivity of Scientific Journals. Master Thesis, School of Industrial Engineering, Purdue University; 1983.
11.
12.
13.
14.
5. 6.
15.
7. 8. 9. 10.
16. 17. 18. 19.
Lelyveld, J. A Scholars Find: Shakespearean Lyric. The New York Times, (November 24, 1985); l-12. With corrections of Editorss Note, (November 25, 1985); 2. Mandelbrot, B. An Information Theory of the Statistical Structure of Language. Proceedings of the Symposium on Applications of Communication Theory. London, September 1952. London: Butterworths; 1953: 486-500. Mandelbrot, B. Final Note on a Class of Skew Distribution Functions: Analysis and Critique of a Model Due to H. A. Simon. Information and Control. 4, 198-216; 1961. Miller, G. and Chomsky, N. Finitary Models of Language Users. Handbook of Mathematical Psychology (Edited by R. Lute, R. Bush and E. Galanter, Vol. II, pp. 419-491. Wiley, New York (1963). Parunak, A. Graphical Analysis of Ranked Counts (of words). Journal of the American Statistical Association. 74(365):25-30; 1979. Simon, H.A. On a Class of Skew Distribution Function. Biometrika. 42:425-440; 1955. Simon, H. A. Some Further Notes on a Class of Skew Distribution Functions. Information and Control. 3:80-88; 1960. Simon, H. A. Some Monte Carlo Estimates of the Yule Distribution. Behavior Science. 8:203-210; 1963. Spevack, M. A Complete and Systematic Concordance to the Works of Shakespeare. Vols. I-IV. George Olms, Hildesheim; 1968.
52
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
20. 21. 22.
23.
Taylor, G. Shakespeares New Poem: A Scholars Clues and Conclusions. New York Times Book Review (December 15), 11-14. Tesitelova, M. On the So-called Vocabulary Richness. Prague Studies in Mathematical Linguistics. 103-120; 1971. Thisted, R. and Efron, B. Did Shakespeare Write a NewlyDiscovered Poem? Technical Report No. 244. Department of Statistics, Stanford University. April 1986. Thomdike, E. L. Book Review-National Unity and Disunity by
24.
25. 26.
G. K. Zipf. Science. 4 July, 1941, V94, p. 19. Yule, G. U. A Mathematical Theory of Evolution, Based on the Conclusions of Dr. J. C. Willis, E R. S. , Philosophical Transactions of the Royal Society of London, Series B. 213:21-87; 1924. Yule, G. U. A Statistical Study of Vocabulary. Cambridge, England: Cambridge University Press; 1944. Zipf, G. K. Human Behavior and the Principle of Least Effort. Reading, MA: Addison Wesley, 1949.
JOURNAL
OF THE
AMERICAN
SOCIETY
FOR
INFORMATION
SCIENCE-January
1989
53

Type Token

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Type Token

Uploaded by

Copyright:

Available Formats

A Type-Token Identity in the Simon-Yule Model of Text

School of Industrial Engineering, Purdue University, West Lafayette, IN 47907

The Type-Token Identity

10, 1986; accepted December

COMEDY 1. Comedy of Errors 2. The Tempest

1.086 1.104 1.142 1.017

works, cesses f(n, t) exactly

curred in the first t words.

The Simon-Yule Model of Text Generation

Examples of type-token relationship [15].

157 446 623 3332 3223 4783 8437 7281

0.523 0.405 0.240 0.252 0.166 0.163 0.255 0.195

0.886 0.871 0.818 0.855 0.818 0.824 0.869 0.845

43,989 136,800 260,430 884,647

6002 14172 29899 31534 258

0.950 0.912 0.941 0.792 1.518

4. A Theoretical Justification of the Identity Based on the Simon-Yule Model

v, lim 7 = (Y with probability I--tFort= Define x, = 1 0 1,2 ,..., var V

English, among auto, vehicle). Simon continued

and car, automobile, motor

From the strong law of large number,

lim x, + - * * + x, t-m (2) = a(1 t

v, = lim ; = (Y with probability *+c=

var(V,) = tcu(1 - a) implies var - 4 . 1,2,...,

In (Y = 1 + in + term of smaller order. = 1,2,...,

In t By Mean Value Theorem,

1,: =1+In we have

Proof: Using Corollary

1, the proof is immediate.

f 7 --f(a) =f(a*) 7 - (Y) 0 ( 1

* =1+=+ In t From Lemma 1, one has

and In V 0. 858 5 T + f S 1.163 In t Proof: Using Corollary 1, the proof is immediate.

A Refinement of the Simon-Yule Model

1, we obtain Ina. IS----!In V sG++. In t

a(t) a(t) a(t) a(t) a(t) a(t) a(t) a(t)

= 0.217, = 0.160, = 0.089, = 0.093, = 0.089, = 0.075, = 0.065, = 0.072,

Type-token relationship with decreasing entry rates for new words:

931 1099 1264 1411

1114 1245 1357 1475 1574 1660

0.26083 0.25127 0.23497 0.22514 0.21475 0.20751 0.19132

0.19111 0.18522 0.18122

1.08449 1.06386 1.05212 1.03897 1.03020 1.01959 1.00864

1.01184 1.00552 1.00373 1.00188

alpha. The interest-

(1) Establish a type-token identity relating the type-token

entry rates for new words: Russian

16. 17. 18. 19.

20. 21. 22.

You might also like