Professional Documents
Culture Documents
Munro 03 Queuing
Munro 03 Queuing
Robert Munro
University of Sydney
rmunro@it.usyd.edu.au
#'(
/10230
54 $ .
0.2 0.4
0.15 0.3
0.1 0.2
0.05 0.1
0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14
Figure 1: Fits for the term ‘Dalai’ Figure 4: Fits for the term ‘Australia’
0.3 0.06
QU(x) QU2(x)
P(x) P2(x)
’doc_freq/the__DT’ ’term_freq/Iraq__NNP’
0.25 0.05
0.2 0.04
0.15 0.03
0.1 0.02
0.05 0.01
0 0
10 20 30 40 50 60 2 4 6 8 10 12 14 16 18 20
Figure 2: Fits for the term ‘the’ Figure 5: Fits for the term ‘Iraq’
0.16 0.2
QU(x) QU2(x)
P2(x) QU(x)
’doc_freq/Kabul__NNP’ 0.18 P2(x)
0.14 P(x)
’doc_freq/Kashmir__NNP’
0.16
0.12
0.14
0.1
0.12
0.08 0.1
0.08
0.06
0.06
0.04
0.04
0.02
0.02
0 0
2 4 6 8 10 12 14 16 18 20 2 4 6 8 10 12 14 16 18 20
Figure 3: Fits for the term ‘Kabul’ Figure 6: Fits for the term ‘Kashmir’
4 Testing Framework POS doc freq term freq
noun 95.9% 99.9%
4.1 Corpus
prop n. 93.5% 99.8%
The corpus used was a set of 50,000 Reuters verb 90.5% 100.0%
newswires from the year 1996. This constituted adj 91.8% 100.0%
approximately 15 million words. Newswires adverb 90.9% 100.0%
make an interesting register for this sort of anal- all 94.3% 99.9%
ysis as each article typically addresses only one
main topic or event, which is probably a better Table 1: Percent of terms for which !&":
was a
delimiter than using the raw number of words or better fit than
sentences.
The words were marked up for part-of-speech POS doc freq term freq
by the brill-tagger (Brill, 1992). The terms con- noun 94.9% 94.6%
sidered were the word/POS pairs, so that a rough prop n. 91.6% 90.9%
division between uses occurred. After some con- verb 90.5% 89.9%
sideration, it was decided that the words should adj 89.6% 89.0%
not be stemmed or lemmatised - either the distri- adverb 89.8% 87.5%
butions of all cases/tenses of a word are similar, all 93.1% 92.6%
or there are differences which may themselves be
interesting to investigate. Table 2: Percent of terms for which !&"
was
To reduce noise, words that occurred with a fre- a better fit than
Level
better fit than (troubled) Murtaza Bhutto, Portillo, Sum-
people: mers, Pen, Dutroux, Espy,
(troubled) tolars, dinars, kroons, dirhams,
Comparisons were made between the fits of all currencies: bolivars, drachmas,
Poisson / queue utilisation combinations. There (troubled) Loewen, Michelin, Tractebel,
were no significant results for which the single companies: Olivetti, Erste, Truck, DWT,
Poisson was more accurate than a double QU. The other: inning, homer, vs, Manitoba,
other results are given in Tables 1, 2 and 3. canola, Chernobyl
The results demonstrate clearly that a queue
utilisation model of word frequency distributions Table 4: Terms for which the !&"#
for term frequency is given in Ta- bles 2 and 3 where the double Poisson was a little
ble 4 (my groupings). These distributions are not less able to improve the representation of some
emergent for the top 1% of words for document nouns.
frequency, given in Table 5. In brief, while a document frequency model
The trend for the term frequency distribution more accurately described the distribution of
continued, with the next 1% of containing terms terms, a term frequency model would seem to be
such as Ansett, Chechnya, Iraq, Mother Theresa, much more useful for comparing with Poisson to
Pope, Tibet and Tyson, the one outlier being the extract topics from a very large volumes of infor-
determiner the. By contrast, the bottom 0.5% of mation.
terms, given in Table 6, are relatively mild and not
6 Discussion and Conclusion
particularly indicative of what the topics may be.
The ordering of terms for which a single There are some proper nouns that serve as more
!&":
was a better fit than a single Poisson cor- static exophoric referents than others. The ones
responded well to the ordering of terms for the in Table 4 are among the most strict, with the no-
double Poisson, with the majority of terms in the table absence of even the most common colloca-
from, which, with, it, an, then, spokesman, A third reason, that is worth speculating about
was, to, in, and, of, a, by, the, seek, not, but, but is hard to confirm in a study necessarily ab-
for, bullish, discounts, before, we, trader, stracted from the text itself by scale, is that the
added, at, day, will, as, casualties, been, up, phenomenon is capturing an underlying subjec-
because, durum, cruise, futures, its, on, be- tivity. In news reporting, the use of narratives of
tween, sharply, firmer, be, after, hand, this, conflict are not limited to articles describing mil-
have, wheat, while, nation, now itary conflicts, often being used in sports report-
ing. Here, only the terms inning and homer evi-
Table 5: Terms for which the !#"#
References
M. Banko and E. Brill. 2001. Scaling to very very
large corpora for natural language disambiguation.