Professional Documents
Culture Documents
Finding Keywords in Literary Texts and Symbolic Sequences
Finding Keywords in Literary Texts and Symbolic Sequences
Finding Keywords in Literary Texts and Symbolic Sequences
Level statistics of words: Finding keywords in literary texts and symbolic sequences
Statistical keyword extraction is a critical step in informa- text. For example, in the sentence “A great scientist must be
tion science, with multiple applications in text-mining and a good teacher and a good researcher” the spectrum corre-
information-retrieval systems 关1兴. Since Luhn 关2兴 proposed sponding to the word “a” is formed by three energy levels
the analysis of frequency occurrences of words in the text as 共1,6,10兲. Figure 1 shows an example of a real book.
a method for keyword extraction, many refinements have Following the physics analogy, the nearest-neighbor spac-
been developed. With a few exceptions 关3兴, the basic prin- ing distribution P共d兲 was used in 关9兴 to characterize the spa-
ciple for keyword extraction is the comparison to a corpus of tial distribution of a particular word, and to show the rela-
documents taken as a reference. For a collection of docu- tionship between word clustering and word semantic
ments, modern term-weighting schemes use the frequency of meaning. P共d兲 is obtained as the normalized histogram of the
a term in a document and the proportion of documents con- sets of distances 共or spacings兲 共d1 , d2 , . . . , dn兲 between con-
taining that term 关4兴. Following a different approach, the secutive occurrences of a word, with di = ei+1 − ei. As seen in
probabilistic model of information retrieval related the sig-
Fig. 1, a nonrelevant word 共as “but”兲 is placed at random
nificance of a term to its frequency fluctuations between
along the text, while a relevant word 共as “Quixote”兲 appears
documents 关5–7兴. The frequency analysis approach to detect
keywords seems to work properly in this context. in the text forming clusters, and this difference is reflected in
However, a more general approach should try to detect their corresponding P共d兲 distributions. In the case of a rel-
keywords in a single text without knowing a priori the sub- evant word the energy levels attract each other, while for a
ject of the text, i.e., without using a corpus of reference. The nonrelevant word, the energy levels are uncorrelated and
applications of such an algorithm are clear: internet searches, therefore distributed at random, so the higher the relevance
data mining, automatic classification of documents, etc. In of a word, the larger the clustering 共the attraction兲 and the
this case, the information provided by the frequency of a larger the deviation of P共d兲 from the random expectation.
word is not very useful, since there are no more texts to The connection between word attraction 共clustering兲 and rel-
compare. In addition, such frequency analysis is of little use evance comes from the fact that a relevant word is usually
in a single document for two main reasons: 共i兲 Two words the main subject on local contexts, and therefore it appears
with very different relevance in the text can have a similar
frequency 共see Fig. 1兲. 共ii兲 A randomization of the text pre-
serves the frequency values but destroys the information, 'Quixote'
(288
which must be also stored in the ordering of the words, and occurrences)
not only in the words themselves. Thus, to detect keywords,
we propose the use of the spatial distribution of the words
along the text and not only their frequencies, in order to take
into account the structure of the text as well as its composi- 'but'
tion. (248
Inspired by the level statistics of quantum-disordered sys- occurrences)
tems following the random matrix theory 关8兴, Ortuño et al.
关9兴 have shown that the spatial distribution of a relevant
0 10000 20000 30000 40000 50000
word in a text is very different from that corresponding to a position (words)
nonrelevant word. In this approach, any of the occurrences of
a particular word is considered as an “energy level” ei within FIG. 1. Spectra of the words “Quixote” and “but” obtained in
an “energy spectrum” formed by all the occurrences of the the first 50 000 words of the book Don Quixote, by Miguel de
analyzed word within the text. The value of any energy level Cervantes. Both words have a similar frequency in the whole text
ei is given simply by the position of the analyzed word in the 共around 2150兲, and also in the part shown in the figure.
more often in some areas and less frequently in others, giv- (a) 1.00
ing rise to clusters.
0.98 p = 0.01 p = 0.05 p = 0.1
We present here an approach to the problem of keyword 1.00
detection with does not need a corpus and which is based on 0.96
<σnor>
the principle that real keywords are clustered in the text 共as 0.96
<σ>
0.94
in 关9兴兲, but in addition we introduce the idea that the cluster- 0.92
ing must be statistically significant, i.e., not due to statistical 0.92
0.88
fluctuations. This is of fundamental importance when analyz- 0.90
ing any text, but is critical when analyzing short documents 0.84
0.88
共articles, etc.兲 where fluctuations are important since all the 0 200 400 600 800 1000
n (word count)
words present a small frequency. As the statistical fluctua- 0.86
0 100 200 300 400 500 600 700 800 900 1000
tions depend on the frequency of the corresponding word
n (word count)
共see below兲, our approach combines both the information
provided by the clustering of the word 共the spatial structure (b) 1.0
along the text兲 and by its frequency. Furthermore, we extend 0.9 10
<σnor> , sd(σnor)
8
0.8 n = 50
P(σnor)
n = 100
boundaries are not known. In particular, we model a generic 0.7
6
n = 500
symbolic sequence 共i.e., a chain of “letters” from a certain 0.6
4
035102-2
RAPID COMMUNICATIONS
LEVEL STATISTICS OF WORDS: FINDING KEYWORDS… PHYSICAL REVIEW E 79, 035102共R兲 共2009兲
TABLE I. The first 20 “words” extracted from the book Rela- nor − 具nor典共n兲
tivity: The Special and General Theory, by A. Einstein, with spaces C共nor,n兲 ⬅ , 共4兲
and punctuation marks removed.
sd共nor兲共n兲
035102-3
RAPID COMMUNICATIONS
that words with semantic meaning and their parents are text without spaces is not only able to detect real hidden
strongly clustered, while irrelevant and common words are keywords, but also a combination of words or whole sen-
randomly distributed. tences with plenty of meaning for the text considered, thus
For keyword extraction we use two principles: 共i兲 for any supporting the validity of our approach.
ᐉ, we apply a threshold for C to remove irrelevant words In conclusion, our algorithm has proven to work properly
which is taken as a percentile of the C distribution, usually a in a generic symbolic sequence, being thus potentially useful
p value 艋0.05兲. And 共ii兲 we explore the “lineages” of all
when analyzing other specific symbolic sequences of inter-
words 共from short, “general” ᐉ words to larger, “specialized”
ones兲 to extract just the words with semantic meaning and est, as for example, spoken language, where only sentences
not any of their parents which might be also highly clustered. can be preceded and followed by silence but not the indi-
The lineage of a word can easily be established by iteratively vidual words, or DNA sequences, where commaless codes
detecting the child word with the highest C value. The result are the rule.
of such an algorithm 关21兴 for a famous book without spaces
and punctuation marks can be seen in Table I, and for other We thank the Spanish Junta de Andalucía 共Grant Nos.
books in 关16兴. It is remarkable that this algorithm, based on P07-FQM3163 and P06-FQM1858兲 and the Spanish Govern-
the spatial attraction of the relevant words, when applied to a ment 共Grant No. BIO2008-01353兲 for financial support.
关1兴 WordNet: An Electronic Lexical Database, edited by C. Fell- 关14兴 We simulate random texts as random binary sequences
baum 共MIT Press, Cambridge, MA, 1998兲; E. Frank et al., 010010100001…. The symbol “1” appears with probability p
Proceedings of the 16th International Joint Conference on Ar- and models a word in a text, and the symbol “0” accounts for
tificial Intelligence, 1999, p. 668; L. van der Plas et al., in the rest of the words with probability 1 − p, so in a realistic
Proceedings of the 4th International Conference on Language case p is very small.
Resources and Evaluation, edited by M. T. Lino, M. F. Xavier, 关15兴 All the texts analyzed have been downloaded from the Project
F. Ferreira, R. Costa, and R. Silva, European Language Re- Gutenberg web page: www.gutenberg.org
source Association, 2004, p. 2205; K Cohen and L. Hunter, 关16兴 http://bioinfo2.ugr.es/TextKeywords
PLOS Comput. Biol. 4, e20 共2008兲. 关17兴 H. J. Bussemaker, H. Li, and E. D. Siggia, Proc. Natl. Acad.
关2兴 H. P. Luhn, IBM J. Res. Dev. 2, 157 共1958兲. Sci. U.S.A. 97, 10096 共2000兲.
关3兴 Y. Matsuo and M. Ishizuka, Int. J. Artif. Intell. 13, 157 共2004兲; 关18兴 We only use lowercase letters to reduce the alphabet.
G. Palshikar, in Proceedings of the Second International Con- 关19兴 Now, the nearest neighbors distances are measured in letters,
ference on Pattern Recognition and Machine Intelligence not in words, since we advance letter by letter exploring the
(PReMI07), Vol. 4815 of Lecture Notes on Computer Science, text. Thus we ignore the “reading frame,” which would be
edited by A. Ghosh, R. K. De, and S. K. Pal 共Springer-Verlag, relevant only in texts with fixed-length words consecutively
Berlin, 2007兲, p. 503. placed, as the words of length 3 共codons兲 in the coding regions
关4兴 G. Salton and M. J. McGill, Introduction to Modern Informa- of DNA. In addition, we consider only nonoverlapping occur-
tion Retrieval 共McGraw-Hill, New York, 1983兲; K. Sparck rences of a word. Although the overlapping is not very likely
Jones, J. Document. 28, 11 共1972兲; S. E. Robertson et al., in in text without spaces, it can happen in other symbolic se-
Overview of the Third Text REtrieval Conference (TREC-3), quences, such as DNA, and the expected P共d兲 is different for
NIST SP 500–225, edited by D. K. Harman 共National Institute overlapping or nonoverlapping randomly placed words 共see S.
of Standards and Technology, Gaithersburg, MD, 1995兲, pp. Robin, F. Rodolphe, and S. Schbath, DNA, Words and Models
109–126. 共Cambridge University Press, Cambridge, England, 2005兲.
关5兴 A. Bookstein and D. R. Swanson, J. Am. Soc. Inf. Sci. 25, 312 关20兴 T. H. Cormen et al., Introduction to Algorithms 共The MIT
共1974兲; 26, 45 共1975兲. Press, Cambridge, MA, 1990兲, pp. 485–488.
关6兴 S. P. Harter, J. Am. Soc. Inf. Sci. 26, 197 共1975兲. 关21兴 The algorithm proceeds as follows: we consider an initial ᐉ0,
关7兴 A. Berger and J. Lafferty, Proc. ACM SIGIR99, 1999, p. 222; and we start with a given ᐉ0 word for which C ⬎ C0, where C0
J. Ponte and W. Croft, Proc. ACM SIGIR98, 1998, p. 275. corresponds to a p value ⫽ 0.05. We then find the child 共ᐉ0
关8兴 T. A. Brody et al., Rev. Mod. Phys. 53, 385 共1981兲; M. L. + 1兲 word with the highest C value, and proceed the same in
Mehta, Random Matrices 共Academic Press, New York, 1991兲. the following generations: we find the successive child 共ᐉ0
关9兴 M. Ortuño et al., Europhys. Lett. 57, 759 共2002兲. + i兲 word with the highest C value; i = 1 , 2 , . . .. This process
关10兴 P. Carpena, P. Bernaola-Galvan, and P. C. Ivanov, Phys. Rev. stops at a previously chosen maximal word length ᐉmax. Fi-
Lett. 93, 176804 共2004兲. nally, we choose as the representative of the lineage the longest
关11兴 H. D. Zhou and G. W. Slater, Physica A 329, 309 共2003兲; M. word for which C ⬎ C0 and define it as the extracted semantic
J. Berryman, A. Allison, and D. Abbott, Fluct. Noise Lett. 3, unit. We repeat this algorithm for all the ᐉ0 words with C
L1 共2003兲. ⬎ C0, and we repeat also all the processes by changing the
关12兴 V. T. Stefanov, J. Appl. Probab. 40, 881 共2003兲. initial ᐉ0 value. Finally, we remove the remaining redundan-
关13兴 S. Robin and J. J. Daudin, Ann. Inst. Stat. Math. 53, 895 cies 共repeated words or semantic units兲 due to explore all lin-
共2001兲. eages from different initial ᐉ0.
035102-4