Professional Documents
Culture Documents
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
Chapter Two Text/Document Operations and Automatic Indexing Statistical Properties of Text
Chapter Two
Text/Document Operations and Automatic Indexing
Statistical Properties of Text
How is the frequency of different words distributed?
o How fast does vocabulary size grow with the size of a corpus?
Such factors affect the performance of IR system and can be used to select
suitable term weights and other aspects of the system.
o A few words are very common.
2 most frequent words (e.g. “the”, “of”) can account for about 10% of word occurrences.
o Most words are very rare.
Half the words in a corpus appear only once, called “read only once”
o Called a “heavy tailed” distribution, since most of the probability mass is in
the “tail”
Sample Word Frequency Data
Frequencies from 336,310 documents in the 1GB TREC volume 3 Corpus 125,720,891 total
word occurrences; 508,209 unique words.
Zipf’s distributions
Rank Frequency Distribution
For all the words in a collection of documents, for each word w
f: is the frequency that w appears
r: is rank of win order of frequency. (The most commonly occurring word has rank 1, etc.)
o The table shows the most frequently occurring words from 336,310 document
collection containing 125, 720, 891 total words; out of which 508, 209 unique
words.
Zipf’s law: modeling word distribution
o The collection frequency of the ith most common term is proportional to 1/I.
fi ∞ 1/i
If the most frequent term occurs f1times then the second most frequent term has
half as many occurrences, the third most frequent term has a third as many, etc.
o Zipf's Law states that the frequency of the ith most frequent word is 1/iӨ times that
of the most frequent word.
Occurrence of some event (P), as a function of the rank (i) where the rank is
determined by the frequency of occurrence, is a power-law function. Pi~
1/iӨ with the exponent Ө close to unity.
More Examples: Zipf’s Law
o Illustration of Rank-Frequency Law. Let the total number of word occurrences in
the sample N = 1, 000, 000
o Luhn (1958) suggested that both extremely common and extremely uncommon
words were not very useful for document representation & indexing.
Heap’s distributions
o Distribution of size of the vocabulary: there is a linear relationship between
vocabulary size and number of tokens
Example
o We want to estimate the size of the vocabulary for a corpus of 1,000,000 words.
However, we only know statistics computed on smaller corpora sizes:
For 100,000 words, there are 50,000 unique words
For 500,000 words, there are 150,000 unique words
Estimate the vocabulary size for the 1,000,000 words corpus?
How about for a corpus of 1,000,000,000 words?
Text Operations
o Not all words in a document are equally significant to represent the
contents/meanings of a document
Some word carry more meaning than others
Noun words are the most representative of a document content
o Therefore, need to preprocess the text of a document in a collection to be used as
index terms. Using these to fall words in a collection to index documents creates
too much noise for the retrieval task
Reduce noise means reduce words which can be used to refer to the
document
o Preprocessing is the process of controlling the size of the vocabulary or the
number of distinct words used as index terms. Preprocessing will lead to an
improvement in the information retrieval performance. However, some search
engines on the Web omit preprocessing. Every word in the document is an index
term
o Text operations is the process of text transformations into logical
representations
o 5 main operations for selecting index terms, i.e. to choose words/stems(or groups
of words) to be used as indexing terms:
o However, stemming may hurt precision, because users can no longer target just a
particular form.
o Stemming may be performed using
Algorithms that stripe of suffixes according to substitution rules.
Large dictionaries, that provides the stem of each word.
Index term selection (indexing)
o Objective: Increase efficiency by extracting from the resulting document a
selected set of terms to be used for indexing the document.
If full text representation is adopted then all words are used for indexing.
o Indexing is a critical process: User's ability to find documents on a particular
subject is limited by the indexing process having created index terms for this
subject.
o Index can be done manually or automatically.
o Historically, manual indexing was performed by professional indexers associated
with library organizations. However, automatic indexing is more common now
(or, with full text representations, indexing is altogether avoided).
What are the advantages of manual indexing?
Ability to perform abstractions (conclude what the subject is) and determine additional
related terms,
Ability to judge the value of concepts.
What are the advantages of automatic indexing?
Reduced cost: Once initial hardware cost is amortized, operational cost is cheaper than
wages for human indexers.
Reduced processing time
Improved consistency.
o Controlled vocabulary: Index terms must be selected from a predefined set of
terms (the domain of the index).