Automatic Indexing

Automatic Indexing
• Automatic indexing is the process of analyzing an item to extract the information to be permanently
kept in an index.
• Classes of Automatic Indexing

1. Statistical Indexing
2. Natural Language
3. Concept Linkages
4. Hypertext Linkages
Statistical Indexing
• Statistical indexing uses frequency of occurrence of events to calculate a number that is used to indicate
the potential relevance of an item.
1. Probabilistic Weighting
• The use of probability theory is a natural choice because it is the basis of evidential reasoning (i.e.,
drawing conclusions from evidence).
• It also leads to an invariant result that facilitates integration of results from different databases.
Probability Ranking Principle (PRP) and its Plausible Corollary

HYPOTHESIS: If a reference retrieval system’s response to each request is a ranking of the documents in
the collection in order of decreasing probability of usefulness to the user who submitted the request, where
the probabilities are estimated as accurately as possible on the basis of whatever data is available for this
purpose, then the overall effectiveness of the system to its users is the best obtainable on the basis of that
data.
PLAUSIBLE COROLLARY: The most promising source of techniques for estimating the probabilities of
usefulness for output ranking in IR is standard probability theory and statistics.
• There are several factors that make this hypothesis and its corollary difficult.
• Probabilities are usually based upon a binary condition; an item is relevant or not.
• But in information systems the relevance of an item is a continuous function from non-relevant to
absolutely useful.
2. Vector Weighting
• In information retrieval, each position in the vector typically represents a processing token.
• There are two approaches to the domain of values in the vector: binary and weighted.
• Under the binary approach, the domain contains the value of one or zero, with one representing the
existence of the processing token in the item.
• Binary vectors require a decision process to determine if the degree that a particular processing token
represents the semantics of an item is sufficient to include it in the vector.
• In the weighted approach, the domain is typically the set of all real positive numbers.
• The value for each processing token represents the relative importance of that processing token in
representing the semantics of the item.
Major algorithms that can be used in calculating the weights used to represent a processing token –
a. Simple Term Frequency Algorithm

• In both the unweighted and weighted approaches, an automatic indexing process implements an
algorithm to determine the weight to be assigned to a processing token for a particular item.
• In a statistical system, the data that are potentially available for calculating a weight are
• frequency of occurrence of the processing token in an existing item (i.e., term frequency - TF),
• frequency of occurrence of the processing token in the existing database (i.e., total frequency -
TOTF) and
• the number of unique items in the database that contain the processing token (i.e., item frequency
- IF, frequently labeled in other publications as document frequency - DF).
• The simplest approach is to have the weight equal to the term frequency.
• This approach emphasizes the use of a particular processing token within an item.
• Use of the absolute value biases weights toward longer items, where a term is more likely to occur with
a higher frequency.
• Thus, one normalization typically used in weighting algorithms compensates for the number of words in
an item.
• The term frequency weighting formula used in TREC 4 was:
• where slope was set at .2 and the pivot was set to the average number of unique terms occurring in the
collection.
• In addition to compensating for document length, they also want the formula to be insensitive to
anomalies introduced by stemming or misspellings.
There are many approaches to account for different document lengths when determining the value of Term
Frequency to use –
• Maximum term frequency - the term frequency for each word is divided by the maximum frequency of
the word in any item.
 This normalizes the term frequency values to a value between zero and one.
 The problem with this technique is that the maximum term frequency can be so large that it decreases
the value of term frequency in short items to too small a value and loses significance.
• Logarithmetic term frequency - the log of the term frequency plus a constant is used to replace the
term frequency.
 The log function will perform the normalization when the term frequencies vary significantly due to size
of documents.
• Another approach recognizes that the normalization process may be over penalizing long documents.
• To compensate, a correction factor was defined that is based upon document length that maps the Cosine
function into an adjusted normalization function.
• The function determines the document length crossover point for longer documents where the
probability of relevance equals the probability of retrieval, (given a query set).
• This value called the "pivot point" is used to apply an adjustment to the normalization process.
Pivoted function = (slope) * (old normalization) + (1.0 – slope) * (pivot)
• Slope and pivot are constants for any document/query set.
b. Inverse Document Frequency
• The basic algorithm is improved by taking into consideration the frequency of occurrence of the
processing token in the database.
• One of the objectives of indexing an item is to discriminate the semantics of that item from other items
in the database.
• Algorithm - the weight assigned to an item should be inversely proportional to the frequency of
occurrence of an item in the database.
• The un-normalized weighting formula is:
where -
• WEIGHTij is the vector weight that is assigned to term “j” in item “i” ,
• TFij(term frequency) is the frequency of term “j” in item “i” ,
• “n” is the number of items in the database and
• IFj(item frequency or document frequency) is the number of items in the database that have term “j” in
them.
• A negative log is the same as dividing by the log value, thus the basis for the name of the algorithm.
c. Signal Weighting
• Inverse document frequency adjusts the weight of a processing token for an item based upon the number
of items that contain the term in the existing database.
• What it does not account for is the term frequency distribution of the processing token in the items that
contain the term.
• The distribution of the frequency of processing tokens within an item can affect the ability to rank
items.
• In Information Theory, the information content value of an object is inversely proportional to the
probability of occurrence of the item.
• An instance of an event that occurs all the time has less information value than an instance of a seldom
occurring event.
• This is typically represented as INFORMATION = -Log2 (p), where p is the probability of occurrence of
event “p.”
d. Discrimination Value
• Another approach to creating a weighting algorithm is to base it upon the discrimination value of a
term.
• To achieve the objective of finding relevant items, it is important that the index discriminates among
items.
• The more all items appear the same, the harder it is to identify those that are needed.
• Discrimination value for each term “i”:
where
• AVESIM is the average similarity between every item in the database and
• AVESIMi is the same calculation except that term “i” is removed from all items.
• There are three possibilities with the DISCRIMi value being positive, close to zero or negative.
• A positive value indicates that removal of term “i” has increased the similarity between items. In this
case, leaving the term in the database assists in discriminating between items and is of value.
• A value close to zero implies that the term’s removal or inclusion does not change the similarity
between items.
• If the value of DISCRIMi is negative, the term’s effect on the database is to make the items appear more
similar since their average similarity decreased with its removal.
• Once the value of DISCRMi is normalized as a positive number, it can be used in the standard weighting
formula as:
3. Bayesian Model :

Automatic Indexing

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Automatic Indexing

Uploaded by

Copyright:

Available Formats

Automatic Indexing

• Classes of Automatic Indexing

Probability Ranking Principle (PRP) and its Plausible Corollary

a. Simple Term Frequency Algorithm

You might also like