Download as pdf or txt
Download as pdf or txt
You are on page 1of 3

Text Databases and Information Retrieval

ELLEN RILOFF and LEE HOLLAAR


Department of Computer Science, University of Utah ^riloff,hollaar@cs.utah.edu&

The goal of a traditional information “shot” has many different meanings, in-
retrieval (IR) system is to search an cluding the act of shooting, an injection,
information repository, such as a text a quantity of liquor, a photograph, pel-
database, and retrieve documents that lets, or an attempt). Phrases also re-
are potentially relevant to a query. quire special attention because multi-
Since query-based IR systems must op- word expressions often have a
erate in real time, they must be able to composite meaning different from the
search large volumes of text quickly and individual words. For example, a “hot
efficiently. Other information-retrieval dog” does not usually refer to a warm
applications, such as text categoriza- canine, and an “operating system” does
tion, text routing, and text filtering, are not usually refer to a system that is
also becoming increasingly important. simply operating.
These applications are generally con- Most information-retrieval systems
cerned with long-term information preprocess a document collection into an
needs, where a topic is expected to be of inverted file that allows the system to
interest for an extended period of time. determine quickly which words appear in
Text categorization systems assign pre- each document. Stopword lists are com-
defined category labels to texts. For ex- monly used to remove highly frequent
ample, a text categorization system for words, such as “the” and “of,” under the
computer science might use categories
assumption that they don’t contribute
such as operating systems, program-
much to the meaning of a text. Stemming
ming languages, artificial intelligence,
algorithms are sometimes used to reduce
or information retrieval. Text routing
a word to its root form so that different
systems typically accept a set of user
profiles and automatically classify texts morphological variations will match
so that relevant texts can be routed to [Frakes and Baeza-Yates 1992]. An alter-
appropriate users [Harman 1994]. Text native text-representation scheme uses
filtering systems accept a list of topics superimposed codewords to produce a
that are, or are not, of interest and fixed-length vector from the binary repre-
allow only texts that satisfy the filter to sentations of words. The fixed-length vec-
pass through to the user [Belkin and tor is especially useful for parallel and
Croft 1992]. Text categorization systems hardware systems, but this method can
are typically applied to static databases, sometimes hallucinate words that don’t
while text routing and text filtering sys- actually appear in the original document.
tems are usually applied to incoming Traditional information-retrieval meth-
data streams. ods retrieve documents by searching for
Information-retrieval systems must relevant words or phrases. Most commer-
grapple with all of the ambiguities and cial IR systems allow the user to define a
idiosyncrasies inherent in natural lan- query using keywords and standard Bool-
guage, such as synonymy (e.g., “start”, ean operators. These systems retrieve
“begin”, and “initiate” have essentially documents that precisely match the
the same meaning) and polysemy (e.g., query. The vector-space model [Salton

Copyright © 1996, CRC Press.

ACM Computing Surveys, Vol. 28, No. 1, March 1996


134 • Ellen Riloff and Lee Hollaar

1971] is a well-known method for auto- systems often suffers because of a lack
matic indexing that views each document of semantic analysis, especially for com-
and query as a vector in an N-dimen- plex information requests. Natural-lan-
sional space, where N is the number of guage processing systems, on the other
relevant terms in the database. The hand, usually perform conceptual anal-
query vector is compared to all of the yses, which allows them to produce
document vectors using a similarity met- richer meanings and representations.
ric. Another retrieval model for automatic However, NLP techniques are more
indexing uses probability estimates to de- computationally expensive and there-
termine whether a document satisfies a fore are more difficult to scale up to
user’s query. For example, Bayesian in- large text collections.
ference networks have been used to com- The information-retrieval community is
pute the belief associated with a query for facing new challenges posed by larger
each document in a database. and more heterogeneous text databases,
Relevance feedback techniques can which have led to an explosion of new
improve performance by asking the user approaches and methodologies. As longer
for feedback about the retrieved texts texts become available on-line, new ap-
[Salton 1989; Van Rijsbergen 1979]. The proaches are needed to process texts that
user labels a subset of the retrieved discuss multiple topics. A variety of tech-
texts as relevant, and this information niques for subtopic identification and pas-
is fed back into the system to modify the sage-based retrieval are actively being ex-
original query, usually by adding new plored. Another area of active research is
terms or by changing the weights of the intelligent information retrieval, which
original query terms. Relevance feed- draws upon techniques from artificial in-
telligence to generate richer text repre-
back has consistently been shown to
sentations. Natural-language processing
improve the performance of IR systems.
methods (such as information extraction),
Experiments with richer text repre-
case-based reasoning techniques, and ma-
sentations have also been conducted us-
chine learning algorithms are all being
ing natural-language processing (NLP)
applied to information retrieval tasks in
techniques. Syntactic approaches have the hopes of building more effective re-
been used to generate more complex trieval systems (for example, see ACM
indexing terms consisting of phrases [1995]). Intelligent information retrieval
and head-modifier structures. Knowl- is an exciting new direction for IR re-
edge-based NLP systems have been search.
used to generate conceptual meaning
representations of queries and docu-
ments. Information extraction tech- REFERENCES
niques [Lehnert and Sundheim 1991] ACM. 1995. Proceedings of the 18th Annual
have also been shown to be effective for International ACM SIGIR Conference on
text classification problems, and repre- Research and Development in Information Re-
sent a compromise between word-based trieval. ACM, New York.
techniques and in-depth natural-lan- BELKIN, N. AND CROFT, W. B. 1992. Information
filtering and information retrieval: Two sides
guage processing. of the same coin? Commun. ACM 35, 12,
The future holds great promise for 29 –38.
integrating information-retrieval tech- FRAKES, W. B. AND BAEZA-YATES, R., EDS.
niques with natural-language process- 1992. Information Retrieval: Data Struc-
ing systems. The strengths of these tures and Algorithms. Prentice-Hall, Engle-
wood Cliffs, NJ.
methodologies are largely complemen-
HARMAN, D., ED. 1994. The Second Text RE-
tary. IR systems use shallow text repre- trieval Conference (TREC2). National Insti-
sentations, which allows them to pro- tute of Standards and Technology Special
cess large amounts of text quickly and Publication 500 –215, Gaithersburg, MD.
efficiently. But the accuracy of these LEHNERT, W. G. AND SUNDHEIM, B. 1991. A per-

ACM Computing Surveys, Vol. 28, No. 1, March 1996


Text Databases and Information Retrieval • 135

formance evaluation of text analysis technolo- SALTON, G. 1989. Automatic Text Processing:
gies. AI Mag. 12, 3, 81–94. The Transformation, Analysis, and Retrieval
SALTON, G., ED. 1971. The SMART Retrieval of Information by Computer. Addison-Wesley,
System: Experiments in Automatic Document Reading, MA.
Processing. Prentice-Hall, Englewood Cliffs, VAN RIJSBERGEN, C. J. 1979. Information Re-
NJ. trieval (2nd Ed.). Butterworths, London.

ACM Computing Surveys, Vol. 28, No. 1, March 1996

You might also like