Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

See discussions, stats, and author profiles for this publication at: https://www.researchgate.

net/publication/314457405

Artificial Intelligence for Information Retrieval

Chapter · January 2008


DOI: 10.4018/9781599048499.ch023

CITATIONS READS
5 10,312

1 author:

Thomas Mandl
Universität Hildesheim
324 PUBLICATIONS 2,337 CITATIONS

SEE PROFILE

All content following this page was uploaded by Thomas Mandl on 11 March 2019.

The user has requested enhancement of the downloaded file.


Encyclopedia of
Artificial Intelligence

Juan Ramón Rabuñal Dopico


University of A Coruña, Spain

Julián Dorado de la Calle


University of A Coruña, Spain

Alejandro Pazos Sierra


University of A Coruña, Spain

Information Sci
Hershey • New York
Director of Editorial Content: Kristin Klinger
Managing Development Editor: Kristin Roth
Development Editorial Assistant: Julia Mosemann, Rebecca Beistline
Senior Managing Editor: Jennifer Neidig
Managing Editor: Jamie Snavely
Assistant Managing Editor: Carole Coulson
Typesetter: Jennifer Neidig, Amanda Appicello, Cindy Consonery
Cover Design: Lisa Tosheff
Printed at: Yurchak Printing Inc.

Published in the United States of America by


Information Science Reference (an imprint of IGI Global)
701 E. Chocolate Avenue, Suite 200
Hershey PA 17033
Tel: 717-533-8845
Fax: 717-533-8661
E-mail: cust@igi-global.com
Web site: http://www.igi-global.com/reference

and in the United Kingdom by


Information Science Reference (an imprint of IGI Global)
3 Henrietta Street
Covent Garden
London WC2E 8LU
Tel: 44 20 7240 0856
Fax: 44 20 7379 0609
Web site: http://www.eurospanbookstore.com

Copyright © 2009 by IGI Global. All rights reserved. No part of this publication may be reproduced, stored or distributed in any form or by any
means, electronic or mechanical, including photocopying, without written permission from the publisher.
Product or company names used in this set are for identification purposes only. Inclusion of the names of the products or companies does not indicate
a claim of ownership by IGI Global of the trademark or registered trademark.

Library of Congress Cataloging-in-Publication Data

Encyclopedia of artificial intelligence / Juan Ramon Rabunal Dopico, Julian Dorado de la Calle, and Alejandro Pazos Sierra, editors.
p. cm.
Includes bibliographical references and index.
Summary: "This book is a comprehensive and in-depth reference to the most recent developments in the field covering theoretical developments, tech-
niques, technologies, among others"--Provided by publisher.
ISBN 978-1-59904-849-9 (hardcover) -- ISBN 978-1-59904-850-5 (ebook)
1. Artificial intelligence--Encyclopedias. I. Rabunal, Juan Ramon, 1973- II. Dorado, Julian, 1970- III. Pazos Sierra, Alejandro.
Q334.2.E63 2008
006.303--dc22
2008027245

British Cataloguing in Publication Data


A Cataloguing in Publication record for this book is available from the British Library.

All work contributed to this encyclopedia set is new, previously-unpublished material. The views expressed in this encyclopedia set are those of the
authors, but not necessarily of the publisher.

If a library purchased a print copy of this publication, please go to http://www.igi-global.com/agreement for information on activating
the library's complimentary electronic access to this publication.


Artificial Intelligence for Information Retrieval A


Thomas Mandl
University of Hildesheim, Germany

INTRODUCTION important way to improve systems. Off the shelf ap-


proaches promise worse results than systems adapted
This article describes the most prominent approaches to users, domain and information needs. Today, most
to apply artificial intelligence technologies to infor- techniques developed in AI have been applied to re-
mation retrieval (IR). Information retrieval is a trieval systems with more or less success. When data
key technology for knowledge management. It deals from users is available, systems use often machine
with the search for information and the representation, learning to optimize their results.
storage and organization of knowledge. Information
retrieval is concerned with search processes in which a Artificial Intelligence Methods in
user needs to identify a subset of information which is Information Retrieval
relevant for his information need within a large amount
of knowledge. The information seeker formulates a Artificial intelligence methods are employed throughout
query trying to describe his information need. The query the standard information retrieval process and for
is compared to document representations which were novel value added services. The first section gives a
extracted during an indexing phase. The representations brief overview of information retrieval. The subsequent
of documents and queries are typically matched by a sections are organized along the steps in the retrieval
similarity function such as the Cosine. The most similar process and give examples for applications.
documents are presented to the users who can evaluate
the relevance with respect to their problem (Belkin, Information Retrieval
2000). The problem to properly represent documents
and to match imprecise representations has soon led to Information retrieval deals with the storage and
the application of techniques developed within Artificial representation of knowledge and the retrieval of
Intelligence to information retrieval. information relevant for a specific user problem. The
information seeker formulates a query trying to de-
scribe his information need. The query is compared
BACKGROUND to document representations. The representations
of documents and queries are typically matched by a
In the early days of computer science, information similarity function such as the Cosine or the Dice coef-
retrieval (IR) and artificial intelligence (AI) developed ficient. The most similar documents are presented to
in parallel. In the 1980s, they started to cooperate and the users who can evaluate the relevance with respect
the term intelligent information retrieval was coined to their problem.
for AI applications in IR. In the 1990s, information Indexing usually consists of the several phases.
retrieval has seen a shift from set based Boolean After word segmentation, stopwords are removed.
retrieval models to ranking systems like the vector These common words like articles or prepositions
space model and probabilistic approaches. These contain little meaning by themselves and are ignored
approximate reasoning systems opened the door for in the document representation. Second, word forms
more intelligent value added components. The large are transformed into their basic form, the stem. During
amount of text documents available in professional the stemming phase, e.g. houses would be transformed
databases and on the internet has led to a demand for into house. For the document representation, different
intelligent methods in text retrieval and to considerable word forms are usually not necessary. The importance
research in this area. The need for better preprocessing of a word for a document can be different. Some words
to extract more knowledge from data has become an better describe the content of a document than others.

Copyright © 2009, IGI Global, distributing in print or electronic forms without written permission of IGI Global is prohibited.
Artificial Intelligence for Information Retrieval

This weight is determined by the frequency of a stem of mass data has been derived from computational
within the text of a document (Savoy, 2003). linguistics (Hartrumpf, 2006).
In multimedia retrieval, the context is essential For application and domain specific knowledge,
for the selection of a form of query and document another approach is taken to improve the representation
representation. Different media representations may of documents. The representation scheme is enriched
be matched against each other or transformations may by exploiting knowledge about concepts of the domain
become necessary (e.g. to match terms against pictures (Lin & Demner-Fushman, 2006).
or spoken language utterances against documents in
written text). Match Between Query and Document
As information retrieval needs to deal with
vague knowledge, exact processing methods are Once the representation has been derived, a crucial
not appropriate. Vague retrieval models like the aspect of an information retrieval system is the
probabilistic model are more suitable. Within these similarity calculation between query and document
models, terms are provided with weights corresponding representation. Most systems use mathematical simi-
to their importance for a document. These weights larity functions such as the Cosine. The decision for
mirror different levels of relevance. a specific function is based on heuristics or empirical
The result of current information retrieval systems evaluations. Several approaches use machine learning
are usually sorted lists of documents where the top for long term optimization of the matching between
results are more likely to be relevant according to the term and document. E.g. one approach applies genetic
system. In some approaches, the user can judge the algorithm to adapt a weighting function to a collection
documents returned to him and tell the systems which (Almeida et al., 2007).
ones are relevant for him. The system then resorts Neural networks have been applied widely in IR.
the result set. Documents which contain many of the Several network architectures have been applied for
words present in the relevant documents are ranked retrieval tasks, most often the so-called spreading activa-
higher. This relevance feedback process is known to tion networks are used. Spreading activation networks
greatly improve the performance. Relevance feedback are simple Hopfield-style networks, however, they do
is also an interesting application for machine learning. not use the learning rule of Hopfield networks. They
Based on a human decisions, the optimization step can typically consist of two layers representing terms and
be modeled with several approaches, e.g. with rough documents. The weights of connections between the
sets (Singh & Dey 2005). In Web environments, a click layers are bi-directional and initially set according to
is often interpreted as an implicit positive relevance the results of the traditional indexing and weighting
judgment (Joachims & Radlinski, 2007). algorithms (Belkin, 2000). The neurons corresponding
to the terms of the user’s query are activated in the term
Advanced Representation Models layer and activation spreads along the weights into
the document layer and back. Activation represents
In order to represent documents in natural language, the relevance or interest and reaches potentially relevant
content of these documents needs to be analyzed. This terms and documents. The most highly activated docu-
is a hard task for computer systems. Robust semantic ments are presented to the user as result. A closer look
analysis for large text collections or even multime- at the models reveals that they very much resemble
dia objects has yet to be developed. Therefore, text the traditional vector space model of Information
documents are represented by natural language terms Retrieval (Mandl, 2000). It is not until after the second
mostly without syntactic or semantic context. This is step that associative nature of the spreading activation
often referred to as the bag-of-words approach. These process leads to results different from a vector space
keywords or terms can only imperfectly represent an model. The spreading activation networks successfully
object because their context and relations to other tested with mass data do not take advantage of this
terms are lost. associative property. In some systems the process is
However, great progress has been made and systems halted after only one step from the term layer into the
for semantic analysis are getting competitive. Advanced document layer, whereas others make one more step
syntactic and semantic parsing for robust processing


Artificial Intelligence for Information Retrieval

back to the term layer to facilitate learning (Kwok & the algorithm adapts the weights of the winning neuron
Grunfeld, 1996). and its neighbor. In that way, neighboring clusters have A
Queries in information retrieval systems are a high similarity.
usually short and contain few words. Longer queries The information retrieval applications of SOMs
have a higher probability to achieve good results. As a classify documents and assign the dominant term as
consequence, systems try to add good terms to a query name for the cluster. For real world large scale col-
entered by a user. Several techniques have been applied. lections, one two-dimensional grid is not sufficient. It
Either these terms are taken from top ranked documents would be either too big or each node would contain
or terms similar to the original ones are used. Another too many documents consequently. Neither would be
technique is to use terms from documents from the same helpful for users, therefore, a layered architecture is
category. For this task, classification algorithms from adopted. The highest layer consists of nodes which
machine learning are used (Sebastiani, 2002). represent clusters of documents. The documents of
Link analysis applies well known measures from these nodes are again analyzed by a SOM. For the
bibliometric analysis to the Web. The number links user, the system consists of several two-dimensional
pointing to a Web page is used as an indicator for its maps of terms where similar terms are close to each
quality (Borodin et al., 2005). PageRank assigns an other. After choosing one node, he may reach another
authority value to each Web page which is primarily a two-dimensional SOM.
function of its back links. Additionally, it assumes that The information retrieval paradigm for the SOM is
links from pages with high authority should be weighed browsing and navigating between layers of maps. The
higher and should result in a higher authority for the SOM seems to be a very natural visualization. However,
receiving page. To account for the different values the SOM approach has some serious drawbacks.
each page has to distribute, the algorithm is carried
out iteratively until the result converges (Borodin et • The interface for interacting with several layers
al., 2005). Machine Learning approaches complement of maps makes the system difficult to browse.
link analysis. Decisions of humans about the quality • Users of large text collections need primarily
of Web pages are used to determine design features of search mechanisms which the SOM itself does
these pages which are good indicators of their quality. not offer.
Machine learning models are applied to determine the • The similarity of the document collection is
quality of pages not judged yet (Mandl, 2006, Marti reduced to two dimensions omitting many po-
& Hearst, 2002). tentially interesting aspects.
Learning from users has been an important strategy • The SOM unfolds its advantages for human-
to improve systems. In addition to the content, artificial computer-interaction better for a small number
intelligence methods have been used to improve the of documents. A very encouraging application
user interface. would be the clustering of the result set. The
neurons would fit on one screen, the number of
Value Added Components for User terms would be limited and therefore, the reduc-
Interfaces tion to two dimensions would not omit so many
aspects.
Several Researchers have implemented information
retrieval systems based on the Kohonen self organiz- User Classification and Personalization
ing map (SOM), a neural network model for unsuper-
vised classification. They provide an associative user Adaptive information retrieval approaches intend to
interface where neighborhood of documents expresses tailor the results of a system to one user and his inter-
a semantic relation. Implementations for large collec- ests and preferences. The most popular representation
tions can be tested on the internet (Kohonen, 1998). scheme relies on the representation scheme used in
The SOM consists of a usually two-dimensional grid information retrieval where a document-term-matrix
of neurons, each associated with a weight vector. Input stores the importance or weight of each term for each
documents are classified according to the similarity document. When a term appears in a document, this
between the input pattern and the weight vectors, and, weight should be different form zero. User interest can


Artificial Intelligence for Information Retrieval

also be stored like a document. Then the interest is a relevance feedback information for personalization.
vector of terms. These terms can be ones that a user Instead, it focuses on the central aspect of a retrieval
has entered or selected in a user interface or which the function, the calculation of the similarity between docu-
system has extracted from documents for which the ment and query. Like other fusion methods, MIMOR
user has shown interest by viewing or downloading accepts the result of individual retrieval systems like
them (Agichtein et al., 2006). from a black box. These results are fused by a linear
An example for such a system is UCAIR which combination which is stored during many sessions. The
can be installed as a browser plugin. UCAIR relies weights for the systems experience a change through
on a standard web search engine to obtain a search learning. They adapt according to relevance feedback
result and a primary ranking. This ranking is now information provided by users and create a long-term
being modified by re-ranking the documents based model for future use. That way, MIMOR learns which
on implicit feedback and a stored user interest profile systems were successful in the past (Mandl & Womser-
(Shen et al., 2005). Hacker, 2004).
Most systems use this method of storing the user
interest in a term vector. However, this method has FUTURE TRENDS
several drawbacks. The interest profile may not be stable
and the user may have a variety of diverging interests Information retrieval systems are applied in more and
for work and leisure which are mixed in one profile. more complex and diverse environments. Searching
Advanced individualization techniques personal- e-mail, social computing collections and other specific
ize the underlying system functions. The results of domains pose new challenges which lead to innovative
empirical studies have shown that relevance feedback systems. These retrieval applications require thorough
is an effective technique to improve retrieval quality. and user oriented evaluation. New evaluation measures
Learning methods for information retrieval need to and standardized test collections are necessary to
extend the range of relevance feedback effects beyond achieve reliable evaluation results.
the modification of the query in order to achieve long- In user adaptation, recommendation systems are an
term adaptation to the subjective point of view of the important trend for future improvement. Recommenda-
user. The mere change of the query often results in tion systems need to be seen in the context of social
improved quality; however, the information is lost after computing applications. System developers face the
the current session. growth of user generated content which allows new
Some systems change the document representation reasoning methods.
according to the relevance feedback information. In New application like question answering relying on
a vector space metaphor, the relevant documents are more intelligent processing can be expected to gain more
moved toward the query representation. This approach market share in the near future (Hartrumpf, 2006)
also comprises some problems. Because only a fraction
of the documents are affected by the modifications, the
basic data from the indexing process is changed to a CONCLUSION
somewhat heterogeneous state. The original indexing
result is not available anymore. Knowledge management is of main importance for
Certainly, this technique is inadequate for fusion the information society. Documents written in natural
approaches where several retrieval methods are com- language contain an important share of the knowl-
bined. In this case, several basic representations would edge available. Consequently, retrieval is crucial for
need to be changed according to the influence of the the success of knowledge management systems. AI
corresponding methods on the relevant documents. technologies have been widely applied in retrieval
The indexes are usually heterogeneous, which is often systems. Exploiting knowledge more efficiently is a
considered an advantage of fusion approaches. A high major research field. In addition, user oriented value
computational overload would be the consequence. added systems require intelligent processing and ma-
The MIMOR (Multiple Indexing and Method-Object chine learning in many forms.
Relations) approach does not rely on changes to the An important future trend for AI methods in IR will
document or the query representation when processing be the context specific adaptation of retrieval methods.


Artificial Intelligence for Information Retrieval

Machine learning can be applied to find optimized Kohonen T. (1998). Self-organization of very large
functions for collections or queries. document collections: state of the art. In Proceedings A
8th Intl Conf Artificial Neural Networks. Springer, 1.
65-74.
REFERENCES Kwok, K., & Grunfeld, L. (1996). TREC-4 ad-hoc,
routing retrieval and filtering experiments using PIRCS.
Agichtein, E., Brill, E., Dumais, S., & Ragno, R. (2006).
In: Harman Donna (ed.). The fourth Text Retrieval
Learning user interaction models for predicting web
Conference (TREC-4). NIST Special Publ 500-236.
search result preferences. Annual International ACM
Gaithersburg, MY.
Conference on Research and Development in Informa-
tion Retrieval (SIGIR) Seattle. ACM Press. 3-10. Lin, J., & Demner-Fushman, D. (2006). The role of
knowledge in conceptual retrieval: a study in the domain
de Almeida, H.M., Gonçalves, M.A, Cristo, M., &
of clinical medicine. In Annual International ACM
Calado, P. (2007). A combined component approach
Conference on Research and Development in Informa-
for finding collection-adapted ranking functions based
tion Retrieval (SIGIR) Seattle. ACM Press. 99-106
on genetic programming. Annual International ACM
Conference on Research and Development in Infor- Mandl, T. (2000). Tolerant Information Retrieval with
mation Retrieval (SIGIR) Amsterdam. ACM Press. Backpropagation Networks. Neural Computing & Ap-
399-406. plications 9(4), 280-289.
Belkin, R. (2000). Finding out about: a Cognitive Per- Mandl, T. (2006). Implementation and evaluation of a
spective on Search Engine Technology and the WWW. quality based search engine. In Proceedings of the 17th
Cambridge et al.: Cambridge University Press. ACM Conference on Hypertext and Hypermedia (HT
‘06) Odense, Denmark. ACM Press. 73-84.
Brusilovsky, P., Kobsa, A. & Nejdl, W. (eds.) (2007).
The adaptive Web: Methods and strategies of Web Mandl, T., & Womser-Hacker, C. (2004). A Framework
personalization. Heidelberg: Springer. for long-term Learning of Topical User Preferences in
Information Retrieval. New Library World, 105(5/6)
Boughanem, M., & Soulé-Dupuy, C. (1998). Mercure
184-195.
at trec6. In Voorhees, E. & Harman, D. (eds.). The
sixth text retrieval conf (TREC-6). NIST Special Publ Savoy, J. (2003). Cross-language information retrieval:
500-240. Gaithersburg, MY. experiments based on CLEF 2000 corporaSebastiani, F.
(2002). Machine Learning in Automated Text Catego-
Borodin, A., Roberts, G., Rosenthal, J. & Tsaparas, P.
rization. ACM Computing Surveys 34(1), 1-47.
(2005). Link analysis ranking: algorithms, theory, and
experiments. ACM Transactions on Internet Technology Shen, X., Tan, B. & Zhai, C. (2005). Context-sensitive
(TOIT) 5(1), 231-297. information retrieval using implicit feedback. Annual
International ACM Conference on Research and De-
Hartrumpf, S. (2006). Extending Knowledge and
velopment in Information Retrieval (SIGIR), Seattle,
Deepening Linguistic Processing for the Question
WA. ACM Press. 43-50.
Answering System InSicht. Accessing Multilingual
Information Repositories, 6th Workshop of the Cross- Singh, S., & Dey Lipika (2005). A rough-fuzzy docu-
Language Evalution Forum, CLEF. [LNCS 4022] ment grading system for customized text information
Springer. 361-369 retrieval. Information Processing & Management
41(2), 195-216.
Ivory M. & Hearst, M. (2002). Statistical Profiles of
Highly-Rated Sites. ACM CHI Conference on Hu- Zhang, D., Chen, X. & Lee, W. (2005). Text clas-
man Factors in Computing Systems. ACM Press. sification with kernels on the multinomial manifold.
367-374. Annual International ACM Conference on Research
and Development in Information Retrieval (SIGIR)
Joachims, T., & Radlinski, F. (2007). Search engines
Seattle. ACM Press. 266-273.
that learn from implicit feedback. IEEE Computer
40(8), 34-40.


Artificial Intelligence for Information Retrieval

KEy TERMS similar to PageRank determine a quality or authority


score based on the number of in-coming links of a
Adaptation: Adaptation is a process of modification page. Furthermore, link analysis is applied to identify
based on input or observation. An information system thematically similar pages, web communities and other
should adapt itself to the specific needs of individual social structures.
users in order to produce optimized results.
Recommendation Systems: Actions or content is
Indexing: Indexing means the assignment of terms suggested to the user based on past experience collected
(words) which represent a document in an index. In- from other users. Very often, documents are recom-
dexing can be carried out manually or automatically. mended based on similarity profiles between users.
Automatic indexing requires the elimination of stop-
Term Expansion: Terms not present in the original
words and stemming.
query to an information retrieval system entered by the
Information Retrieval: Information retrieval is user are added automatically. The expanded query is
concerned with the representation and knowledge and then sent to the system again.
subsequent search for relevant information within these
Weighting: Weighting determines the importance
knowledge sources. Information retrieval provides the
of a term for a document. Weights are calculated using
technology behind search engines.
many different formulas which consider the frequency
Link Analysis: The links between pages on the web of each term in a document and in the collection as
are a large knowledge source which is exploited by link well as the length of the document and the average or
analysis algorithms for many ends. Many algorithms maximum length of any document in the collection.



View publication stats

You might also like