Professional Documents
Culture Documents
Strathprints Institutional Repository: Strathprints - Strath.ac - Uk
Strathprints Institutional Repository: Strathprints - Strath.ac - Uk
Ruthven, I. (2003) Re-examining the potential effectiveness of interactive query expansion. In:
26th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, 2003-07-28 - 2003-08-01, Toronto.
Strathprints is designed to allow users to access the research output of the University of Strathclyde.
Copyright c and Moral Rights for the papers on this site are retained by the individual authors
and/or other copyright owners. You may not engage in further distribution of the material for any
profitmaking activities or any commercial gain. You may freely distribute both the url (http://
strathprints.strath.ac.uk/) and the content of this paper for research or study, educational, or
not-for-profit purposes without prior permission or charge.
Any correspondence concerning this service should be sent to Strathprints administrator:
mailto:strathprints@strath.ac.uk
http://strathprints.strath.ac.uk/
Ruthven, I. Re-examining the potential effectiveness of interactive query
expansion. In: Proceedings of the Twenty-Sixth Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, Toronto.
http://eprints.cdlr.strath.ac.uk/1915/
Ian Ruthven
Department of Computer and Information Sciences
University of Strathclyde
Glasgow. G1 1XH
Ian.Ruthven@cis.strath.ac.uk
ABSTRACT
Much attention has been paid to the relative effectiveness of One argument in favour of AQE is that the system has
interactive query expansion versus automatic query expansion. access to more statistical information on the relative utility of
Although interactive query expansion has the potential to be an expansion terms and can make better a better selection of which
effective means of improving a search, in this paper we show terms to add to the user’s query. The main argument in favour of
that, on average, human searchers are less likely than systems to IQE is that interactive query expansion gives more control to the
make good expansion decisions. To enable good expansion user. As it is the user who decides the criteria for relevance in a
decisions, searchers must have adequate instructions on how to search, then the user should be able to make better decisions on
use interactive query expansion functionalities. We show that which terms are likely to be useful, [10].
simple instructions on using interactive query expansion do not A number of comparative user studies of automatic versus
necessarily help searchers make good expansion decisions and interactive query expansion have come up with inconclusive
discuss difficulties found in making query expansion decisions. findings regarding the relative merits of AQE versus IQE. For
example, Koenemann and Belkin [10] demonstrated that IQE
Categories and Subject Descriptors can outperform AQE for specific tasks, whereas Beaulieu [1]
H.3.3 [Information Search and Retrieval]: - search process, showed AQE as giving higher retrieval effectiveness in an
relevance feedback. operational environment. One reason for this discrepancy in
findings is that the design of the interface, search tasks and
experimental methodology can affect the uptake and
General Terms effectiveness of query expansion techniques.
Experimentation, Human Factors Magennis and Van Rijsbergen [12] attempted to gauge the
effectiveness of IQE in live and simulated user experiments. In
Keywords their experiments they estimated the performance that might be
Query expansion, Evaluation gained if a user was making very good IQE decisions (the
potential effectiveness of IQE) compared to that of real users
making the query modification decisions (the actual
1. INTRODUCTION effectiveness of IQE). Their conclusion was that users tend to
Query expansion techniques, e.g. [1, 5], aim to improve a user’s make sub-optimal decisions on query term utility.
search by adding new query terms to an existing query. A In this paper we revisit this claim to investigate more fully
standard method of performing query expansion is to use the potential effectiveness of IQE. In particular we investigate
relevance information from the user – those documents a user how good a user’s query term selection would have to be to
has assessed as containing relevant information. The content of increase retrieval effectiveness over automatic strategies for
these relevant documents can be used to form a set of possible query expansion. We also compare human assessment of
expansion terms, ranked by some measure that describes how expansion term utility with those assessments made by the
useful the terms might be in attracting more relevant documents, system.
[13]. All or some of these expansion terms can be added to the The remainder of the paper is structured as follows. In
query either by the user – interactive query expansion (IQE) – section 2 we discuss the motivation behind our investigation and
or by the retrieval system – automatic query expansion (AQE). that of Magennis and Van Rijsbergen. In section 3 we describe
our experimental methodology and data. In section 4 we
Permission to make digital or hard copies of all or part of this work for investigate the potential effectiveness of IQE and in section 5 we
personal or classroom use is granted without fee provided that copies compare potential strategies for helping users make IQE
are not made or distributed for profit or commercial advantage and that decisions. In section 6 we summarise our findings.
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGIR’03, July 28–August 1, 2003, Toronto, Canada.
Copyright 2003 ACM 1-58113-646-3/03/0007…$5.00.
2. MOTIVATION iv. only one document collection was used. Differences in the
In this section we summarise the experiments carried out by creation of test collections, the search topics used and the
Magennis and Van Rijsbergen, section 2.1, some limitations of documents present in the test collection may affect the results of
these experiments, section 2.2, and discuss the motivation their conclusions and restrict the generality of their conclusions.
behind our work, section 2.3
2.3 Aims of study
2.1 The potential effectiveness of IQE In our experiments we aim to overcome these limitations to
In [11, 12] Magennis and Van Rijsbergen, based on earlier work create a more realistic evaluation of the potential effectiveness
by Harman [7], carried out an experiment to estimate how good of interactive query expansion. In particular we aim to
IQE could be if performed by expert searchers. investigate how good IQE could be, how easy it is to make good
IQE decisions and investigate guidelines for helping users make
Using the WSJ (1987-1992) test collection, a list of the top good IQE decisions. We also investigate what kind of IQE
20 expansion terms was created for each query, using terms decisions are actually made by searchers when selecting new
taken from the top 20 retrieved documents. This list of possible search terms. In the following section we describe how we
expansion terms was ranked by applying the F4, [14], term obtain the query expansion results analysed in the first part of
reweighting formula to the set of unretrieved relevant this paper. These experiments are also based on simulations of
documents. This set of documents consists of the relevant interactive query expansion decisions.
documents not yet seen by the user. This set could not be
calculated in a real search environment as retrieval systems will
only have knowledge of the set of documents that have been
seen by the user. However, as query expansion aims to retrieve
3. EXPERIMENTAL SETUP
this set of documents, they form the best evidence on the utility In this section we outline the experimental methodology we
of expansion terms. used to simulate query expansion decisions. The experiments
themselves were carried out on the Associated Press (AP 1998),
Using these sets of expansion terms, Magennis and Van San Jose Mercury News (SJM 1991), and Wall Street Journal
Rijsbergen simulated a user selecting expansion terms over four (WSJ 1990-1992) collections, details of which are given in
iterations of query expansion. At each iteration, from the list of Table 1. These collections come from the TREC initiative [16].
20 expansion terms, the top 0, 3, 6, 10 and 20 terms were
isolated. These groups of terms simulated possible sets of
expansion terms chosen by a user. By varying which group of Table 1: Collection statistics
terms was added at each iteration, all possible expansion
AP SJM WSJ
decisions were simulated. For example, expansion by the top 3
terms at feedback iteration 1, the top 10 terms at feedback Number of documents 79919 90257 74520
iteration 2, etc. The best simulation for each query was taken to Number of queries used 32 39 28
be a measure of the best IQE decisions that could be made by a
1
user; the potential effectiveness of IQE. Average words per query 2.9 3.7 2.9
Average number of relevant 37.8 58.6 30.3
2.2 Limitations documents per query
One of the benefits of an approach such as that taken by
Magennis and Van Rijsbergen is that it is possible to isolate the
effect of query expansion itself. That is, by eliminating the For each query we use the top 25 retrieved documents to
effects of individual searchers and search interfaces the results provide a list of possible expansion terms, as described below.
can be used as baseline figures with which to compare user Although each collection comes with a list of 50 topic (query)
search effectiveness. However, in [11, 12], Magennis noted descriptions, we concentrate on those queries where query
several limitations of this particular measurement of the expansion could change the effectiveness of an existing query.
potential effectiveness of IQE. This meant excluding some queries from each test collection;
i. only certain combinations of terms are considered, i.e. the those queries for which there are no relevant documents, queries
top 3, 6, 10 or 20 terms. Other combinations of terms are where no relevant documents were retrieved in the top 25
possible, e.g. the top 4 terms, and these may give better retrieval documents (as no expansion terms could be formed without at
performance. least one relevant document), and queries where all the relevant
documents are found within the top 25 retrieved documents (as
ii. real searchers are unlikely to use add a consecutive set of query expansion will not cause a change in retrieval
expansion terms, i.e. the top 3, 6, 10 or 20 terms suggested by effectiveness for these queries).
the system. It is more likely that searchers will choose terms
from throughout the list of expansion terms. In this way, users In our experiments we used the wpq method of ranking
can, for example, avoid poor expansion terms suggested by the terms for query expansion, [13], as this has been shown to give
system. good results for both AQE and IQE, [4]. The equation for
calculating a weight for a term using wpq is shown below, where
iii. the ranking of expansion terms is based on information
from the unseen relevant documents; ones that the user has not
yet viewed. In a real search environment the expansion terms
1
will be ranked based on their presence or absence in the Queries used in the experiments only. The query comes from
documents seen and assessed relevant by the user. the short title field of the TREC topic description.
the value rt = the number of seen relevant documents containing section we compare how likely a user is to make better query
term t, nt = the number of documents containing t, R = the expansion decisions using IQE than allowing the system to
number of seen relevant documents for query q, N = the number perform AQE. Our three AQE techniques are:
of documents in the collection. Collection independent expansion. A common approach to
AQE is to add a fixed number of terms, n, to each query. Our
first AQE technique simulates this by adding the top six
rt (R − rt ) r n −r expansion terms to all queries, irrespective of the collection
wpq t = log • t − t t
(nt − rt ) (N − nt − R + rt ) R N − R
used. The value of six was chosen without prior knowledge of
the effectiveness of adding this number of terms to any of the
queries in the test collections used.
Our procedure was as follows: Collection dependent expansion. The previous approach to
For each query, AQE adds the same number of expansion terms to all queries in
all collections. When using a specific test collection we can
i. rank the documents using a standard tf*idf weighting to calculate a better value of n; one that is specific to the test
obtain an initial ranking of the documents. collection used. To calculate n, for each collection, we compared
ii. use the relevant documents in the top 25 retrieved the average precision over all the queries used in each collection
documents to obtain a list of possible expansion terms, using the after the addition of the top n expansion terms, where n varied
wpq formula to rank the expansion terms. from 1 to 15. The value of n that gave the optimal value of
average precision for the whole query set was taken to be the
iii. using the top 15 expansion terms, create all possible sets of value of n for each query in the collection.
expansion terms. For each query this gives 32 678 possible sets
of expansion terms. This simulates all possible selections of These values could not be calculated in an operational
expansion terms using the top 15 terms, including no expansion environment, where knowledge of all queries submitted is
of the query. Each of these 32 678 sets of terms represents a unknown. However, it gives a stricter AQE baseline measure as
possible IQE decision that could be made by a user. the value of n is optimal for the collection used. The values for n
are shown in Table 2, and is higher than the six terms added in
iv. using each combination of expansion terms, add the the previous strategy.
combination to the original query and use the new query to rank
the documents, again using tf*idf. That is, for each query, we
carry out 32 678 separate versions of query expansion. Table 2: Optimal values of n
v. calculate the recall-precision values for each version of the Collection AP SJM WSJ
query. Here we use a full-freezing approach by which we only
re-rank the unseen documents – those not use to create the list of n 15 15 13
expansion terms. This is a standard method of assessing the
performance of a query expansion technique based on relevance Query dependent expansion. The collection dependent
information, [3] expansion strategy adds a fixed number of terms to each query
within a test collection. This is optimal for the entire query set
We only use the top 15 expansion terms for query expansion but may be sub-optimal for individual queries, i.e. some queries
as this is a computationally intensive method of creating may give better retrieval effectiveness for greater or smaller
possible queries. In a real interactive situation users may be values of n. The query dependent expansion strategy calculates
shown more terms than this. However, it does allow us to which value of n is optimal for individual queries. This may be
concentrate on those terms that are considered by the system to implemented in an operational retrieval system by, for example,
be the best for query expansion. setting a threshold on the expansion term weights.
For each query in each collection, therefore, we have a set These three AQE methods act as baseline performance
of 32 678 possible IQE decisions that could be made by a measures for comparing AQE with IQE.
searcher. For each possible IQE decision we can assess the
effect of making this decision on the quality of the expanded
4.1 Query expansion vs. no query expansion
We first compare the effect of query expansion against no query
query. We use this information in several ways; firstly, in
expansion; how good are different approaches to query
section 4, we compare the possible IQE decisions against three
expansion? In Table 3 we compare the AQE baselines against no
methods of applying AQE. We then, in section 5, examine
potential strategies for helping searchers make good IQE query expansion: the performance of the original query with no
decisions. In section 5 we also compare the possible IQE additional query terms. Specifically, we compare how many
decisions against human expansion decisions. queries in each collection give higher average precision than no
query expansion; the percentage of queries that are improved by
each AQE strategy. Also included in this table, in bold figures,
are the average precision figures given by applying the
4. COMPARING QUERY EXPANSION techniques.
TECHNIQUES As can be seen, all AQE strategies were more likely, on
In this section we examine the potential effectiveness of IQE average, to improve a query than harm it. That is, all techniques
against three possible strategies for applying AQE. In this
improved at least 50% of the queries where query expansion Finally, in row 7 of Table 3, we show the effect if a user
could make a difference to retrieval effectiveness. was consistently making the worst IQE decisions possible, i.e.
The automatic strategy that is most specific to the query, always choosing the combination of expansion terms that gave
the query dependent strategy, not only improves the highest the lowest average precision of all possible decisions. Even
percentage of queries– is most stable – but also gives the highest though a user is unlikely to always make such poor decisions,
average precision over the queries - is most effective. Conversely these decisions are being made on terms selected from the top
the automatic strategy that is least effective and improves least 15 expansion terms. So, although IQE can be effective it is a
queries is the one that is less tailored to either the query or technique that needs to be applied carefully. In the next section
collection – the collection independent strategy. we examine how likely a user is to make a good decision using
IQE.
Table 3: AQE baselines and example IQE decisions. 4.2 AQE vs. IQE
In this section we look at how difficult it is to select a set of
Baseline AP SJM WSJ expansion terms that will perform better than AQE or no query
Collection independent 56% 72% 50% expansion. We do this by comparing how many of the possible
IQE decisions will give better average precision than the AQE
18.8 23.8 18.4
baselines. In Table 4 we show the results of this analysis. For
Collection dependent 72% 79% 53% each collection we show how many possible IQE decisions gave
19.0 24.8 18.6 greater average precision than each of the three baselines (top
row in columns 2-4) and how many of the decisions gave a
Query dependent 75% 90% 86% significantly higher average precision than the baselines (bold
20.1 26.7 21.1 figures in columns 2 – 4)2.
IQE best 94% 97% 96%
22.3 29.1 22.4 Table 4: Percentage of combinations better than baselines
IQE poor 7.61 7.27 7.44 5.3.1 System analysis of expansion term utility
The system, or automatic, analysis of an expansion term is based
on the overall impact of adding that term to all possible IQE
5.2 Trust the system decisions that do not already contain the term. That is, we
A second approach might be to advise users to concentrate on estimate the likely impact of adding a new expansion term t to an
the terms suggested most strongly by the system. These are existing set of expansion terms.
terms that are calculated by the system to be the most likely to For each query, each expansion term, t, belongs to
improve a query, and in our experiment are the terms with the 50% (16384) of the possible IQE decisions (and does not belong
highest wpq score. In Table 6, we present the average wpq value to 50% possible decisions, including no query expansion). In
effect these two sets of possible decisions are identical except as terms are oswald, dealei, kwitni, theori, documentari, frenchi
relates to t: adding t to each IQE decision in the latter set would and bulletin, and the neutral terms are warren, theorist and
give an IQE decision in the former set. By comparing the belin. The term tippit is good and poor for an equal percentage
average precision of all IQE decisions that contain t, with the of combinations and cannot be classified.
corresponding decisions that do not contain t, we can classify
each of the top 15 expansion terms according to whether they 5.3.2 Human analysis of expansion term utility
are good, neutral or poor expansion terms. Good terms are those The automatic classification of expansion term utility presented
that are likely to improve the performance of a possible IQE in the previous section was compared against a set of human
decision (a set of expansion terms); neutral ones are those that classification of the same expansion terms.
generally make no difference and poor expansion terms are We selected 8 queries from each collection and asked 3
those that are likely to decrease the performance of a set of human subjects to read the whole TREC topic and each of the
expansion terms. relevant documents found within the top 25 retrieved
We demonstrate this in Table 7, based on the TREC topic documents. These were the relevant documents used to create
259 ‘New Kennedy Assassination Theories’ run on the AP the list of the top 15 expansion terms in the previous
collection. Each row shows what percentage of the 16384 experiment. The subjects were given the full TREC topic
possible decisions, not already containing the term in column 1, description to provide some context for the search, and were
that are improved, worsened, or have no difference after the shown the initial query that retrieved the documents. The
addition of the term. For example, the addition of the term jfk subjects were then presented with the top 15 expansion terms.
will always improve retrieval effectiveness. That is, adding the For each expansion term the subjects were asked whether they
term jfk to any set of expansion terms will increase retrieval felt the term would be useful or not useful at retrieving
effectiveness. Conversely, adding the term frenchi will always additional relevant documents when added to the existing
reduce the retrieval effectiveness, and the addition of the term query5.
warren4 will make no difference. We asked each subject to assess each of the 24 queries
rather than distributing the queries across multiple subjects. This
was to preserve any strategies the individual users may be
Table 7: Addition of expansion terms for TREC topic 259 employing when selecting expansion terms [8]. However, we
Term Improved No difference Worsened did not ask the subjects to read the non-relevant retrieved
jfk 100 0 0 documents as we felt this was too great a burden on the subjects.
The subjects’ selection of expansion terms was compared
oswald 3 0 97
against the automatic analysis from section 5.3.1 to compare the
dealei 37 4 59 system classification against human classification of expansion
kwitni 29 0 71 term utility. The comparison was done in three ways; first we
compare how good the subjects are at detecting good expansion
motorcad 64 4 32 terms, section 5.3.2.1, how good the subjects are at eliminating
marcello 100 0 0 poor expansion terms, section 5.3.2.2, and examine the
decisions made by the subjects, section 5.3.2.3.
warren 0 100 0
theorist 0 100 0 5.3.2.1 Detecting good expansion terms
For each subject we examine first whether the subjects can
theori 18 0 82 detect good expansion terms; whether the subjects can recognise
depositori 67 0 33 the expansion terms that are likely to be useful in combination
with other expansion terms.
documentari 40 19 41
belin 0 100 0
Table 8: Percentage of good expansion terms detected
tippit 46 8 46
Subject 1 Subject 2 Subject 3
frenchi 0 0 100
AP 73% 60% 63%
bulletin 45 0 55
SJM 50% 40% 42%
WSJ 62% 32% 45%