Download as pdf or txt
Download as pdf or txt
You are on page 1of 10

View metadata, citation and similar papers at core.ac.

uk brought to you by CORE


provided by University of Strathclyde Institutional Repository

Strathprints Institutional Repository

Ruthven, I. (2003) Re-examining the potential effectiveness of interactive query expansion. In:
26th Annual International ACM SIGIR Conference on Research and Development in Information
Retrieval, 2003-07-28 - 2003-08-01, Toronto.
Strathprints is designed to allow users to access the research output of the University of Strathclyde.
Copyright c and Moral Rights for the papers on this site are retained by the individual authors
and/or other copyright owners. You may not engage in further distribution of the material for any
profitmaking activities or any commercial gain. You may freely distribute both the url (http://
strathprints.strath.ac.uk/) and the content of this paper for research or study, educational, or
not-for-profit purposes without prior permission or charge.
Any correspondence concerning this service should be sent to Strathprints administrator:
mailto:strathprints@strath.ac.uk

http://strathprints.strath.ac.uk/
Ruthven, I. Re-examining the potential effectiveness of interactive query
expansion. In: Proceedings of the Twenty-Sixth Annual International
ACM SIGIR Conference on Research and Development in Information
Retrieval, Toronto.

http://eprints.cdlr.strath.ac.uk/1915/

This is an author-produced version of a paper published in Proceedings


of the Twenty-Sixth Annual International ACM SIGIR Conference on
Research and Development in Information Retrieval. This version has been
peer-reviewed, but does not include the final publisher proof
corrections, published layout, or pagination.
Strathprints is designed to allow users to access the research
output of the University of Strathclyde. Copyright © and Moral
Rights for the papers on this site are retained by the individual
authors and/or other copyright owners. Users may download
and/or print one copy of any article(s) in Strathprints to facilitate
their private study or for non-commercial research. You may not
engage in further distribution of the material or use it for any profit-
making activities or any commercial gain. You may freely distribute
the url (http://eprints.cdlr.strath.ac.uk) of the Strathprints website.

Any correspondence concerning this service should be sent to The


Strathprints Administrator: eprints@cis.strath.ac.uk
Re-examining the Potential Effectiveness of Interactive
Query Expansion

Ian Ruthven
Department of Computer and Information Sciences
University of Strathclyde
Glasgow. G1 1XH
Ian.Ruthven@cis.strath.ac.uk

ABSTRACT
Much attention has been paid to the relative effectiveness of One argument in favour of AQE is that the system has
interactive query expansion versus automatic query expansion. access to more statistical information on the relative utility of
Although interactive query expansion has the potential to be an expansion terms and can make better a better selection of which
effective means of improving a search, in this paper we show terms to add to the user’s query. The main argument in favour of
that, on average, human searchers are less likely than systems to IQE is that interactive query expansion gives more control to the
make good expansion decisions. To enable good expansion user. As it is the user who decides the criteria for relevance in a
decisions, searchers must have adequate instructions on how to search, then the user should be able to make better decisions on
use interactive query expansion functionalities. We show that which terms are likely to be useful, [10].
simple instructions on using interactive query expansion do not A number of comparative user studies of automatic versus
necessarily help searchers make good expansion decisions and interactive query expansion have come up with inconclusive
discuss difficulties found in making query expansion decisions. findings regarding the relative merits of AQE versus IQE. For
example, Koenemann and Belkin [10] demonstrated that IQE
Categories and Subject Descriptors can outperform AQE for specific tasks, whereas Beaulieu [1]
H.3.3 [Information Search and Retrieval]: - search process, showed AQE as giving higher retrieval effectiveness in an
relevance feedback. operational environment. One reason for this discrepancy in
findings is that the design of the interface, search tasks and
experimental methodology can affect the uptake and
General Terms effectiveness of query expansion techniques.
Experimentation, Human Factors Magennis and Van Rijsbergen [12] attempted to gauge the
effectiveness of IQE in live and simulated user experiments. In
Keywords their experiments they estimated the performance that might be
Query expansion, Evaluation gained if a user was making very good IQE decisions (the
potential effectiveness of IQE) compared to that of real users
making the query modification decisions (the actual
1. INTRODUCTION effectiveness of IQE). Their conclusion was that users tend to
Query expansion techniques, e.g. [1, 5], aim to improve a user’s make sub-optimal decisions on query term utility.
search by adding new query terms to an existing query. A In this paper we revisit this claim to investigate more fully
standard method of performing query expansion is to use the potential effectiveness of IQE. In particular we investigate
relevance information from the user – those documents a user how good a user’s query term selection would have to be to
has assessed as containing relevant information. The content of increase retrieval effectiveness over automatic strategies for
these relevant documents can be used to form a set of possible query expansion. We also compare human assessment of
expansion terms, ranked by some measure that describes how expansion term utility with those assessments made by the
useful the terms might be in attracting more relevant documents, system.
[13]. All or some of these expansion terms can be added to the The remainder of the paper is structured as follows. In
query either by the user – interactive query expansion (IQE) – section 2 we discuss the motivation behind our investigation and
or by the retrieval system – automatic query expansion (AQE). that of Magennis and Van Rijsbergen. In section 3 we describe
our experimental methodology and data. In section 4 we
Permission to make digital or hard copies of all or part of this work for investigate the potential effectiveness of IQE and in section 5 we
personal or classroom use is granted without fee provided that copies compare potential strategies for helping users make IQE
are not made or distributed for profit or commercial advantage and that decisions. In section 6 we summarise our findings.
copies bear this notice and the full citation on the first page. To copy
otherwise, or republish, to post on servers or to redistribute to lists,
requires prior specific permission and/or a fee.
SIGIR’03, July 28–August 1, 2003, Toronto, Canada.
Copyright 2003 ACM 1-58113-646-3/03/0007…$5.00.
2. MOTIVATION iv. only one document collection was used. Differences in the
In this section we summarise the experiments carried out by creation of test collections, the search topics used and the
Magennis and Van Rijsbergen, section 2.1, some limitations of documents present in the test collection may affect the results of
these experiments, section 2.2, and discuss the motivation their conclusions and restrict the generality of their conclusions.
behind our work, section 2.3
2.3 Aims of study
2.1 The potential effectiveness of IQE In our experiments we aim to overcome these limitations to
In [11, 12] Magennis and Van Rijsbergen, based on earlier work create a more realistic evaluation of the potential effectiveness
by Harman [7], carried out an experiment to estimate how good of interactive query expansion. In particular we aim to
IQE could be if performed by expert searchers. investigate how good IQE could be, how easy it is to make good
IQE decisions and investigate guidelines for helping users make
Using the WSJ (1987-1992) test collection, a list of the top good IQE decisions. We also investigate what kind of IQE
20 expansion terms was created for each query, using terms decisions are actually made by searchers when selecting new
taken from the top 20 retrieved documents. This list of possible search terms. In the following section we describe how we
expansion terms was ranked by applying the F4, [14], term obtain the query expansion results analysed in the first part of
reweighting formula to the set of unretrieved relevant this paper. These experiments are also based on simulations of
documents. This set of documents consists of the relevant interactive query expansion decisions.
documents not yet seen by the user. This set could not be
calculated in a real search environment as retrieval systems will
only have knowledge of the set of documents that have been
seen by the user. However, as query expansion aims to retrieve
3. EXPERIMENTAL SETUP
this set of documents, they form the best evidence on the utility In this section we outline the experimental methodology we
of expansion terms. used to simulate query expansion decisions. The experiments
themselves were carried out on the Associated Press (AP 1998),
Using these sets of expansion terms, Magennis and Van San Jose Mercury News (SJM 1991), and Wall Street Journal
Rijsbergen simulated a user selecting expansion terms over four (WSJ 1990-1992) collections, details of which are given in
iterations of query expansion. At each iteration, from the list of Table 1. These collections come from the TREC initiative [16].
20 expansion terms, the top 0, 3, 6, 10 and 20 terms were
isolated. These groups of terms simulated possible sets of
expansion terms chosen by a user. By varying which group of Table 1: Collection statistics
terms was added at each iteration, all possible expansion
AP SJM WSJ
decisions were simulated. For example, expansion by the top 3
terms at feedback iteration 1, the top 10 terms at feedback Number of documents 79919 90257 74520
iteration 2, etc. The best simulation for each query was taken to Number of queries used 32 39 28
be a measure of the best IQE decisions that could be made by a
1
user; the potential effectiveness of IQE. Average words per query 2.9 3.7 2.9
Average number of relevant 37.8 58.6 30.3
2.2 Limitations documents per query
One of the benefits of an approach such as that taken by
Magennis and Van Rijsbergen is that it is possible to isolate the
effect of query expansion itself. That is, by eliminating the For each query we use the top 25 retrieved documents to
effects of individual searchers and search interfaces the results provide a list of possible expansion terms, as described below.
can be used as baseline figures with which to compare user Although each collection comes with a list of 50 topic (query)
search effectiveness. However, in [11, 12], Magennis noted descriptions, we concentrate on those queries where query
several limitations of this particular measurement of the expansion could change the effectiveness of an existing query.
potential effectiveness of IQE. This meant excluding some queries from each test collection;
i. only certain combinations of terms are considered, i.e. the those queries for which there are no relevant documents, queries
top 3, 6, 10 or 20 terms. Other combinations of terms are where no relevant documents were retrieved in the top 25
possible, e.g. the top 4 terms, and these may give better retrieval documents (as no expansion terms could be formed without at
performance. least one relevant document), and queries where all the relevant
documents are found within the top 25 retrieved documents (as
ii. real searchers are unlikely to use add a consecutive set of query expansion will not cause a change in retrieval
expansion terms, i.e. the top 3, 6, 10 or 20 terms suggested by effectiveness for these queries).
the system. It is more likely that searchers will choose terms
from throughout the list of expansion terms. In this way, users In our experiments we used the wpq method of ranking
can, for example, avoid poor expansion terms suggested by the terms for query expansion, [13], as this has been shown to give
system. good results for both AQE and IQE, [4]. The equation for
calculating a weight for a term using wpq is shown below, where
iii. the ranking of expansion terms is based on information
from the unseen relevant documents; ones that the user has not
yet viewed. In a real search environment the expansion terms
1
will be ranked based on their presence or absence in the Queries used in the experiments only. The query comes from
documents seen and assessed relevant by the user. the short title field of the TREC topic description.
the value rt = the number of seen relevant documents containing section we compare how likely a user is to make better query
term t, nt = the number of documents containing t, R = the expansion decisions using IQE than allowing the system to
number of seen relevant documents for query q, N = the number perform AQE. Our three AQE techniques are:
of documents in the collection. Collection independent expansion. A common approach to
AQE is to add a fixed number of terms, n, to each query. Our
first AQE technique simulates this by adding the top six
rt (R − rt ) r n −r  expansion terms to all queries, irrespective of the collection
wpq t = log •  t − t t 
(nt − rt ) (N − nt − R + rt )  R N − R 
used. The value of six was chosen without prior knowledge of
the effectiveness of adding this number of terms to any of the
queries in the test collections used.
Our procedure was as follows: Collection dependent expansion. The previous approach to
For each query, AQE adds the same number of expansion terms to all queries in
all collections. When using a specific test collection we can
i. rank the documents using a standard tf*idf weighting to calculate a better value of n; one that is specific to the test
obtain an initial ranking of the documents. collection used. To calculate n, for each collection, we compared
ii. use the relevant documents in the top 25 retrieved the average precision over all the queries used in each collection
documents to obtain a list of possible expansion terms, using the after the addition of the top n expansion terms, where n varied
wpq formula to rank the expansion terms. from 1 to 15. The value of n that gave the optimal value of
average precision for the whole query set was taken to be the
iii. using the top 15 expansion terms, create all possible sets of value of n for each query in the collection.
expansion terms. For each query this gives 32 678 possible sets
of expansion terms. This simulates all possible selections of These values could not be calculated in an operational
expansion terms using the top 15 terms, including no expansion environment, where knowledge of all queries submitted is
of the query. Each of these 32 678 sets of terms represents a unknown. However, it gives a stricter AQE baseline measure as
possible IQE decision that could be made by a user. the value of n is optimal for the collection used. The values for n
are shown in Table 2, and is higher than the six terms added in
iv. using each combination of expansion terms, add the the previous strategy.
combination to the original query and use the new query to rank
the documents, again using tf*idf. That is, for each query, we
carry out 32 678 separate versions of query expansion. Table 2: Optimal values of n
v. calculate the recall-precision values for each version of the Collection AP SJM WSJ
query. Here we use a full-freezing approach by which we only
re-rank the unseen documents – those not use to create the list of n 15 15 13
expansion terms. This is a standard method of assessing the
performance of a query expansion technique based on relevance Query dependent expansion. The collection dependent
information, [3] expansion strategy adds a fixed number of terms to each query
within a test collection. This is optimal for the entire query set
We only use the top 15 expansion terms for query expansion but may be sub-optimal for individual queries, i.e. some queries
as this is a computationally intensive method of creating may give better retrieval effectiveness for greater or smaller
possible queries. In a real interactive situation users may be values of n. The query dependent expansion strategy calculates
shown more terms than this. However, it does allow us to which value of n is optimal for individual queries. This may be
concentrate on those terms that are considered by the system to implemented in an operational retrieval system by, for example,
be the best for query expansion. setting a threshold on the expansion term weights.

For each query in each collection, therefore, we have a set These three AQE methods act as baseline performance
of 32 678 possible IQE decisions that could be made by a measures for comparing AQE with IQE.
searcher. For each possible IQE decision we can assess the
effect of making this decision on the quality of the expanded
4.1 Query expansion vs. no query expansion
We first compare the effect of query expansion against no query
query. We use this information in several ways; firstly, in
expansion; how good are different approaches to query
section 4, we compare the possible IQE decisions against three
expansion? In Table 3 we compare the AQE baselines against no
methods of applying AQE. We then, in section 5, examine
potential strategies for helping searchers make good IQE query expansion: the performance of the original query with no
decisions. In section 5 we also compare the possible IQE additional query terms. Specifically, we compare how many
decisions against human expansion decisions. queries in each collection give higher average precision than no
query expansion; the percentage of queries that are improved by
each AQE strategy. Also included in this table, in bold figures,
are the average precision figures given by applying the
4. COMPARING QUERY EXPANSION techniques.
TECHNIQUES As can be seen, all AQE strategies were more likely, on
In this section we examine the potential effectiveness of IQE average, to improve a query than harm it. That is, all techniques
against three possible strategies for applying AQE. In this
improved at least 50% of the queries where query expansion Finally, in row 7 of Table 3, we show the effect if a user
could make a difference to retrieval effectiveness. was consistently making the worst IQE decisions possible, i.e.
The automatic strategy that is most specific to the query, always choosing the combination of expansion terms that gave
the query dependent strategy, not only improves the highest the lowest average precision of all possible decisions. Even
percentage of queries– is most stable – but also gives the highest though a user is unlikely to always make such poor decisions,
average precision over the queries - is most effective. Conversely these decisions are being made on terms selected from the top
the automatic strategy that is least effective and improves least 15 expansion terms. So, although IQE can be effective it is a
queries is the one that is less tailored to either the query or technique that needs to be applied carefully. In the next section
collection – the collection independent strategy. we examine how likely a user is to make a good decision using
IQE.

Table 3: AQE baselines and example IQE decisions. 4.2 AQE vs. IQE
In this section we look at how difficult it is to select a set of
Baseline AP SJM WSJ expansion terms that will perform better than AQE or no query
Collection independent 56% 72% 50% expansion. We do this by comparing how many of the possible
IQE decisions will give better average precision than the AQE
18.8 23.8 18.4
baselines. In Table 4 we show the results of this analysis. For
Collection dependent 72% 79% 53% each collection we show how many possible IQE decisions gave
19.0 24.8 18.6 greater average precision than each of the three baselines (top
row in columns 2-4) and how many of the decisions gave a
Query dependent 75% 90% 86% significantly higher average precision than the baselines (bold
20.1 26.7 21.1 figures in columns 2 – 4)2.
IQE best 94% 97% 96%
22.3 29.1 22.4 Table 4: Percentage of combinations better than baselines

IQE middle 31% 38% 30% Baseline AP SJM WSJ


18.4 22.9 18.1 No expansion 59% 69% 53%
IQE worst 0% 0% 0% 30% 38% 21%
11.9 15.8 14.0 Collection independent 45% 36% 41%
9% 11% 12%
We can compare these decisions against possible IQE Collection dependent 47% 35% 43%
decisions. Firstly, in row 5 of Table 3, we show the percentage 13% 9% 8%
of queries improved, and average precision obtained, when
Query dependent 9% 10% 10%
using the best IQE decision for each query. This set of figures
gives the best possible results on each collection when using 1% 1% 2%
query expansion. This is the highest potential performance of
IQE using the top 15 expansion terms.
What we are trying to uncover here is how likely a user is
Comparing the performance of the best IQE decision to make good IQE decisions over a range of queries. The
against the AQE decisions, it can be seen that IQE has the argument for IQE, based on this analysis, is mixed. On the
potential to be the most stable technique overall in that it positive side over 50% of the possible IQE decisions give better
improves most queries. It also has the potential to be the most performance than no query expansion, and over 20% of the
effective query expansion technique as it gives highest overall possible decisions give significantly better performance (row 2).
average precision. However this is only a potential benefit, as However, this also means that nearly half of the possible
we shall show in the remainder of this paper it may not be easy decisions will decrease retrieval performance3 and most
for a user to select such an optimal set of terms. decisions will not make any significant difference to the existing
For example, in the row 6 of Table 3 we show the query performance.
performance of a middle-performing IQE decision. This is Compared against the best AQE strategy (query
obtained, for each query, by ranking the average precision of all dependent), only a small percentage (9-10%) of possible
32768 possible IQE decisions and selecting the IQE decision at decisions are likely to be better than allowing the system to
position 16384 (half way down the ranking). This decision is make the query expansion decisions. Based on this analysis it
one that would be obtained if a user makes query expansion
decisions that were neither good nor poor compared to other
possible decisions. This result shows that even fair IQE 2
Measured using a t-test (p < 0.05), holding recall fixed and
decisions can perform relatively poorly; improving less than half
varying precision. Values were calculated on the set of RP
of queries and giving poorer retrieval effectiveness than any of
figures for each query not the averaged value.
the AQE strategies.
3
A small percentage (1%-3%) of possible decisions will neither
increase nor decrease query performance.
appears that it may be hard for users to make very good IQE of the terms chosen in good and poor IQE decisions, and also in
decisions; ones that are better than a good AQE technique. the best IQE and AQE strategies.
The collection independent strategy is the most realistic The average wpq value for terms in good (row 4) and poor
default AQE approach as it assumes no knowledge of collections IQE decisions (row 5) is relatively similar. This means that sets
or queries. However, although 35%-45% of possible IQE of terms with high wpq values are not more likely to give good
decisions are better than the collection independent strategy, this performance than sets of terms with lower wpq values.
still means that searchers are more likely to make a poorer query The average value for the best AQE decisions (row 2) is
expansion decision than the system. This is only true, however, generally higher than that of the IQE decisions. This, however,
if users lack any method of selecting good combinations of results in part from the fact that the query dependent AQE
expansion terms. In the next section we analyse potential strategy adds a consecutive set of terms taken from the top of the
guidelines that could be given to users to help them make good expansion term ranking. As these terms are at the top of the term
IQE decisions. ranking, they will naturally have a higher wpq value.
The average term score for the best IQE (row 3) decision is
also higher than either the good or poor IQE decisions, so there
5. POSSIBLE GUIDELINES FOR IQE is some merit in choosing terms that the system recommends
In this section we try to assess possible instructions that could
most highly – those with high wpq values.
be given to users to help them make use of IQE as a general
search technique.
Table 6: Average wpq of terms chosen
5.1 Select more terms
One reason for asking users to engage in IQE is to give more AP SJM WSJ
evidence to the retrieval system regarding the information for Query dependent 2.20 2.11 2.39
which they are looking. Users, especially in web searches, often
IQE best 2.94 1.91 2.26
use very short queries [9]. Presenting lists of possible expansion
terms is one way to get users to give more information, in the IQE good 1.92 1.71 1.70
form of query words, to the system.
IQE poor 1.93 1.70 2.12
A useful guideline to give to users, then, may be to expand
the query with as many useful terms are possible. In Table 5 we
compare the size of IQE decisions that lead to an increase in However, the lack of difference between the good and poor
retrieval effectiveness (good IQE decisions, Table 5, row 4) IQE decisions means we cannot alone recommend the user
against those that led to a decrease in retrieval effectiveness concentrates more closely on the terms suggested by the system.
(poor IQE decisions, Table 5, row 5). As can be seen, the size of That is, highly scored terms are useful but the user must apply
the query expansion does not distinguish good decisions from some additional strategy to select which of these terms to use for
poor decisions. query expansion.
The size of the best IQE decisions (the average size of the
combinations that gave the best average precision) is similar
both to the average size of the good and poor combinations 5.3 Use semantics
(Table 5, row 3). The sizes of the average of the best AQE One of the more intuitive arguments in favour of IQE is that,
decisions are also within a similar range (Table 5, row 2). So unlike the statistically-based query expansion techniques,
giving the system more evidence does not necessarily gain any humans can exploit semantic relationships for retrieval. That is,
improvement in effectiveness. people can recognise expansion terms that are semantically
related to the information for which they are seeking and expand
the query using these terms. However, investigations such as the
Table 5: Average size of query expansions one presented in [2] indicate that searchers can find it difficult to
AP SJM WSJ use semantic information even when the system supports the
recognition and use of semantic relationships.
Query dependent 6.63 7.10 8.46
Consequently, in this section we outline a small pilot
IQE best 7.29 5.56 7.16 experiment designed to compare system recommendations of
IQE good 7.35 7.60 7.27 term utility against human assessment of the same terms.

IQE poor 7.61 7.27 7.44 5.3.1 System analysis of expansion term utility
The system, or automatic, analysis of an expansion term is based
on the overall impact of adding that term to all possible IQE
5.2 Trust the system decisions that do not already contain the term. That is, we
A second approach might be to advise users to concentrate on estimate the likely impact of adding a new expansion term t to an
the terms suggested most strongly by the system. These are existing set of expansion terms.
terms that are calculated by the system to be the most likely to For each query, each expansion term, t, belongs to
improve a query, and in our experiment are the terms with the 50% (16384) of the possible IQE decisions (and does not belong
highest wpq score. In Table 6, we present the average wpq value to 50% possible decisions, including no query expansion). In
effect these two sets of possible decisions are identical except as terms are oswald, dealei, kwitni, theori, documentari, frenchi
relates to t: adding t to each IQE decision in the latter set would and bulletin, and the neutral terms are warren, theorist and
give an IQE decision in the former set. By comparing the belin. The term tippit is good and poor for an equal percentage
average precision of all IQE decisions that contain t, with the of combinations and cannot be classified.
corresponding decisions that do not contain t, we can classify
each of the top 15 expansion terms according to whether they 5.3.2 Human analysis of expansion term utility
are good, neutral or poor expansion terms. Good terms are those The automatic classification of expansion term utility presented
that are likely to improve the performance of a possible IQE in the previous section was compared against a set of human
decision (a set of expansion terms); neutral ones are those that classification of the same expansion terms.
generally make no difference and poor expansion terms are We selected 8 queries from each collection and asked 3
those that are likely to decrease the performance of a set of human subjects to read the whole TREC topic and each of the
expansion terms. relevant documents found within the top 25 retrieved
We demonstrate this in Table 7, based on the TREC topic documents. These were the relevant documents used to create
259 ‘New Kennedy Assassination Theories’ run on the AP the list of the top 15 expansion terms in the previous
collection. Each row shows what percentage of the 16384 experiment. The subjects were given the full TREC topic
possible decisions, not already containing the term in column 1, description to provide some context for the search, and were
that are improved, worsened, or have no difference after the shown the initial query that retrieved the documents. The
addition of the term. For example, the addition of the term jfk subjects were then presented with the top 15 expansion terms.
will always improve retrieval effectiveness. That is, adding the For each expansion term the subjects were asked whether they
term jfk to any set of expansion terms will increase retrieval felt the term would be useful or not useful at retrieving
effectiveness. Conversely, adding the term frenchi will always additional relevant documents when added to the existing
reduce the retrieval effectiveness, and the addition of the term query5.
warren4 will make no difference. We asked each subject to assess each of the 24 queries
rather than distributing the queries across multiple subjects. This
was to preserve any strategies the individual users may be
Table 7: Addition of expansion terms for TREC topic 259 employing when selecting expansion terms [8]. However, we
Term Improved No difference Worsened did not ask the subjects to read the non-relevant retrieved
jfk 100 0 0 documents as we felt this was too great a burden on the subjects.
The subjects’ selection of expansion terms was compared
oswald 3 0 97
against the automatic analysis from section 5.3.1 to compare the
dealei 37 4 59 system classification against human classification of expansion
kwitni 29 0 71 term utility. The comparison was done in three ways; first we
compare how good the subjects are at detecting good expansion
motorcad 64 4 32 terms, section 5.3.2.1, how good the subjects are at eliminating
marcello 100 0 0 poor expansion terms, section 5.3.2.2, and examine the
decisions made by the subjects, section 5.3.2.3.
warren 0 100 0
theorist 0 100 0 5.3.2.1 Detecting good expansion terms
For each subject we examine first whether the subjects can
theori 18 0 82 detect good expansion terms; whether the subjects can recognise
depositori 67 0 33 the expansion terms that are likely to be useful in combination
with other expansion terms.
documentari 40 19 41
belin 0 100 0
Table 8: Percentage of good expansion terms detected
tippit 46 8 46
Subject 1 Subject 2 Subject 3
frenchi 0 0 100
AP 73% 60% 63%
bulletin 45 0 55
SJM 50% 40% 42%
WSJ 62% 32% 45%

For simplicity, we classify terms simply by their


predominant tendency. For the example in Table 7 the good In Table 8 we show the percentage of the good expansion
terms are jfk, motorcad, marcello, and depositori. The poor terms, as classified in section 5.3.1, which were chosen by each
subject as being possibly useful for query expansion. The
4
From the Warren Commission which investigated the
5
assassination of President Kennedy. This term and the term If the subjects could not decide whether the term was
theori are the only ones to appear in the TREC topic useful/not useful, they could assign the term to the category
description. ‘cannot decide’.
subjects varied in their ability to identify good expansion terms, Appearing in lots of relevant documents appears initially to
being able to identify 32% - 73% of the good expansion terms. correlate with an assessment of good expansion term utility.
However the difference in relevant document occurrence
5.3.2.2 Eliminating poor expansion terms between good/poor and bad/poor misclassification is often
If the subjects are not always good at detecting good expansion slight.
terms perhaps they are better are eliminating poor expansion
terms? In Table 9 we show the percentage of expansion terms The most apparent pattern from Table 10 is that subjects
that were assessed as being poor by the system but good by the tend to classify terms with a high collection frequency as being
subjects. As in the previous section, the subjects’ ability to good expansion terms. Conversely terms with a low collection
correctly classify expansion terms varied with at least 25% of frequency are likely to be assessed as being poor expansion
the poor expansion terms being rated as good by the subjects. terms. This is not a universal pattern (Subject 2 on the WSJ
The implication here is that subjects may have difficulty spotting collection for example does the opposite) but it is the main
poor expansion terms. pattern and suggests that searchers may not being assessing
which terms are useful but which terms are recognisable.

Table 9: Percentage of poor expansion terms classified as


good by subjects 5.3.3 Subjects’ reasons for expansion term
Subject 1 Subject S2 Subject S3 selection
We discussed with each subject their reasons for their
AP 54% 36% 43% classification of expansion terms. Based on the subjects’ reasons
SJM 39% 26% 35% for classification and the later automatic classification, we can
suggest three reasons for misclassification of expansion term
WSJ 38% 45% 39%
utility.
i. Statistical relationships are important as well as semantic
One reason for the poor classification of terms may be that the ones. Subjects tended to ignore terms if the terms appear to have
subjects are only choosing certain types of terms. In Table 10 we been suggested for purely statistical reasons, e.g. numbers. In
compare the cases where the system classification (column 2) general this may be a sensible approach if the query does not
agreed or disagreed with the subjects’ classification (column 3) mention specific numbers or dates. However, the documents in
of terms. the static collections we used are only a sample of the possible
documents on the topics investigated. In this case, strong
statistical relationships may be useful for future retrieval.
Table 10: Comparison of system and subject classification
ii. Users cannot always identify semantic relationships.
System User S1 S2 S3 Making good use of semantic information means being able to
AP Good Good 692 (6.5) 570 (6.22) 666 (5.3) identify semantic relationships between the information need
and the possible expansion terms. For specialised or unusual
Poor Good 622 (4.3) 914 (4.48) 601 (4.2) terms, the subjects could be unsure of the value of these terms
Good Poor 429 (4.5) 830 (4.14) 578 (4.1) unless the relationship between these terms and the information
need was made clear in the documents.
Poor Poor 142 (1.7) 223(1.8) 178 (1.6)
However, being able to recognise why expansion terms have
SJM Good Good 1321 (7.5) 1831 (7.3) 1542 (7.7) been suggested, and the searcher’s ability to classify terms as
useful or not, does not necessarily guarantee that the terms
Poor Good 766 (3.7) 867 (3.7) 802 (3.7)
themselves will be seen as useful. Rather, we propose that
Good Poor 390 (2.8) 405 (3.8) 397 (3.5) searchers need more sophisticated support in assessing the
potential quality of expansion terms.
Poor Poor 53 (1.4) 253 (1.9) 179 (1.6)
iii. Users cannot always identify useful semantic relationships
WSJ Good Good 833 (5.2) 204 (2.2) 765 (4.5) for retrieval. The difficulty most subjects experienced with
Poor Good 1496 (3.9) 682 (2.8) 881 (3.3) selecting expansion terms is that, although they felt they could
identify obvious semantic relationships, they could not identify
Good Poor 285 (2.6) 598 (4.0) 270 (2.7) which semantic relationships were going to attract more relevant
Poor Poor 427 (1.8) 966 (3.0) 470 (2.3) documents. In short, the subjects felt they could not identify the
effect of individual expansion terms on future retrieval. Instead
the subjects concentrated mainly on terms they viewed as safe;
For each case we give the average collection occurrence of those that were semantically related to the topic description
the terms and (the figure in parentheses) their average rather than the retrieved relevant documents. That is, the
occurrence within the relevant documents. For example, for the subjects tended to concentrate on terms for new queries rather
terms on which subject 1 and the system agreed that the terms than modified or refined queries.
were useful, these terms appeared in an average of 692 This type of decision-making can also be seen in other
documents in the AP collection and an average of 6.5 relevant investigations, e.g. [15] which demonstrated that, although
documents. terms suggested from relevant documents can be useful terms,
they are often not used as a main source of additional search [4] Efthimiadis, E. N. User-choices: a new yardstick for the
terms. evaluation of ranking algorithms for interactive query
In a real interactive environment users can, of course, try out expansion. Information processing and management. 31. 4.
expansion terms, or add their own new terms, and see the effect pp 605-620. 1995.
on the type of documents retrieved. However, the lack of [5] Efthimiadis, E. N.. Query expansion. ARIST Volume 31:
connection between expansion terms and documents used to Annual Review of Information Science and Technology.
provide those terms indicates that searchers may need more Martha E. Williams (ed). 1996.
support in how to use query expansion as a general interactive
technique. [6] Fowkes, H., and Beaulieu, M. Interactive searching
behaviour: Okapi experiment for TREC-8. Proceedings of
6. CONCLUSIONS IRSG 2000. 22nd Annual Colloquium on Information
In this paper we examined the potential effectiveness of Retrieval Research. Cambridge. Cambridge. 2002.
interactive query expansion. This is mainly a simulation [7] Harman, D. Towards interactive query expansion.
experiment and is intended to supplement rather than replace Proceedings of the 11th Annual International ACM SIGIR
experimental investigations of real user IQE decision-making. Conference on Research and Development in Information
There are several limitations to this work: for example, we only Retrieval. pp 321-331. Grenoble. 1988.
concentrated on altering the content of the query; future
investigations will compare the results obtained here when we [8] Iivonen, M. Consistency in the selection of search concepts
use relevance weighting in addition to query expansion. We also and search terms. Information Processing and Management.
do not differentiate between queries although the success of 31. 2. pp 180-186. 1995.
query expansion can vary greatly across queries. We will [9] Jansen, B. J., Spink, A., and Saracevic, T. Real life, real
consider this in future work, our intention here is to investigate users, and real needs: A study and analysis of users on the
the general applicability of query expansion. web. Information Processing & Management. 36. 2. pp
The experimental results initially provided a comparison 207-227. 2000.
between AQE and IQE techniques. From Table 3, section 4.1, [10] Koenemann, J., and Belkin, N. J. A case for interaction: a
IQE has the potential to be an effective technique compared with study of interactive information retrieval behavior and
AQE. One of the main claims for IQE is that searchers can be effectiveness. Proceedings of the Human Factors in
more adept, than the system, at identifying good expansion Computing Systems Conference (CHI'96). pp 205-212.
terms. This may be particularly true for certain types of search, Zurich. 1996.
e.g. in [6] Fowkes and Beaulieu showed that searchers preferred
IQE when dealing with complex query statements. Subjects may [11] Magennis, M. The potential and actual effectiveness of
also be better at targeting specific aspects of the search, i.e. interactive query expansion. PhD thesis. University of
focussing on parts of their information need. Glasgow. 1997.
However, the analyses presented here show that the [12] Magennis, M., and van Rijsbergen, C. J. The potential and
potential benefits of IQE may not be easy to achieve. In actual effectiveness of interactive query expansion.
particular searchers have difficulty identifying useful terms for Proceedings of the 20th International ACM SIGIR
effective query expansion. The implication is that simple term Conference on Research and Development in Information
presentation interfaces are not sufficient in providing sufficient Retrieval. pp 324-332. Philadelphia. 1997.
support and context to allow good query expansion decisions. [13] Robertson, S. E. On term selection for query expansion.
Interfaces must support the identification of relationships Journal of Documentation. 46. 4. pp 359-364. 1990.
between relevant material and suggested expansion terms and
should support the development of good expansion strategies by [14] Robertson, S. E., and Sparck Jones, K. Relevance
the searcher. weighting of search terms. Journal of the American Society
for Information Science. 27. 3. pp 129-146. 1976.
[15] Spink, A. and Saracevic, T. Interaction in information
7. REFERENCES retrieval: Selection and effectiveness of search terms.
[1] Beaulieu, M. Experiments with interfaces to support query Journal of the American Society for Information Science.
expansion. Journal of Documentation. 53. 1. pp 8-19. 1997. 48. 8. pp 741-761. 1997.
[2] Blocks, D., Binding, C., Cunliffe, D., and Tudhope, D. [16] Voorhees, E. H., and Harman, D. Overview of the sixth text
Qualitative evaluation of thesaurus-based retrieval. retrieval conference (TREC-6). Information Processing and
Proceedings of the 6th European Conference in Digital Management. 36. 1. pp 3 - 35. 2000.
Libraries. Rome. Lecture Notes in Computer Science
2458. pp 346-361. 2002.
[3] Chang, Y. K., Cirillo, C., and Razon, J. Evaluation of
feedback retrieval using modified freezing, residual
collection & test and control groups. The SMART retrieval
system - experiments in automatic document processing. G.
Salton (ed). Chapter 17. pp 355-370. 1971.

You might also like