Professional Documents
Culture Documents
Integrating: H.Ievaiim
Integrating: H.Ievaiim
port specialists share their knowledge and MANAGEMENT AND INFORMATION RETRIEVAL NEEDS.
experiences with other specialists by writ-
ing up their problems and solutions and
NATURAL LANGUAGE PROCESSING PLAYS AN lMl'ORlXhT ROLE,
making them available in a worldwide on-
line information base. By sharing this
knowledge among a large body of physi- indexed with a controllcd vocabulary and from a query to broaden it if the original
cally distributed support engineers. cus- retrieved using Boolean expressions over search resulted in no hits.
tomer problems that have been previously this vocabulary. By the mid- 1980s. how- On the whole. this approach to informa-
encountered can be resolved much more ever, this system was replaced by a full- tion retrieval has worked well for special-
quickly, thereby increasing customer sat- text system. Stars, which proved much ists hunting down problem-solving infor-
isfaction and decreasing Digital's cost of more effective. In a full-text system, all the mation. However, the rapid growth in the
providing service. words in a text are available as keywords size of databases has contributed to an
Over the past decade, the range of prod- for querying. Users do not have to learn a increase in the number of false hits re-
ucts supported by Digital's specialists has controlled vocabulary, and they can ex- trieved during searches, as well as an in-
grown steadily and now includes non-Dig- press their queries in natural language rather crease in the need to reformulate queries to
ita1 products as well. This growth has been than the syntax of Boolean expressions. guide the search. Particularly frustrating
matched by a massive increase in the size Stars did not try to "understand" the were cases in which a specialist knew an
of the on-line inforniation base, which has natural language queries, nor did it simply article existed but couldn't easily craft a
surged from less than 10,000 article5 into match the query's tokens with articles con- query that would retrieve it. Problems like
the hundreds of thousands. taining those strings. Instead. i t truncated this led Digital's Stars development group
As the size and importance of the data- words to remove suffixes. eliminated noise to contact us at the company's AI Technol-
base has grown, the customer support cen- words like "a" and "of." added synonyms. ogy Center to explore the applicability of
ters have improved their software's data and converted the results into Boolean ex- AI technologies to this critical component
management and information retrieval ca- pressions for matching against articles in of customer service and support. Intrigued
pabilities. In the early 1980s. articles were the database. It could also remove terms by the possibilities for exploiting expert
systems, intelligent user/system dialogues, operators, including negation. The object riences, the users we interviewed were
blackboard-based integration of knowledge of their task, to respond as quickly as pos- wary of many of the “enhancements” we
sources, and natural language processing, sible to a specific customer inquiry, meant initially proposed. Intelligent automated
we eagerly accepted the challenge and chris- that they were primarily interested in “pre- query reformulation and statistical nearest
tened our collaboration the AI-Stars project. cision” searching: finding the one article neighbor matching methods were perceived
that addressed a particular problem, as as potentially making the system even more
opposed to all the articles that might be difficult to understand. They argued that
Interviewing the users even peripherally related. users should be able to readily comprehend
For the most part, users reported liking and easily override any “intelligence” built
Information retrieval research in the the Stars system and appeared to be effec- into the system. Furthermore, their practi-
1980shad already pursued anumber of AI- tive at using it. However, their confidence cal need to perform searches quickly ar-
influenced approaches. The Rubric sys- was shaken when they had trouble locating gued against a user interface that engaged
tem, for example, cast queries as rules for an article that they knew to exist; and they in interactive dialogues to elicit long and
building up evidence about a document’s “accurate” statements of users’ information
relevance.’ 13R used a blackboard archi- needs. They liked the convenience of
tecture to orchestrate a dialogue with the launching searches with short natural lan-
user, elaborate the user’s information needs, guage queries.
and find appropriate documents.* Natural W E FELT OBLIGED TO Users did volunteer suggestions for im-
language processing was used to automate provements, however. They wanted accu-
the role of an expert intermediary in inter- EXPLORE SOLUTIONS THAT rate stemming, which behaved according
active document r e t r i e ~ a l Statistical
.~ meth- WOULD MINIMIZE THE to their linguistic intuitions and did not, for
ods had also emerged as attractive alterna- example, accidentally conflate technical
tives to Boolean and controlled vocabulary USER’S NEED TO DO terms with similar natural language words.
approaches; for example, Wide Area In- ANYTHING MORE THAN TYPE They wanted the system to match phrases,
formation Servers4drew on the vector space so that “C library” would be interpreted
model, in which queries and documents are IN A SIMPLE NATURAL as a single expression rather than the
represented as weighted term vectors, al- combination of two potentially noncon-
LANGUAGE QUERY.
lowing for the computation of statistical tiguous terms. They wanted more control
similarity metrics between them.5 Such over the system’s default behavior with
systems work best with long queries, and it respect to noise words, synonyms, and
is possible to use the text of one article as query broadening. They wanted the system
a query to find others similar to it. had little way of knowing, when a search to recognize certain expressions such as
One of the conclusions researchers had failed, whether that meant there was no version numbers, so as to retrieve articles
come to after exploring this wide range of article that met their need or whether fur- containing variant forms or specializa-
approaches was that no one retrieval meth- ther searching was likely to be fruitful. tions (for example, a query containing
od is likely to meet the needs of all users.6 Users also reported some trouble coming “V5.4” should match an article contain-
Not only do the methods tend to retrieve to grips with the internal heuristics used by ing “version 5.4-la”). Help in reformulat-
different sets of relevant documents, but Stars to convert natural language queries ing queries would be valuable as well, so
users themselves come with different task into search expressions. Indeed, each of long as the user remained firmly in control
orientations, interaction preferences, skill Stars’ internal query-processing strategies of reformulation.
levels, and so on. We therefore began our could in some cases lead to poor results and These interviews dampened our enthusi-
AI-Stars investigation by interviewing confuse users: asm for a number of possible AI approaches;
members of our user population and watch- we concluded that our initial efforts should
Stemming sometimes reduced unrelated
ing them as they performed their daily focus on improving the linguistic perfor-
words to the same stem.
work of computer troubleshooting. mance of the existing Stars system. While
Some apparent noise words were in fact
We had discovered many different styles some of the users’ requested functionality
content words in certain contexts.
of on-line searching. Some specialists liked could be achieved through extensions to
Automatic use of synonyms sometimes
to enter a broad (one- or two-word) query the user query language, such as proximity
broadened the query in inappropriate
and then, based on the size of the resulting operators and wildcarding, we felt obliged
ways.
set, further restrict the search expression. to explore solutions that would minimize
Users intended certain combinations of
Others started with longer queries of per- the user’s need to do anything more than
words in their queries to be contiguous
haps half a dozen terms and removed or type in a simple natural language query.
phrases rather than independent search
added related terms if the initial query We therefore decided to evaluate the feasi-
terms.
appeared to miss its mark. Users tended to bility of incorporating some of the basic
The strategy for automatically broaden-
understand that their natural language ex- building blocks of a standard natural lan-
ing a query with no hits did not always
pression was transformed into a Boolean guage processing system, namely a
do the “right” thing.
expression and, on some occasions, would computational lexicon, morphological an-
resort to using optional explicit Boolean Perhaps reacting to these kinds of expe- alyzer, and parser.
IO IEEE EXPERT
I
Developing a prototype
To implement our first prototype, we
adapted a set of tools we had developed for
a machine-aided translation system. This
included a 17,000-word English dictionary
containing information about each word’s
grammatical category and stem forms. Its
morphological analyzer contained a set of
rules encoding English orthography and
inflectional suffixation. Given a string, it
proposed a set of potential stems that could
then be verified against the lexicon. The
parser was a simple bottom-up chartpars-
er,’ which, given a set of syntactic rules
and an English string, generated a table, or
chart, by applying the grammar to the to-
kens in the input string and storing all the
well-formed substrings. The chart can be
Uversion 5
DECEMBER 1993 11
I
I
H-r
I n
UdadTenar
Flk,
___
T m "I I .
igure 2. The thesaurus window, query input window, and modified query reformulation workspate.
from inactive terms. Each tile displays the Clicking on a tile changes it from active Selecting a synonym from the Thesaurus
term's citation form and the number of to inactive, and vice versa. window broadens the query by adding
article postings for that term. Through the Dragging a tile to a new location chang- the item as an active tile to the same
display, the results of all the default behav- es its Boolean relationship with the oth- column(s) as the previously selected
iors of the query interpreter are available at er tiles. workspace term, without changing the
aglance: the morphological analysis, phrase Stretching a tile across multiple col- term's activation.
matching, special expression recognition, umns effectively ORs the term with the Selecting a conceptually related term
and selection of active terms. The article AND of terms in the columns it spans. narrows the query by appending the item
postings provide feedback on the discrim- Selecting a tile option can pull up a as an active tile after the last occupied
inative role played by each term in the thesaurus window for that term (see Fig- column of the chart (visually establish-
query. For most configurations of the tiles, ure 2). ing a new column).
even the Boolean interpretation is visually Selecting a compound term has the same
The thesaurus provides a way to access
manifest; users can apply the rule of thumb effect as adding a synonym. Compound
related terms that can help broaden or nar-
that tiles in the same column are ORed and terms are superstrings of the term that
row a search. To minimize user effort in
those in different columns are ANDed. The
tiles in Figure 1 can thus be interpreted as
1 adding thesaurus terms to the query, we
contain at least one nonalphabetic char-
acter. They are common in computer
the niienr
1 divided the window into four catgories of
science domains where many terms, such
Y""J
1 relationships:
as error messages and facility names, are
composed of mnemonic strings embed-
. . . ded in longer strings, using nonalphabetic
But the visual display is not only a win- phrase for the selected term in the work- characters as delineators.
dow into the system's internals. It also space: It is placed as an active tile in the
serves as a direct manipulation workspace same column(s) as the previously selected Selected thesaurus terms are automatically
for performing iterative query refinement: workspace term, which is made inactive. placed on the chart in aposition (relative to
IEEE EXPERT
the initial term) that makes sense for their Given the additional problem that our pro- addition, our experience implementing the
relationship to that term, so the user does spective users were always too busy to ex- prototype had made us aware of several
not have to decide where to place the terms. tensively test a prototype, we had to make weaknesses in our original design.
Since each default thesaurus operation do with a series of short demonstrations
has an immediate visual manifestation, the followed by informal user evaluations. In A dynamic lexicon. One of the first
user can adjust the query if the default is general, users were pleased with the lin- problems we encountered was the need to
not what is desired. Figure 2 shows how the guistic enhancements underlying the query support a dynamic rather than static lexi-
query reformulation workspace from Fig- interpreter, especially the handling of phrases con. The constant influx of new textual
ure 1 would look after clicking on the tiles and special expressions. Initial reactions to information from a variety of sources, in-
for “copy,” “BACKUP,” and “saveset,” our slide presentation about the query refor- cluding corporate publications, internal
and adding the phrase “scratch tape” from mulation workspace ranged widely; some bulletin boards, and external support
the thesaurus. This new tile configuration users were intrigued and enthusiastic, oth- documentation, guranteed that lexical
would translate into the query ers were baffled and skeptical. While we do k n o w l e d g d h e words and phrases that
comprise the application’s working vo-
((“BACKUP” AND “saveset”) OR “BACKUP cabulary-would also remain highly dy-
saveset”) AND ”scratch tape” AND (“v5.0”
OR “version 5.0”)) namic. Indeed, each new computing prod-
uct or technology is likely to introduce a
In lieu of interacting directly with the THERE
WAS NO WAY TO wealth of new phrases, endow existing
thesaurus, a user can opt to have prespeci- words with new meanings or new gram-
fied categories of related terms (such as UPDATE THE ARTlCLE INDEXES matical categories (for instance, a noun
synonyms) added to the query automati- WITHOUT REAPPLmG THE might be adopted as a verb), and even coin
cally. In addition, other terms that might be altogether novel words (such as “widget”
useful in query reformulation can be in- and “laptop”).
cluded in the display as inactive tiles. In ANALYZER TO ALL THE T m T S , Compared to a static lexicon, an evolv-
Figure 2, for example, an inactive tile has ing lexicon has significant implications for
been added by default to generalize the A TlME-CONSUMING PROCESS. any information retrieval system. Since we
version number (“version 5.0” + “version cannot assume that all required vocabulary
5”). In this way, the ability to broaden the is known before the first article is loaded
query according to this useful domain- into the system, the system must include
specific generalization is only a mouse not know how much of the latter response some mechanism for ensuring that the lex-
click away. was due to a general unfamiliarity with icon and the article database always re-
By making it easy to iteratively permute direct-manipulation graphical user inter- main synchronized. Not only must the lex-
and rerun a query, we were hoping to over- faces, we were pleased to find that most icon be updated if new vocabulary is used
come the traditional brittleness of all-or- users interviewed, including some of the in an article, but, if article index keys are
none responses to single Boolean queries, initially most dubious critics, felt at ease based on linguistic analysis, then reindex-
without sacrificing the almost “surgical” with the workspace after about five minutes ing might be necessary whenever the lin-
precision of the Boolean model. Likewise, of supervised interaction. guistic information changes. Secondly, a
we believed that giving the user a conve- Encouraged by this feedback, the Stars dynamic lexicon is likely to require many
nient way to grasp how the various compo- development team decided to include the of the same administrative functions that
nent terms contribute to the result would workspace in the next major release of the apply to a dynamic textual database, such
improve the user’s sense of whether to Stars product. Also slated for inclusion as data distribution, security, activity log-
continue trying to refine a search or to were a number of major architectural chang- ging, data subsetting, transactional consis-
assume that no relevant article exists. es and functional enhancements, many de- tency, and so on. We were therefore inter-
signed specifically to support the customer ested in exploring whether an information
support centers. The centers wanted to retrieval system could itself be an effective
From prototype to product evolve their corporate network of loosely repository of lexical information, allowing
coupled Stars databases into a true world- the lexicon to share all the administrative
Evaluating an information retrieval sys- wide distributed database, from which us- facilities already provided.
tem has always been problematic.’” Accu- ers could retrieve any piece of information As we began to add new words and
rately calculating precision and recall for a in an identical manner from any site. New phrases to our prototype’s lexicon, we were
given document collection and a given set features included full clientkerver com- forced to confront the disadvantage of in-
of queries requires a full understanding of puting, user-defined semistructured infor- dexing articles by their citation forms: There
the collection’s contents and some way to mation objects, data replication, hyperin- was no way to update the article indexes
determine the subjective relevance of each formation linking, and logical database without reapplying the parser/morpholog-
document to each query. Furthermore, dif- partitioning. As we prepared to reimple- ical analyzer to all the texts, a time-
ferent task orientations, user backgrounds, ment the linguistic features we had proto- consuming process. What was worse, a
and the nature of the information stored in typed, we had to consider how they would new article might be added while the previ-
the collection might need to be considered. integrate into this new architecture. In ous index transaction was still in progress
~
DECEMBER 1993 13
and, conversely, new lexicon entries might processing, there is no longer a need to database but are desirable for lexical data
be added while an article index update was coordinate updates of the lexicon with up- as well.
already in progress to accommodate previ- dates to the article indexes. As soon as a We initially prototyped the AI-Stars lex-
ous lexicon modifications. We did not want new word or phrase is added to the lexicon, icon on top of a standard relational data-
to block either lexicon or article updates, it can be used in a user query to access base management system. However, as we
so we initially developed a scheme that articles. added features and extended the system
time-stamped the updates. This facilitated The disadvantage of this approach is that into a wide-area distributed clientkerver
the coordination of lexicon and index up- it moves more of the computational burden implementation, we began to explore the
dates, and gave users the illusion of a to query time. However, we have found possibility of layering the natural language
consistent system by making new lexicon that the major performance cost (the disk processing functions directly on top of the
entries “invisible” until all article reindex- 110s required to fetch the index entries for AI-Stars Collection Services, the system’s
ing had completed. From a user’s perspec- each inflectional variant) can be effectively storage and indexing facilities. Several as-
tive, however, it made no sense that, after mitigated by using a B-tree index that sorts pects of the system made this particularly
updating the lexicon, it took some inde- attractive:
terminate amount of time before the update
In our distributed environment, Collec-
was reflected in the system’s search behav-
tion Services provide for the replication
ior. Therefore, we began to look into other
of document collections at multiple net-
possibilities. I F INDEXES NO LONGER work sites, for both performance and
We recognized that if we indexed on
surface forms (the strings just as they ap- DEPEND ON LINGUISTIC reliability (availability) reasons. Updates
pear in the text) in addition to citation to document collections are automati-
PROCESSING, THERE IS NO cally propagated to all replication sites.
forms, we could take advantage of a much
more efficient scheme for updating article LONGER A NEED TO By using the same facilities for lexical
indexes. That is, whenever a new word was data, lexicon updates can be handled in
COORDINATE UPDATES OF THE an identical fashion.
to be added to the lexicon, we could use our
morphological generator to generate a list LEXlCON WTH UPDATES TO Collection Services support the con-
of all the inflectional variants of the word. struction of virtual databases (so-called
We could then do a query on the OR of
THE ARTICLE INDEXES. derived collections), defined by apply-
those terms, and the resulting article set ing a query filter to the union of one or
would be the precise set that should be more other collections. For textual data-
indexed by the newly added citation form. bases, this provides a mechanism to log-
the surface string keys according to a case
Likewise, for phrases, we could do a ically categorize the data in multiple
insensitive alphabetical sort. Since most
query on the AND of the phrase compo- ways, as well as to provide restricted
inflectional variants of a word (in English)
nents to identify the articles that poten- views of data for security reasons. For
share the same prefix string, they will tend
tially contained the phrase. However, with- lexical data, this capability can be used
to cluster on the same disk block, making it
out information specifying the locations of to maintain sublanguage vocabularies or
likely that successive disk fetches will not
phrase components within the articles, it vocabularies involving security restric-
be required to access the index entries for
was impossible to compute the precise set tions. Collections also provide a natural
a set of inflectional variants.
without reparsing each article in the set to vehicle for providing separate lexicons
check for a confirmed occurrence of the for different natural languages.
Implementing a lexicon on top of an
Lexical entries are semistructured ob-
phrase. We thus chose to build a concor- information retrieval engine. As men-
dance index, which contains not only the jects. They can include free text data,
tioned earlier, treating the lexicon as a
set of articles that contain each word but such as definitions and usage examples.
dynamic rather than a static data structure
also the location (token number) of each Layering the lexicon on the information
requires support for the full range of activ-
occurrence of the word. Word adjacency retrieval system lets users query lexical
ities implied by a dynamic database. These
can be tested by comparing token numbers data in exactly the same manner as any
include not only Add, Modify, and Delete
in the index, enabling AI-Stars to find other semistructured information object
functions, but also a host of other features
phrases in articles without reparsing. in the AI-Stars database.
that might be required in the real-world
We also decided to try indexing only on environment in which the lexicon must In spite of these important advantages, it
surface strings, and using query-time mor- operate. Digital’s customer support lexi- was not clear to us at first whether per-
phological analysis and generation as well con is managed by a dedicated group of formance (again, measured in disk I/Os)
as phrase detection (via concordance adja- database administrators, whose responsi- would be competitive with other ap-
cency checking) to simulate the matching bility also includes administrative over- proaches. Retrieving individual words is
of articles by citation forms of words and sight of the contents of the on-line textual intrinsically less efficient in a concordance-
phrases. This approach has the architectur- databases. Activity logging and reporting, based inverted indexing system, since each
al advantage of functionally isolating lin- security, and data organization, subsetting, stem lookup retrieves not the set of words
guistic processing from article indexing. If and distribution are some of the functions but a set of object identification tags that
indexes no longer depend on linguistic that not only must be provided for the text must in turn be used to access the word
14 IEEE EXPERT
objects. Fortunately, however, this indi-
rection (the use of intermediate sets of
object IDS) can be exploited to recognize
phrases. Rather than use an on-disk dis-
crimination net, for example, we can per-
form phrase recognition using set intersec-
tion over the concordance index, testing
whether contiguous words in a query are
members of the same phrase (or phrases).
Since linguistic analysis at query time must
include both word and phrase recognition,
the query time cost of using our concor-
dance-indexing scheme turns out, on the
whole, to be equivalent to that of alterna-
tive approaches. We therefore chose to
reimplement the lexicon on top of the AI-
Stars Collection Services.
DECEMBER 1993 IS
, . -~~ .
- -~ . _- .~ . .-__ .-
I . I ‘.,.
i (VMS system” ((S “VMS”) (L “system“ mum))) , Star> without ;in cxce\.riw burdm of hand-
(“LTA device” ((S ‘LTA“) (L “device” nouns))) t;iiloring the Icxicm. S i n x the typical hclp
(“load host” (6“load”) IL ”host” nouns))) d e A en\ tronmcnt 4tiinlv doc\ not ha\.e the
that have more than one candidate ex- articles in the database.
planation with the same degree of cor- This algorithm has been quite effective, We view our present work as part of an
pus covering, or for words that do not generating more than 15,000 noun phrases evolutionary approach to incorporating even
appear to have any inflection variants from a 100-Mbyte corpus, with an accept- more sophisticated linguistic processing.
in the corpus. we apply a set of suffix ably low percentage of false alarms (Figure For example, a system that tried word sense
heuristics. The suffix - / y typically in- 4 gives examples of acquiredphrases). Most disambiguation, concept identification, or
dicates an adverb, whereas - i o r i indi- false alarms are due to errors made in part- the generation of syntactic variants of larger
cates a noun. For words for which no of-speech tagging in step I . These knowl- phrasal units could also use a graphical
suffix heuristic applies, we classify the edge acquisition procedures make it possi- device like the query reformulation work-
word as a noun by default. ble to bring up new databases for use in space to make its inferences public and
1
malleable. At the Same time, our exper]- Transformarion. Analysis, and Retrieval of
ence integrating even modest amounts of Informarion by Computer, Addison-Wes-
1
~