Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

integrating Nafurwi Language

Processha and hmRormation


Reh.ievaiIm a 7koubieshooting
Help Desk
Peter 6. Anick, Digital Equipment Corporation

ation provides telephone and electronic


support to its customers through a world-
AS THE SIZE AND IMPORTANCE OF DEC'S ON-LINE
wide network of support centers and field DATABASE HAVE GROWN, ITS CUSTOMER SUPPORT
offices. In addition to troubleshooting cus-
SOFTWARE HAS EVOLVED TO HANDLE INCREASED DATA
tomer problems, Digital's service and sup- I

port specialists share their knowledge and MANAGEMENT AND INFORMATION RETRIEVAL NEEDS.
experiences with other specialists by writ-
ing up their problems and solutions and
NATURAL LANGUAGE PROCESSING PLAYS AN lMl'ORlXhT ROLE,
making them available in a worldwide on-
line information base. By sharing this
knowledge among a large body of physi- indexed with a controllcd vocabulary and from a query to broaden it if the original
cally distributed support engineers. cus- retrieved using Boolean expressions over search resulted in no hits.
tomer problems that have been previously this vocabulary. By the mid- 1980s. how- On the whole. this approach to informa-
encountered can be resolved much more ever, this system was replaced by a full- tion retrieval has worked well for special-
quickly, thereby increasing customer sat- text system. Stars, which proved much ists hunting down problem-solving infor-
isfaction and decreasing Digital's cost of more effective. In a full-text system, all the mation. However, the rapid growth in the
providing service. words in a text are available as keywords size of databases has contributed to an
Over the past decade, the range of prod- for querying. Users do not have to learn a increase in the number of false hits re-
ucts supported by Digital's specialists has controlled vocabulary, and they can ex- trieved during searches, as well as an in-
grown steadily and now includes non-Dig- press their queries in natural language rather crease in the need to reformulate queries to
ita1 products as well. This growth has been than the syntax of Boolean expressions. guide the search. Particularly frustrating
matched by a massive increase in the size Stars did not try to "understand" the were cases in which a specialist knew an
of the on-line inforniation base, which has natural language queries, nor did it simply article existed but couldn't easily craft a
surged from less than 10,000 article5 into match the query's tokens with articles con- query that would retrieve it. Problems like
the hundreds of thousands. taining those strings. Instead. i t truncated this led Digital's Stars development group
As the size and importance of the data- words to remove suffixes. eliminated noise to contact us at the company's AI Technol-
base has grown, the customer support cen- words like "a" and "of." added synonyms. ogy Center to explore the applicability of
ters have improved their software's data and converted the results into Boolean ex- AI technologies to this critical component
management and information retrieval ca- pressions for matching against articles in of customer service and support. Intrigued
pabilities. In the early 1980s. articles were the database. It could also remove terms by the possibilities for exploiting expert
systems, intelligent user/system dialogues, operators, including negation. The object riences, the users we interviewed were
blackboard-based integration of knowledge of their task, to respond as quickly as pos- wary of many of the “enhancements” we
sources, and natural language processing, sible to a specific customer inquiry, meant initially proposed. Intelligent automated
we eagerly accepted the challenge and chris- that they were primarily interested in “pre- query reformulation and statistical nearest
tened our collaboration the AI-Stars project. cision” searching: finding the one article neighbor matching methods were perceived
that addressed a particular problem, as as potentially making the system even more
opposed to all the articles that might be difficult to understand. They argued that
Interviewing the users even peripherally related. users should be able to readily comprehend
For the most part, users reported liking and easily override any “intelligence” built
Information retrieval research in the the Stars system and appeared to be effec- into the system. Furthermore, their practi-
1980shad already pursued anumber of AI- tive at using it. However, their confidence cal need to perform searches quickly ar-
influenced approaches. The Rubric sys- was shaken when they had trouble locating gued against a user interface that engaged
tem, for example, cast queries as rules for an article that they knew to exist; and they in interactive dialogues to elicit long and
building up evidence about a document’s “accurate” statements of users’ information
relevance.’ 13R used a blackboard archi- needs. They liked the convenience of
tecture to orchestrate a dialogue with the launching searches with short natural lan-
user, elaborate the user’s information needs, guage queries.
and find appropriate documents.* Natural W E FELT OBLIGED TO Users did volunteer suggestions for im-
language processing was used to automate provements, however. They wanted accu-
the role of an expert intermediary in inter- EXPLORE SOLUTIONS THAT rate stemming, which behaved according
active document r e t r i e ~ a l Statistical
.~ meth- WOULD MINIMIZE THE to their linguistic intuitions and did not, for
ods had also emerged as attractive alterna- example, accidentally conflate technical
tives to Boolean and controlled vocabulary USER’S NEED TO DO terms with similar natural language words.
approaches; for example, Wide Area In- ANYTHING MORE THAN TYPE They wanted the system to match phrases,
formation Servers4drew on the vector space so that “C library” would be interpreted
model, in which queries and documents are IN A SIMPLE NATURAL as a single expression rather than the
represented as weighted term vectors, al- combination of two potentially noncon-
LANGUAGE QUERY.
lowing for the computation of statistical tiguous terms. They wanted more control
similarity metrics between them.5 Such over the system’s default behavior with
systems work best with long queries, and it respect to noise words, synonyms, and
is possible to use the text of one article as query broadening. They wanted the system
a query to find others similar to it. had little way of knowing, when a search to recognize certain expressions such as
One of the conclusions researchers had failed, whether that meant there was no version numbers, so as to retrieve articles
come to after exploring this wide range of article that met their need or whether fur- containing variant forms or specializa-
approaches was that no one retrieval meth- ther searching was likely to be fruitful. tions (for example, a query containing
od is likely to meet the needs of all users.6 Users also reported some trouble coming “V5.4” should match an article contain-
Not only do the methods tend to retrieve to grips with the internal heuristics used by ing “version 5.4-la”). Help in reformulat-
different sets of relevant documents, but Stars to convert natural language queries ing queries would be valuable as well, so
users themselves come with different task into search expressions. Indeed, each of long as the user remained firmly in control
orientations, interaction preferences, skill Stars’ internal query-processing strategies of reformulation.
levels, and so on. We therefore began our could in some cases lead to poor results and These interviews dampened our enthusi-
AI-Stars investigation by interviewing confuse users: asm for a number of possible AI approaches;
members of our user population and watch- we concluded that our initial efforts should
Stemming sometimes reduced unrelated
ing them as they performed their daily focus on improving the linguistic perfor-
words to the same stem.
work of computer troubleshooting. mance of the existing Stars system. While
Some apparent noise words were in fact
We had discovered many different styles some of the users’ requested functionality
content words in certain contexts.
of on-line searching. Some specialists liked could be achieved through extensions to
Automatic use of synonyms sometimes
to enter a broad (one- or two-word) query the user query language, such as proximity
broadened the query in inappropriate
and then, based on the size of the resulting operators and wildcarding, we felt obliged
ways.
set, further restrict the search expression. to explore solutions that would minimize
Users intended certain combinations of
Others started with longer queries of per- the user’s need to do anything more than
words in their queries to be contiguous
haps half a dozen terms and removed or type in a simple natural language query.
phrases rather than independent search
added related terms if the initial query We therefore decided to evaluate the feasi-
terms.
appeared to miss its mark. Users tended to bility of incorporating some of the basic
The strategy for automatically broaden-
understand that their natural language ex- building blocks of a standard natural lan-
ing a query with no hits did not always
pression was transformed into a Boolean guage processing system, namely a
do the “right” thing.
expression and, on some occasions, would computational lexicon, morphological an-
resort to using optional explicit Boolean Perhaps reacting to these kinds of expe- alyzer, and parser.

IO IEEE EXPERT
I

Developing a prototype
To implement our first prototype, we
adapted a set of tools we had developed for
a machine-aided translation system. This
included a 17,000-word English dictionary
containing information about each word’s
grammatical category and stem forms. Its
morphological analyzer contained a set of
rules encoding English orthography and
inflectional suffixation. Given a string, it
proposed a set of potential stems that could
then be verified against the lexicon. The
parser was a simple bottom-up chartpars-
er,’ which, given a set of syntactic rules
and an English string, generated a table, or
chart, by applying the grammar to the to-
kens in the input string and storing all the
well-formed substrings. The chart can be
Uversion 5

thought of as a two-dimensional array, in


which the input tokens span single col-
umns along the topmost row. Larger con-
stituents recognized by the parser span
Figure 1. The query input window and query reformulation workspace.
multiple columns directly below the to-
kens or phrases from which they are com-
posed. For natural language processing,
this data structure conveniently represents
competing ambiguous interpretations as
edges in different rows that span the same over the entire text base at load time, but maining active terms. Roughly speaking,
columns. had the effect of reducing all inflectional the algorithm ORed together terms that
Our original intent was to use the chart variants of a word or phrase to a single appeared in the same column on the chart,
parser to recognize potential noun phrases index entry. By indexing on the citation and then ANDed together these disjunct^.^
in user queries. However, many user que- form, we were also collapsing across the AI-Stars ran the resulting query expression
ries consist simply of a string of nouns, word’s part of speech. That is, our index against the database index, and displayed
without punctuation or function words to did not distinguish between “program” as a the size of the matching article set to the
indicate noun phrase boundaries. We de- noun and as a verb, for example. However, user, who could then examine the list or
cided that a purely syntactic approach to since noun and verb senses of English modify the query and rerun it.
recognizing noun phrases in user queries words are often semantically related, we
was impractical, since any sequence of two felt that any attempt to disambiguate words
or more nouns in a query would be inter- into grammatical categories at index time Reformulatingqueries
preted as a potential noun phrase. Instead, would have little tangible benefit for sub-
we adopted the strategy of storing useful sequent information retrieval. Adding linguistic components helped
noun phrases in the lexicon and constrain- At query time, the AI-Stars prototype correct a number of shortcomings in the
ing the parser to recognize only those parsed the user’s query, as described above, existing Stars engine, but we still had to
phrases. The parser was also used to recog- to construct a chart containing the citation address our users’ second directive: to make
nize special expressions, such as variants forms of words, phrases, and special ex- the system’s search behavior easier to in-
of operating systems’ version numbers. pressions. It then applied heuristics to de- terrogate and manipulate. Our recent ex-
One common technique in implement- termine which of the identified terms should perience with generating Boolean expres-
ing an efficient text retrieval system is to be included in the actual search expression sions from the underlying 2D chart data
construct inverted indexes for descriptors, to be matched against the article index. structure led us to craft a visual representa-
facilitating rapid access to the objects as- Function words like “the” and “of” were tion of the chart to serve as a user window
sociated with each descriptor.8 For our marked inactive. Likewise, any term whose into the internals of the Stars query inter-
prototype, we chose to index articles by the chart column span was exceeded by the preter. Figure 1 shows the visual display
citation forms of the words and phrases span of another term was labeled inactive, we developed; specifically, the query input
contained in them (that is, the string you thus eliminating any individual terms that window with the query reformulation work-
would typically use to look up a word in a were part of a larger phrase. The prototype space. Each term in the query is depicted as
traditional dictionary). This required run- then applied an algorithm to the chart to a tile in a 2D layout, with active terms in
ning a morphological and phrasal analysis construct a Boolean query from the re- reverse video to visually distinguish them

DECEMBER 1993 11
I

I
H-r

I n
UdadTenar
Flk,
___
T m "I I .

igure 2. The thesaurus window, query input window, and modified query reformulation workspate.

from inactive terms. Each tile displays the Clicking on a tile changes it from active Selecting a synonym from the Thesaurus
term's citation form and the number of to inactive, and vice versa. window broadens the query by adding
article postings for that term. Through the Dragging a tile to a new location chang- the item as an active tile to the same
display, the results of all the default behav- es its Boolean relationship with the oth- column(s) as the previously selected
iors of the query interpreter are available at er tiles. workspace term, without changing the
aglance: the morphological analysis, phrase Stretching a tile across multiple col- term's activation.
matching, special expression recognition, umns effectively ORs the term with the Selecting a conceptually related term
and selection of active terms. The article AND of terms in the columns it spans. narrows the query by appending the item
postings provide feedback on the discrim- Selecting a tile option can pull up a as an active tile after the last occupied
inative role played by each term in the thesaurus window for that term (see Fig- column of the chart (visually establish-
query. For most configurations of the tiles, ure 2). ing a new column).
even the Boolean interpretation is visually Selecting a compound term has the same
The thesaurus provides a way to access
manifest; users can apply the rule of thumb effect as adding a synonym. Compound
related terms that can help broaden or nar-
that tiles in the same column are ORed and terms are superstrings of the term that
row a search. To minimize user effort in
those in different columns are ANDed. The
tiles in Figure 1 can thus be interpreted as
1 adding thesaurus terms to the query, we
contain at least one nonalphabetic char-
acter. They are common in computer
the niienr
1 divided the window into four catgories of
science domains where many terms, such
Y""J
1 relationships:
as error messages and facility names, are
composed of mnemonic strings embed-
. . . ded in longer strings, using nonalphabetic
But the visual display is not only a win- phrase for the selected term in the work- characters as delineators.
dow into the system's internals. It also space: It is placed as an active tile in the
serves as a direct manipulation workspace same column(s) as the previously selected Selected thesaurus terms are automatically
for performing iterative query refinement: workspace term, which is made inactive. placed on the chart in aposition (relative to

IEEE EXPERT
the initial term) that makes sense for their Given the additional problem that our pro- addition, our experience implementing the
relationship to that term, so the user does spective users were always too busy to ex- prototype had made us aware of several
not have to decide where to place the terms. tensively test a prototype, we had to make weaknesses in our original design.
Since each default thesaurus operation do with a series of short demonstrations
has an immediate visual manifestation, the followed by informal user evaluations. In A dynamic lexicon. One of the first
user can adjust the query if the default is general, users were pleased with the lin- problems we encountered was the need to
not what is desired. Figure 2 shows how the guistic enhancements underlying the query support a dynamic rather than static lexi-
query reformulation workspace from Fig- interpreter, especially the handling of phrases con. The constant influx of new textual
ure 1 would look after clicking on the tiles and special expressions. Initial reactions to information from a variety of sources, in-
for “copy,” “BACKUP,” and “saveset,” our slide presentation about the query refor- cluding corporate publications, internal
and adding the phrase “scratch tape” from mulation workspace ranged widely; some bulletin boards, and external support
the thesaurus. This new tile configuration users were intrigued and enthusiastic, oth- documentation, guranteed that lexical
would translate into the query ers were baffled and skeptical. While we do k n o w l e d g d h e words and phrases that
comprise the application’s working vo-
((“BACKUP” AND “saveset”) OR “BACKUP cabulary-would also remain highly dy-
saveset”) AND ”scratch tape” AND (“v5.0”
OR “version 5.0”)) namic. Indeed, each new computing prod-
uct or technology is likely to introduce a
In lieu of interacting directly with the THERE
WAS NO WAY TO wealth of new phrases, endow existing
thesaurus, a user can opt to have prespeci- words with new meanings or new gram-
fied categories of related terms (such as UPDATE THE ARTlCLE INDEXES matical categories (for instance, a noun
synonyms) added to the query automati- WITHOUT REAPPLmG THE might be adopted as a verb), and even coin
cally. In addition, other terms that might be altogether novel words (such as “widget”
useful in query reformulation can be in- and “laptop”).
cluded in the display as inactive tiles. In ANALYZER TO ALL THE T m T S , Compared to a static lexicon, an evolv-
Figure 2, for example, an inactive tile has ing lexicon has significant implications for
been added by default to generalize the A TlME-CONSUMING PROCESS. any information retrieval system. Since we
version number (“version 5.0” + “version cannot assume that all required vocabulary
5”). In this way, the ability to broaden the is known before the first article is loaded
query according to this useful domain- into the system, the system must include
specific generalization is only a mouse not know how much of the latter response some mechanism for ensuring that the lex-
click away. was due to a general unfamiliarity with icon and the article database always re-
By making it easy to iteratively permute direct-manipulation graphical user inter- main synchronized. Not only must the lex-
and rerun a query, we were hoping to over- faces, we were pleased to find that most icon be updated if new vocabulary is used
come the traditional brittleness of all-or- users interviewed, including some of the in an article, but, if article index keys are
none responses to single Boolean queries, initially most dubious critics, felt at ease based on linguistic analysis, then reindex-
without sacrificing the almost “surgical” with the workspace after about five minutes ing might be necessary whenever the lin-
precision of the Boolean model. Likewise, of supervised interaction. guistic information changes. Secondly, a
we believed that giving the user a conve- Encouraged by this feedback, the Stars dynamic lexicon is likely to require many
nient way to grasp how the various compo- development team decided to include the of the same administrative functions that
nent terms contribute to the result would workspace in the next major release of the apply to a dynamic textual database, such
improve the user’s sense of whether to Stars product. Also slated for inclusion as data distribution, security, activity log-
continue trying to refine a search or to were a number of major architectural chang- ging, data subsetting, transactional consis-
assume that no relevant article exists. es and functional enhancements, many de- tency, and so on. We were therefore inter-
signed specifically to support the customer ested in exploring whether an information
support centers. The centers wanted to retrieval system could itself be an effective
From prototype to product evolve their corporate network of loosely repository of lexical information, allowing
coupled Stars databases into a true world- the lexicon to share all the administrative
Evaluating an information retrieval sys- wide distributed database, from which us- facilities already provided.
tem has always been problematic.’” Accu- ers could retrieve any piece of information As we began to add new words and
rately calculating precision and recall for a in an identical manner from any site. New phrases to our prototype’s lexicon, we were
given document collection and a given set features included full clientkerver com- forced to confront the disadvantage of in-
of queries requires a full understanding of puting, user-defined semistructured infor- dexing articles by their citation forms: There
the collection’s contents and some way to mation objects, data replication, hyperin- was no way to update the article indexes
determine the subjective relevance of each formation linking, and logical database without reapplying the parser/morpholog-
document to each query. Furthermore, dif- partitioning. As we prepared to reimple- ical analyzer to all the texts, a time-
ferent task orientations, user backgrounds, ment the linguistic features we had proto- consuming process. What was worse, a
and the nature of the information stored in typed, we had to consider how they would new article might be added while the previ-
the collection might need to be considered. integrate into this new architecture. In ous index transaction was still in progress
~

DECEMBER 1993 13
and, conversely, new lexicon entries might processing, there is no longer a need to database but are desirable for lexical data
be added while an article index update was coordinate updates of the lexicon with up- as well.
already in progress to accommodate previ- dates to the article indexes. As soon as a We initially prototyped the AI-Stars lex-
ous lexicon modifications. We did not want new word or phrase is added to the lexicon, icon on top of a standard relational data-
to block either lexicon or article updates, it can be used in a user query to access base management system. However, as we
so we initially developed a scheme that articles. added features and extended the system
time-stamped the updates. This facilitated The disadvantage of this approach is that into a wide-area distributed clientkerver
the coordination of lexicon and index up- it moves more of the computational burden implementation, we began to explore the
dates, and gave users the illusion of a to query time. However, we have found possibility of layering the natural language
consistent system by making new lexicon that the major performance cost (the disk processing functions directly on top of the
entries “invisible” until all article reindex- 110s required to fetch the index entries for AI-Stars Collection Services, the system’s
ing had completed. From a user’s perspec- each inflectional variant) can be effectively storage and indexing facilities. Several as-
tive, however, it made no sense that, after mitigated by using a B-tree index that sorts pects of the system made this particularly
updating the lexicon, it took some inde- attractive:
terminate amount of time before the update
In our distributed environment, Collec-
was reflected in the system’s search behav-
tion Services provide for the replication
ior. Therefore, we began to look into other
of document collections at multiple net-
possibilities. I F INDEXES NO LONGER work sites, for both performance and
We recognized that if we indexed on
surface forms (the strings just as they ap- DEPEND ON LINGUISTIC reliability (availability) reasons. Updates
pear in the text) in addition to citation to document collections are automati-
PROCESSING, THERE IS NO cally propagated to all replication sites.
forms, we could take advantage of a much
more efficient scheme for updating article LONGER A NEED TO By using the same facilities for lexical
indexes. That is, whenever a new word was data, lexicon updates can be handled in
COORDINATE UPDATES OF THE an identical fashion.
to be added to the lexicon, we could use our
morphological generator to generate a list LEXlCON WTH UPDATES TO Collection Services support the con-
of all the inflectional variants of the word. struction of virtual databases (so-called
We could then do a query on the OR of
THE ARTICLE INDEXES. derived collections), defined by apply-
those terms, and the resulting article set ing a query filter to the union of one or
would be the precise set that should be more other collections. For textual data-
indexed by the newly added citation form. bases, this provides a mechanism to log-
the surface string keys according to a case
Likewise, for phrases, we could do a ically categorize the data in multiple
insensitive alphabetical sort. Since most
query on the AND of the phrase compo- ways, as well as to provide restricted
inflectional variants of a word (in English)
nents to identify the articles that poten- views of data for security reasons. For
share the same prefix string, they will tend
tially contained the phrase. However, with- lexical data, this capability can be used
to cluster on the same disk block, making it
out information specifying the locations of to maintain sublanguage vocabularies or
likely that successive disk fetches will not
phrase components within the articles, it vocabularies involving security restric-
be required to access the index entries for
was impossible to compute the precise set tions. Collections also provide a natural
a set of inflectional variants.
without reparsing each article in the set to vehicle for providing separate lexicons
check for a confirmed occurrence of the for different natural languages.
Implementing a lexicon on top of an
Lexical entries are semistructured ob-
phrase. We thus chose to build a concor- information retrieval engine. As men-
dance index, which contains not only the jects. They can include free text data,
tioned earlier, treating the lexicon as a
set of articles that contain each word but such as definitions and usage examples.
dynamic rather than a static data structure
also the location (token number) of each Layering the lexicon on the information
requires support for the full range of activ-
occurrence of the word. Word adjacency retrieval system lets users query lexical
ities implied by a dynamic database. These
can be tested by comparing token numbers data in exactly the same manner as any
include not only Add, Modify, and Delete
in the index, enabling AI-Stars to find other semistructured information object
functions, but also a host of other features
phrases in articles without reparsing. in the AI-Stars database.
that might be required in the real-world
We also decided to try indexing only on environment in which the lexicon must In spite of these important advantages, it
surface strings, and using query-time mor- operate. Digital’s customer support lexi- was not clear to us at first whether per-
phological analysis and generation as well con is managed by a dedicated group of formance (again, measured in disk I/Os)
as phrase detection (via concordance adja- database administrators, whose responsi- would be competitive with other ap-
cency checking) to simulate the matching bility also includes administrative over- proaches. Retrieving individual words is
of articles by citation forms of words and sight of the contents of the on-line textual intrinsically less efficient in a concordance-
phrases. This approach has the architectur- databases. Activity logging and reporting, based inverted indexing system, since each
al advantage of functionally isolating lin- security, and data organization, subsetting, stem lookup retrieves not the set of words
guistic processing from article indexing. If and distribution are some of the functions but a set of object identification tags that
indexes no longer depend on linguistic that not only must be provided for the text must in turn be used to access the word

14 IEEE EXPERT
objects. Fortunately, however, this indi-
rection (the use of intermediate sets of
object IDS) can be exploited to recognize
phrases. Rather than use an on-disk dis-
crimination net, for example, we can per-
form phrase recognition using set intersec-
tion over the concordance index, testing
whether contiguous words in a query are
members of the same phrase (or phrases).
Since linguistic analysis at query time must
include both word and phrase recognition,
the query time cost of using our concor-
dance-indexing scheme turns out, on the
whole, to be equivalent to that of alterna-
tive approaches. We therefore chose to
reimplement the lexicon on top of the AI-
Stars Collection Services.

Extending the query reformulation


workspace. Another piece of code that
underwent significant redesign was the
query reformulation workspace. We rec- igure 3. The structured query interface for a query containing multiple restrictions. The
subquery window shows the workspate for the value of the All Text field.
ognized that the workspace interface, while
originally designed for manipulating natu-
ral language queries, generalized nicely to
the case of structured queries, which range
over the values of structured data fields, interface supports the creation, movement, easy for database administrators to add
such as author and date. For manipulating and deletion of tiles. The operational seman- new vocabulary, phrases, and thesaurus
structured data, we needed to extend the tics of tiles is implemented via callbacks entries. However, manually identifying the
chart workspace functionality to allow for associated with annotations on the tiles. A thousands of new words and phrases that
the nesting of subcharts. That is, we allow chart is itself defined as a subtype of tile, appear over time in on-line technical data-
a tile to correspond to either a value term, thereby allowing charts to be nested (as bases would be economically infeasible.
such as a word in a natural language query, needed for the structured query interface We have therefore constructed a number of
or a full <field relational-operator value> described earlier.) tools, based on techniques developed in
restriction expression. '
the field of corpus linguistics,' to assist in
A user can expand any tile of the second Using the morphological analyzer for the (semi)automatic population of our lex-
type into its own chart window to manipu- technical terms. Among the most useful ical knowledge base.
late its value content. But in addition, the terms in some computer troubleshooting With each word in the lexicon, we store
Boolean relationships among multiple-re- queries are technical symbols such as its uninflected form, its part of speech,
striction expressions are now available for io$m-abort and sys$unwind. These terms, inflectional paradigm, and orthographic
direct manipulation in the top-level query identifying such things as function library properties, such as whether its final conso-
reformulation workspace. This structured- entry points and error messages, tend to nant gets doubled. Based on our assump-
query interface is illustrated in Figure 3, in have a fixed number of systematic syntac- tion that in a large enough text corpus,
which the tile representing the restriction tic variants, such as io$v-abort, $unwind-s, words that inflect are likely to appear in
for the All Text field has been expanded in $unwind_g, and so on. Just as with inflec- several of their inflected forms, we devel-
its own subquery window. The user can tional variants of English words, users typ- oped the following algorithm for extract-
manipulate the tiles in either window (or ically want to match on all variants; they ing information about unknown words in a
both) and then rerun the query. also tend to be inconsistent about which corpus:
When other engineering groups expressed variant they use in their queries. We have
interest in experimenting with this interface therefore found it convenient to treat such (1) For each surface string in the corpus,
for other applications, we decided to repack- terms as we do inflectional variants, add- test whether it is already in the lexicon,
age the operations on the chart data struc- ing rules to our morphological analyzer/ using morphological analysis. Omit
ture, which originally had been tied directly generator to cover technical symbols that known surface strings from further
to the natural language parser. As our chart undergo regular patterns of variation. consideration.
now had at least two clients (the parser and (2) Since we generally do not need to store
the direct manipulation interface), we decid- Populating the lexicon. Given the re- inflectional information for proper
ed to implement it as an abstract data type quirement for a dynamic lexicon, we de- nouns in the lexicon, eliminate them
with its own callable interface. The abstract signed an interactive interface to make it from the data using the simple heuristic

DECEMBER 1993 IS
, . -~~ .
- -~ . _- .~ . .-__ .-

I . I ‘.,.

i (VMS system” ((S “VMS”) (L “system“ mum))) , Star> without ;in cxce\.riw burdm of hand-
(“LTA device” ((S ‘LTA“) (L “device” nouns))) t;iiloring the Icxicm. S i n x the typical hclp
(“load host” (6“load”) IL ”host” nouns))) d e A en\ tronmcnt 4tiinlv doc\ not ha\.e the

(“patch file” ((S “patch”) (L “file” nouns)))


manual administration of linguistic data- ,
(“server node” ((S “server”) (L “node” nouns))) base$, the ability to automate much of the 1

tion is still largely required is in specifying


thesaurus relationships among words. While
we are pursuing research in automatically
that proper nouns are usually capital- In English. a language with relatively extracting thesaurus relationships from cor-
ized in English. Use the morphological little morphological inflection, the vast pora, this is still in the very early stages.’?
guesser to generate a set of candidate majority of new unknown words are nouns, Fortunately, syntactically related words
lexical entries for each of the remain- and this classification is often made on the generally are also semantically related. Thus,
ing (unknown and noncapitalized) basis of the default case in step 4. How- we could use the database of noun phrases
strings. ever. we have found this algorithm partic- as a source for generating related terms.
Apply abductive reasoning to choose a ularly useful for languages such as French That is, words that appear together in
“best guess” explanation for each un- in which a great deal of morphological phrasal relationships can be made available
known surface form in the corpus. That inflection exists. We are currently experi- via the thesaurus as related terms.
is, choose the lexical entry that ac- menting with it to populate a French lexi-
counts for the greatest number of dif- con from scratch.
ferent surface forms that are actually
found in the text. For example, if we Extracting phrases. Even with the best r H E NEW VERSION O F STARS IS
found both “instantiate” and “instanti- parsers. syntactic analysis of large text in the early stages of field testing at several
ates” in a corpus, the algorithm would corpora can be computationally expensive, of Digital’s customer support centers. Us-
propose that “instantiate” is either the slow, and error prone. Given that our goal ers have indicated that, subjectively at least,
singular form of a noun whose plural is was to extract possible noun phrases in an the new search engine allows them to re-
“instantiates” or it is the infinitive of a expedient manner, we eschewed full pars- trieve articles with greater confidence and
verb whose third person singular form ing in favor of a more streamlined, if less precision. Because they are adjusting to a
is “instantiates.” However. if it finds accurate, approach. We apply a three-pass number of new facilities in the tool (in-
“instantiated” in the text as well, then algorithm that cluding the switch from a character cell
it prefers the verb interpretation, as interface to a window-based one). many
( I ) uses an ordered set of local syntactic
this interpretation would account for have not yet begun to take full advantage of
context rules (along with the
three unknown strings in the corpus. the query reformulation facilities. Thus it
morphological analyzer) to heuristi-
whereas the noun explanation covers is still premature to speculate how much
cally tag each word in the corpus with
only two. more effective users will be once they are
a single unambiguous part of speech,
Many English verbs are also nouns, so it comfortable with the range of new options.
(2) uses afinite-state network that encodes
is useful to supplement morphological We are continuing to alter the user inter-
a set of noun phrase recognition rules
data with syntactic collocational data to face as users request. We are also looking
tc) identify certain sequences of parts
test whether a word is a noun in addition into the possibility of integrating other
of speech as potential noun phrases,
to a verb. A simple heuristic for this *..A retrieval models, where appropriate. For
aiiu
would be to test for the appearance of a example, if the user finds an article that is
(3) applies a filter to the result of step 2 to
determiner (for example, “the” or “a”) closely related to but that does not com-
appearing directly before the potential
remove a known set of false alarms. 1 pletely fulfill some information need, it
These include candidate phrases that a
noun somewhere in the corpus. might be useful to allow a vector space or
person in the loop has already explicit-
Use heuristics to break ties. For words
~

probabilistic search to test for other similar


ly rejected.
~

that have more than one candidate ex- articles in the database.
planation with the same degree of cor- This algorithm has been quite effective, We view our present work as part of an
pus covering, or for words that do not generating more than 15,000 noun phrases evolutionary approach to incorporating even
appear to have any inflection variants from a 100-Mbyte corpus, with an accept- more sophisticated linguistic processing.
in the corpus. we apply a set of suffix ably low percentage of false alarms (Figure For example, a system that tried word sense
heuristics. The suffix - / y typically in- 4 gives examples of acquiredphrases). Most disambiguation, concept identification, or
dicates an adverb, whereas - i o r i indi- false alarms are due to errors made in part- the generation of syntactic variants of larger
cates a noun. For words for which no of-speech tagging in step I . These knowl- phrasal units could also use a graphical
suffix heuristic applies, we classify the edge acquisition procedures make it possi- device like the query reformulation work-
word as a noun by default. ble to bring up new databases for use in space to make its inferences public and
1

malleable. At the Same time, our exper]- Transformarion. Analysis, and Retrieval of
ence integrating even modest amounts of Informarion by Computer, Addison-Wes-
1
~

natural language processing in an informa- ley, Reading, Mass., 1989.


9. P.G. Anick et al., “A Direct Manipulation
system has made us ~
Interface for Boolean Information Retriev-
aware of the subtle relationships that can
exist between components. More sophisti-
cated use of natural language processing in
information retrieval will have to be ac-
companied by continued practical research 1 Tech. Report TR91- I 2063 expert system interfaces,
in system architecture and performance. 1 Dept. of Computer Science, Cornell Univ.,
Ithaca. N,Y,, 1991, full-text informationretrieval, and machine-aided
translation. He holds a MS from the University
1 1 . Compururional Linguisrics, special issue on of Wisconsin-Madison and is pursuing a PhD at
using large corpora, Vol. 19, No. I , 1993. Brandeis University. He is a member of the
Acknowledgments I
I 12. J. Pustejovsky, S . Bergler, and P. Anick, Association for Computational Linguistics and
’ “Lexical Semantic Techniques for Corpus the ACM SIGIR. He can be reached at DEC,
The evolution of the ideas embodied in AI- Analysis,” Cnmputarinnal Linguistics, Vol. 1 1 I Locke Dr., LM02-1/D12, Marlboro,
Stars and its implementation have been a grow 1 19, No. 2, 1993, pp. 331-358. MA 01 752; e-mail, anick@aiag.enet.dec.com
effort. Jeffrey Robbins was the principal design-
er and implementor of the original Stars system.
Past and current members of the AI-Stars team
who have contributed to the work cited here
include Bryan Alvey, Suzanne Artemieff, Jeff
Brennen, Rex Flynn, David Hanssen, Jong Kim,
Jim Moore, Mayank Prakash, and Clark Wright.
I also wish to acknowledge Jim Miller andRalph
Swick of Digital’s Cambridge Research Lab for
their collaboration on the chart abstract data
RAL: Rule-Based Systems
type, as well as the many database administra-
tors and specialists who have offered their in-
sights, critiques, and support.
RAL represents a new approach to rule-based
systems. It was developed by one of the oldest
References and most experienced companies in the industry
1. R.M. Tong andD.G. Shapiro, “Experimen- to meet the needs of real-world developers. RAL
tal Investigations of Uncertainty in a Rule- offers:
Based System for Information Retrieval,”
Int’l J. Man-Machine Studies, V o l . 22, No.
Intewration With C. Based o n a new implementation technology, RAL is
3, 1985, pp. 265-282.
the first rule-based system to permit the seamless
2. W.B. Croft and R.T. Thompson, “13R: A integration of rules into C programs. When you use
New Approach to the Design of Document RAL, you need no clumsy interface code between
Retrieval Systems,” J . Am. Soc. for Infor-
mation Science, Vol. 38,1987, pp. 389-404. the rules and the procedures. And you need no
“bridges” to access other software packages.
3. G . Guida and C. Tasso, “IR-NLI: An Ex-
pert Natural Language Interface to On-
Line Databases,” Proc. ACL Conf Applied High performance. PST products are based o n the proprietary Rete I1
Natural Language Processing, Assoc. for algorithm. Independently-developed benchmarks
Computational Linguistics, Morristown, have shown PST tools to run significantly faster
N.J., 1983. than the fastest competing tool tested.
4. B. Kahle, “An Information System for Cor-
porate Users: Wide-Area Information Serv- True ease Of use. T h e rich rule language of RAL enables you to
ers,’’ Tech. Report TMC-199, Thinking solve problems with fewer rules. At PST, we
Machines, Cambridge, Mass. believe that this kind of expressive power defines
5. G. Salton and M.J. McGill, Inwoduction to true ease of use. With RAL, you can concentrate
Modern Information Retrieval, McGraw- o n solving your problems, not o n working
Hill, New York, 1983. around limitations of the tool.
6. N.J. Belkin and W.B. Croft, “Retrieval
Techniques,” in Ann. Rev. Information Sci- Proven reliability. Systems developed with PST’s tools are used in
ence and Technology, Vol. 22, Martha critical applications throughout North America,
Williams, ed., Elsevier Science Publishers, Europe, and Japan. A number of systems devel-
New York, 1987, pp. 109-145.
oped using PST tools are now being sold as com-
7. M. Kay, “Algorithm Schemata and Data mercial products. You may already be using RAL
Structures in Syntactic Processing,” Tech. systems without knowing it.
Report CSL-80-12, Xerox Palo Alto Re-
search Center, Palo Alto, Calif., 1980. RAL is available for PCs, workstations, superminis, and mainframes
8. G. Salton, Automatic Text Processing: The ProductionSystems Technologies Inc If you are inrerested in solving real-world problems with rule-based
5001 Baum Boulevard systems, write for our technical brochure today Or call us at
Pinsburgh Pennsylvania 15213 4 12-683-4000. (FAX: 412-683-6347)
DECEMBER 1993 Reader Service Number 2

You might also like