Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 8

DATA MINING archeology.

Additionally, there are some


disagreements about what actually constitutes
data mining. It turns out that “mining” is not a
very good metaphor for what people in the field
actually do. Mining implies extracting precious
nuggets of ore from otherwise worthless rock. If
ABSTRACT data mining really followed this metaphor, it
The possibilities for data mining from large text would mean that people were discovering new
collections are virtually untapped. Text expresses factoids within their inventory databases.
a vast, rich range of information, but encodes this However, in practice this is not really the case.
information in a form that is difficult to decipher Instead, data mining applications tend to be
automatically. Perhaps for this reason, there has (semi)automated discovery of trends and patterns
been little work in text data mining to date, and across very large datasets, usually for the
most people who have talked about it have either purposes of decision making. Part of what we
conflated it with information access or have not wish to argue here is that in the case of text, it can
made use of text directly to discover heretofore be interesting to take the mining-for-nuggets
unknown information. metaphor seriously.

In this paper we will first define data mining, TDM vs. INFORMATION
information access, and corpus-based
ACCESS
computational linguistics, and then discuss the
relationship of these to text data mining. The It is important to differentiate between text data
intent behind these contrasts is to draw attention mining and information access (or information
to exciting new kinds of problems for retrieval, as it is more widely known).
computational linguists. We describe examples of The goal of information access is to help users
what we consider to be real text data mining find documents that satisfy their information
efforts and briefly outline our recent ideas about needs. The standard procedure is akin to looking
how to pursue exploratory data analysis over text. for needles in a needle stack - the problem isn't so

INTRODUCTION much that the desired information is not known,


but rather that the desired information coexists
The nascent field of text data mining (TDM) has
with many other valid pieces of information. Just
the peculiar distinction of having a name and a
because a user is currently interested in NAFTA
fair amount of hype but as yet almost no
and not Furbies does not mean that all
descriptions of Furbies are worthless. The
problem is one of homing in on what is currently
of interest to the user.
practitioners. We suspect this has happened As noted above, the goal of data mining is to
because people assume TDM is a natural discover or derive new information from data,
extension of the slightly less nascent field of data finding patterns across datasets, and/or separating
mining (DM), also known as knowledge signal from noise. The fact that an information
discovery in databases, and information retrieval system can return a document that
contains the information a user requested does analysis. Our efforts in this direction are
not imply that a new discovery has been made: embodied in the LINDI project, described below.
the information had to have already been known
TDM AND COMPUTATIONAL
to the author of the text; otherwise the author
could not have written it down. LINGUISTICS
We have observed that many people, when asked If we extrapolate from data mining (as practiced)
about text data mining, assume it should have on numerical data to data mining from text
something to do with “making things easier to collections, we discover that there already exists
find on the web”. For example, the description of a field engaged in text data mining: corpus-based
the KDD-97 panel on Data Mining and the Web computational linguistics! Empirical
stated: computational linguistics computes statistics over
Two challenges are predominant for data mining large text collections in order to discover useful
on the Web. The first goal is to help users in patterns. These patterns are used to inform
finding useful information on the Web and in algorithms for various sub problems within
discovering knowledge about a domain that is natural language processing, such as part-of-
represented by a collection of Web-documents. speech tagging, word sense disambiguation, and
The second goal is to analyses the transactions bilingual dictionary creation.
run in a Web-based system, be it to optimize the It is certainly of interest to a computational
system or to find information about the clients linguist that the words ``prices, prescription, and
using the system. patent'' are highly likely to co-occur with the
This search-centric view misses the point that we medical sense of ``drug'' while ``abuse,
might actually want to treat the information in the paraphernalia, and illicit'' are likely to co-occur
web as a large knowledge base from which we with the illegal drug sense of this word. However,
can extract new, never-before encountered the kinds of patterns found and used in
information. computational linguistics are not likely to be what
On the other hand, the results of certain types of the general business community hopes for when
text processing can yield tools that indirectly aid they use the term text data mining.
in the information access process. Examples Within the computational linguistics framework,
include text clustering to create thematic efforts in automatic augmentation of existing
overviews of text collections, automatically lexical structures seem to fit the data-mining-as-
generating term associations to aid in query ore-extraction metaphor. Examples include
expansion, and using co-citation analysis to find automatic augmentation of Word Net relations by
general topics within a collection or identify identifying lexico-syntactic patterns that
central web pages. unambiguously indicate those relations, and
Aside from providing tools to aid in the standard automatic acquisition of sub categorization data
information access process, I think text data from large text corpora. However, these serve the
mining can contribute along another dimension. specific needs of computational linguistics and
In future I hope to see information access systems are not applicable to a broader audience.
supplemented with tools for exploratory data
TDM AND CATEGORY stream of news stories in chronological order, and
whose output is a yes/no decision for each story,
METAMEDIA made at the time the story arrives, indicating
Some researchers have claimed that text whether the story is the first reference to a newly
categorization should be considered text data occurring event. In other words, the system must
mining. Although analogies can be found in the detect the first instance of what will become a
data mining literature (e.g., referring to series of reports on some important topic.
classification of astronomical phenomena as data Although this can be viewed as a standard
mining), I believe when applied to text classification task (where the class is a binary
categorization this is a misnomer. Text assignment to the new-event class) it is more in
categorization is a boiling down of the specific the spirit of data mining, in that the focus is on
content of a document into one (or more) of a set discovery of the beginning of a new theme or
of pre-defined labels. This does not lead to trend.
discovery of new information; presumably the The reason we consider this examples - using
person who wrote the document knew what it multiple occurrences of text categories to detect
was about. Rather, it produces a compact trends or patterns - to be ``real'' data mining is
summary of something that is already known. that they use text metadata to tell us something
However, there are two recent areas of inquiry about the world, outside of the text collection
that make use of text categorization and do seem itself. (However, since these applications use
to fit within the conceptual framework of metadata associated with text documents, rather
discovery of trends and patterns within textual than the text directly, it is unclear if it should be
data for more general purpose usage. considered text data mining or standard data
One body of work uses text category labels mining.) The computational linguistics
(associated with Reuter’s newswire) to find applications tell us about how to improve
“unexpected patterns” among text articles. The language analysis, but they do not discover more
main approach is to compare distributions of widely usable information.
category assignments within subsets of the
document collection. For instance, distributions
of commodities in country C1 are compared
against those of country C2 to see if interesting or
unexpected trends can be found. Extending this
The various contrasts made are summarized in
idea, one country's export trends might be
Table 1.
compared against those of a set of countries that
Table: 1 A classification of data mining and text
are seen as an economic unit (such as the G-7).
data mining applications.
Another effort is that of the DARPA Topic
Finding
Detection and Tracking initiative. While several   Finding Nuggets
Patterns
of the tasks within this initiative are standard text
analysis problems (such as categorization and     Novel Non-Novel

segmentation), there is an interesting task called Non- standard data ? database


On-line New Event Detection, whose input is a textual mining queries
can lead to hypotheses for causes of rare diseases,
data
some of which have received supporting
Textual computational real information
experimental evidence.
data linguistics TDM retrieval
For example, when investigating causes of
migraine headaches, he extracted various pieces
TEXT DATA MINING AS of evidence from titles of articles in the
biomedical literature. Some of these clues can be
EXPLORATORY DATA
paraphrased as follows:
ANALYSIS  stress is associated with migraines
Another way to view text data mining is as a  stress can lead to loss of magnesium
process of exploratory data analysis that leads to  calcium channel blockers prevent some
the discovery of heretofore unknown information, migraines
or to answers to questions for which the answer is
 magnesium is a natural calcium channel
not currently known.
blocker
Of course, it can be argued that the standard
 spreading cortical depression (SCD) is
practice of reading textbooks, journal articles and
implicated in some migraines
other documents helps researchers in the
 high leveles of magnesium inhibit SCD
discovery of new information, since this is an
integral part of the research process. However,  migraine patients have high platelet

the idea here is to use text for discovery in a more aggregability

direct manner. Two examples are described  magnesium can suppress platelet
below. aggregability
These clues suggest that magnesium deficiency
may play a role in some kinds of migraine

USING TEXT TO FORM headache; a hypothesis which did not exist in the
literature at the time Swanson found these links.
ABOUT HYPOTHESIS The hypothesis has to be tested via non-textual

DISEASE means, but the important point is that a new,


potentially plausible medical hypothesis was
For more than a decade, Don Swanson has
derived from a combination of text fragments and
eloquently argued why it is plausible to expect
the explorer's medical expertise. (According to
new information to be derivable from text
swanson91, subsequent study found support for
collections: experts can only read a small subset
the magnesium-migraine hypothesis.)
of what is published in their fields and are often
This approach has been only partially automated.
unaware of developments in related fields. Thus
There is, of course, a potential for combinatorial
it should be possible to find useful linkages
explosion of potentially valid links. Beeferman98
between information in related literatures, if the
has developed a flexible interface and analysis
authors of those literatures rarely refer to one
tool for exploring certain kinds of chains of links
another's work. Swanson has shown how chains
among lexical relations within Word Net.
of causal implication within the medical literature
However, sophisticated new algorithms are
needed for helping in the pruning process, since a zeroed in on those awarded to industry. For 2,841
good pruning algorithm will want to take into patents issued in 1993 and 1994, it examined the
account various kinds of semantic constraints. peak year of literature references, 1988, and
This may be an interesting area of investigation found 5,217 citations to science papers.
for computational linguists. Of these, it found that 73.3 percent had been

USING TEXT TO UNCOVER written at public institutions - universities,


government labs and other public agencies, both
SOCIAL IMPACT in the United States and abroad.
Switching to an entirely different domain, Thus a heterogeneous mix of operations was
consider a recent effort to determine the effects of required to conduct a complex analysis over large
publicly financed research on industrial advances. text collections. These operations included:
After years of preliminary studies and building   1. Retrieval of articles from a particular
special purpose tools, the authors found that the collection (patents) within a particular
technology industry relies more heavily than ever date range.
on government-sponsored research results. The 2. Identification of the citation pool
authors explored relationships among patent text (articles cited by the patents).
and the published research literature, using a
3. Bracketing of this pool by date,
procedure which was reported as follows in
creating a new subset of articles.
broad97:
4. Computation of the percentage of
The CHI Research team examined the science
articles that remain after bracketing.
references on the front pages of American patents
5. Joining these results with those of
in two recent periods - 1987 and 1988, as well as
other collections to identify the
1993 and 1994 - looking at all the 397,660
publishers of articles in the pool.
patents issued. It found 242,000 identifiable
6. Elimination of redundant articles.
science references and zeroed in on those
7. Elimination of articles based on an
published in the preceding 11 years, which turned
attribute type (author nationality).
out to be 80 percent of them. Searches of
8. Location of full-text versions of the
computer databases allowed the linking of
articles.
109,000 of these references to known journals
and authors' addresses. After eliminating 9. Extraction of a special attribute from

redundant citations to the same paper, as well as the full text (the acknowledgement of

articles with no known American author, the funding).

study had a core collection of 45,000 papers. 10. Classification of this attribute (by

Armies of aides then fanned out to libraries to institution type).

look up the papers and examine their closing 11. Narrowing the set of articles to
lines, which often say that who financed the consider by an attribute (institution
research. That detective work revealed an type).
extensive reliance on publicly financed science. 12. Computation of statistics over one of
Further narrowing its focus, the study set aside the attributes (peak year)
patents given to schools and governments and
13. Computation of the percentage of We are interested in an important problem in
articles for which one attribute has been molecular biology that of automating the
assigned another attribute type (whose discovery of the function of newly sequenced
citation attribute has a particular genes. Human genome researchers perform
institution attribute). experiments in which they analyze co-expression
Because all the data was not available online, of tens of thousands of novel and known genes
much of the work had to be done by hand, and simultaneously. Given this huge collection of
special purpose tools were required to perform genetic information, the goal is to determine
the operations. which of the novel genes are medically
interesting, meaning that they are co-expressed
with already understood genes which are known
to be involved in disease. Our strategy is to
explore the biomedical literature, trying to
THE LINDI PROJECT
formulate plausible hypotheses about which
The objectives of the LINDI project are to genes are of interest.
investigate how researchers can use large text Most information access systems require the user
collections in the discovery of new important to execute and keep track of tactical moves, often
information, and to build software systems to distracting from the thought-intensive aspects of
help support this process. The main tools for the problem. The LINDI interface provides a
discovering new information are of two types: facility for users to build and so reuse sequences
support for issuing sequences of queries and of query operations via a drag-and-drop interface.
related operations across text collections, and These allow the user to repeat the same sequence
tightly coupled statistical and visualization tools of actions for different queries. In the gene
for the examination of associations among example, this allows the user to specify a
concepts that co-occur within the retrieved sequence of operations to apply to one co-
documents. Both sets of tools make use of expressed gene, and then iterate this sequence
attributes associated specifically with text over a list of other co-expressed genes that can be
collections and their metadata. Thus the dragged onto the template. (The Visage interface
broadening, narrowing, and linking of relations implements this kind of functionality within its
seen in the patent example should be tightly information-centric framework.) These include
integrated with analysis and interpretation tools the following operations (see Figure 1):
as needed in the biomedical example.
 Iteration of an operation over the items
Following amant96, the interaction paradigm is within a set. (This allows each item
that of a mixed-initiative balance of control retrieved in a previous query to be use as
between user and system. The interaction is a a search terms for a new query.)
cycle in which the system suggests hypotheses
 Transformation, i.e., applying an
and strategies for investigating these hypotheses,
operation to an item and returning a
and the user either uses or ignores these
transformed item (such as extracting a
suggestions and decides on the next move.
feature).
 Ranking, i.e., applying an operation to a
set of items and returning a (possibly)
reordered set of items with the same
cardinality.
 Selection, i.e., applying an operation to a
set of items and returning a (possibly)
reordered set of items with the same or
smaller cardinality.

This system will allow maintenance of several


different types of history including history of
commands issued, history of strategies employed,
and history of hypotheses tested. For the history
view, we plan to use a ``spreadsheet'' layout as
well as a variation on a ``slide sorter'' view which
Visage uses for presentation creation but not for
 
history retention.
Figure: 1 A hypothetical sequence of operations
Since gene function discovery is a new area, there
for the exploration of gene function within a
is not yet a known set of exploration strategies.
biomedical text collection, where the functions of
So initially the system must help an expert user
genes A, B, and C are known, and commonalities
generate and record good exploration strategies.
are sought to hypothesize the function of the
The user interface provides a mechanism for
unknown gene. The mapping operation imposes
recording and modifying sequences of actions.
a rank ordering on the selected keywords. The
These include facilities that refer to metadata
final operation is a selection of only those
structure, allowing, for example, query terms to
documents that contain at least one of the top-
be expanded by terms one level above or below
ranked keywords and that contain mentions of all
them in a subject hierarchy. Once a successful set
three known genes.
of strategies has been devised, they can be re-
used by other researchers and (with luck) by an
automated version of the system. The intent is to SUMMARY
build up enough strategies that the system will For almost a decade the computational linguistics
begin to be used as an assistant or advisor, community has viewed large text collections as a
ranking hypotheses according to projected resource to be tapped in order to produce better
importance and plausibility. text analysis algorithms. In this paper, I have
Thus the emphasis of this system is to help attempted to suggest a new emphasis: the use of
automate the tedious parts of the text large online text collections to discover new facts
manipulation process and to integrate underlying and trends about the world itself. I suggest that to
computationally-driven text analysis with human- make progress we do not need fully artificial
guided decision making within exploratory data intelligent text analysis; rather, a mixture of
analysis over text.
computationally-driven and user-guided analysis
may open the door to exciting new results.
Project: LINDI: Linking Information for Novel
Discovery and Insight.

BIBLIOGRAPHY
Allan et al.1998
J. Allan, J. Carbonell, G. Doddington, J. Yamron,
and Y. Yang.
Narin et al.1997
Francis Narin, Kimberly S. Hamilton, and
Dominic Olivastro. 1997.
Armstrong1994
Susan Armstrong, editor. 1994. Using Large
Corpora. MIT Press.
Baeza-Yates and Ribeiro-Neto
Ricardo Baeza-Yates and Berthier Ribeiro-Neto.
Modern Information Retrieval.
Peat and Willett1991
Helen J. Peat and Peter Willett. 1991.
Bates1990 Marcia J. Bates. 1990.
Craven et al.
M. Craven, D. DiPasquo, D. Freitag, A.
McCallum, T. Mitchell, K. Nigam, and S.
Slattery. 1998.
Ramadan ET al.1989
N. M. Ramadan, H. Halvorson, A. Vandelinde,
and S.R. Levine.
1989. Low brain magnesium in migraine.
Dagan et al. 1996 Ido Dagan, Ronen Feldman,
and Haym Hirsh. 1996.

You might also like