Professional Documents
Culture Documents
Fayyad-From Data Mining To Knowledge Discovery in Databases PDF
Fayyad-From Data Mining To Knowledge Discovery in Databases PDF
Fayyad-From Data Mining To Knowledge Discovery in Databases PDF
■ Data mining and knowledge discovery in This article begins by discussing the histori-
databases have been attracting a significant cal context of KDD and data mining and their
amount of research, industry, and media atten- intersection with other related fields. A brief
tion of late. What is all the excitement about? summary of recent KDD real-world applica-
This article provides an overview of this emerging
tions is provided. Definitions of KDD and da-
field, clarifying how data mining and knowledge
ta mining are provided, and the general mul-
discovery in databases are related both to each
other and to related fields, such as machine tistep KDD process is outlined. This multistep
learning, statistics, and databases. The article process has the application of data-mining al-
mentions particular real-world applications, gorithms as one particular step in the process.
specific data-mining techniques, challenges in- The data-mining step is discussed in more de-
volved in real-world applications of knowledge tail in the context of specific data-mining al-
discovery, and current and future research direc- gorithms and their application. Real-world
tions in the field. practical application issues are also outlined.
Finally, the article enumerates challenges for
future research and development and in par-
A
cross a wide variety of fields, data are ticular discusses potential opportunities for AI
being collected and accumulated at a technology in KDD systems.
dramatic pace. There is an urgent need
for a new generation of computational theo-
ries and tools to assist humans in extracting
useful information (knowledge) from the
Why Do We Need KDD?
rapidly growing volumes of digital data. The traditional method of turning data into
These theories and tools are the subject of the knowledge relies on manual analysis and in-
emerging field of knowledge discovery in terpretation. For example, in the health-care
databases (KDD). industry, it is common for specialists to peri-
At an abstract level, the KDD field is con- odically analyze current trends and changes
cerned with the development of methods and in health-care data, say, on a quarterly basis.
techniques for making sense of data. The basic The specialists then provide a report detailing
problem addressed by the KDD process is one the analysis to the sponsoring health-care or-
of mapping low-level data (which are typically ganization; this report becomes the basis for
too voluminous to understand and digest easi- future decision making and planning for
ly) into other forms that might be more com- health-care management. In a totally differ-
pact (for example, a short report), more ab- ent type of application, planetary geologists
stract (for example, a descriptive sift through remotely sensed images of plan-
approximation or model of the process that ets and asteroids, carefully locating and cata-
generated the data), or more useful (for exam- loging such geologic objects of interest as im-
ple, a predictive model for estimating the val- pact craters. Be it science, marketing, finance,
ue of future cases). At the core of the process is health care, retail, or any other field, the clas-
the application of specific data-mining meth- sical approach to data analysis relies funda-
ods for pattern discovery and extraction.1 mentally on one or more analysts becoming
Copyright © 1996, American Association for Artificial Intelligence. All rights reserved. 0738-4602-1996 / $2.00 FALL 1996 37
Articles
intimately familiar with the data and serving areas is astronomy. Here, a notable success
as an interface between the data and the users was achieved by SKICAT, a system used by as-
and products. tronomers to perform image analysis,
For these (and many other) applications, classification, and cataloging of sky objects
this form of manual probing of a data set is from sky-survey images (Fayyad, Djorgovski,
slow, expensive, and highly subjective. In and Weir 1996). In its first application, the
fact, as data volumes grow dramatically, this system was used to process the 3 terabytes
type of manual data analysis is becoming (1012 bytes) of image data resulting from the
completely impractical in many domains. Second Palomar Observatory Sky Survey,
Databases are increasing in size in two ways: where it is estimated that on the order of 109
(1) the number N of records or objects in the sky objects are detectable. SKICAT can outper-
database and (2) the number d of fields or at- form humans and traditional computational
tributes to an object. Databases containing on techniques in classifying faint sky objects. See
the order of N = 109 objects are becoming in- Fayyad, Haussler, and Stolorz (1996) for a sur-
creasingly common, for example, in the as- vey of scientific applications.
tronomical sciences. Similarly, the number of In business, main KDD application areas
There is an fields d can easily be on the order of 102 or includes marketing, finance (especially in-
even 103, for example, in medical diagnostic vestment), fraud detection, manufacturing,
urgent need applications. Who could be expected to di- telecommunications, and Internet agents.
for a new gest millions of records, each having tens or Marketing: In marketing, the primary ap-
hundreds of fields? We believe that this job is plication is database marketing systems,
generation of certainly not one for humans; hence, analysis which analyze customer databases to identify
computation- work needs to be automated, at least partially. different customer groups and forecast their
The need to scale up human analysis capa- behavior. Business Week (Berry 1994) estimat-
al theories bilities to handling the large number of bytes ed that over half of all retailers are using or
and tools to that we can collect is both economic and sci- planning to use database marketing, and
assist entific. Businesses use data to gain competi- those who do use it have good results; for ex-
tive advantage, increase efficiency, and pro- ample, American Express reports a 10- to 15-
humans in vide more valuable services to customers. percent increase in credit-card use. Another
extracting Data we capture about our environment are notable marketing application is market-bas-
the basic evidence we use to build theories ket analysis (Agrawal et al. 1996) systems,
useful and models of the universe we live in. Be- which find patterns such as, “If customer
information cause computers have enabled humans to bought X, he/she is also likely to buy Y and
gather more data than we can digest, it is on- Z.” Such patterns are valuable to retailers.
(knowledge) ly natural to turn to computational tech- Investment: Numerous companies use da-
from the niques to help us unearth meaningful pat- ta mining for investment, but most do not
rapidly terns and structures from the massive describe their systems. One exception is LBS
volumes of data. Hence, KDD is an attempt to Capital Management. Its system uses expert
growing address a problem that the digital informa- systems, neural nets, and genetic algorithms
volumes of tion era made a fact of life for all of us: data to manage portfolios totaling $600 million;
overload. since its start in 1993, the system has outper-
digital formed the broad stock market (Hall, Mani,
data. Data Mining and Knowledge and Barr 1996).
Fraud detection: HNC Falcon and Nestor
Discovery in the Real World PRISM systems are used for monitoring credit-
A large degree of the current interest in KDD card fraud, watching over millions of ac-
is the result of the media interest surrounding counts. The FAIS system (Senator et al. 1995),
successful KDD applications, for example, the from the U.S. Treasury Financial Crimes En-
focus articles within the last two years in forcement Network, is used to identify finan-
Business Week, Newsweek, Byte, PC Week, and cial transactions that might indicate money-
other large-circulation periodicals. Unfortu- laundering activity.
nately, it is not always easy to separate fact Manufacturing: The CASSIOPEE trou-
from media hype. Nonetheless, several well- bleshooting system, developed as part of a
documented examples of successful systems joint venture between General Electric and
can rightly be referred to as KDD applications SNECMA, was applied by three major Euro-
and have been deployed in operational use pean airlines to diagnose and predict prob-
on large-scale real-world problems in science lems for the Boeing 737. To derive families of
and in business. faults, clustering methods are used. CASSIOPEE
In science, one of the primary application received the European first prize for innova-
38 AI MAGAZINE
Articles
tive applications (Manago and Auriol 1996). Data Mining and KDD
Telecommunications: The telecommuni-
cations alarm-sequence analyzer (TASA) was Historically, the notion of finding useful pat-
built in cooperation with a manufacturer of terns in data has been given a variety of
telecommunications equipment and three names, including data mining, knowledge ex-
telephone networks (Mannila, Toivonen, and traction, information discovery, information
Verkamo 1995). The system uses a novel harvesting, data archaeology, and data pattern
processing. The term data mining has mostly
framework for locating frequently occurring
been used by statisticians, data analysts, and
alarm episodes from the alarm stream and
the management information systems (MIS)
presenting them as rules. Large sets of discov-
communities. It has also gained popularity in
ered rules can be explored with flexible infor-
the database field. The phrase knowledge dis-
mation-retrieval tools supporting interactivity
covery in databases was coined at the first KDD
and iteration. In this way, TASA offers pruning,
workshop in 1989 (Piatetsky-Shapiro 1991) to
grouping, and ordering tools to refine the re- emphasize that knowledge is the end product
sults of a basic brute-force search for rules. of a data-driven discovery. It has been popular-
Data cleaning: The MERGE - PURGE system ized in the AI and machine-learning fields.
was applied to the identification of duplicate In our view, KDD refers to the overall pro-
welfare claims (Hernandez and Stolfo 1995). cess of discovering useful knowledge from da- The basic
It was used successfully on data from the Wel- ta, and data mining refers to a particular step
fare Department of the State of Washington. in this process. Data mining is the application
problem
In other areas, a well-publicized system is of specific algorithms for extracting patterns addressed by
IBM’s ADVANCED SCOUT, a specialized data-min- from data. The distinction between the KDD
ing system that helps National Basketball As- process and the data-mining step (within the
the KDD
sociation (NBA) coaches organize and inter- process) is a central point of this article. The process is
pret data from NBA games (U.S. News 1995). additional steps in the KDD process, such as one of
ADVANCED SCOUT was used by several of the data preparation, data selection, data cleaning,
NBA teams in 1996, including the Seattle Su- incorporation of appropriate prior knowledge, mapping
personics, which reached the NBA finals. and proper interpretation of the results of low-level
Finally, a novel and increasingly important mining, are essential to ensure that useful
type of discovery is one based on the use of in- knowledge is derived from the data. Blind ap- data into
telligent agents to navigate through an infor- plication of data-mining methods (rightly crit- other forms
mation-rich environment. Although the idea icized as data dredging in the statistical litera-
of active triggers has long been analyzed in the ture) can be a dangerous activity, easily that might be
database field, really successful applications of leading to the discovery of meaningless and more
invalid patterns.
this idea appeared only with the advent of the compact,
Internet. These systems ask the user to specify
The Interdisciplinary Nature of KDD more
a profile of interest and search for related in-
KDD has evolved, and continues to evolve,
formation among a wide variety of public-do-
from the intersection of research fields such as
abstract,
main and proprietary sources. For example,
FIREFLY is a personal music-recommendation
machine learning, pattern recognition, or more
agent: It asks a user his/her opinion of several
databases, statistics, AI, knowledge acquisition useful.
for expert systems, data visualization, and
music pieces and then suggests other music
high-performance computing. The unifying
that the user might like (<http://
goal is extracting high-level knowledge from
www.ffly.com/>). CRAYON (http://crayon.net/>)
low-level data in the context of large data sets.
allows users to create their own free newspaper
The data-mining component of KDD cur-
(supported by ads); NEWSHOUND (<http://www. rently relies heavily on known techniques
sjmercury.com/hound/>) from the San Jose from machine learning, pattern recognition,
Mercury News and FARCAST (<http://www.far- and statistics to find patterns from data in the
cast.com/> automatically search information data-mining step of the KDD process. A natu-
from a wide variety of sources, including ral question is, How is KDD different from pat-
newspapers and wire services, and e-mail rele- tern recognition or machine learning (and re-
vant documents directly to the user. lated fields)? The answer is that these fields
These are just a few of the numerous such provide some of the data-mining methods
systems that use KDD techniques to automat- that are used in the data-mining step of the
ically produce useful information from large KDD process. KDD focuses on the overall pro-
masses of raw data. See Piatetsky-Shapiro et cess of knowledge discovery from data, includ-
al. (1996) for an overview of issues in devel- ing how the data are stored and accessed, how
oping industrial KDD applications. algorithms can be scaled to massive data sets
FALL 1996 39
Articles
and still run efficiently, how results can be in- A driving force behind KDD is the database
terpreted and visualized, and how the overall field (the second D in KDD). Indeed, the
man-machine interaction can usefully be problem of effective data manipulation when
modeled and supported. The KDD process data cannot fit in the main memory is of fun-
can be viewed as a multidisciplinary activity damental importance to KDD. Database tech-
that encompasses techniques beyond the niques for gaining efficient data access,
scope of any one particular discipline such as grouping and ordering operations when ac-
machine learning. In this context, there are cessing data, and optimizing queries consti-
clear opportunities for other fields of AI (be- tute the basics for scaling algorithms to larger
sides machine learning) to contribute to data sets. Most data-mining algorithms from
KDD. KDD places a special emphasis on find- statistics, pattern recognition, and machine
ing understandable patterns that can be inter- learning assume data are in the main memo-
preted as useful or interesting knowledge. ry and pay no attention to how the algorithm
Thus, for example, neural networks, although breaks down if only limited views of the data
a powerful modeling tool, are relatively are possible.
difficult to understand compared to decision A related field evolving from databases is
trees. KDD also emphasizes scaling and ro- data warehousing, which refers to the popular
bustness properties of modeling algorithms business trend of collecting and cleaning
Data mining for large noisy data sets. transactional data to make them available for
Related AI research fields include machine online analysis and decision support. Data
is a step in discovery, which targets the discovery of em- warehousing helps set the stage for KDD in
the KDD pirical laws from observation and experimen- two important ways: (1) data cleaning and (2)
tation (Shrager and Langley 1990) (see Kloes- data access.
process that gen and Zytkow [1996] for a glossary of terms Data cleaning: As organizations are forced
consists of ap- common to KDD and machine discovery), to think about a unified logical view of the
and causal modeling for the inference of wide variety of data and databases they pos-
plying data causal models from data (Spirtes, Glymour, sess, they have to address the issues of map-
analysis and and Scheines 1993). Statistics in particular ping data to a single naming convention,
discovery al- has much in common with KDD (see Elder uniformly representing and handling missing
and Pregibon [1996] and Glymour et al. data, and handling noise and errors when
gorithms that [1996] for a more detailed discussion of this possible.
produce a par- synergy). Knowledge discovery from data is Data access: Uniform and well-defined
fundamentally a statistical endeavor. Statistics methods must be created for accessing the da-
ticular enu- provides a language and framework for quan- ta and providing access paths to data that
meration of tifying the uncertainty that results when one were historically difficult to get to (for exam-
tries to infer general patterns from a particu- ple, stored offline).
patterns lar sample of an overall population. As men- Once organizations and individuals have
(or models) tioned earlier, the term data mining has had solved the problem of how to store and ac-
negative connotations in statistics since the cess their data, the natural next step is the
over the 1960s when computer-based data analysis question, What else do we do with all the da-
data. techniques were first introduced. The concern ta? This is where opportunities for KDD natu-
arose because if one searches long enough in rally arise.
any data set (even randomly generated data), A popular approach for analysis of data
one can find patterns that appear to be statis- warehouses is called online analytical processing
tically significant but, in fact, are not. Clearly, (OLAP), named for a set of principles pro-
this issue is of fundamental importance to posed by Codd (1993). OLAP tools focus on
KDD. Substantial progress has been made in providing multidimensional data analysis,
recent years in understanding such issues in which is superior to SQL in computing sum-
statistics. Much of this work is of direct rele- maries and breakdowns along many dimen-
vance to KDD. Thus, data mining is a legiti- sions. OLAP tools are targeted toward simpli-
mate activity as long as one understands how fying and supporting interactive data analysis,
to do it correctly; data mining carried out but the goal of KDD tools is to automate as
poorly (without regard to the statistical as- much of the process as possible. Thus, KDD is
pects of the problem) is to be avoided. KDD a step beyond what is currently supported by
can also be viewed as encompassing a broader most standard database systems.
view of modeling than statistics. KDD aims to
provide tools to automate (to the degree pos- Basic Definitions
sible) the entire process of data analysis and KDD is the nontrivial process of identifying
the statistician’s “art” of hypothesis selection. valid, novel, potentially useful, and ultimate-
40 AI MAGAZINE
Articles
Interpretation /
Evaluation
Data Mining
Transformation Knowledge
Preprocessing
Selection
Patterns
--- --- ---
--- --- ---
--- --- ---
Transformed
Preprocessed Data Data
ly understandable patterns in data (Fayyad, data) or utility (for example, gain, perhaps in
Piatetsky-Shapiro, and Smyth 1996). dollars saved because of better predictions or
Here, data are a set of facts (for example, speedup in response time of a system). No-
cases in a database), and pattern is an expres- tions such as novelty and understandability
sion in some language describing a subset of are much more subjective. In certain contexts,
the data or a model applicable to the subset. understandability can be estimated by sim-
Hence, in our usage here, extracting a pattern plicity (for example, the number of bits to de-
also designates fitting a model to data; find- scribe a pattern). An important notion, called
ing structure from data; or, in general, mak- interestingness (for example, see Silberschatz
ing any high-level description of a set of data. and Tuzhilin [1995] and Piatetsky-Shapiro and
The term process implies that KDD comprises Matheus [1994]), is usually taken as an overall
many steps, which involve data preparation, measure of pattern value, combining validity,
search for patterns, knowledge evaluation, novelty, usefulness, and simplicity. Interest-
and refinement, all repeated in multiple itera- ingness functions can be defined explicitly or
tions. By nontrivial, we mean that some can be manifested implicitly through an or-
search or inference is involved; that is, it is dering placed by the KDD system on the dis-
not a straightforward computation of covered patterns or models.
predefined quantities like computing the av- Given these notions, we can consider a
erage value of a set of numbers. pattern to be knowledge if it exceeds some in-
The discovered patterns should be valid on terestingness threshold, which is by no
new data with some degree of certainty. We means an attempt to define knowledge in the
also want patterns to be novel (at least to the philosophical or even the popular view. As a
system and preferably to the user) and poten- matter of fact, knowledge in this definition is
tially useful, that is, lead to some benefit to purely user oriented and domain specific and
the user or task. Finally, the patterns should is determined by whatever functions and
be understandable, if not immediately then thresholds the user chooses.
after some postprocessing. Data mining is a step in the KDD process
The previous discussion implies that we can that consists of applying data analysis and
define quantitative measures for evaluating discovery algorithms that, under acceptable
extracted patterns. In many cases, it is possi- computational efficiency limitations, pro-
ble to define measures of certainty (for exam- duce a particular enumeration of patterns (or
ple, estimated prediction accuracy on new models) over the data. Note that the space of
FALL 1996 41
Articles
patterns is often infinite, and the enumera- methods, the effective number of variables
tion of patterns involves some form of under consideration can be reduced, or in-
search in this space. Practical computational variant representations for the data can be
constraints place severe limits on the sub- found.
space that can be explored by a data-mining Fifth is matching the goals of the KDD pro-
algorithm. cess (step 1) to a particular data-mining
The KDD process involves using the method. For example, summarization, clas-
database along with any required selection, sification, regression, clustering, and so on,
preprocessing, subsampling, and transforma- are described later as well as in Fayyad, Piatet-
tions of it; applying data-mining methods sky-Shapiro, and Smyth (1996).
(algorithms) to enumerate patterns from it; Sixth is exploratory analysis and model
and evaluating the products of data mining and hypothesis selection: choosing the data-
to identify the subset of the enumerated pat- mining algorithm(s) and selecting method(s)
terns deemed knowledge. The data-mining to be used for searching for data patterns.
component of the KDD process is concerned This process includes deciding which models
with the algorithmic means by which pat- and parameters might be appropriate (for ex-
terns are extracted and enumerated from da- ample, models of categorical data are differ-
ta. The overall KDD process (figure 1) in- ent than models of vectors over the reals) and
cludes the evaluation and possible matching a particular data-mining method
interpretation of the mined patterns to de- with the overall criteria of the KDD process
termine which patterns can be considered (for example, the end user might be more in-
new knowledge. The KDD process also in- terested in understanding the model than its
cludes all the additional steps described in predictive capabilities).
the next section.
Seventh is data mining: searching for pat-
The notion of an overall user-driven pro-
terns of interest in a particular representa-
cess is not unique to KDD: analogous propos-
tional form or a set of such representations,
als have been put forward both in statistics
including classification rules or trees, regres-
(Hand 1994) and in machine learning (Brod-
sion, and clustering. The user can significant-
ley and Smyth 1996).
ly aid the data-mining method by correctly
performing the preceding steps.
Eighth is interpreting mined patterns, pos-
The KDD Process sibly returning to any of steps 1 through 7 for
further iteration. This step can also involve
The KDD process is interactive and iterative,
visualization of the extracted patterns and
involving numerous steps with many deci-
sions made by the user. Brachman and Anand models or visualization of the data given the
(1996) give a practical view of the KDD pro- extracted models.
cess, emphasizing the interactive nature of Ninth is acting on the discovered knowl-
the process. Here, we broadly outline some of edge: using the knowledge directly, incorpo-
its basic steps: rating the knowledge into another system for
First is developing an understanding of the further action, or simply documenting it and
application domain and the relevant prior reporting it to interested parties. This process
knowledge and identifying the goal of the also includes checking for and resolving po-
KDD process from the customer’s viewpoint. tential conflicts with previously believed (or
Second is creating a target data set: select- extracted) knowledge.
ing a data set, or focusing on a subset of vari- The KDD process can involve significant
ables or data samples, on which discovery is iteration and can contain loops between
to be performed. any two steps. The basic flow of steps (al-
Third is data cleaning and preprocessing. though not the potential multitude of itera-
Basic operations include removing noise if tions and loops) is illustrated in figure 1.
appropriate, collecting the necessary informa- Most previous work on KDD has focused on
tion to model or account for noise, deciding step 7, the data mining. However, the other
on strategies for handling missing data fields, steps are as important (and probably more
and accounting for time-sequence informa- so) for the successful application of KDD in
tion and known changes. practice. Having defined the basic notions
Fourth is data reduction and projection: and introduced the KDD process, we now
finding useful features to represent the data focus on the data-mining component,
depending on the goal of the task. With di- which has, by far, received the most atten-
mensionality reduction or transformation tion in the literature.
42 AI MAGAZINE
Articles
FALL 1996 43
Articles
44 AI MAGAZINE
Articles
FALL 1996 45
Articles
46 AI MAGAZINE
Articles
FALL 1996 47
Articles
estimation (Silverman 1986) and mixture evitably limited in scope; many data-mining
modeling (Titterington, Smith, and Makov techniques, particularly specialized methods
1985). for particular types of data and domains, were
not mentioned specifically. We believe the
Probabilistic Graphic general discussion on data-mining tasks and
Dependency Models components has general relevance to a vari-
Graphic models specify probabilistic depen- ety of methods. For example, consider time-
dencies using a graph structure (Whittaker series prediction, which traditionally has
1990; Pearl 1988). In its simplest form, the been cast as a predictive regression task (au-
model specifies which variables are directly de- toregressive models, and so on). Recently,
pendent on each other. Typically, these mod- more general models have been developed for
els are used with categorical or discrete-valued time-series applications, such as nonlinear ba-
variables, but extensions to special cases, such sis functions, example-based models, and ker-
Understand- as Gaussian densities, for real-valued variables nel methods. Furthermore, there has been
are also possible. Within the AI and statistical significant interest in descriptive graphic and
ing data communities, these models were initially de- local data modeling of time series rather than
mining and veloped within the framework of probabilistic purely predictive modeling (Weigend and
expert systems; the structure of the model and Gershenfeld 1993). Thus, although different
model the parameters (the conditional probabilities algorithms and applications might appear dif-
induction at attached to the links of the graph) were elicit- ferent on the surface, it is not uncommon to
ed from experts. Recently, there has been sig- find that they share many common compo-
this nificant work in both the AI and statistical nents. Understanding data mining and model
component communities on methods whereby both the induction at this component level clarifies
level clarifies structure and the parameters of graphic mod- the behavior of any data-mining algorithm
els can be learned directly from databases and makes it easier for the user to understand
the behavior (Buntine 1996; Heckerman 1996). Model-eval- its overall contribution and applicability to
of any uation criteria are typically Bayesian in form, the KDD process.
and parameter estimation can be a mixture of An important point is that each technique
data-mining closed-form estimates and iterative methods typically suits some problems better than
algorithm depending on whether a variable is directly others. For example, decision tree classifiers
observed or hidden. Model search can consist can be useful for finding structure in high-di-
and makes it of greedy hill-climbing methods over various mensional spaces and in problems with
easier for the graph structures. Prior knowledge, such as a mixed continuous and categorical data (be-
partial ordering of the variables based on cause tree methods do not require distance
user to metrics). However, classification trees might
causal relations, can be useful in terms of re-
understand its ducing the model search space. Although still not be suitable for problems where the true
overall primarily in the research phase, graphic model decision boundaries between classes are de-
induction methods are of particular interest to scribed by a second-order polynomial (for ex-
contribution KDD because the graphic form of the model ample). Thus, there is no universal data-min-
and lends itself easily to human interpretation. ing method, and choosing a particular
algorithm for a particular application is some-
applicability Relational Learning Models thing of an art. In practice, a large portion of
to the Although decision trees and rules have a repre- the application effort can go into properly
sentation restricted to propositional logic, rela- formulating the problem (asking the right
KDD tional learning (also known as inductive logic question) rather than into optimizing the al-
process. programming) uses the more flexible pattern gorithmic details of a particular data-mining
language of first-order logic. A relational learn- method (Langley and Simon 1995; Hand
er can easily find formulas such as X = Y. Most 1994).
research to date on model-evaluation methods Because our discussion and overview of da-
for relational learning is logical in nature. The ta-mining methods has been brief, we want
extra representational power of relational to make two important points clear:
models comes at the price of significant com- First, our overview of automated search fo-
putational demands in terms of search. See cused mainly on automated methods for ex-
Dzeroski (1996) for a more detailed discussion. tracting patterns or models from data. Al-
though this approach is consistent with the
definition we gave earlier, it does not neces-
Discussion sarily represent what other communities
Given the broad spectrum of data-mining might refer to as data mining. For example,
methods and algorithms, our overview is in- some use the term to designate any manual
48 AI MAGAZINE
Articles
search of the data or search assisted by queries oriented data, although making the applica-
to a database management system or to refer tion development more difficult, make it po-
to humans visualizing patterns in data. In tentially much more useful because it is easier
other communities, it is used to refer to the to retrain a system than a human. Finally,
automated correlation of data from transac- and perhaps one of the most important con-
tions or the automated generation of transac- siderations, is prior knowledge. It is useful to
tion reports. We choose to focus only on know something about the domain —what
methods that contain certain degrees of are the important fields, what are the likely
search autonomy. relationships, what is the user utility func-
Second, beware the hype: The state of the tion, what patterns are already known, and so
art in automated methods in data mining is on.
still in a fairly early stage of development.
There are no established criteria for deciding Research and Application Challenges
which methods to use in which circum- We outline some of the current primary re-
stances, and many of the approaches are search and application challenges for KDD.
based on crude heuristic approximations to This list is by no means exhaustive and is in-
avoid the expensive search required to find tended to give the reader a feel for the types
optimal, or even good, solutions. Hence, the of problem that KDD practitioners wrestle
reader should be careful when confronted with.
with overstated claims about the great ability Larger databases: Databases with hun-
of a system to mine useful information from dreds of fields and tables and millions of
large (or even small) databases. records and of a multigigabyte size are com-
monplace, and terabyte (1012 bytes) databases
are beginning to appear. Methods for dealing
Application Issues with large data volumes include more
For a survey of KDD applications as well as efficient algorithms (Agrawal et al. 1996),
detailed examples, see Piatetsky-Shapiro et al. sampling, approximation, and massively par-
(1996) for industrial applications and Fayyad, allel processing (Holsheimer et al. 1996).
Haussler, and Stolorz (1996) for applications High dimensionality: Not only is there of-
in science data analysis. Here, we examine ten a large number of records in the database,
criteria for selecting potential applications, but there can also be a large number of fields
which can be divided into practical and tech- (attributes, variables); so, the dimensionality
nical categories. The practical criteria for KDD of the problem is high. A high-dimensional
projects are similar to those for other applica- data set creates problems in terms of increas-
tions of advanced technology and include the ing the size of the search space for model in-
potential impact of an application, the ab- duction in a combinatorially explosive man-
sence of simpler alternative solutions, and ner. In addition, it increases the chances that
strong organizational support for using tech- a data-mining algorithm will find spurious
nology. For applications dealing with person- patterns that are not valid in general. Ap-
al data, one should also consider the privacy proaches to this problem include methods to
and legal issues (Piatetsky-Shapiro 1995). reduce the effective dimensionality of the
The technical criteria include considera- problem and the use of prior knowledge to
tions such as the availability of sufficient data identify irrelevant variables.
(cases). In general, the more fields there are Overfitting: When the algorithm searches
and the more complex the patterns being for the best parameters for one particular
sought, the more data are needed. However, model using a limited set of data, it can mod-
strong prior knowledge (see discussion later) el not only the general patterns in the data
can reduce the number of needed cases sig- but also any noise specific to the data set, re-
nificantly. Another consideration is the rele- sulting in poor performance of the model on
vance of attributes. It is important to have da- test data. Possible solutions include cross-vali-
ta attributes that are relevant to the discovery dation, regularization, and other sophisticat-
task; no amount of data will allow prediction ed statistical strategies.
based on attributes that do not capture the Assessing of statistical significance: A
required information. Furthermore, low noise problem (related to overfitting) occurs when
levels (few data errors) are another considera- the system is searching over many possible
tion. High amounts of noise make it hard to models. For example, if a system tests models
identify patterns unless a large number of cas- at the 0.001 significance level, then on aver-
es can mitigate random noise and help clarify age, with purely random data, N/1000 of
the aggregate patterns. Changing and time- these models will be accepted as significant.
FALL 1996 49
Articles
This point is frequently missed by many ini- edge is important in all the steps of the KDD
tial attempts at KDD. One way to deal with process. Bayesian approaches (for example,
this problem is to use methods that adjust Cheeseman [1990]) use prior probabilities
the test statistic as a function of the search, over data and distributions as one form of en-
for example, Bonferroni adjustments for inde- coding prior knowledge. Others employ de-
pendent tests or randomization testing. ductive database capabilities to discover
Changing data and knowledge: Rapidly knowledge that is then used to guide the da-
changing (nonstationary) data can make pre- ta-mining search (for example, Simoudis,
viously discovered patterns invalid. In addi- Livezey, and Kerber [1995]).
tion, the variables measured in a given appli- Integration with other systems: A stand-
cation database can be modified, deleted, or alone discovery system might not be very
augmented with new measurements over useful. Typical integration issues include inte-
time. Possible solutions include incremental gration with a database management system
methods for updating the patterns and treat- (for example, through a query interface), in-
ing change as an opportunity for discovery tegration with spreadsheets and visualization
by using it to cue the search for patterns of tools, and accommodating of real-time sensor
change only (Matheus, Piatetsky-Shapiro, and readings. Examples of integrated KDD sys-
McNeill 1996). See also Agrawal and Psaila tems are described by Simoudis, Livezey, and
(1995) and Mannila, Toivonen, and Verkamo Kerber (1995) and Stolorz, Nakamura, Mesro-
(1995). biam, Muntz, Shek, Santos, Yi, Ng, Chien,
Missing and noisy data: This problem is Mechoso, and Farrara (1995).
especially acute in business databases. U.S.
census data reportedly have error rates as
great as 20 percent in some fields. Important
Concluding Remarks: The
attributes can be missing if the database was Potential Role of AI in KDD
not designed with discovery in mind. Possible In addition to machine learning, other AI fiel-
solutions include more sophisticated statisti- ds can potentially contribute significantly to
cal strategies to identify hidden variables and various aspects of the KDD process. We men-
dependencies (Heckerman 1996; Smyth et al. tion a few examples of these areas here:
1996). Natural language presents significant op-
Complex relationships between fields: portunities for mining in free-form text, espe-
Hierarchically structured attributes or values, cially for automated annotation and indexing
relations between attributes, and more so- prior to classification of text corpora. Limited
phisticated means for representing knowl- parsing capabilities can help substantially in
edge about the contents of a database will re- the task of deciding what an article refers to.
quire algorithms that can effectively use such Hence, the spectrum from simple natural lan-
information. Historically, data-mining algo- guage processing all the way to language un-
rithms have been developed for simple at- derstanding can help substantially. Also, nat-
tribute-value records, although new tech- ural language processing can contribute
niques for deriving relations between significantly as an effective interface for stat-
variables are being developed (Dzeroski 1996; ing hints to mining algorithms and visualiz-
Djoko, Cook, and Holder 1995). ing and explaining knowledge derived by a
Understandability of patterns: In many KDD system.
applications, it is important to make the dis- Planning considers a complicated data
coveries more understandable by humans. analysis process. It involves conducting com-
Possible solutions include graphic representa- plicated data-access and data-transformation
tions (Buntine 1996; Heckerman 1996), rule operations; applying preprocessing routines;
structuring, natural language generation, and and, in some cases, paying attention to re-
techniques for visualization of data and source and data-access constraints. Typically,
knowledge. Rule-refinement strategies (for ex- data processing steps are expressed in terms of
ample, Major and Mangano [1995]) can be desired postconditions and preconditions for
used to address a related problem: The discov- the application of certain routines, which
ered knowledge might be implicitly or explic- lends itself easily to representation as a plan-
itly redundant. ning problem. In addition, planning ability
User interaction and prior knowledge: can play an important role in automated
Many current KDD methods and tools are not agents (see next item) to collect data samples
truly interactive and cannot easily incorpo- or conduct a search to obtain needed data sets.
rate prior knowledge about a problem except Intelligent agents can be fired off to col-
in simple ways. The use of domain knowl- lect necessary information from a variety of
50 AI MAGAZINE
Articles
FALL 1996 51
Articles
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. ering Informative Patterns and Data Cleaning. In
Uthurusamy, 73–95. Menlo Park, Calif.: AAAI Press. Advances in Knowledge Discovery and Data Mining,
Cheng, B., and Titterington, D. M. 1994. Neural eds. U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and
Networks—A Review from a Statistical Perspective. R. Uthurusamy, 181–204. Menlo Park, Calif.: AAAI
Statistical Science 9(1): 2–30. Press.
Codd, E. F. 1993. Providing OLAP (On-Line Analyti- Hall, J.; Mani, G.; and Barr, D. 1996. Applying
cal Processing) to User-Analysts: An IT Mandate. E. Computational Intelligence to the Investment Pro-
F. Codd and Associates. cess. In Proceedings of CIFER-96: Computational
Dasarathy, B. V. 1991. Nearest Neighbor (NN) Intelligence in Financial Engineering. Washington,
Norms: NN Pattern Classification Techniques. D.C.: IEEE Computer Society.
Washington, D.C.: IEEE Computer Society. Hand, D. J. 1994. Deconstructing Statistical Ques-
Djoko, S.; Cook, D.; and Holder, L. 1995. Analyzing tions. Journal of the Royal Statistical Society A. 157(3):
the Benefits of Domain Knowledge in Substructure 317–356.
Discovery. In Proceedings of KDD-95: First Interna- Hand, D. J. 1981. Discrimination and Classification.
tional Conference on Knowledge Discovery and Chichester, U.K.: Wiley.
Data Mining, 75–80. Menlo Park, Calif.: American Heckerman, D. 1996. Bayesian Networks for Knowl-
Association for Artificial Intelligence. edge Discovery. In Advances in Knowledge Discovery
Dzeroski, S. 1996. Inductive Logic Programming for and Data Mining, eds. U. Fayyad, G. Piatetsky-
Knowledge Discovery in Databases. In Advances in Shapiro, P. Smyth, and R. Uthurusamy, 273–306.
Knowledge Discovery and Data Mining, eds. U. Menlo Park, Calif.: AAAI Press.
Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Hernandez, M., and Stolfo, S. 1995. The MERGE -
Uthurusamy, 59–82. Menlo Park, Calif.: AAAI Press. PURGE Problem for Large Databases. In Proceedings
Elder, J., and Pregibon, D. 1996. A Statistical Per- of the 1995 ACM-SIGMOD Conference, 127–138.
spective on KDD. In Advances in Knowledge Discov- New York: Association for Computing Machinery.
ery and Data Mining, eds. U. Fayyad, G. Piatetsky- Holsheimer, M.; Kersten, M. L.; Mannila, H.; and
Shapiro, P. Smyth, and R. Uthurusamy, 83–116. Toivonen, H. 1996. Data Surveyor: Searching the
Menlo Park, Calif.: AAAI Press. Nuggets in Parallel. In Advances in Knowledge Dis-
Etzioni, O. 1996. The World Wide Web: Quagmire covery and Data Mining, eds. U. Fayyad, G. Piatet-
or Gold Mine? Communications of the ACM (Special sky-Shapiro, P. Smyth, and R. Uthurusamy,
Issue on Data Mining). November 1996. Forthcom- 447–471. Menlo Park, Calif.: AAAI Press.
ing. Horvitz, E., and Jensen, F. 1996. Proceedings of the
Fayyad, U. M.; Djorgovski, S. G.; and Weir, N. 1996. Twelfth Conference of Uncertainty in Artificial Intelli-
From Digitized Images to On-Line Catalogs: Data gence. San Mateo, Calif.: Morgan Kaufmann.
Mining a Sky Survey. AI Magazine 17(2): 51–66. Jain, A. K., and Dubes, R. C. 1988. Algorithms for
Fayyad, U. M.; Haussler, D.; and Stolorz, Z. 1996. Clustering Data. Englewood Cliffs, N.J.: Prentice-
KDD for Science Data Analysis: Issues and Exam- Hall.
ples. In Proceedings of the Second International Kloesgen, W. 1996. A Multipattern and Multistrate-
Conference on Knowledge Discovery and Data gy Discovery Assistant. In Advances in Knowledge
Mining (KDD-96), 50–56. Menlo Park, Calif.: Amer- Discovery and Data Mining, eds. U. Fayyad, G. Piatet-
ican Association for Artificial Intelligence. sky-Shapiro, P. Smyth, and R. Uthurusamy,
Fayyad, U. M.; Piatetsky-Shapiro, G.; and Smyth, P. 249–271. Menlo Park, Calif.: AAAI Press.
1996. From Data Mining to Knowledge Discovery: Kloesgen, W., and Zytkow, J. 1996. Knowledge Dis-
An Overview. In Advances in Knowledge Discovery covery in Databases Terminology. In Advances in
and Data Mining, eds. U. Fayyad, G. Piatetsky- Knowledge Discovery and Data Mining, eds. U. Fayyad,
Shapiro, P. Smyth, and R. Uthurusamy, 1–30. Men- G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy,
lo Park, Calif.: AAAI Press. 569–588. Menlo Park, Calif.: AAAI Press.
Fayyad, U. M.; Piatetsky-Shapiro, G.; Smyth, P.; and Kolodner, J. 1993. Case-Based Reasoning. San Fran-
Uthurusamy, R. 1996. Advances in Knowledge Dis- cisco, Calif.: Morgan Kaufmann.
covery and Data Mining. Menlo Park, Calif.: AAAI Langley, P., and Simon, H. A. 1995. Applications of
Press. Machine Learning and Rule Induction. Communica-
Friedman, J. H. 1989. Multivariate Adaptive Regres- tions of the ACM 38:55–64.
sion Splines. Annals of Statistics 19:1–141. Major, J., and Mangano, J. 1995. Selecting among
Geman, S.; Bienenstock, E.; and Doursat, R. 1992. Rules Induced from a Hurricane Database. Journal
Neural Networks and the Bias/Variance Dilemma. of Intelligent Information Systems 4(1): 39–52.
Neural Computation 4:1–58. Manago, M., and Auriol, M. 1996. Mining for OR.
Glymour, C.; Madigan, D.; Pregibon, D.; and ORMS Today (Special Issue on Data Mining), Febru-
Smyth, P. 1996. Statistics and Data Mining. Com- ary, 28–32.
munications of the ACM (Special Issue on Data Min- Mannila, H.; Toivonen, H.; and Verkamo, A. I.
ing). November 1996. Forthcoming. 1995. Discovering Frequent Episodes in Sequences.
Glymour, C.; Scheines, R.; Spirtes, P.; Kelly, K. 1987. In Proceedings of the First International Confer-
Discovering Causal Structure. New York: Academic. ence on Knowledge Discovery and Data Mining
Guyon, O.; Matic, N.; and Vapnik, N. 1996. Discov- (KDD-95), 210–215. Menlo Park, Calif.: American
52 AI MAGAZINE
Articles
Association for Artificial Intelligence. Spirtes, P.; Glymour, C.; and Scheines, R. 1993.
Matheus, C.; Piatetsky-Shapiro, G.; and McNeill, D. Causation, Prediction, and Search. New York:
1996. Selecting and Reporting What Is Interesting: Springer-Verlag.
The KEfiR Application to Healthcare Data. In Ad- Stolorz, P.; Nakamura, H.; Mesrobian, E.; Muntz, R.;
vances in Knowledge Discovery and Data Mining, eds. Shek, E.; Santos, J.; Yi, J.; Ng, K.; Chien, S.; Me-
U. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. choso, C.; and Farrara, J. 1995. Fast Spatio-Tempo-
Uthurusamy, 495–516. Menlo Park, Calif.: AAAI ral Data Mining of Large Geophysical Datasets. In
Press. Proceedings of KDD-95: First International Confer-
Pearl, J. 1988. Probabilistic Reasoning in Intelligent ence on Knowledge Discovery and Data Mining,
Systems. San Francisco, Calif.: Morgan Kaufmann. 300–305. Menlo Park, Calif.: American Association
for Artificial Intelligence.
Piatetsky-Shapiro, G. 1995. Knowledge Discovery
Titterington, D. M.; Smith, A. F. M.; and Makov, U.
in Personal Data versus Privacy—A Mini-Sympo-
E. 1985. Statistical Analysis of Finite-Mixture Distribu-
sium. IEEE Expert 10(5).
tions. Chichester, U.K.: Wiley.
Piatetsky-Shapiro, G. 1991. Knowledge Discovery
U.S. News. 1995. Basketball’s New High-Tech Guru:
in Real Databases: A Report on the IJCAI-89 Work-
IBM Software Is Changing Coaches’ Game Plans.
shop. AI Magazine 11(5): 68–70.
U.S. News and World Report, 11 December.
Piatetsky-Shapiro, G., and Matheus, C. 1994. The
Weigend, A., and Gershenfeld, N., eds. 1993. Pre-
Interestingness of Deviations. In Proceedings of
dicting the Future and Understanding the Past. Red-
KDD-94, eds. U. M. Fayyad and R. Uthurusamy.
wood City, Calif.: Addison-Wesley.
Technical Report WS-03. Menlo Park, Calif.: AAAI
Press. Weiss, S. I., and Kulikowski, C. 1991. Computer Sys-
tems That Learn: Classification and Prediction Meth-
Piatetsky-Shapiro, G.; Brachman, R.; Khabaza, T.;
ods from Statistics, Neural Networks, Machine Learn-
Kloesgen, W.; and Simoudis, E., 1996. An Overview ing, and Expert Systems. San Francisco, Calif.:
of Issues in Developing Industrial Data Mining and Morgan Kaufmann.
Knowledge Discovery Applications. In Proceedings
Whittaker, J. 1990. Graphical Models in Applied Mul-
of the Second International Conference on Knowl-
tivariate Statistics. New York: Wiley.
edge Discovery and Data Mining (KDD-96), eds. J.
Han and E. Simoudis, 89–95. Menlo Park, Calif.: Zembowicz, R., and Zytkow, J. 1996. From Contin-
American Association for Artificial Intelligence. gency Tables to Various Forms of Knowledge in
Databases. In Advances in Knowledge Discovery and
Quinlan, J. 1992. C4.5: Programs for Machine Learn-
Data Mining, eds. U. Fayyad, G. Piatetsky-Shapiro, P.
ing. San Francisco, Calif.: Morgan Kaufmann.
Smyth, and R. Uthurusamy, 329–351. Menlo Park,
Ripley, B. D. 1994. Neural Networks and Related Calif.: AAAI Press.
Methods for Classification. Journal of the Royal Sta-
tistical Society B. 56(3): 409–437.
Senator, T.; Goldberg, H. G.; Wooton, J.; Cottini, M.
A.; Umarkhan, A. F.; Klinger, C. D.; Llamas, W. M.;
Marrone, M. P.; and Wong, R. W. H. 1995. The Fi- Usama Fayyad is a senior re-
nancial Crimes Enforcement Network AI System searcher at Microsoft Research.
( FAIS ): Identifying Potential Money Laundering He received his Ph.D. in 1991
from Reports of Large Cash Transactions. AI Maga- from the University of Michigan
zine 16(4): 21–39. at Ann Arbor. Prior to joining Mi-
crosoft in 1996, he headed the
Shrager, J., and Langley, P., eds. 1990. Computation-
Machine Learning Systems Group
al Models of Scientific Discovery and Theory Forma-
at the Jet Propulsion Laboratory
tion. San Francisco, Calif.: Morgan Kaufmann.
(JPL), California Institute of Tech-
Silberschatz, A., and Tuzhilin, A. 1995. On Subjec- nology, where he developed data-mining systems
tive Measures of Interestingness in Knowledge Dis- for automated science data analysis. He remains
covery. In Proceedings of KDD-95: First Interna- affiliated with JPL as a distinguished visiting scien-
tional Conference on Knowledge Discovery and tist. Fayyad received the JPL 1993 Lew Allen Award
Data Mining, 275–281. Menlo Park, Calif.: Ameri- for Excellence in Research and the 1994 National
can Association for Artificial Intelligence. Aeronautics and Space Administration Exceptional
Silverman, B. 1986. Density Estimation for Statistics Achievement Medal. His research interests include
and Data Analysis. New York: Chapman and Hall. knowledge discovery in large databases, data min-
Simoudis, E.; Livezey, B.; and Kerber, R. 1995. Using ing, machine-learning theory and applications, sta-
Recon for Data Cleaning. In Proceedings of KDD-95: tistical pattern recognition, and clustering. He was
First International Conference on Knowledge Discov- program cochair of KDD-94 and KDD-95 (the First
International Conference on Knowledge Discovery
ery and Data Mining, 275–281. Menlo Park, Calif.:
and Data Mining). He is general chair of KDD-96,
American Association for Artificial Intelligence.
an editor in chief of the journal Data Mining and
Smyth, P.; Burl, M.; Fayyad, U.; and Perona, P. Knowledge Discovery, and coeditor of the 1996 AAAI
1996. Modeling Subjective Uncertainty in Image Press book Advances in Knowledge Discovery and Da-
Annotation. In Advances in Knowledge Discovery and ta Mining.
Data Mining, 517–540. Menlo Park, Calif.: AAAI
Press.
FALL 1996 53
Articles
54 AI MAGAZINE