Data Mining

DATA MINING
Data mining, also known as knowledge-discovery in databases (KDD), is the practice

of automatically searching large stores of data for patterns. To do this, data mining
uses computational techniques from statistics and pattern recognition.
Definition
Data mining has been defined as "The nontrivial extraction of implicit, previously
unknown, and potentially useful information from data" and "The science of extracting
useful information from large data sets or databases" . Although it is usually used in
relation to analysis of data, data mining, like artificial intelligence, is an umbrella term
and is used with varied meaning in a wide range of contexts. It is usually associated with
a business or organization's need to identify trends.
A simple example of data mining is its use in a retail sales department. If a store tracks
the purchases of a customer and notices that a customer buys a lot of silk shirts, the data
mining system will make a correlation between that customer and silk shirts. The sales
department will look at that information and may begin direct mail marketing of silk
shirts to that customer, or it may alternatively attempt to get the customer to buy a wider
range of products. In this case, the data mining system used by the retail store discovered
new information about the customer that was previously unknown to the company.
Another widely used (though hypothetical) example is that of a very large North
American chain of supermarkets. Through intensive analysis of the transactions and the
goods bought over a period of time, analysts found that beers and diapers were often
bought together. Though explaining this interrelation might be difficult, taking advantage
of it, on the other hand, should not be hard (e.g. placing the high-profit diapers next to the
high-profit beers). This technique is often referred to as "Market Basket Analysis".
In statistical analyses in which there is no underlying theoretical model, data mining is

often approximated via stepwise regression methods wherein the space of 2k possible
relationships between a single outcome variable and k potential explanatory variables is
smartly searched. With the advent of grid computing, it became possible (when k is less
than approximately 40) to examine all 2k models. This procedure is called all subsets or
exhaustive regression. Some of the first applications of exhaustive regression involved
the study of clinical data.
Data dredging
Used in the technical context of data warehousing and analysis, the term "data mining" is
neutral. However, it sometimes has a more pejorative usage that implies imposing
patterns (and particularly causal relationships) on data where none exist. This imposition
of irrelevant, misleading or trivial attribute correlation is more properly criticized as "data
dredging" in the statistical literature. Another term for this misuse of statistics is data
fishing.
Used in this latter sense, data dredging implies scanning the data for any relationships,
and then when one is found coming up with an interesting explanation. (This is also
referred to as "over fitting the model".) The problem is that large data sets invariably
happen to have some exciting relationships peculiar to that data. Therefore any
conclusions reached are likely to be highly suspect. In spite of this, some exploratory data
work is always required in any applied statistical analysis to get a feel for the data, so
sometimes the line between good statistical practice and data dredging is less than clear.
The common approach, in data mining, to overcoming the problem of over fitting is to
separate the data into two or three separate data sets (called the training set, validation
set, and testing set). The model is built using the training and validation set, and is then
tested using the testing set; the procedure can be repeated many times by re sampling the
data sets, in order to be more certain that a real pattern has been found and that the model
is not merely capitalizing on random chance (i.e. over fitting).
A more significant danger is finding correlations that do not really exist. Investment
analysts appear to be particularly vulnerable to this. "There have always been a
considerable number of pathetic people who busy themselves examining the last
thousand numbers which have appeared on a roulette wheel, in search of some repeating
pattern. Sadly enough, they have usually found it.". However, when properly done,
determining correlations in Investment analysis has proven to be very profitable for
statistical arbitrage operations (such as pairs trading strategies), and furthermore
correlation analysis has shown to be very useful in risk management. Indeed, finding
correlations in the financial markets, when done properly, is not the same as finding false
patterns in roulette wheels.
Most data mining efforts are focused on developing a finely-grained, highly detailed
model of some large data set. Other researchers have described an alternate method that
involves finding the minimal differences between elements in a data set, with the goal of
developing simpler models that represent relevant data.
Privacy concerns
There are also privacy concerns associated with data mining. For example, if an employer
has access to medical records, they may screen out people who have diabetes or have had
a heart attack. Screening out such employees will cut costs for insurance, but it creates
ethical and legal problems.
Data mining government or commercial data sets for national security or law
enforcement purposes has also raised privacy concerns.
There are many legitimate uses of data mining. For example, a database of prescription
drugs taken by a group of people could be used to find combinations of drugs with
adverse reactions. Since the combination may occur in only 1 out of 1000 people, a single
case may not be apparent. A project involving pharmacies could reduce the number of
drug reactions and potentially save lives. Unfortunately, there is also a huge potential for
abuse of such a database.
Essentially, data mining gives information that would not be available otherwise. It must
be properly interpreted to be useful. When the data collected involves individual people,
there are many questions concerning privacy, legality, and ethics.
Combinatorial game data mining

• Data mining from combinatorial game oracles:
Since the early 1990's, with the availability of oracles for certain combinatorial games,
also called tablebases (e.g. for 3x3-chess) with any beginning configuration, small-board
dots-and-boxes, small-board-hex, and certain endgames in chess, dots-and-boxes, and
hex; a new area for data mining has been opened up. This is the extraction of human-
usable strategies from these oracles. This is pattern-recognition at too high an abstraction
for known Statistical Pattern Recognition algorithms or any other algorithmic approaches
to be applied: at least, no one knows how to do it yet (as of January 2005). The method
used is the full force of Scientific Method: extensive experimentation with the table bases
combined with intensive study of table base-answers to well designed problems,
combined with knowledge of prior art i.e. pre-table base knowledge, leading to flashes of
insight. Berlekamp in dots-and-boxes etc. and John Nunn in chess endgames are notable
examples of people doing this work, though they were not and are not involved in
tablebase generation.
Data Mining Improves Decision Making
Data mining uncovers patterns in data using predictive techniques. These patterns play a
critical role in decision making because they reveal areas for process improvement. Using
data mining, organizations can increase the profitability of their interactions with
customers, detect fraud, and improve risk management. The patterns uncovered using
data mining help organizations make better and timelier decisions.
Relational data mining
Relational data mining is a data mining technique for relational databases. Unlike
traditional data mining algorithms, which look for patterns in a single table (propositional
patterns), relational data mining algorithms look for patterns among multiple tables
(relational patterns). For most types of propositional propatterns, there are corresponding
relational patterns. For example, there are relational classification rules, relational
regression trees, relational association rules, and so on.
The most important theoretical foundation of relational data mining is inductive logic
programming.
DATA MINING FUTURE
Data mining is the analysis of large data sets to discover patterns of interests. Data
mining has come a long way from the early academic beginnings in the late seventies.
Many of the early data mining software packages were based on one algorithm.
Until the mid-nineties data mining required considerable specialized knowledge and was
mainly restricted to statisticians. Customer Relationship Management (CRM) software
played a great part in popularizing data mining among corporate users. Data mining in
CRMs are often hidden from the end users. The algorithms are packaged behind business
functionality such as Churn analysis. Churn analysis is the process to predict which
customers are the ones most likely to defect to a competitor.
Data mining algorithms are now freely available. Database vendors have started to
incorporate data mining modules. Developers can now access data mining via open
standards such as OLE-DB for data mining on SQL Server 2000. Data mining
functionality can now be added directly to the application source code.
The complexity of data mining must be hidden from end-users before it will take the true
center stage in an organization. Business use cases can be designed, with tight constrains
, around data mining algorithms.

Data Mining

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining

Uploaded by

Copyright:

Available Formats

DATA MINING

Data mining, also known as knowledge-discovery in databases (KDD), is the practice

In statistical analyses in which there is no underlying theoretical model, data mining is

Combinatorial game data mining

Data Mining Improves Decision Making

DATA MINING FUTURE

, around data mining algorithms.

You might also like