Discovering New Knowledge - Data Mining

Discovering New Knowledge –
Data Mining
Advantage of Data Mining
The knowledge elicited from experts was
already known by humans.
However, there are considerable limitations
to human knowledge.
Enormous “gold mines” of knowledge are
buried in databases.
Therefore, knowing, understanding, and
using this knowledge can give an
organization significant competitive
advantage.
Example
The knowledge about how to design better

and more efficient products based on data
about:
product performance
customer preference.
The knowledge about how to better serve
existing customers based upon information
about:
their buying habits
the problems they typically face.
Chapter Objectives
Introduce the student to the concept of Data
Mining.
How it is different from knowledge elicitation from
experts
How it is different from extracting existing
knowledge from databases.
The objectives of data mining
Explanation of past events (descriptive DM)
Prediction of future events (predictive DM)
(continued)
Objectives
Introduce the student to the different classes

of methods available for DM
Symbolic (induction)
Connectionist (neural networks)
Statistical
Introduce the student to the details of some
of the methods described in the chapter.
Historical Perspective
DM emerged at the intersection of three

independently evolved research directions:
Classical statistics and statistical pattern
recognition
Machine learning (from symbolic AI)
Neural networks
Objectives of Data Mining
Descriptive DM seeks patterns in past actions or
activities to affect these actions or activities
eg, seek patterns indicative of fraud in past records
Predictive DM looks at past history to predict future
behavior
Classification classifies a new instance into one of a set of
discrete predefined categories
Clustering groups items in the data set into different
categories
Affinity or association finds items closely associated in
the data set
Induction of symbolic rules
Induction is “reasoning from particular facts
or individual cases to general conclusion.”
Rule induction by learning decision trees:
symbolic approach to data mining
Main algorithms for rule induction
C5.0 and its ancestors, ID3 and CLS (from
machine learning)
CART (Classification And Regression Trees) and
CHAID, very similar algorithms for rule induction
(independently developed in statistics)
Table 12.1 – decision tables
(if ordered, then decision lists)
features (attributes and values)
Name Outlook Temperature Humidity Class

Data sample1 Sunny Mild Dry Enjoyable
Data sample2 Cloudy Cold Humid Not Enjoyable
Data sample3 Rainy Mild Humid Not Enjoyable
Data sample4 Sunny Hot Humid Not Enjoyable
Figure 12.1 – decision trees
(a.k.a. classification trees)
Root node
Is the stock’s
price/earnings
Yes
ratio > 5?
No
Leaf node
Has the company’s Don’t

quarterly profit buy
increased over the last
year by 10% or more?
Yes
No
Don’t Is the company’s

buy management
stable?
Yes No
Buy Don’t
Buy
Induction trees
An induction tree is a decision tree holding

the data samples (of the training set)
Built progressively by gradually segregating
the data samples
Induction tree: nodes and links
Nodes: elements that contain information
Links (edges or arcs): connect nodes
Induction tree: hierarchical, acyclic network of
nodes and links
parent node
Rain
child node
Figure 12.2 – simple induction tree
(step 1) Name
DS1
Outlook
Sunny
Temperature
Mild
Humidity
Dry
Class
Enjoyable
DS2 Cloudy Cold Humid Not Enjoyable
DS3 Rainy Mild Humid Not Enjoyable
DS4 Sunny Hot Humid Not Enjoyable
{DS1, DS2, DS3, DS4}
Outlook
Cloudy Rainy Sunny
Rain
DS2 - Not Enjoyable DS3 - Not Enjoyable DS1 – Enjoyable

DS4 - Not Enjoyable
Figure 12.3 – simple induction tree
(step 2) Name
DS1
Outlook
Sunny
Temperature
Mild
Humidity
Dry
Class
Enjoyable
Outlook
Cloudy Rain Sunny
Rain Temperature DS1 – Enjoyable
DS2 - Not DS3 - Not DS4 – Not Enjoyable

Enjoyable Enjoyable
Cold Mild Hot
None
DS1 - Enjoyable DS4 – Not Enjoyable
Writing the induced tree as rules
Rule 1. If the Outlook is cloudy, then the

Weather is not enjoyable.
Rule 2. If the Outlook is rainy, then the
Weather is not enjoyable.
Rule 3. If the Outlook is sunny and
Temperature is mild, then the Weather is
enjoyable.
Rule 4. If the Outlook is sunny and
Temperature is hot, then the Weather is not
enjoyable.
Quiz: Develop a induction tree starting
with feature “Humidity”
Name Outlook Temperature Humidity Class
DS1 Sunny Mild Dry Enjoyable
Humidity
Learning decision trees for
classification into multiple classes
In the previous example, we were learning a function to predict a
boolean (enjoyable = true/false) output.
The same approach can be generalized to learn a function that
predicts a class (when there are multiple predefined
classes/categories).
For example, suppose we are attempting to select a KBS shell
for some application:
with the following as our (hypothetical) options:
ThoughtGen, Offsite, Genie, SilverWorks, XS, MilliExpert
using the following attributes and range of values:
Development language: { java, C++, lisp }
Reasoning method: { forward, backward }
External interfaces: { dBase, spreadsheetXL, ASCII file, devices }
Cost: any positive number
Memory: any positive number
Table 12.2 –
collection of data samples (training set)
described as vectors of attributes (feature vectors)
Language Reasoning Interface Cost Memory Classification

method Method
Java Backward SpreadsheetXL 250 128MB MilliExpert
Java Backward ASCII 250 128MB MilliExpert
Java Backward dBase 195 256MB ThoughtGen
Java * Devices 985 512MB OffSite
C++ Forward * 6500 640MB Genie
LISP Forward * 15000 5GB Silverworks
C++ Backward * 395 256MB XS
LISP Backward * 395 256MB XS
Figure 12.4 – decision tree resulting
from selection of the language attribute
Language
Java C++ Lisp
MilliExpert Genie SilverWorks

ThoughtGen XS XS
OffSite
Figure 12.5 – decision tree resulting
from addition of the reasoning method
attribute Language
Java C++ Lisp
Reasoning method Reasoning method Reasoning method

ThoughtGen XS XS
OffSite
Backwards Backward Backward
Forward Forward
Forward
MilliExpert OffSite XS Ginie XS SilverWorks

ThoughtGen
OffSite
Figure 12.6 – final decision tree
Language
Java C++ Lisp

Reasoning method Reasoning method Reasoning method
ThoughtGen XS XS
OffSite
Backward Backward Backward
Forward Forward Forward

Interface method
MilliExpert XS Genie XS SilverWorks
OffSite
ThoughtGen
OffSite
ASCII Devices
SpreadsheetXL dBase
MilliExpert MilliExpert ThoughtGen OffSite
Order of choosing attributes
Note that the decision tree that is built

depends greatly on which attributes you
choose first
Figure 12.7
Humidity
Humid Dry
DS2 – Not Enjoyable

DS3 – Not Enjoyable DS1 - Enjoyable
DS4 – Not Enjoyable
Order of choosing attributes (cont)
One sensible objective is to seek the minimal tree,

ie, the smallest tree required to classify all training
set samples correctly
Occam’s Razor principle: the simplest explanation is the
best
What order should you choose attributes in, so as to
obtain the minimal tree?
Often too complex to be feasible
Heuristics used
Information gain, computed using information theoretic
quantities, is the best way in practice
Artificial Neural Networks
Objective of this section
Provide a detailed description of the
connectionist approach to data mining –
neural networks
Present the basic neural network architecture
- the multi-layer feed forward neural network
Present the main supervised learning
algorithm -backpropagation
Present the main unsupervised neural
network architecture - the Kohonen network
Figure 12.8– simple model of a neuron
Weight
x1 W1
Summation
x2 W2 Activation
Σ function
y f()
Inputs
xk Wn
Neuron as the basic element
The inputs enter a neuron either from external
sources, or multi-layered networks (defined later),
from the output of neurons in the previous layer.
The weights are values that signify the strength of
the connection.
The inputs for each neuron are multiplied by their
respective weight to from the input into the summing
junction.
The summing junction adds all the inputs into the
neuron, and forwards the result to the activation
function.
The neuron fires if the total strength of its inputs
merits forwarding its computed value to the next
neuron.
Figure 12.9 – three common activation
functions
The output is normally constrained in the
range between 0-1.0.
1.0
y
Threshold function Piece-wise
Sigmoid function
Linear function
Figure 12.10 – simple single layer
neural network
Neurons
Inputs Outputs
Figure 12.11 – two-layer neural network
Hidden layer
Neurons
Neurons
Outputs
Inputs
Supervised Learning: Back
Propagation
An iterative learning algorithm with three
phases:
1. Presentation of the examples (input patterns with
outputs) and feed forward execution of the
network
2. Calculation of the associated errors when the
output of the previous step is compared with the
expected output and back propagation of this
error
3. Adjustment of the weights
Unsupervised Learning: Kohonen
Networks
Clustering by an iterative competitive
algorithm
Note relation to CBR
Figure 12.13 – Kohonen selforganizing
map
Wi
Inputs
Kohonen selforganizing map
A grid or map of neurons
Each neuron is connected to every input applied to
the network, and each connection has weight
assigned to it.
When a new data record is presented during training,
it compares the input vector to the weight vector of
each cluster unit in the grid.
The one cluster unit whose weight pattern most
closely matches the current input vector “takes “ the
input and modifies its weights to better reflect this
input.
All clusters near the winning unit likewise adjust the
weights to better match the neighboring winning
cluster.
Statistical Methods for Data Mining
Curve fitting
K-means clustering
Market Basket Analysis
Figure 12.14 – 2-D input data plotted
on a graph
y
x
Figure 12.15 – data and deviations
Curve fitting
y
x
Best fitting
equation
x
K-means clustering
K: number of clusters to be defined
Initially, (randomly) select k centers
(centroids)
Compute the distance between each pair of
data point and center.
Assign each data point to the cluster whose
center is the nearest.
Re-calculate centers of the current clusters
Iterate until no data are moved.
K-means clustering sample
Variable B
Variable A
K-means clustering sample (1): Initial
state
Variable B
x
x
Initial center of
cluster #2
Initial center of
cluster #1
Variable A
K-means clustering sample (2): Data
assignment
Variable B
x
Cluster #2
x
Cluster #1
Variable A
K-means clustering sample (3): Re-
calculation of centers
Variable B
x
x Cluster #2
Cluster #1
Variable A
K-means clustering sample (4): Re-
assignment
Variable B
x
x Cluster #2
Cluster #1
Variable A
K-means clustering sample (5): stable
state
Variable B
x
x Cluster #2
Cluster #1
Variable A
Figure 12.12 – clusters of related data in
2-D space
Variable B
Cluster #2
Cluster #1
Variable A
Market Basket Analysis
Determine buying patterns of people

Its output is expressed in rules that describe
associations between products that people
purchase together.
For example:
If beer is purchased, then chips and dip are also
purchased 75% if the time.
A merchandiser could promote one of the
products and increase the sales of the other
automatically.
Table 12.7: Errors
Predicted Predicted
Heart Disease Diagnostic
No Disease Presence of Disease
Actual
118 (72%) 46 (28%)
No Disease
Actual Presence of Disease 43 (30.9%) 96 (69.1%)

When to use what
Provide useful guidelines for determining

what technique to use for specific problems
Table 12.3
Goal Input Output Statistical Examples
Variables Variables Technique
[SPSS, 2000]
(Predictors) (Outcomes)
Find linear Continuous Discrete Discriminant • Predict instances of fraud
combination of Analysis • Predict whether customers
predictors that will remain or leave
best separate the (churners or not)
population • Predict which customers
will respond to a new
product or offer
•Predict outcomes of
various medical procedures
Predict the Continuous Discrete Logistic and • Predicting insurance policy
probability of Multinomial renewal
outcome being in Regression • Predicting fraud
a particular • Predicting which product a
category customer will buy
• Predicting that a product is
likely to fail
Table 12.3 (cont.)
[SPSS, 2000]
(Predictors) (Outcomes)
Output is a linear Continuous Continuous Linear • Predict expected revenue
combination of Regression in dollars from a new
input variables customer
• Predict sales revenue for a
store
• Predict waiting time on hold
for callers to an 800 number.
• Predict length of stay in a
hospital based on patient
characteristics and medical
condition.
For experiments Most inputs Continuous Analysis of • Predict which
and repeated must be Variance environmental factors are
measures of the Discrete (ANOVA) likely to cause cancer
same sample
To predict future Continuous Continuous Time Series • Predict future sales data
events whose Analysis from past sales records
history has been
collected at
regular intervals
Table 12.4
(Predictor) (Outcome) Technique
[SPSS, 2000]
Variables Variables
Predict outcome Continuous, Continuous or Memory- •Predicting medical
based on values Discrete, and Discrete based outcomes
of nearest Text Reasoning
neighbors (MBR)
Predict by Continuous or Continuous or Decision •Predicting which
splitting data into Discrete Discrete Trees customers will leave
subgroups (Different (Different •Predicting
(branches) techniques used techniques instances of fraud
based on data used based on
characteristics) data
characteristics)
Predict outcome Continuous or Continuous or Neural •Predicting expected

in complex non- Discrete Discrete Networks revenue
linear •Predicting credit
environments risk
Table 12.5
Goal Input Output Statistical Examples [SPSS,
(Predictor) (Outcome) Technique 2000]
Variables Variables
Predict by Continuous, Discrete Chi-square Automatic • Predict which
splitting data into Discrete, or Interaction Detection demographic
more than two Ordinal (CHAID) combinations of
subgroups predictors yield the
(branches) highest probability of
a sale
• Predict which
factors are causing
product defects in
manufacturing
Predict by Continuous Discrete C5.0 • Predict which loan
splitting data into customers are
more than two considered a “good”
subgroups risk
(branches) • Predict which
factors are
associated with a
country’s investment
risk
Table 12.5 (cont.)
Goal Input Output Statistical Examples [SPSS,
(Predictor) (Outcome) Technique 2000]
Variables Variables
Predict by Continuous Continuous Classification and • Predict which

splitting data into Regression Trees factors are
binary subgroups (CART) associated with a
(branches) country’s
competitiveness
• Discover which
variables are
predictors of
increased customer
profitability
Predict by Continuous Discrete Quick, Unbiased, •Predict who needs
splitting data into Efficient, Statistical additional care after
binary subgroups Tree (QUEST) heart surgery
(branches)
Table 12.6
Goal Input Output Statistical Examples [SPSS, 2000]
(Predictor) (Outcome)
Find large groups Continuous No K-means • Customer segments for

of cases in large or Discrete outcome Cluster marketing
data files that are variable Analysis • Groups of similar insurance
similar on a small claims
set of input
characteristics,
To create large Kohonen • Cluster customers into
cluster Neural segments based on
memberships Networks demographics and buying
patterns
Create small set Logical No Market • Identify which products are
associations and outcome Basket or likely to be purchased
look for patterns variable Association together
between many Analysis with • Identify which courses
categories Apriori students are likely to take
together
Conclusions
The student should be able to use:

Curve-fitting algorithms.
Statistical methods for clustering.
The C5.0 algorithm to capture rules from
examples.
Basic feed forward neural networks with
supervised learning.
Unsupervised learning, clustering techniques and
the Kohonen networks.
Other statistical techniques.

Discovering New Knowledge - Data Mining

Uploaded by

Copyright:

Available Formats

You might also like

Discovering New Knowledge - Data Mining

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Discovering New Knowledge - Data Mining

Uploaded by

Copyright:

Available Formats

Discovering New Knowledge –

 The knowledge about how to design better

 Introduce the student to the different classes

 DM emerged at the intersection of three

Name Outlook Temperature Humidity Class

Has the company’s Don’t

Don’t Is the company’s

 An induction tree is a decision tree holding

{DS1, DS2, DS3, DS4}

Cloudy Rainy Sunny

DS2 - Not Enjoyable DS3 - Not Enjoyable DS1 – Enjoyable

Cloudy Rain Sunny

Rain Temperature DS1 – Enjoyable

DS2 - Not DS3 - Not DS4 – Not Enjoyable

Cold Mild Hot

 Rule 1. If the Outlook is cloudy, then the

{DS1, DS2, DS3, DS4}

Language Reasoning Interface Cost Memory Classification

Java C++ Lisp

MilliExpert Genie SilverWorks

Java C++ Lisp

Reasoning method Reasoning method Reasoning method

MilliExpert OffSite XS Ginie XS SilverWorks

Java C++ Lisp

Forward Forward Forward

 Note that the decision tree that is built

DS2 – Not Enjoyable

 One sensible objective is to seek the minimal tree,

 Determine buying patterns of people

Actual Presence of Disease 43 (30.9%) 96 (69.1%)

 Provide useful guidelines for determining

Predict outcome Continuous or Continuous or Neural •Predicting expected

Predict by Continuous Continuous Classification and • Predict which

Find large groups Continuous No K-means • Customer segments for

 The student should be able to use:

You might also like

The knowledge about how to design better

Introduce the student to the different classes

DM emerged at the intersection of three

An induction tree is a decision tree holding

Rule 1. If the Outlook is cloudy, then the

Note that the decision tree that is built

One sensible objective is to seek the minimal tree,

Determine buying patterns of people

Provide useful guidelines for determining

The student should be able to use: