Professional Documents
Culture Documents
Discovering New Knowledge - Data Mining
Discovering New Knowledge - Data Mining
Discovering New Knowledge - Data Mining
Data Mining
Advantage of Data Mining
The knowledge elicited from experts was
already known by humans.
However, there are considerable limitations
to human knowledge.
Enormous “gold mines” of knowledge are
buried in databases.
Therefore, knowing, understanding, and
using this knowledge can give an
organization significant competitive
advantage.
Example
Yes
ratio > 5?
No
Leaf node
Buy Don’t
Buy
Induction trees
parent node
Rain
child node
Figure 12.2 – simple induction tree
(step 1) Name
DS1
Outlook
Sunny
Temperature
Mild
Humidity
Dry
Class
Enjoyable
DS2 Cloudy Cold Humid Not Enjoyable
DS3 Rainy Mild Humid Not Enjoyable
DS4 Sunny Hot Humid Not Enjoyable
Outlook
Rain
Outlook
None
DS1 - Enjoyable DS4 – Not Enjoyable
Writing the induced tree as rules
Humidity
Learning decision trees for
classification into multiple classes
In the previous example, we were learning a function to predict a
boolean (enjoyable = true/false) output.
The same approach can be generalized to learn a function that
predicts a class (when there are multiple predefined
classes/categories).
For example, suppose we are attempting to select a KBS shell
for some application:
with the following as our (hypothetical) options:
ThoughtGen, Offsite, Genie, SilverWorks, XS, MilliExpert
using the following attributes and range of values:
Development language: { java, C++, lisp }
Reasoning method: { forward, backward }
External interfaces: { dBase, spreadsheetXL, ASCII file, devices }
Cost: any positive number
Memory: any positive number
Table 12.2 –
collection of data samples (training set)
described as vectors of attributes (feature vectors)
Language
Forward Forward
Forward
ASCII Devices
SpreadsheetXL dBase
MilliExpert MilliExpert ThoughtGen OffSite
Order of choosing attributes
Humidity
Humid Dry
x1 W1
Summation
x2 W2 Activation
Σ function
y f()
Inputs
xk Wn
Neuron as the basic element
The inputs enter a neuron either from external
sources, or multi-layered networks (defined later),
from the output of neurons in the previous layer.
The weights are values that signify the strength of
the connection.
The inputs for each neuron are multiplied by their
respective weight to from the input into the summing
junction.
The summing junction adds all the inputs into the
neuron, and forwards the result to the activation
function.
The neuron fires if the total strength of its inputs
merits forwarding its computed value to the next
neuron.
Figure 12.9 – three common activation
functions
The output is normally constrained in the
range between 0-1.0.
1.0
y
Threshold function Piece-wise
Sigmoid function
Linear function
Figure 12.10 – simple single layer
neural network
Neurons
Inputs Outputs
Figure 12.11 – two-layer neural network
Hidden layer
Neurons
Neurons
Outputs
Inputs
Supervised Learning: Back
Propagation
An iterative learning algorithm with three
phases:
1. Presentation of the examples (input patterns with
outputs) and feed forward execution of the
network
2. Calculation of the associated errors when the
output of the previous step is compared with the
expected output and back propagation of this
error
3. Adjustment of the weights
Unsupervised Learning: Kohonen
Networks
Clustering by an iterative competitive
algorithm
Note relation to CBR
Figure 12.13 – Kohonen selforganizing
map
Wi
Inputs
Kohonen selforganizing map
A grid or map of neurons
Each neuron is connected to every input applied to
the network, and each connection has weight
assigned to it.
When a new data record is presented during training,
it compares the input vector to the weight vector of
each cluster unit in the grid.
The one cluster unit whose weight pattern most
closely matches the current input vector “takes “ the
input and modifies its weights to better reflect this
input.
All clusters near the winning unit likewise adjust the
weights to better match the neighboring winning
cluster.
Statistical Methods for Data Mining
Curve fitting
K-means clustering
Market Basket Analysis
Figure 12.14 – 2-D input data plotted
on a graph
y
x
Figure 12.15 – data and deviations
Curve fitting
y
x
Best fitting
equation
x
K-means clustering
K: number of clusters to be defined
Initially, (randomly) select k centers
(centroids)
Compute the distance between each pair of
data point and center.
Assign each data point to the cluster whose
center is the nearest.
Re-calculate centers of the current clusters
Iterate until no data are moved.
K-means clustering sample
Variable B
Variable A
K-means clustering sample (1): Initial
state
Variable B
x
x
Initial center of
cluster #2
Initial center of
cluster #1
Variable A
K-means clustering sample (2): Data
assignment
Variable B
x
Cluster #2
x
Cluster #1
Variable A
K-means clustering sample (3): Re-
calculation of centers
Variable B
x
x Cluster #2
Cluster #1
Variable A
K-means clustering sample (4): Re-
assignment
Variable B
x
x Cluster #2
Cluster #1
Variable A
K-means clustering sample (5): stable
state
Variable B
x
x Cluster #2
Cluster #1
Variable A
Figure 12.12 – clusters of related data in
2-D space
Variable B
Cluster #2
Cluster #1
Variable A
Market Basket Analysis
Predicted Predicted
Heart Disease Diagnostic
No Disease Presence of Disease
Actual
118 (72%) 46 (28%)
No Disease