Professional Documents
Culture Documents
Data Mining in Market Research
Data Mining in Market Research
Well use:
WEKA (Waikato Environment for Knowledge Analysis)
Free (GPLed) Java package with GUI
Online at www.cs.waikato.ac.nz/ml/weka
Witten and Frank, 2000. Data Mining: Practical Machine Learning
Tools and Techniques with Java Implementations.
R packages
E.g. rpart, class, tree, nnet, cclust, deal, GeneSOM, knnTree,
mlbench, randomForest, subselect
Numeric prediction
Regression trees, model trees
Association rules
Meta-learning methods
Cross-validation, bagging, boosting
Classification
Methods for predicting a discrete response
One kind of supervised learning
Note: in biological and other sciences, classification
has long had a different meaning, referring to cluster
analysis
Applications include:
Identifying good prospects for specific marketing or
sales efforts
Cross-selling, up-selling when to offer products
Customers likely to be especially profitable
Customers likely to defect
Weather/Game-Playing Data
Small dataset
14 instances
5 attributes
Outlook - nominal
Temperature - numeric
Humidity - numeric
Wind - nominal
Play
Whether or not a certain game would be played
This is what we want to understand and predict
Classification Algorithms
Many methods available in WEKA
0R, 1R, NaiveBayes, DecisionTable, ID3, PRISM,
Instance-based learner (IB1, IBk), C4.5 (J48), PART,
Support vector machine (SMO)
1R Algorithm (continued)
Biased towards predictors with more categories
These can result in over-fitting to the training data
P A X P B X P C X P X
P( A, B, C )
Decision Trees
Classification rules can be expressed in a tree
structure
Move from the top of the tree, down through various
nodes, to the leaves
At each node, a decision is made using a simple test
based on attribute values
The leaf you reach holds the appropriate predicted value
Ifx=1andy=1
thenclass=a
Ifz=1andw=1
thenclass=a
Otherwiseclass=b
(a)
(b)
(c)
(d)
Weather Example
First node from outlook split is for sunny, with
entropy 2/5 * log2(2/5) 3/5 * log2(3/5) = 0.971
Average entropy of nodes from outlook split is
5/14 x 0.971 + 4/14 x 0 + 5/14 x 0.971= 0.693
(b)
(c)
Classification Trees
Described (along with regression trees) in:
L. Breiman, J.H. Friedman, R.A. Olshen, and C.J. Stone, 1984.
Classification and Regression Trees.
Regression Trees
Trees can also be used to predict numeric
attributes
Predict using average value of the response in the
appropriate node
Implemented in CART and C4.5 frameworks
Another approach that often works well is to fit the tree, remove all
training cases that are not correctly predicted, and refit the tree on
the reduced dataset
Typically gives a smaller tree
This usually works almost as well on the training data
But generalises better, e.g. works better on test data
Scheme:
weka.classifiers.j48.J48 -C 0.25 -M 2
Relation: german_credit
Instances: 1000
Attributes: 21
Number of Leaves :
103
140
a b <-- classified as
618 82 | a = good
179 121 | b = bad
%
%
Cross-Validation
Due to over-fitting, cannot estimate prediction
error directly on the training dataset
Cross-validation is a simple and widely used
method for estimating prediction error
Simple approach
Set aside a test dataset
Train learner on the remainder (the training dataset)
Estimate prediction error by using the resulting
prediction model on the test dataset
k-fold Cross-Validation
For smaller datasets, use k-fold cross-validation
Split dataset into k roughly equal parts
Tr
Tr
Test
Tr
Tr
Tr
Tr
Tr
Tr
For each part, train on the other k-1 parts and use this part as
the test dataset
Do this for each of the k parts, and average the resulting
prediction errors
24.58
|
n=60
Weight>=2568
Weight< 2568
22.47
n=45
30.93
n=15
Weight>=3088
Weight< 3088
20.41
n=22
24.43
n=23
Weight>=2748
23.8
n=15
Weight< 2748
25.63
n=8
Call:
rpart(formula = Mileage ~ Weight, data = car.test.frame)
n= 60
1
2
3
4
xerror
1.0322233
0.6081645
0.4557341
0.4659556
xstd
0.17981796
0.11371656
0.09178782
0.09134201
Inf
0.28
0.042
0.011
0.8
0.6
0.4
1.0
1.2
cp
24.58
|
n=60
Weight>=2568
Weight< 2568
22.47
n=45
30.93
n=15
Weight>=3088
Weight< 3088
20.41
n=22
24.43
n=23
Weight>=2748
23.8
n=15
Weight< 2748
25.63
n=8
24.58
|
n=60
Weight>=2568
Weight< 2568
22.47
n=45
30.93
n=15
Weight>=3088
Weight< 3088
20.41
n=22
24.43
n=23
Classification Methods
Project the attribute space into decision regions
Decision trees: piecewise constant approximation
Logistic regression: linear log-odds approximation
Discriminant analysis and neural nets: linear & non-linear separators
They can be useful tools for learning from examples to find patterns
in data and predict outputs
However on their own, they tend to overfit the training data
Meta-learning tools are needed to choose the best fit
ANNs have been applied to data editing and imputation, but not
widely
E.g. for a tree learner, the bagged estimate is the average prediction
from the resulting B trees
Note that this is not a tree
In general, bagging a model or learner does not produce a model or
learner of the same form
Association Rules
Association rules
The support/confidence approach is widely used
Efficiently implemented in the Apriori algorithm
First identify item sets with sufficient support
Then turn each item set into sets of rules with sufficient confidence
Clustering EM Algorithm
Assume that the data is from a mixture of normal
distributions
I.e. one normal component for each cluster
Log-likelihood:
n
l X log p1 xi 1 p 2 xi ,
i 1
Clustering EM Algorithm
Think of data as being augmented by a latent 0/1 variable
di indicating membership of cluster 1
If the values of this variable were known, the loglikelihood would be:
n
l X , D log d i1 xi 1 d i 2 xi
i 1
Clustering EM Algorithm
Resulting estimates may only be a local
maximum
Run several times with different starting points
to find global maximum (hopefully)
Clustering EM Algorithm
Extending to more latent classes is easy
Information criteria such as AIC and BIC are often used to decide
how many are appropriate
Probabilistic/EM
Multi-Resolution kd-Tree for EM [Moore99]
Scalable EM [BRF98b]
CF Kernel Density Estimation [ZRL99]