Download as pdf or txt
Download as pdf or txt
You are on page 1of 66

Data Mining in Excel Using

XLMiner

Nitin R. Patel
Cytel Software and M.I.T.Sloan

1
Contact Info
XLMiner is distributed by Resampling
Stats, Inc.
www.xlminer.net
Contact Peter Bruce: pbruce@resample.com
703-522-2713

2
What is XLMiner?
XLMiner is an affordable, easy-to-use tool for
business analysts, consultants and business
students to:
learn strengths and weaknesses of data mining methods,
prototype large scale data mining applications,
implement medium scale data mining applications.
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally-intensive techniques.
3
Available Data Mining Software
Application-specific: aimed at providing
solutions to end-users for common tasks
(e.g. Unica for Customer Relationship
Management, Urban Science for location
and distribution)
Technique-specific: focused on a few data
mining methods (e.g. CART from Salford
Associates, Neural Nets from HNC
Software)

4
Kohonen Net
Source: Elder Research

Association Rules
K-Means
Sequential. Rules
TimeSeries
x

Logistic
Regression
Rule Induction
x
x
Nave Bayes
Radial Basis Fns.
K-Nearest x
Neighbors
TECHNIQUE-SPECIFIC PRODUCTS

Multilayer Neural
Net

x
Linear Regression
Class. & Regr.
Trees

x
x

x
CART (Salford)
Algorithms>

NeuroShell

WizWhy
Cognos

See5
Available Data Mining Software
Horizontal products: designed for data
mining analysts: (e.g. SAS Enterprise
Miner, SPSS Clementine, IBM Intelligent
Miner, NCR Teraminer, Splus Insightful
Miner, Darwin/Oracle)
Powerful, comprehensive, easy-to-use; but
Need substantial learning effort
Expensive
6
HORIZONTAL PRODUCTS Source: Elder Research

Class. & Regr. Trees


Linear Regression
Multilayer Neural Net

K-Nearest Neighbors

Radial Basis Fns.

Nave Bayes

Rule Induction

Logistic Regression

TimeSeries
Sequential. Rules
K-Means

Association Rules

Kohonen Net
Algorithms>

Enterprise
Miner (SAS)
x x x x x x x x x
Clementine
(SPSS)
x x x x x x x
Intelligent
Miner (IBM)
x x x x x x x x
MineSet (SGI)
x x x x x
Darwin
(Oracle)
x x x x
PRW (Unica)
x x x x x x 7
Desiderata for Data Mining and
Modern Data Analysis Software
Easy-to-use
Data import (e.g. cross-platform, various data bases)
Data handling (e.g. data partitioning, scoring)
Invoking and experimenting with procedures
Comprehensive Range of Procedures:
Statistics (e.g. Regression, Multivariate procedures)
Machine learning (e.g. Neural Nets, Classification
Trees)
Database (e.g. Association Rules)
8
XLMiner is Unique
Low cost,
Comprehensive set of data mining models and
algorithms that includes statistical, machine
learning and database methods,
Based on prototype used in three years of MBA
courses on data mining at Sloan School, M.I.T.
Focus on business applications: Book of lecture
notes and cases in preparation (first draft available
for examination).

9
Why Data Mining in Excel?

Leverage familiarity of MBA students,


managers and business analysts with
interface and functionality of Excel to
provide them with hands-on experience in
data mining.

10
Advantages
Low learning hurdle
Promotes understanding of strengths and
weaknesses of different data mining techniques
and processes
Enables interactive analysis of data (important in
early stages of model building)
Facilitates incorporation of domain knowledge
(often key to successful applications) by
empowering end-users to participate actively in
data mining projects
Enables pre-processing of data and post-
processing of results using Excel functions,
reporting in Word, presentations in PowerPoint
11
Advantages (cont.)
Supports communication between data miners and
end-users
Supports smooth transition from prototyping to
custom solution development (VB and VBA)
Emphasizes openness
enables integration with other analytic software for
optimization (Solver), simulation (Crystal Ball) ,
numerical methods;
interface modifications (e.g.custom forms and outputs)
solution specific routines (VBA)
Examples:
Boston Celtics analysis of player statistics
Clustering for improving forecasts, optimizing price
markdowns. 12
Size Limitations
An Excel spreadsheet cannot exceed 64,000 rows.
If data records are stored as rows in a single
spreadsheet this is the largest data set that can be
accommodated. The number of variables cannot
exceed 256 (number of columns).
These limits do not apply to deployment of model
to score large databases.
If Excel is used as a view-port into a database such
as Access, MS SQL Server, Oracle or SAS, these
limits do not apply.

13
Sampling
Practical Data Mining Methodologies such
as SEMMA (SAS) and CRISP-DM (SPSS
and European Industry Standard)
recommend working with a sample
(typically 10,000 random cases) in the
model and algorithm selection phase. This
facilitates interactive development of data
mining models.
14
XLMiner
Free 30 day trial version: limit is 200 records per
partition.
Education version: limit is 2,000 records per
partition, so maximum size for a data set is 6,000
records.
Standard version (currently in beta test: will be
available by end August):
Up to 60,000 records obtained by drawing samples
from large data bases in accordance with SASs
SEMMA (Sample, Explore, Model, Measure, Apply)
methodology. Training data restricted to 10,000 records
Sampling from and scoring to Access databases (later
SQLServer, Oracle, SAS) 15
Data Mining Procedures in
XLMiner
Partitioning data sets (into Training, Validation,
and Test data sets)
Scoring of training, validation, test and other data
Prediction (of a continuous variable)
Classification
Data reduction and exploration
Affinity
Utilities: Sampling, graphics, missing data,
binning, creation of dummy variables
16
Prediction
Multiple Linear Regression with subset
selection, residual analysis, and collinearity
diagnostics.
K-Nearest Neighbors
Regression Tree
Neural Net

17
Classification
Logistic Regression with subset selection,
residual analysis, and collinearity diagnostics
Discriminant Analysis
K-Nearest Neighbors
Classification Tree
Nave Bayes
Neural Networks
18
Data Reduction and Exploration
Principal Components
K-Means Clustering
Hierarchical Clustering

19
Affinity
Association Rules (Market Basket Analysis)

20
Partitioning

Aim: To construct training,


validation, and test data sets from
Boston Housing data

21
22
Boston Housing Data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 397 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.97 2 242 17.8 397 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.97 2 242 17.8 393 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.06 3 222 18.7 395 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.06 3 222 18.7 397 5.33 36.2
0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7
0.08829 13 7.87 0 0.524 6.012 66.6 5.56 5 311 15.2 396 12.43 22.9
0.14455 13 7.87 0 0.524 6.172 96.1 5.95 5 311 15.2 397 19.15 27.1
0.21124 13 7.87 0 0.524 5.631 100 6.08 5 311 15.2 387 29.93 16.5
0.17004 13 7.87 0 0.524 6.004 85.9 6.59 5 311 15.2 387 17.1 18.9
0.22489 13 7.87 0 0.524 6.377 94.3 6.35 5 311 15.2 393 20.45 15
0.11747 13 7.87 0 0.524 6.009 82.9 6.23 5 311 15.2 397 13.27 18.9
0.09378 13 7.87 0 0.524 5.889 39 5.45 5 311 15.2 391 15.71 21.7
0.62976 0 8.14 0 0.538 5.949 61.8 4.71 4 307 21 397 8.26 20.4

23
XLMiner : Data Partition Sheet Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1)

Output Navigator
Training Data Validation Data Test Data

Data
Data source housing!$A$2:$O$507

Selected variables CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B

Partitioning Method Randomly chosen


Random Seed 81801
# training row s 253
# validation row s 152
# test row s 101

Selected variables

CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
Row Id.
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33
6 0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21
7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43
8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15
10 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.1
12 0.11747 12.5 7.87 0 0.524 6.009 82.9 6.2267 5 311 15.2 396.9 13.27
14 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21 396.9 8.26

3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
9 0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93
13 0.09378 12.5 7.87 0 0.524 5.889 39 5.4509 5 311 15.2 390.5 15.71

4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
11 0.22489 12.5 7.87 0 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.45
17 1.05393 0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21 386.85 6.58

24
Prediction
Multiple Linear Regression using
subset selection

Aim: To estimate median residential


property value for a census tract

25
The Regression Model

Input variables Coefficient Std. Error p-value SS


Constant term 32.677 7.444 0.000 128852 Residual df 239
CRIM -0.094 0.049 0.054 3566 Multiple R-squared 0.738
ZN 0.055 0.020 0.007 2550 Std. Dev. estimate 5.025
INDUS 0.030 0.091 0.742 1529 Residual SS 6036
CHAS 2.836 1.199 0.019 645
NOX -15.889 5.463 0.004 143
RM 3.872 0.597 0.000 4697
AGE 0.007 0.019 0.728 0
DIS -1.405 0.292 0.000 938
RAD 0.358 0.097 0.000 1
TAX -0.013 0.005 0.019 174
PTRATIO -0.934 0.208 0.000 620
B 0.014 0.004 0.000 502
LSTAT -0.582 0.073 0.000 1623

Training Data scoring - Summary Report

Total sum of
squared RMS Error
Average
Error
# Records training 253
errors
6036 4.884 0.000 # Records validation 152
# Records test 101
Validation Data scoring - Summary Report

Total sum of
Average
squared RMS Error
Error
errors
2848 4.329 0.066

Test Data scoring - Summary Report

Total sum of
Average
squared RMS Error
errors
Error 26
2392 4.866 -1.019
Subset selection (exhaustive enumeration)

Adjusted R- Models (Constant present in all models)


Subset size RSS Cp R-Squared Prob
Squared 1 2 3 4 5 6 7
2 19472.3789 362.7529 0.5441 0.5432 0.0000 Constant LSTAT * * * * *
3 15439.3086 185.6474 0.6386 0.6371 0.0000 Constant RM LSTAT * * * *
4 13727.9863 111.6489 0.6786 0.6767 0.0000 Constant RM PTRATIO LSTAT * * *
5 13228.9072 91.4852 0.6903 0.6878 0.0000 Constant RM DIS PTRATIO LSTAT * *
6 12469.3447 59.7537 0.7081 0.7052 0.0000 Constant NOX RM DIS PTRATIO LSTAT *
7 12141.0723 47.1754 0.7158 0.7123 0.0000 Constant CHAS NOX RM DIS PTRATIO LSTAT

27
The Regression Model

Predictor (Indep. Var.) Coefficient Std. Error p-value SS


Constant 42.8367 7.1766 0.0000 126430.6016 Residual df 247.0000
NOX -21.7852 4.6042 0.0000 3404.4565 Multiple R-squared 0.6601
RM 3.7503 0.6177 0.0000 6583.3579 Std. Dev. Estimate 5.3467
DIS -1.4072 0.2535 0.0000 211.6853 Residual SS 7061.1646
PTRATIO -1.0086 0.1747 0.0000 1453.9551
LSTAT -0.5907 0.0696 0.0000 2060.2676

XLMiner : Multiple Linear Regression - Prediction of Validation Data


MaxAbsErr= 20.33
RMSErr= Data range Data_Partition1!$C$273:$P$424 Back to Navigator
4.9355 AvMEDV= 22.9645 %RMSErr= 21.5% AvAbsErr= 3.57
Predicted Actual
NOX RM DIS PTRATIO LSTAT AbsErr
SqErr Value Value
0.8439637 22.0187 21.1 0.4640 5.8560 4.4290 18.6000 13.0000 0.92
0.2196854 32.8687 32.4 0.4470 6.7580 4.0776 17.6000 3.5300 0.47
0.2137043 25.4623 25 0.4890 6.1820 3.9454 18.6000 9.4700 0.46
6.6637521 31.0814 28.5 0.4110 6.8610 5.1167 19.2000 3.3300 2.58
4.0947798 22.4236 20.4 0.5470 5.8720 2.4775 17.8000 15.3700 2.02
18.224484 24.5690 20.3 0.5440 5.9720 3.1025 18.4000 9.9700 4.27
0.3253246 23.4704 22.9 0.5240 6.0120 5.5605 15.2000 12.4300 0.57
51.86411 14.6983 21.9 0.7180 4.9630 1.7523 20.2000 14.0000 7.20

28
%AvAbsErr=15.6%

AbsErr Freq
0 0
2 61
4 40
6 25
8 10
10 9
12 2
14 3
16 0
18 0
20 1
22 1

70
Frequency in Validation Dataset

60

50

40

30

20

10

0
0 2 4 6 8 10 12 14 16 18 20 22
AbsErr

29
Prediction
K_Nearest Neighbors

Aim: To estimate median residential


property value for a census tract

30
XLMiner : K-Nearest Neighbors Prediction
Data
Source data w orksheet Data_Partition1
Training data used for building the model Data_Partition1!$C$19:$Q$322
Validation data Data_Partition1!$C$323:$Q$524
# cases in the training data set 304
# cases in the validation data set 202
Normalization TRUE
# nearest neighbors (k) 1

Variables
Input variables NOX RM DIS PTRATIO LSTAT
Output variable MEDV

31
Param eters/Options
# Nearest neighbors 1

Training Data scoring - Summary Report

Total sum of
Average
squared RMS Error
Error
errors
0 0 0

Validation Data scoring - Summary Report

Total sum of # Records training 253


Average
squared RMS Error
errors
Error # Records validation 152
3314 4.669 0.805 # Records test 101

Test Data scoring - Summary Report

Total sum of
Average
squared RMS Error
Error
errors
3895 6.210 -0.450

Timings

Overall (secs) 3.00


32
Validation Data prediction details

Predicted Actual Actual #Nearest


Row Id. Residual CRIM ZN INDUS CHAS NOX
Value Value Neighbors
3 28.70 34.70 6.00 1 0.02729 0 7.07 0 0.469
9 14.40 16.50 2.10 1 0.21124 12.5 7.87 0 0.524
13 22.90 21.70 -1.20 1 0.09378 12.5 7.87 0 0.524
15 19.60 18.20 -1.40 1 0.63796 0 8.14 0 0.538
16 20.40 19.90 -0.50 1 0.62739 0 8.14 0 0.538
20 20.40 18.20 -2.20 1 0.7258 0 8.14 0 0.538
25 16.60 15.60 -1.00 1 0.75026 0 8.14 0 0.538
29 19.60 18.40 -1.20 1 0.77299 0 8.14 0 0.538

33
Classification
Classification Tree

Aim: To classify census tracts into


high and low residential property
value classes

34
Boston Housing Data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV HIGHCLASS
0.00632 18 2.31 0 0.54 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 0
0.02731 0 7.07 0 0.47 6.421 78.9 4.97 2 242 17.8 396.9 9.14 21.6 0
0.02729 0 7.07 0 0.47 7.185 61.1 4.97 2 242 17.8 392.83 4.03 34.7 1
0.03237 0 2.18 0 0.46 6.998 45.8 6.06 3 222 18.7 394.63 2.94 33.4 1
0.06905 0 2.18 0 0.46 7.147 54.2 6.06 3 222 18.7 396.9 5.33 36.2 1
0.02985 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394.12 5.21 28.7 0
0.08829 13 7.87 0 0.52 6.012 66.6 5.56 5 311 15.2 395.6 12.43 22.9 0
0.14455 13 7.87 0 0.52 6.172 96.1 5.95 5 311 15.2 396.9 19.15 27.1 0
0.21124 13 7.87 0 0.52 5.631 100 6.08 5 311 15.2 386.63 29.93 16.5 0
0.17004 13 7.87 0 0.52 6.004 85.9 6.59 5 311 15.2 386.71 17.1 18.9 0
0.22489 13 7.87 0 0.52 6.377 94.3 6.35 5 311 15.2 392.52 20.45 15 0
0.11747 13 7.87 0 0.52 6.009 82.9 6.23 5 311 15.2 396.9 13.27 18.9 0
0.09378 13 7.87 0 0.52 5.889 39 5.45 5 311 15.2 390.5 15.71 21.7 0

35
Training Log

Grow ing the Tree


#Nodes Error
0 13.82
1 3.45
2 2.97
3 0.67
4 0.65
5 0.56
6 0.2
7 0.14
8 0.06
9 0.05
10 0.05
11 0.04
12 0.02
13 0.01
14 0.01
15 0

Validation Misclassification Summary

Classification Confusion Matrix


Predicted Class
Actual
0 1
Class
0 152 6
1 8 36

Error Report
Class # Cases # Errors % Error
0 158 6 3.80
1 44 8 18.18
Overall 202 14 6.93 36
XLMiner : Classification Tree - Prune Log

Back to Navigator

# Decision
Error
Nodes
15 0.0792
14 0.0644
13 0.0644
12 0.0644
11 0.0644
10 0.0644 <-- Minimum Error Prune Std. Err. 0.0172708
9 0.0743
8 0.0743
7 0.0743
6 0.0693
5 0.0693
4 0.0693
3 0.0693 <-- Best Prune
2 0.099
1 0.2079

37
Classification Tree : Full Tree

Back to Navigator

6.55
05
RM
228 76

1.35 6.79
929 1
DIS RM
6 222 31 45

10.1 73.0 % 7.63 19.4


702 5 5
CRIM 0 LSTAT PTRATIO
2 4 14 17 37 8

0.65 % 1.31 % 3.43 5.59 % 7.06 1.25


515 449 934
1 0 DIS 0 RM DIS
5 9 12 25 1 7

35.0 286. 8.22 % 0.32 % 2.30 %


18.1
000 000
PTRATIO ZN TAX 1 1 0
3 2 7 2 7 5

0.98 % 4.13 2.30 % 4.62 2.30 %


378
499 499
1 LSTAT 0 LSTAT 1 TAX
1 1 1 1 3 2

0.32 % 0.32 % 0.32 % 0.32 % 0.98 % 0.65 %

1 0 0 1 0 1
38
Classification Tree : Best Pruned Tree

Back to Navigator

6.55
05
RM
136 66

67.3 % 6.79
1
0 RM
16 50

7.92 % 19.4
5
0 PTRATIO
44 6

21.7 % 2.97 %

1 0
39
Classification Tree : Minimum Error Tree

Back to Navigator

6.55
05
RM
136 66

1.35 6.79
929 1
DIS RM
3 133 16 50

10.1 65.8 % 7.63 19.4


702 5 5
CRIM 0 LSTAT PTRATIO
3 0 11 5 44 6

1.48 % 0% 3.43 2.47 % 7.06 2.97 %


515 449
1 0 DIS 0 RM 0
0 11 14 30

0% 5.44 % 286. 14.8 %


000
1 0 TAX 1
7 7

3.46 %
378
1 TAX
6 1

2.97 % 0.49 %

0 1
40
Classification
Neural Network

Aim: To classify census tracts into


high and low residential property
value classes

41
XLMiner : Neural Network Classification

Epochs Inform ation


Number of Epochs 30
Accumulated Trials 9120
Class 0 1
Trials 7860 1260

Architecture
Number of hidden layers 1
Hidden Layer 1
# Nodes 25
Step size for gradient descent 0.1000
Weight change momentum 0.6000
Weight decay 0.0000
Cost Function Squared Error
Hidden layer sigmoid Standard
Output layer sigmoid Standard

42
Training Data scoring - Summary Report

Cut of f Prob.Val. f or Success (Updatable) 0.5

Clas sification Confus ion Matrix


Pre dicte d Class
Actual
1 0
Class
1 40 11
0 4 249

Error Re port
Class # Cas es # Errors % Error
1 51 11 21.57
0 253 4 1.58
Overall 304 15 4.93

Validation Data scoring - Summary Report

Cut of f Prob.Val. f or Success (Updatable) 0.5

Clas sification Confus ion Matrix


Pre dicte d Class
Actual
1 0
Class
1 26 7
0 1 168

Error Re port
Class # Cas es # Errors % Error
1 33 7 21.21
0 169 1 0.59
43
Overall 202 8 3.96
Lift chart (validation dataset)

35
30 Cumulative
Cumulative

25 HIGHV when
20 sorted using
15 predicted values
10
Cumulative
5
HIGHV using
0 average
0 100 200 300
# cases

Decile-wise lift chart (validation dataset)

7
Decile mean / Global

6
5
4
mean

3
2
1
0
1 2 3 4 5 6 7 8 9 10
De cile s

44
Data Reduction and Exploration
Hierarchical Clustering

Aim: To cluster electric utilities into


similar groups

45
Utilities Data
seq# x1 x2 x3 x4 x5 x6 x7 x8
Arizona 1 1.06 9.2 151 54.4 1.6 9077 0 0.628
Boston 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
Central 3 1.43 15.4 113 53 3.4 9212 0 1.058
Common 4 1.02 11.2 168 56 0.3 6423 34.3 0.7
Consolid 5 1.49 8.8 192 51.2 1 3300 15.6 2.044
Florida 6 1.32 13.5 111 60 -2.2 11127 22.5 1.241
Hawaiian 7 1.22 12.2 175 67.6 2.2 7642 0 1.652
Idaho 8 1.1 9.2 245 57 3.3 13082 0 0.309
Kentucky 9 1.34 13 168 60.4 7.2 8406 0 0.862
Madison 10 1.12 12.4 197 53 2.7 6455 39.2 0.623
Nevada 11 0.75 7.5 173 51.5 6.5 17441 0 0.768
NewEngla 12 1.13 10.9 178 62 3.7 6154 0 1.897
Northern 13 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
Oklahoma 14 1.09 12 96 49.8 1.4 9673 0 0.588
Pacific 15 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
Puget 16 1.16 9.9 252 56 9.2 15991 0 0.62
SanDiego 17 0.76 6.4 136 61.9 9 5714 8.3 1.92
Southern 18 1.05 12.6 150 56.7 2.7 10140 0 1.108
Texas 19 1.16 11.7 104 54 -2.1 13507 0 0.636
Wisconsi 20 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
United 21 1.04 8.6 204 61 3.5 6650 0 2.116
Virginia 22 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

46
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage)

3.5

2.5
Distance

1.5

0.5

0
1 18 14 19 9 2 4 10 13 20 7 12 21 15 22 6 3 8 16 17 11 5

47
Predicted Clusters Back to Navigator

Cluster id. x1 x2 x3 x4 x5 x6 x7 x8
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
1 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
1 1.43 15.4 113 53 3.4 9212 0 1.058
1 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
1 1.22 12.2 175 67.6 2.2 7642 0 1.652
1 1.1 9.2 245 57 3.3 13082 0 0.309
1 1.34 13 168 60.4 7.2 8406 0 0.862
1 1.12 12.4 197 53 2.7 6455 39.2 0.623
3 0.75 7.5 173 51.5 6.5 17441 0 0.768
1 1.13 10.9 178 62 3.7 6154 0 1.897
1 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
1 1.09 12 96 49.8 1.4 9673 0 0.588
1 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
1 1.16 9.9 252 56 9.2 15991 0 0.62
4 0.76 6.4 136 61.9 9 5714 8.3 1.92
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
1 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
1 1.04 8.6 204 61 3.5 6650 0 2.116
1 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

48
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete
linkage)

4
Distance

0
1 18 14 19 6 3 9 2 22 4 20 10 13 5 7 12 21 15 17 8 16 11

49
Predicted Clusters Back to Navigator

Cluster id. x1 x2 x3 x4 x5 x6 x7 x8
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
1 1.43 15.4 113 53 3.4 9212 0 1.058
2 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
3 1.22 12.2 175 67.6 2.2 7642 0 1.652
4 1.1 9.2 245 57 3.3 13082 0 0.309
1 1.34 13 168 60.4 7.2 8406 0 0.862
2 1.12 12.4 197 53 2.7 6455 39.2 0.623
4 0.75 7.5 173 51.5 6.5 17441 0 0.768
3 1.13 10.9 178 62 3.7 6154 0 1.897
2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
1 1.09 12 96 49.8 1.4 9673 0 0.588
3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
4 1.16 9.9 252 56 9.2 15991 0 0.62
3 0.76 6.4 136 61.9 9 5714 8.3 1.92
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
3 1.04 8.6 204 61 3.5 6650 0 2.116
2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306

50
Predicted Clusters (sorted)

Cluster id. x1 x2 x3 x4 x5 x6 x7 x8
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
1 1.43 15.4 113 53 3.4 9212 0 1.058
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
1 1.34 13 168 60.4 7.2 8406 0 0.862
1 1.09 12 96 49.8 1.4 9673 0 0.588
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
2 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
2 1.12 12.4 197 53 2.7 6455 39.2 0.623
2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
3 1.22 12.2 175 67.6 2.2 7642 0 1.652
3 1.13 10.9 178 62 3.7 6154 0 1.897
3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
3 0.76 6.4 136 61.9 9 5714 8.3 1.92
3 1.04 8.6 204 61 3.5 6650 0 2.116
4 1.1 9.2 245 57 3.3 13082 0 0.309
4 0.75 7.5 173 51.5 6.5 17441 0 0.768
4 1.16 9.9 252 56 9.2 15991 0 0.62

M eans
Cluster 1 1.21 12.5 128 55.5 1.7 10163 3.2 0.874
Cluster 2 1.13 10.9 183 55.1 3.1 6546 33.2 1.065
Cluster 3 1.02 9.1 171 62.9 3.7 6526 1.8 1.797
Cluster 4 1.00 8.9 223 54.8 6.3 15505 0.0 0.566

51
Affinity
Association Rules
(Market Basket Analysis)

Aim: to identify types of books that


are likely to be bought by customers
based on past purchases of books
52
YouthBks

DoItYBks

GeogBks
CookBks
2000

ChildBks

Florence
ItalCook

ItalAtlas
RefBks

ArtBks

ItalArt
customers
0 1 0 1 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 1 0 1 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
1 1 0 1 1 1 0 0 1 1 0
1 1 1 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0
1 1 1 0 0 1 0 0 0 0 1

53
XLMiner : Association Rules

Data
Input Data Sheet1!$A$1:$K$2001
Data Format Binary Matrix
Min. Support 200
Min. Conf. % 70
# Rules 19

Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Support(a U c) Lift Ratio

1 100 ItalCook=> CookBks 227 862 227 2.32


2 82.19 DoItYBks, ArtBks=> CookBks 247 862 203 1.91
3 81.89 DoItYBks, GeogBks=> CookBks 265 862 217 1.90
4 80.33 CookBks, RefBks=> ChildBks 305 846 245 1.90
5 80 ArtBks, GeogBks=> ChildBks 255 846 204 1.89
6 81.18 ArtBks, GeogBks=> CookBks 255 862 207 1.88
7 79.63 YouthBks, CookBks=> ChildBks 324 846 258 1.88
8 80.86 ChildBks, RefBks=> CookBks 303 862 245 1.88
9 78.87 DoItYBks, GeogBks=> ChildBks 265 846 209 1.86
10 79.35 ChildBks, DoItYBks=> CookBks 368 862 292 1.84
11 77.87 CookBks, DoItYBks=> ChildBks 375 846 292 1.84
12 77.66 CookBks, GeogBks=> ChildBks 385 846 299 1.84
13 78.18 ChildBks, YouthBks=> CookBks 330 862 258 1.81
14 77.85 ChildBks, ArtBks=> CookBks 325 862 253 1.81
15 75.75 CookBks, ArtBks=> ChildBks 334 846 253 1.79
16 76.67 ChildBks, GeogBks=> CookBks 390 862 299 1.78
17 70.65 GeogBks=> ChildBks 552 846 390 1.67
18 70.63 RefBks=> ChildBks 429 846 303 1.67
19 71.1 RefBks=> CookBks 429 862 305 1.65
54
Some Utilities
Sampling from worksheets and databases
Database scoring
Graphics
Binning

55
Simple
Random
Sampling

56
Stratified
Random
Sampling

57
Scoring to
databases and
worksheets

58
Binning
continuous
variables

59
Missing Data

60
Graphics: Boston Housing data

Box Plot Histogram

120 180
160
100
140
120

Frequency
80
100
Y Values

60 80
60
40
40
20
20
0
0 0 10 20 30 40 50 60 70 80 90 100
AGE
AGE

61
Box Plot Histogram

10 250
9
8 200
7

Frequency
150
6
Y Values

5
100
4
3 50
2
1 0
0 3 3.6 4.2 4.8 5.4 6 6.6 7.2 7.8 8.4 9 9.6
RM RM

62
Matrix Plot High tax
towns have
0 0.2 0.4 0.6 0.8 1 fewer rooms
on average?

9
4.2 5.4 6.6 7.8
RM
0
10

3
1
0.2 0.4 0.6 0.8

AGE
2
10
0

4.2 5.4 6.6


TAX
2
10

3
1.8
3 4.2 5.4 6.6 7.8 9 1.8 3 4.2 5.4 6.6
63
RM Box Plot

10

8
Y Values

0
1 2 3 4 5
Binned_TAX

64
Future Extensions
Cross Validation
Bootstrap, Bagging and Boosting
Error-based clustering
Time Series and Sequences
Support Vector Machines
Collaborative Filtering

65
In Conclusion
XLMiner is a modern tool-belt for data mining. It
is an affordable, easy-to-use tool for consultants,
MBAs and business analysts to learn, create and
deploy data mining methods,
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally intensive techniques.

66

You might also like