Data Mining in Excel Using Xlminer: Nitin R. Patel

Data Mining in Excel Using
XLMiner
Nitin R. Patel
Cytel Software and M.I.T.Sloan
1
Contact Info
XLMiner is distributed by Resampling
Stats, Inc.
www.xlminer.net
Contact Peter Bruce: pbruce@resample.com
703-522-2713
2
What is XLMiner?
XLMiner is an affordable, easy-to-use tool for
business analysts, consultants and business
students to:
learn strengths and weaknesses of data mining methods,
prototype large scale data mining applications,
implement medium scale data mining applications.
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally-intensive techniques.
3
Available Data Mining Software
Application-specific: aimed at providing
solutions to end-users for common tasks
(e.g. Unica for Customer Relationship
Management, Urban Science for location
and distribution)
Technique-specific: focused on a few data
mining methods (e.g. CART from Salford
Associates, Neural Nets from HNC
Software)
4
Kohonen Net
Source: Elder Research
Association Rules
K-Means
Sequential. Rules
TimeSeries
x
Logistic
Regression
Rule Induction
x
x
Nave Bayes
Radial Basis Fns.
K-Nearest x
Neighbors
TECHNIQUE-SPECIFIC PRODUCTS
Multilayer Neural
Net
x
Linear Regression
Class. & Regr.
Trees
x
x
x
CART (Salford)
Algorithms>
NeuroShell
WizWhy
Cognos
See5
Available Data Mining Software
Horizontal products: designed for data
mining analysts: (e.g. SAS Enterprise
Miner, SPSS Clementine, IBM Intelligent
Miner, NCR Teraminer, Splus Insightful
Miner, Darwin/Oracle)
Powerful, comprehensive, easy-to-use; but
Need substantial learning effort
Expensive
6
HORIZONTAL PRODUCTS Source: Elder Research
Class. & Regr. Trees

Linear Regression
Multilayer Neural Net
K-Nearest Neighbors
Radial Basis Fns.
Nave Bayes
Rule Induction
Logistic Regression
TimeSeries
Sequential. Rules
K-Means
Association Rules
Kohonen Net
Algorithms>
Enterprise
Miner (SAS)
x x x x x x x x x
Clementine
(SPSS)
x x x x x x x
Intelligent
Miner (IBM)
x x x x x x x x
MineSet (SGI)
x x x x x
Darwin
(Oracle)
x x x x
PRW (Unica)
x x x x x x 7
Desiderata for Data Mining and
Modern Data Analysis Software
Easy-to-use
Data import (e.g. cross-platform, various data bases)
Data handling (e.g. data partitioning, scoring)
Invoking and experimenting with procedures
Comprehensive Range of Procedures:
Statistics (e.g. Regression, Multivariate procedures)
Machine learning (e.g. Neural Nets, Classification
Trees)
Database (e.g. Association Rules)
8
XLMiner is Unique
Low cost,
Comprehensive set of data mining models and
algorithms that includes statistical, machine
learning and database methods,
Based on prototype used in three years of MBA
courses on data mining at Sloan School, M.I.T.
Focus on business applications: Book of lecture
notes and cases in preparation (first draft available
for examination).
9
Why Data Mining in Excel?
Leverage familiarity of MBA students,

managers and business analysts with
interface and functionality of Excel to
provide them with hands-on experience in
data mining.
10
Advantages
Low learning hurdle
Promotes understanding of strengths and
weaknesses of different data mining techniques
and processes
Enables interactive analysis of data (important in
early stages of model building)
Facilitates incorporation of domain knowledge
(often key to successful applications) by
empowering end-users to participate actively in
data mining projects
Enables pre-processing of data and post-
processing of results using Excel functions,
reporting in Word, presentations in PowerPoint
11
Advantages (cont.)
Supports communication between data miners and
end-users
Supports smooth transition from prototyping to
custom solution development (VB and VBA)
Emphasizes openness
enables integration with other analytic software for
optimization (Solver), simulation (Crystal Ball) ,
numerical methods;
interface modifications (e.g.custom forms and outputs)
solution specific routines (VBA)
Examples:
Boston Celtics analysis of player statistics
Clustering for improving forecasts, optimizing price
markdowns. 12
Size Limitations
An Excel spreadsheet cannot exceed 64,000 rows.
If data records are stored as rows in a single
spreadsheet this is the largest data set that can be
accommodated. The number of variables cannot
exceed 256 (number of columns).
These limits do not apply to deployment of model
to score large databases.
If Excel is used as a view-port into a database such
as Access, MS SQL Server, Oracle or SAS, these
limits do not apply.
13
Sampling
Practical Data Mining Methodologies such
as SEMMA (SAS) and CRISP-DM (SPSS
and European Industry Standard)
recommend working with a sample
(typically 10,000 random cases) in the
model and algorithm selection phase. This
facilitates interactive development of data
mining models.
14
XLMiner
Free 30 day trial version: limit is 200 records per
partition.
Education version: limit is 2,000 records per
partition, so maximum size for a data set is 6,000
records.
Standard version (currently in beta test: will be
available by end August):
Up to 60,000 records obtained by drawing samples
from large data bases in accordance with SASs
SEMMA (Sample, Explore, Model, Measure, Apply)
methodology. Training data restricted to 10,000 records
Sampling from and scoring to Access databases (later
SQLServer, Oracle, SAS) 15
Data Mining Procedures in
XLMiner
Partitioning data sets (into Training, Validation,
and Test data sets)
Scoring of training, validation, test and other data
Prediction (of a continuous variable)
Classification
Data reduction and exploration
Affinity
Utilities: Sampling, graphics, missing data,
binning, creation of dummy variables
16
Prediction
Multiple Linear Regression with subset
selection, residual analysis, and collinearity
diagnostics.
K-Nearest Neighbors
Regression Tree
Neural Net
17
Classification
Logistic Regression with subset selection,
residual analysis, and collinearity diagnostics
Discriminant Analysis
K-Nearest Neighbors
Classification Tree
Nave Bayes
Neural Networks
18
Data Reduction and Exploration
Principal Components
K-Means Clustering
Hierarchical Clustering
19
Affinity
Association Rules (Market Basket Analysis)
20
Partitioning
Aim: To construct training,

validation, and test data sets from
Boston Housing data
21
22
Boston Housing Data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 397 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.97 2 242 17.8 397 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.97 2 242 17.8 393 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.06 3 222 18.7 395 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.06 3 222 18.7 397 5.33 36.2
0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7
0.08829 13 7.87 0 0.524 6.012 66.6 5.56 5 311 15.2 396 12.43 22.9
0.14455 13 7.87 0 0.524 6.172 96.1 5.95 5 311 15.2 397 19.15 27.1
0.21124 13 7.87 0 0.524 5.631 100 6.08 5 311 15.2 387 29.93 16.5
0.17004 13 7.87 0 0.524 6.004 85.9 6.59 5 311 15.2 387 17.1 18.9
0.22489 13 7.87 0 0.524 6.377 94.3 6.35 5 311 15.2 393 20.45 15
0.11747 13 7.87 0 0.524 6.009 82.9 6.23 5 311 15.2 397 13.27 18.9
0.09378 13 7.87 0 0.524 5.889 39 5.45 5 311 15.2 391 15.71 21.7
0.62976 0 8.14 0 0.538 5.949 61.8 4.71 4 307 21 397 8.26 20.4
23
XLMiner : Data Partition Sheet Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1)
Output Navigator
Training Data Validation Data Test Data
Data
Data source housing!$A$2:$O$507
Selected variables CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B
Partitioning Method Randomly chosen

Random Seed 81801
# training row s 253
# validation row s 152
# test row s 101
Selected variables
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
Row Id.
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33
6 0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21
7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43
8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15
10 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.1
12 0.11747 12.5 7.87 0 0.524 6.009 82.9 6.2267 5 311 15.2 396.9 13.27
14 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21 396.9 8.26
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
9 0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93
13 0.09378 12.5 7.87 0 0.524 5.889 39 5.4509 5 311 15.2 390.5 15.71
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
11 0.22489 12.5 7.87 0 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.45
17 1.05393 0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21 386.85 6.58
24
Prediction
Multiple Linear Regression using
subset selection
Aim: To estimate median residential

property value for a census tract
25
The Regression Model
Input variables Coefficient Std. Error p-value SS

Constant term 32.677 7.444 0.000 128852 Residual df 239
CRIM -0.094 0.049 0.054 3566 Multiple R-squared 0.738
ZN 0.055 0.020 0.007 2550 Std. Dev. estimate 5.025
INDUS 0.030 0.091 0.742 1529 Residual SS 6036
CHAS 2.836 1.199 0.019 645
NOX -15.889 5.463 0.004 143
RM 3.872 0.597 0.000 4697
AGE 0.007 0.019 0.728 0
DIS -1.405 0.292 0.000 938
RAD 0.358 0.097 0.000 1
TAX -0.013 0.005 0.019 174
PTRATIO -0.934 0.208 0.000 620
B 0.014 0.004 0.000 502
LSTAT -0.582 0.073 0.000 1623
Training Data scoring - Summary Report
Total sum of
squared RMS Error
Average
Error
# Records training 253
errors
6036 4.884 0.000 # Records validation 152
# Records test 101
Validation Data scoring - Summary Report
Total sum of
Average
squared RMS Error
Error
errors
2848 4.329 0.066
Test Data scoring - Summary Report
Total sum of
Average
squared RMS Error
errors
Error 26
2392 4.866 -1.019
Subset selection (exhaustive enumeration)
Adjusted R- Models (Constant present in all models)

Subset size RSS Cp R-Squared Prob
Squared 1 2 3 4 5 6 7
2 19472.3789 362.7529 0.5441 0.5432 0.0000 Constant LSTAT * * * * *
3 15439.3086 185.6474 0.6386 0.6371 0.0000 Constant RM LSTAT * * * *
4 13727.9863 111.6489 0.6786 0.6767 0.0000 Constant RM PTRATIO LSTAT * * *
5 13228.9072 91.4852 0.6903 0.6878 0.0000 Constant RM DIS PTRATIO LSTAT * *
6 12469.3447 59.7537 0.7081 0.7052 0.0000 Constant NOX RM DIS PTRATIO LSTAT *
7 12141.0723 47.1754 0.7158 0.7123 0.0000 Constant CHAS NOX RM DIS PTRATIO LSTAT
27
The Regression Model
Predictor (Indep. Var.) Coefficient Std. Error p-value SS

Constant 42.8367 7.1766 0.0000 126430.6016 Residual df 247.0000
NOX -21.7852 4.6042 0.0000 3404.4565 Multiple R-squared 0.6601
RM 3.7503 0.6177 0.0000 6583.3579 Std. Dev. Estimate 5.3467
DIS -1.4072 0.2535 0.0000 211.6853 Residual SS 7061.1646
PTRATIO -1.0086 0.1747 0.0000 1453.9551
LSTAT -0.5907 0.0696 0.0000 2060.2676
XLMiner : Multiple Linear Regression - Prediction of Validation Data

MaxAbsErr= 20.33
RMSErr= Data range Data_Partition1!$C$273:$P$424 Back to Navigator
4.9355 AvMEDV= 22.9645 %RMSErr= 21.5% AvAbsErr= 3.57
Predicted Actual
NOX RM DIS PTRATIO LSTAT AbsErr
SqErr Value Value
0.8439637 22.0187 21.1 0.4640 5.8560 4.4290 18.6000 13.0000 0.92
0.2196854 32.8687 32.4 0.4470 6.7580 4.0776 17.6000 3.5300 0.47
0.2137043 25.4623 25 0.4890 6.1820 3.9454 18.6000 9.4700 0.46
6.6637521 31.0814 28.5 0.4110 6.8610 5.1167 19.2000 3.3300 2.58
4.0947798 22.4236 20.4 0.5470 5.8720 2.4775 17.8000 15.3700 2.02
18.224484 24.5690 20.3 0.5440 5.9720 3.1025 18.4000 9.9700 4.27
0.3253246 23.4704 22.9 0.5240 6.0120 5.5605 15.2000 12.4300 0.57
51.86411 14.6983 21.9 0.7180 4.9630 1.7523 20.2000 14.0000 7.20
28
%AvAbsErr=15.6%
AbsErr Freq
0 0
2 61
4 40
6 25
8 10
10 9
12 2
14 3
16 0
18 0
20 1
22 1
70
Frequency in Validation Dataset
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20 22
AbsErr
29
Prediction
K_Nearest Neighbors
Aim: To estimate median residential

property value for a census tract
30
XLMiner : K-Nearest Neighbors Prediction
Data
Source data w orksheet Data_Partition1
Training data used for building the model Data_Partition1!$C$19:$Q$322
Validation data Data_Partition1!$C$323:$Q$524
# cases in the training data set 304
# cases in the validation data set 202
Normalization TRUE
# nearest neighbors (k) 1
Variables
Input variables NOX RM DIS PTRATIO LSTAT
Output variable MEDV
31
Param eters/Options
# Nearest neighbors 1
Total sum of
Average
squared RMS Error
Error
errors
0 0 0
Total sum of # Records training 253

Average
squared RMS Error
errors
Error # Records validation 152
3314 4.669 0.805 # Records test 101
Test Data scoring - Summary Report
Total sum of
Average
squared RMS Error
Error
errors
3895 6.210 -0.450
Timings
Overall (secs) 3.00

32
Validation Data prediction details
Predicted Actual Actual #Nearest

Row Id. Residual CRIM ZN INDUS CHAS NOX
Value Value Neighbors
3 28.70 34.70 6.00 1 0.02729 0 7.07 0 0.469
9 14.40 16.50 2.10 1 0.21124 12.5 7.87 0 0.524
13 22.90 21.70 -1.20 1 0.09378 12.5 7.87 0 0.524
15 19.60 18.20 -1.40 1 0.63796 0 8.14 0 0.538
16 20.40 19.90 -0.50 1 0.62739 0 8.14 0 0.538
20 20.40 18.20 -2.20 1 0.7258 0 8.14 0 0.538
25 16.60 15.60 -1.00 1 0.75026 0 8.14 0 0.538
29 19.60 18.40 -1.20 1 0.77299 0 8.14 0 0.538
33
Classification
Classification Tree
Aim: To classify census tracts into

high and low residential property
value classes
34
Boston Housing Data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV HIGHCLASS
0.00632 18 2.31 0 0.54 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 0
0.02731 0 7.07 0 0.47 6.421 78.9 4.97 2 242 17.8 396.9 9.14 21.6 0
0.02729 0 7.07 0 0.47 7.185 61.1 4.97 2 242 17.8 392.83 4.03 34.7 1
0.03237 0 2.18 0 0.46 6.998 45.8 6.06 3 222 18.7 394.63 2.94 33.4 1
0.06905 0 2.18 0 0.46 7.147 54.2 6.06 3 222 18.7 396.9 5.33 36.2 1
0.02985 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394.12 5.21 28.7 0
0.08829 13 7.87 0 0.52 6.012 66.6 5.56 5 311 15.2 395.6 12.43 22.9 0
0.14455 13 7.87 0 0.52 6.172 96.1 5.95 5 311 15.2 396.9 19.15 27.1 0
0.21124 13 7.87 0 0.52 5.631 100 6.08 5 311 15.2 386.63 29.93 16.5 0
0.17004 13 7.87 0 0.52 6.004 85.9 6.59 5 311 15.2 386.71 17.1 18.9 0
0.22489 13 7.87 0 0.52 6.377 94.3 6.35 5 311 15.2 392.52 20.45 15 0
0.11747 13 7.87 0 0.52 6.009 82.9 6.23 5 311 15.2 396.9 13.27 18.9 0
0.09378 13 7.87 0 0.52 5.889 39 5.45 5 311 15.2 390.5 15.71 21.7 0
35
Training Log
Grow ing the Tree

#Nodes Error
0 13.82
1 3.45
2 2.97
3 0.67
4 0.65
5 0.56
6 0.2
7 0.14
8 0.06
9 0.05
10 0.05
11 0.04
12 0.02
13 0.01
14 0.01
15 0
Validation Misclassification Summary
Classification Confusion Matrix

Predicted Class
Actual
0 1
Class
0 152 6
1 8 36
Error Report
Class # Cases # Errors % Error
0 158 6 3.80
1 44 8 18.18
Overall 202 14 6.93 36
XLMiner : Classification Tree - Prune Log
Back to Navigator
# Decision
Error
Nodes
15 0.0792
14 0.0644
13 0.0644
12 0.0644
11 0.0644
10 0.0644 <-- Minimum Error Prune Std. Err. 0.0172708
9 0.0743
8 0.0743
7 0.0743
6 0.0693
5 0.0693
4 0.0693
3 0.0693 <-- Best Prune
2 0.099
1 0.2079
37
Classification Tree : Full Tree
Back to Navigator
6.55
05
RM
228 76
1.35 6.79
929 1
DIS RM
6 222 31 45
10.1 73.0 % 7.63 19.4

702 5 5
CRIM 0 LSTAT PTRATIO
2 4 14 17 37 8
0.65 % 1.31 % 3.43 5.59 % 7.06 1.25

515 449 934
1 0 DIS 0 RM DIS
5 9 12 25 1 7
35.0 286. 8.22 % 0.32 % 2.30 %

18.1
000 000
PTRATIO ZN TAX 1 1 0
3 2 7 2 7 5
0.98 % 4.13 2.30 % 4.62 2.30 %

378
499 499
1 LSTAT 0 LSTAT 1 TAX
1 1 1 1 3 2
0.32 % 0.32 % 0.32 % 0.32 % 0.98 % 0.65 %
1 0 0 1 0 1
38
Classification Tree : Best Pruned Tree
Back to Navigator
6.55
05
RM
136 66
67.3 % 6.79
1
0 RM
16 50
7.92 % 19.4
5
0 PTRATIO
44 6
21.7 % 2.97 %
1 0
39
Classification Tree : Minimum Error Tree
Back to Navigator
6.55
05
RM
136 66
1.35 6.79
929 1
DIS RM
3 133 16 50
10.1 65.8 % 7.63 19.4

702 5 5
CRIM 0 LSTAT PTRATIO
3 0 11 5 44 6
1.48 % 0% 3.43 2.47 % 7.06 2.97 %

515 449
1 0 DIS 0 RM 0
0 11 14 30
0% 5.44 % 286. 14.8 %

000
1 0 TAX 1
7 7
3.46 %
378
1 TAX
6 1
2.97 % 0.49 %
0 1
40
Classification
Neural Network
Aim: To classify census tracts into

high and low residential property
value classes
41
XLMiner : Neural Network Classification
Epochs Inform ation

Number of Epochs 30
Accumulated Trials 9120
Class 0 1
Trials 7860 1260
Architecture
Number of hidden layers 1
Hidden Layer 1
# Nodes 25
Step size for gradient descent 0.1000
Weight change momentum 0.6000
Weight decay 0.0000
Cost Function Squared Error
Hidden layer sigmoid Standard
Output layer sigmoid Standard
42
Cut of f Prob.Val. f or Success (Updatable) 0.5
Clas sification Confus ion Matrix

Pre dicte d Class
Actual
1 0
Class
1 40 11
0 4 249
Error Re port
Class # Cas es # Errors % Error
1 51 11 21.57
0 253 4 1.58
Overall 304 15 4.93
Cut of f Prob.Val. f or Success (Updatable) 0.5
Clas sification Confus ion Matrix

Pre dicte d Class
Actual
1 0
Class
1 26 7
0 1 168
Error Re port
Class # Cas es # Errors % Error
1 33 7 21.21
0 169 1 0.59
43
Overall 202 8 3.96
Lift chart (validation dataset)
35
30 Cumulative
Cumulative
25 HIGHV when
20 sorted using
15 predicted values
10
Cumulative
5
HIGHV using
0 average
0 100 200 300
# cases
Decile-wise lift chart (validation dataset)
7
Decile mean / Global
6
5
4
mean
3
2
1
0
1 2 3 4 5 6 7 8 9 10
De cile s
44
Data Reduction and Exploration
Hierarchical Clustering
Aim: To cluster electric utilities into

similar groups
45
Utilities Data
seq# x1 x2 x3 x4 x5 x6 x7 x8
Arizona 1 1.06 9.2 151 54.4 1.6 9077 0 0.628
Boston 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
Central 3 1.43 15.4 113 53 3.4 9212 0 1.058
Common 4 1.02 11.2 168 56 0.3 6423 34.3 0.7
Consolid 5 1.49 8.8 192 51.2 1 3300 15.6 2.044
Florida 6 1.32 13.5 111 60 -2.2 11127 22.5 1.241
Hawaiian 7 1.22 12.2 175 67.6 2.2 7642 0 1.652
Idaho 8 1.1 9.2 245 57 3.3 13082 0 0.309
Kentucky 9 1.34 13 168 60.4 7.2 8406 0 0.862
Madison 10 1.12 12.4 197 53 2.7 6455 39.2 0.623
Nevada 11 0.75 7.5 173 51.5 6.5 17441 0 0.768
NewEngla 12 1.13 10.9 178 62 3.7 6154 0 1.897
Northern 13 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
Oklahoma 14 1.09 12 96 49.8 1.4 9673 0 0.588
Pacific 15 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
Puget 16 1.16 9.9 252 56 9.2 15991 0 0.62
SanDiego 17 0.76 6.4 136 61.9 9 5714 8.3 1.92
Southern 18 1.05 12.6 150 56.7 2.7 10140 0 1.108
Texas 19 1.16 11.7 104 54 -2.1 13507 0 0.636
Wisconsi 20 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
United 21 1.04 8.6 204 61 3.5 6650 0 2.116
Virginia 22 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
46
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage)
3.5
2.5
Distance
1.5
0.5
0
1 18 14 19 9 2 4 10 13 20 7 12 21 15 22 6 3 8 16 17 11 5
47
Predicted Clusters Back to Navigator
Cluster id. x1 x2 x3 x4 x5 x6 x7 x8
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
1 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
1 1.43 15.4 113 53 3.4 9212 0 1.058
1 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
1 1.22 12.2 175 67.6 2.2 7642 0 1.652
1 1.1 9.2 245 57 3.3 13082 0 0.309
1 1.34 13 168 60.4 7.2 8406 0 0.862
1 1.12 12.4 197 53 2.7 6455 39.2 0.623
3 0.75 7.5 173 51.5 6.5 17441 0 0.768
1 1.13 10.9 178 62 3.7 6154 0 1.897
1 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
1 1.09 12 96 49.8 1.4 9673 0 0.588
1 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
1 1.16 9.9 252 56 9.2 15991 0 0.62
4 0.76 6.4 136 61.9 9 5714 8.3 1.92
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
1 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
1 1.04 8.6 204 61 3.5 6650 0 2.116
1 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
48
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete
linkage)
4
Distance
0
1 18 14 19 6 3 9 2 22 4 20 10 13 5 7 12 21 15 17 8 16 11
49
Predicted Clusters Back to Navigator
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
1 1.43 15.4 113 53 3.4 9212 0 1.058
2 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
3 1.22 12.2 175 67.6 2.2 7642 0 1.652
4 1.1 9.2 245 57 3.3 13082 0 0.309
1 1.34 13 168 60.4 7.2 8406 0 0.862
2 1.12 12.4 197 53 2.7 6455 39.2 0.623
4 0.75 7.5 173 51.5 6.5 17441 0 0.768
3 1.13 10.9 178 62 3.7 6154 0 1.897
2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
1 1.09 12 96 49.8 1.4 9673 0 0.588
3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
4 1.16 9.9 252 56 9.2 15991 0 0.62
3 0.76 6.4 136 61.9 9 5714 8.3 1.92
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
3 1.04 8.6 204 61 3.5 6650 0 2.116
2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
50
Predicted Clusters (sorted)
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
1 1.43 15.4 113 53 3.4 9212 0 1.058
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
1 1.34 13 168 60.4 7.2 8406 0 0.862
1 1.09 12 96 49.8 1.4 9673 0 0.588
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
2 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
2 1.12 12.4 197 53 2.7 6455 39.2 0.623
2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
3 1.22 12.2 175 67.6 2.2 7642 0 1.652
3 1.13 10.9 178 62 3.7 6154 0 1.897
3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
3 0.76 6.4 136 61.9 9 5714 8.3 1.92
3 1.04 8.6 204 61 3.5 6650 0 2.116
4 1.1 9.2 245 57 3.3 13082 0 0.309
4 0.75 7.5 173 51.5 6.5 17441 0 0.768
4 1.16 9.9 252 56 9.2 15991 0 0.62
M eans
Cluster 1 1.21 12.5 128 55.5 1.7 10163 3.2 0.874
Cluster 2 1.13 10.9 183 55.1 3.1 6546 33.2 1.065
Cluster 3 1.02 9.1 171 62.9 3.7 6526 1.8 1.797
Cluster 4 1.00 8.9 223 54.8 6.3 15505 0.0 0.566
51
Affinity
Association Rules
(Market Basket Analysis)
Aim: to identify types of books that

are likely to be bought by customers
based on past purchases of books
52
YouthBks
DoItYBks
GeogBks
CookBks
2000
ChildBks
Florence
ItalCook
ItalAtlas
RefBks
ArtBks
ItalArt
customers
0 1 0 1 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 1 0 1 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
1 1 0 1 1 1 0 0 1 1 0
1 1 1 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0
1 1 1 0 0 1 0 0 0 0 1
53
XLMiner : Association Rules
Data
Input Data Sheet1!$A$1:$K$2001
Data Format Binary Matrix
Min. Support 200
Min. Conf. % 70
# Rules 19
Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Support(a U c) Lift Ratio
1 100 ItalCook=> CookBks 227 862 227 2.32

2 82.19 DoItYBks, ArtBks=> CookBks 247 862 203 1.91
3 81.89 DoItYBks, GeogBks=> CookBks 265 862 217 1.90
4 80.33 CookBks, RefBks=> ChildBks 305 846 245 1.90
5 80 ArtBks, GeogBks=> ChildBks 255 846 204 1.89
6 81.18 ArtBks, GeogBks=> CookBks 255 862 207 1.88
7 79.63 YouthBks, CookBks=> ChildBks 324 846 258 1.88
8 80.86 ChildBks, RefBks=> CookBks 303 862 245 1.88
9 78.87 DoItYBks, GeogBks=> ChildBks 265 846 209 1.86
10 79.35 ChildBks, DoItYBks=> CookBks 368 862 292 1.84
11 77.87 CookBks, DoItYBks=> ChildBks 375 846 292 1.84
12 77.66 CookBks, GeogBks=> ChildBks 385 846 299 1.84
13 78.18 ChildBks, YouthBks=> CookBks 330 862 258 1.81
14 77.85 ChildBks, ArtBks=> CookBks 325 862 253 1.81
15 75.75 CookBks, ArtBks=> ChildBks 334 846 253 1.79
16 76.67 ChildBks, GeogBks=> CookBks 390 862 299 1.78
17 70.65 GeogBks=> ChildBks 552 846 390 1.67
18 70.63 RefBks=> ChildBks 429 846 303 1.67
19 71.1 RefBks=> CookBks 429 862 305 1.65
54
Some Utilities
Sampling from worksheets and databases
Database scoring
Graphics
Binning
55
Simple
Random
Sampling
56
Stratified
Random
Sampling
57
Scoring to
databases and
worksheets
58
Binning
continuous
variables
59
Missing Data
60
Graphics: Boston Housing data
Box Plot Histogram
120 180
160
100
140
120
Frequency
80
100
Y Values
60 80
60
40
40
20
20
0
0 0 10 20 30 40 50 60 70 80 90 100
AGE
AGE
61
Box Plot Histogram
10 250
9
8 200
7
Frequency
150
6
Y Values
5
100
4
3 50
2
1 0
0 3 3.6 4.2 4.8 5.4 6 6.6 7.2 7.8 8.4 9 9.6
RM RM
62
Matrix Plot High tax
towns have
0 0.2 0.4 0.6 0.8 1 fewer rooms
on average?
9
4.2 5.4 6.6 7.8
RM
0
10
3
1
0.2 0.4 0.6 0.8
AGE
2
10
0
4.2 5.4 6.6

TAX
2
10
3
1.8
3 4.2 5.4 6.6 7.8 9 1.8 3 4.2 5.4 6.6
63
RM Box Plot
10
8
Y Values
0
1 2 3 4 5
Binned_TAX
64
Future Extensions
Cross Validation
Bootstrap, Bagging and Boosting
Error-based clustering
Time Series and Sequences
Support Vector Machines
Collaborative Filtering
65
In Conclusion
XLMiner is a modern tool-belt for data mining. It
is an affordable, easy-to-use tool for consultants,
MBAs and business analysts to learn, create and
deploy data mining methods,
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally intensive techniques.
66

Data Mining in Excel Using Xlminer: Nitin R. Patel

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Data Mining in Excel Using Xlminer: Nitin R. Patel

Uploaded by

Copyright:

Available Formats

Data Mining in Excel Using

Class. & Regr. Trees

Radial Basis Fns.

Leverage familiarity of MBA students,

Aim: To construct training,

Partitioning Method Randomly chosen

Aim: To estimate median residential

Input variables Coefficient Std. Error p-value SS

Training Data scoring - Summary Report

Test Data scoring - Summary Report

Adjusted R- Models (Constant present in all models)

Predictor (Indep. Var.) Coefficient Std. Error p-value SS

XLMiner : Multiple Linear Regression - Prediction of Validation Data

Aim: To estimate median residential

Training Data scoring - Summary Report

Validation Data scoring - Summary Report

Total sum of # Records training 253

Test Data scoring - Summary Report

Overall (secs) 3.00

Predicted Actual Actual #Nearest

Aim: To classify census tracts into

Grow ing the Tree

Validation Misclassification Summary

Classification Confusion Matrix

10.1 73.0 % 7.63 19.4

0.65 % 1.31 % 3.43 5.59 % 7.06 1.25

35.0 286. 8.22 % 0.32 % 2.30 %

0.98 % 4.13 2.30 % 4.62 2.30 %

0.32 % 0.32 % 0.32 % 0.32 % 0.98 % 0.65 %

10.1 65.8 % 7.63 19.4

1.48 % 0% 3.43 2.47 % 7.06 2.97 %

0% 5.44 % 286. 14.8 %

Aim: To classify census tracts into

Epochs Inform ation

Cut of f Prob.Val. f or Success (Updatable) 0.5

Clas sification Confus ion Matrix

Validation Data scoring - Summary Report

Cut of f Prob.Val. f or Success (Updatable) 0.5

Clas sification Confus ion Matrix

Decile-wise lift chart (validation dataset)

Aim: To cluster electric utilities into

Aim: to identify types of books that

1 100 ItalCook=> CookBks 227 862 227 2.32

Box Plot Histogram

4.2 5.4 6.6

You might also like