Professional Documents
Culture Documents
Data Mining in Excel Using Xlminer: Nitin R. Patel
Data Mining in Excel Using Xlminer: Nitin R. Patel
XLMiner
Nitin R. Patel
Cytel Software and M.I.T.Sloan
1
Contact Info
XLMiner is distributed by Resampling
Stats, Inc.
www.xlminer.net
Contact Peter Bruce: pbruce@resample.com
703-522-2713
2
What is XLMiner?
XLMiner is an affordable, easy-to-use tool for
business analysts, consultants and business
students to:
learn strengths and weaknesses of data mining methods,
prototype large scale data mining applications,
implement medium scale data mining applications.
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally-intensive techniques.
3
Available Data Mining Software
Application-specific: aimed at providing
solutions to end-users for common tasks
(e.g. Unica for Customer Relationship
Management, Urban Science for location
and distribution)
Technique-specific: focused on a few data
mining methods (e.g. CART from Salford
Associates, Neural Nets from HNC
Software)
4
Kohonen Net
Source: Elder Research
Association Rules
K-Means
Sequential. Rules
TimeSeries
x
Logistic
Regression
Rule Induction
x
x
Nave Bayes
Radial Basis Fns.
K-Nearest x
Neighbors
TECHNIQUE-SPECIFIC PRODUCTS
Multilayer Neural
Net
x
Linear Regression
Class. & Regr.
Trees
x
x
x
CART (Salford)
Algorithms>
NeuroShell
WizWhy
Cognos
See5
Available Data Mining Software
Horizontal products: designed for data
mining analysts: (e.g. SAS Enterprise
Miner, SPSS Clementine, IBM Intelligent
Miner, NCR Teraminer, Splus Insightful
Miner, Darwin/Oracle)
Powerful, comprehensive, easy-to-use; but
Need substantial learning effort
Expensive
6
HORIZONTAL PRODUCTS Source: Elder Research
K-Nearest Neighbors
Nave Bayes
Rule Induction
Logistic Regression
TimeSeries
Sequential. Rules
K-Means
Association Rules
Kohonen Net
Algorithms>
Enterprise
Miner (SAS)
x x x x x x x x x
Clementine
(SPSS)
x x x x x x x
Intelligent
Miner (IBM)
x x x x x x x x
MineSet (SGI)
x x x x x
Darwin
(Oracle)
x x x x
PRW (Unica)
x x x x x x 7
Desiderata for Data Mining and
Modern Data Analysis Software
Easy-to-use
Data import (e.g. cross-platform, various data bases)
Data handling (e.g. data partitioning, scoring)
Invoking and experimenting with procedures
Comprehensive Range of Procedures:
Statistics (e.g. Regression, Multivariate procedures)
Machine learning (e.g. Neural Nets, Classification
Trees)
Database (e.g. Association Rules)
8
XLMiner is Unique
Low cost,
Comprehensive set of data mining models and
algorithms that includes statistical, machine
learning and database methods,
Based on prototype used in three years of MBA
courses on data mining at Sloan School, M.I.T.
Focus on business applications: Book of lecture
notes and cases in preparation (first draft available
for examination).
9
Why Data Mining in Excel?
10
Advantages
Low learning hurdle
Promotes understanding of strengths and
weaknesses of different data mining techniques
and processes
Enables interactive analysis of data (important in
early stages of model building)
Facilitates incorporation of domain knowledge
(often key to successful applications) by
empowering end-users to participate actively in
data mining projects
Enables pre-processing of data and post-
processing of results using Excel functions,
reporting in Word, presentations in PowerPoint
11
Advantages (cont.)
Supports communication between data miners and
end-users
Supports smooth transition from prototyping to
custom solution development (VB and VBA)
Emphasizes openness
enables integration with other analytic software for
optimization (Solver), simulation (Crystal Ball) ,
numerical methods;
interface modifications (e.g.custom forms and outputs)
solution specific routines (VBA)
Examples:
Boston Celtics analysis of player statistics
Clustering for improving forecasts, optimizing price
markdowns. 12
Size Limitations
An Excel spreadsheet cannot exceed 64,000 rows.
If data records are stored as rows in a single
spreadsheet this is the largest data set that can be
accommodated. The number of variables cannot
exceed 256 (number of columns).
These limits do not apply to deployment of model
to score large databases.
If Excel is used as a view-port into a database such
as Access, MS SQL Server, Oracle or SAS, these
limits do not apply.
13
Sampling
Practical Data Mining Methodologies such
as SEMMA (SAS) and CRISP-DM (SPSS
and European Industry Standard)
recommend working with a sample
(typically 10,000 random cases) in the
model and algorithm selection phase. This
facilitates interactive development of data
mining models.
14
XLMiner
Free 30 day trial version: limit is 200 records per
partition.
Education version: limit is 2,000 records per
partition, so maximum size for a data set is 6,000
records.
Standard version (currently in beta test: will be
available by end August):
Up to 60,000 records obtained by drawing samples
from large data bases in accordance with SASs
SEMMA (Sample, Explore, Model, Measure, Apply)
methodology. Training data restricted to 10,000 records
Sampling from and scoring to Access databases (later
SQLServer, Oracle, SAS) 15
Data Mining Procedures in
XLMiner
Partitioning data sets (into Training, Validation,
and Test data sets)
Scoring of training, validation, test and other data
Prediction (of a continuous variable)
Classification
Data reduction and exploration
Affinity
Utilities: Sampling, graphics, missing data,
binning, creation of dummy variables
16
Prediction
Multiple Linear Regression with subset
selection, residual analysis, and collinearity
diagnostics.
K-Nearest Neighbors
Regression Tree
Neural Net
17
Classification
Logistic Regression with subset selection,
residual analysis, and collinearity diagnostics
Discriminant Analysis
K-Nearest Neighbors
Classification Tree
Nave Bayes
Neural Networks
18
Data Reduction and Exploration
Principal Components
K-Means Clustering
Hierarchical Clustering
19
Affinity
Association Rules (Market Basket Analysis)
20
Partitioning
21
22
Boston Housing Data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV
0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 397 4.98 24
0.02731 0 7.07 0 0.469 6.421 78.9 4.97 2 242 17.8 397 9.14 21.6
0.02729 0 7.07 0 0.469 7.185 61.1 4.97 2 242 17.8 393 4.03 34.7
0.03237 0 2.18 0 0.458 6.998 45.8 6.06 3 222 18.7 395 2.94 33.4
0.06905 0 2.18 0 0.458 7.147 54.2 6.06 3 222 18.7 397 5.33 36.2
0.02985 0 2.18 0 0.458 6.43 58.7 6.06 3 222 18.7 394 5.21 28.7
0.08829 13 7.87 0 0.524 6.012 66.6 5.56 5 311 15.2 396 12.43 22.9
0.14455 13 7.87 0 0.524 6.172 96.1 5.95 5 311 15.2 397 19.15 27.1
0.21124 13 7.87 0 0.524 5.631 100 6.08 5 311 15.2 387 29.93 16.5
0.17004 13 7.87 0 0.524 6.004 85.9 6.59 5 311 15.2 387 17.1 18.9
0.22489 13 7.87 0 0.524 6.377 94.3 6.35 5 311 15.2 393 20.45 15
0.11747 13 7.87 0 0.524 6.009 82.9 6.23 5 311 15.2 397 13.27 18.9
0.09378 13 7.87 0 0.524 5.889 39 5.45 5 311 15.2 391 15.71 21.7
0.62976 0 8.14 0 0.538 5.949 61.8 4.71 4 307 21 397 8.26 20.4
23
XLMiner : Data Partition Sheet Date: 29-Jul-2003 13:50:09 (Ver: 1.2.0.1)
Output Navigator
Training Data Validation Data Test Data
Data
Data source housing!$A$2:$O$507
Selected variables CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B
Selected variables
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT
Row Id.
1 0.00632 18 2.31 0 0.538 6.575 65.2 4.09 1 296 15.3 396.9 4.98
2 0.02731 0 7.07 0 0.469 6.421 78.9 4.9671 2 242 17.8 396.9 9.14
5 0.06905 0 2.18 0 0.458 7.147 54.2 6.0622 3 222 18.7 396.9 5.33
6 0.02985 0 2.18 0 0.458 6.43 58.7 6.0622 3 222 18.7 394.12 5.21
7 0.08829 12.5 7.87 0 0.524 6.012 66.6 5.5605 5 311 15.2 395.6 12.43
8 0.14455 12.5 7.87 0 0.524 6.172 96.1 5.9505 5 311 15.2 396.9 19.15
10 0.17004 12.5 7.87 0 0.524 6.004 85.9 6.5921 5 311 15.2 386.71 17.1
12 0.11747 12.5 7.87 0 0.524 6.009 82.9 6.2267 5 311 15.2 396.9 13.27
14 0.62976 0 8.14 0 0.538 5.949 61.8 4.7075 4 307 21 396.9 8.26
3 0.02729 0 7.07 0 0.469 7.185 61.1 4.9671 2 242 17.8 392.83 4.03
9 0.21124 12.5 7.87 0 0.524 5.631 100 6.0821 5 311 15.2 386.63 29.93
13 0.09378 12.5 7.87 0 0.524 5.889 39 5.4509 5 311 15.2 390.5 15.71
4 0.03237 0 2.18 0 0.458 6.998 45.8 6.0622 3 222 18.7 394.63 2.94
11 0.22489 12.5 7.87 0 0.524 6.377 94.3 6.3467 5 311 15.2 392.52 20.45
17 1.05393 0 8.14 0 0.538 5.935 29.3 4.4986 4 307 21 386.85 6.58
24
Prediction
Multiple Linear Regression using
subset selection
25
The Regression Model
Total sum of
squared RMS Error
Average
Error
# Records training 253
errors
6036 4.884 0.000 # Records validation 152
# Records test 101
Validation Data scoring - Summary Report
Total sum of
Average
squared RMS Error
Error
errors
2848 4.329 0.066
Total sum of
Average
squared RMS Error
errors
Error 26
2392 4.866 -1.019
Subset selection (exhaustive enumeration)
27
The Regression Model
28
%AvAbsErr=15.6%
AbsErr Freq
0 0
2 61
4 40
6 25
8 10
10 9
12 2
14 3
16 0
18 0
20 1
22 1
70
Frequency in Validation Dataset
60
50
40
30
20
10
0
0 2 4 6 8 10 12 14 16 18 20 22
AbsErr
29
Prediction
K_Nearest Neighbors
30
XLMiner : K-Nearest Neighbors Prediction
Data
Source data w orksheet Data_Partition1
Training data used for building the model Data_Partition1!$C$19:$Q$322
Validation data Data_Partition1!$C$323:$Q$524
# cases in the training data set 304
# cases in the validation data set 202
Normalization TRUE
# nearest neighbors (k) 1
Variables
Input variables NOX RM DIS PTRATIO LSTAT
Output variable MEDV
31
Param eters/Options
# Nearest neighbors 1
Total sum of
Average
squared RMS Error
Error
errors
0 0 0
Total sum of
Average
squared RMS Error
Error
errors
3895 6.210 -0.450
Timings
33
Classification
Classification Tree
34
Boston Housing Data
CRIM ZN INDUS CHAS NOX RM AGE DIS RAD TAX PTRATIO B LSTAT MEDV HIGHCLASS
0.00632 18 2.31 0 0.54 6.575 65.2 4.09 1 296 15.3 396.9 4.98 24 0
0.02731 0 7.07 0 0.47 6.421 78.9 4.97 2 242 17.8 396.9 9.14 21.6 0
0.02729 0 7.07 0 0.47 7.185 61.1 4.97 2 242 17.8 392.83 4.03 34.7 1
0.03237 0 2.18 0 0.46 6.998 45.8 6.06 3 222 18.7 394.63 2.94 33.4 1
0.06905 0 2.18 0 0.46 7.147 54.2 6.06 3 222 18.7 396.9 5.33 36.2 1
0.02985 0 2.18 0 0.46 6.43 58.7 6.06 3 222 18.7 394.12 5.21 28.7 0
0.08829 13 7.87 0 0.52 6.012 66.6 5.56 5 311 15.2 395.6 12.43 22.9 0
0.14455 13 7.87 0 0.52 6.172 96.1 5.95 5 311 15.2 396.9 19.15 27.1 0
0.21124 13 7.87 0 0.52 5.631 100 6.08 5 311 15.2 386.63 29.93 16.5 0
0.17004 13 7.87 0 0.52 6.004 85.9 6.59 5 311 15.2 386.71 17.1 18.9 0
0.22489 13 7.87 0 0.52 6.377 94.3 6.35 5 311 15.2 392.52 20.45 15 0
0.11747 13 7.87 0 0.52 6.009 82.9 6.23 5 311 15.2 396.9 13.27 18.9 0
0.09378 13 7.87 0 0.52 5.889 39 5.45 5 311 15.2 390.5 15.71 21.7 0
35
Training Log
Error Report
Class # Cases # Errors % Error
0 158 6 3.80
1 44 8 18.18
Overall 202 14 6.93 36
XLMiner : Classification Tree - Prune Log
Back to Navigator
# Decision
Error
Nodes
15 0.0792
14 0.0644
13 0.0644
12 0.0644
11 0.0644
10 0.0644 <-- Minimum Error Prune Std. Err. 0.0172708
9 0.0743
8 0.0743
7 0.0743
6 0.0693
5 0.0693
4 0.0693
3 0.0693 <-- Best Prune
2 0.099
1 0.2079
37
Classification Tree : Full Tree
Back to Navigator
6.55
05
RM
228 76
1.35 6.79
929 1
DIS RM
6 222 31 45
1 0 0 1 0 1
38
Classification Tree : Best Pruned Tree
Back to Navigator
6.55
05
RM
136 66
67.3 % 6.79
1
0 RM
16 50
7.92 % 19.4
5
0 PTRATIO
44 6
21.7 % 2.97 %
1 0
39
Classification Tree : Minimum Error Tree
Back to Navigator
6.55
05
RM
136 66
1.35 6.79
929 1
DIS RM
3 133 16 50
3.46 %
378
1 TAX
6 1
2.97 % 0.49 %
0 1
40
Classification
Neural Network
41
XLMiner : Neural Network Classification
Architecture
Number of hidden layers 1
Hidden Layer 1
# Nodes 25
Step size for gradient descent 0.1000
Weight change momentum 0.6000
Weight decay 0.0000
Cost Function Squared Error
Hidden layer sigmoid Standard
Output layer sigmoid Standard
42
Training Data scoring - Summary Report
Error Re port
Class # Cas es # Errors % Error
1 51 11 21.57
0 253 4 1.58
Overall 304 15 4.93
Error Re port
Class # Cas es # Errors % Error
1 33 7 21.21
0 169 1 0.59
43
Overall 202 8 3.96
Lift chart (validation dataset)
35
30 Cumulative
Cumulative
25 HIGHV when
20 sorted using
15 predicted values
10
Cumulative
5
HIGHV using
0 average
0 100 200 300
# cases
7
Decile mean / Global
6
5
4
mean
3
2
1
0
1 2 3 4 5 6 7 8 9 10
De cile s
44
Data Reduction and Exploration
Hierarchical Clustering
45
Utilities Data
seq# x1 x2 x3 x4 x5 x6 x7 x8
Arizona 1 1.06 9.2 151 54.4 1.6 9077 0 0.628
Boston 2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
Central 3 1.43 15.4 113 53 3.4 9212 0 1.058
Common 4 1.02 11.2 168 56 0.3 6423 34.3 0.7
Consolid 5 1.49 8.8 192 51.2 1 3300 15.6 2.044
Florida 6 1.32 13.5 111 60 -2.2 11127 22.5 1.241
Hawaiian 7 1.22 12.2 175 67.6 2.2 7642 0 1.652
Idaho 8 1.1 9.2 245 57 3.3 13082 0 0.309
Kentucky 9 1.34 13 168 60.4 7.2 8406 0 0.862
Madison 10 1.12 12.4 197 53 2.7 6455 39.2 0.623
Nevada 11 0.75 7.5 173 51.5 6.5 17441 0 0.768
NewEngla 12 1.13 10.9 178 62 3.7 6154 0 1.897
Northern 13 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
Oklahoma 14 1.09 12 96 49.8 1.4 9673 0 0.588
Pacific 15 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
Puget 16 1.16 9.9 252 56 9.2 15991 0 0.62
SanDiego 17 0.76 6.4 136 61.9 9 5714 8.3 1.92
Southern 18 1.05 12.6 150 56.7 2.7 10140 0 1.108
Texas 19 1.16 11.7 104 54 -2.1 13507 0 0.636
Wisconsi 20 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
United 21 1.04 8.6 204 61 3.5 6650 0 2.116
Virginia 22 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
46
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Single linkage)
3.5
2.5
Distance
1.5
0.5
0
1 18 14 19 9 2 4 10 13 20 7 12 21 15 22 6 3 8 16 17 11 5
47
Predicted Clusters Back to Navigator
Cluster id. x1 x2 x3 x4 x5 x6 x7 x8
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
1 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
1 1.43 15.4 113 53 3.4 9212 0 1.058
1 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
1 1.22 12.2 175 67.6 2.2 7642 0 1.652
1 1.1 9.2 245 57 3.3 13082 0 0.309
1 1.34 13 168 60.4 7.2 8406 0 0.862
1 1.12 12.4 197 53 2.7 6455 39.2 0.623
3 0.75 7.5 173 51.5 6.5 17441 0 0.768
1 1.13 10.9 178 62 3.7 6154 0 1.897
1 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
1 1.09 12 96 49.8 1.4 9673 0 0.588
1 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
1 1.16 9.9 252 56 9.2 15991 0 0.62
4 0.76 6.4 136 61.9 9 5714 8.3 1.92
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
1 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
1 1.04 8.6 204 61 3.5 6650 0 2.116
1 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
48
Dendrogram (Data range:T12-5!$M$3:$U$24, Method:Complete
linkage)
4
Distance
0
1 18 14 19 6 3 9 2 22 4 20 10 13 5 7 12 21 15 17 8 16 11
49
Predicted Clusters Back to Navigator
Cluster id. x1 x2 x3 x4 x5 x6 x7 x8
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
1 1.43 15.4 113 53 3.4 9212 0 1.058
2 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
3 1.22 12.2 175 67.6 2.2 7642 0 1.652
4 1.1 9.2 245 57 3.3 13082 0 0.309
1 1.34 13 168 60.4 7.2 8406 0 0.862
2 1.12 12.4 197 53 2.7 6455 39.2 0.623
4 0.75 7.5 173 51.5 6.5 17441 0 0.768
3 1.13 10.9 178 62 3.7 6154 0 1.897
2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
1 1.09 12 96 49.8 1.4 9673 0 0.588
3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
4 1.16 9.9 252 56 9.2 15991 0 0.62
3 0.76 6.4 136 61.9 9 5714 8.3 1.92
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
3 1.04 8.6 204 61 3.5 6650 0 2.116
2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
50
Predicted Clusters (sorted)
Cluster id. x1 x2 x3 x4 x5 x6 x7 x8
1 1.06 9.2 151 54.4 1.6 9077 0 0.628
1 1.43 15.4 113 53 3.4 9212 0 1.058
1 1.32 13.5 111 60 -2.2 11127 22.5 1.241
1 1.34 13 168 60.4 7.2 8406 0 0.862
1 1.09 12 96 49.8 1.4 9673 0 0.588
1 1.05 12.6 150 56.7 2.7 10140 0 1.108
1 1.16 11.7 104 54 -2.1 13507 0 0.636
2 0.89 10.3 202 57.9 2.2 5088 25.3 1.555
2 1.02 11.2 168 56 0.3 6423 34.3 0.7
2 1.49 8.8 192 51.2 1 3300 15.6 2.044
2 1.12 12.4 197 53 2.7 6455 39.2 0.623
2 1.15 12.7 199 53.7 6.4 7179 50.2 0.527
2 1.2 11.8 148 59.9 3.5 7287 41.1 0.702
2 1.07 9.3 174 54.3 5.9 10093 26.6 1.306
3 1.22 12.2 175 67.6 2.2 7642 0 1.652
3 1.13 10.9 178 62 3.7 6154 0 1.897
3 0.96 7.6 164 62.2 -0.1 6468 0.9 1.4
3 0.76 6.4 136 61.9 9 5714 8.3 1.92
3 1.04 8.6 204 61 3.5 6650 0 2.116
4 1.1 9.2 245 57 3.3 13082 0 0.309
4 0.75 7.5 173 51.5 6.5 17441 0 0.768
4 1.16 9.9 252 56 9.2 15991 0 0.62
M eans
Cluster 1 1.21 12.5 128 55.5 1.7 10163 3.2 0.874
Cluster 2 1.13 10.9 183 55.1 3.1 6546 33.2 1.065
Cluster 3 1.02 9.1 171 62.9 3.7 6526 1.8 1.797
Cluster 4 1.00 8.9 223 54.8 6.3 15505 0.0 0.566
51
Affinity
Association Rules
(Market Basket Analysis)
DoItYBks
GeogBks
CookBks
2000
ChildBks
Florence
ItalCook
ItalAtlas
RefBks
ArtBks
ItalArt
customers
0 1 0 1 0 0 1 0 0 0 0
1 0 0 0 0 0 0 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
1 1 1 0 1 0 1 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
0 1 0 0 0 0 0 0 0 0 0
0 1 0 0 1 0 0 0 0 0 0
1 0 0 1 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
0 0 0 0 0 0 0 0 0 0 0
0 0 1 0 0 0 1 0 0 0 0
1 0 0 0 0 1 0 0 0 0 1
1 1 0 1 1 1 0 0 1 1 0
1 1 1 0 0 0 0 0 0 0 0
1 1 1 0 0 0 1 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
0 0 1 0 0 0 0 0 0 0 0
1 1 1 1 1 1 1 0 0 0 0
1 1 1 0 0 1 0 0 0 0 1
53
XLMiner : Association Rules
Data
Input Data Sheet1!$A$1:$K$2001
Data Format Binary Matrix
Min. Support 200
Min. Conf. % 70
# Rules 19
Rule # Conf. % Antecedent (a) Consequent (c) Support(a) Support(c) Support(a U c) Lift Ratio
55
Simple
Random
Sampling
56
Stratified
Random
Sampling
57
Scoring to
databases and
worksheets
58
Binning
continuous
variables
59
Missing Data
60
Graphics: Boston Housing data
120 180
160
100
140
120
Frequency
80
100
Y Values
60 80
60
40
40
20
20
0
0 0 10 20 30 40 50 60 70 80 90 100
AGE
AGE
61
Box Plot Histogram
10 250
9
8 200
7
Frequency
150
6
Y Values
5
100
4
3 50
2
1 0
0 3 3.6 4.2 4.8 5.4 6 6.6 7.2 7.8 8.4 9 9.6
RM RM
62
Matrix Plot High tax
towns have
0 0.2 0.4 0.6 0.8 1 fewer rooms
on average?
9
4.2 5.4 6.6 7.8
RM
0
10
3
1
0.2 0.4 0.6 0.8
AGE
2
10
0
3
1.8
3 4.2 5.4 6.6 7.8 9 1.8 3 4.2 5.4 6.6
63
RM Box Plot
10
8
Y Values
0
1 2 3 4 5
Binned_TAX
64
Future Extensions
Cross Validation
Bootstrap, Bagging and Boosting
Error-based clustering
Time Series and Sequences
Support Vector Machines
Collaborative Filtering
65
In Conclusion
XLMiner is a modern tool-belt for data mining. It
is an affordable, easy-to-use tool for consultants,
MBAs and business analysts to learn, create and
deploy data mining methods,
More generally, XLMiner is a tool for data
analysis in Excel that uses classical and modern,
computationally intensive techniques.
66