Professional Documents
Culture Documents
Turban ch05
Turban ch05
Turban ch05
Business Intelligence
Systems
(9th Ed., Prentice Hall)
Chapter 5:
Data Mining for Business
Intelligence
Learning Objectives
n Define data mining as an enabling technology
for business intelligence
n Understand the objectives and benefits of
business analytics and data mining
n Recognize the wide range of applications of
data mining
n Learn the standardized data mining processes
n CRISP-DM,
n SEMMA,
n KDD, …
5-2 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Learning Objectives
n Understand the steps involved in data
preprocessing for data mining
n Learn different methods and algorithms of data
mining
n Build awareness of the existing data mining
software tools
n Commercial versus free/open source
n Understand the pitfalls and myths of data
mining
n Problem
n Proposed solution
n Results
Range <1 >1 > 10 > 20 > 40 > 65 > 100 > 150 > 200
(in $Millions) (Flop) < 10 < 20 < 40 < 65 < 100 < 150 < 200 (Blockbuster)
Number of
Independent Variable Possible Values
Values
Dependent
Variable MPAA Rating 5 G, PG, PG-13, R, NR
Independent Competition 3 High, Medium, Low
Variables Star value 3 High, Medium, Low
Sci-Fi, Historic Epic Drama,
Modern Drama, Politically
A Typical Genre 10 Related, Thriller, Horror,
Comedy, Cartoon, Action,
Classification Documentary
The DM
process
Process
Map in Model
PASW Assessment
process
Ar
Pattern
tifi
c
Recognition
ial
s
tic
Int
tis
ellig
Sta
en
ce
DATA Machine
MINING Learning
Mathematical
Modeling Databases
n Types of patterns
n Association
n Prediction
n Cluster (segmentation)
n Sequential (or time series) relationships
5-13 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
A Taxonomy for Data Mining Tasks
Data Mining Learning Method Popular Algorithms
n Types of DM
n Hypothesis-driven data mining
n Discovery-driven data mining
n Insurance
n Forecast claim costs for better business planning
n Determine optimal rate plans
n Optimize marketing to specific customers
n Identify and prevent fraudulent claim activities
5-18 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining Applications (cont.)
n Computer hardware and software
n Science and engineering
n Government and defense
n Homeland security and law enforcement
n Travel industry
n Healthcare Highly popular application
n Medicine areas for data mining
n Entertainment industry
n Sports
n Etc.
5-19 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining Process
n A manifestation of best practices
n A systematic way to conduct DM projects
n Different groups has different versions
n Most common standard processes:
n CRISP-DM (Cross-Industry Standard Process
for Data Mining)
n SEMMA (Sample, Explore, Modify, Model,
and Assess)
n KDD (Knowledge Discovery in Databases)
5-20 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining Process
1 2
Business Data
Understanding Understanding
3
Data
Preparation
Data Sources
6
4
Deployment
Model
Building
5
Testing and
Evaluation
·∙
Collect data
Data Consolidation ·∙
Select data
·∙
Integrate data
·∙
Normalize data
Data Transformation ·∙
Discretize/aggregate data
·∙
Construct new attributes
Well-formed
Data
Assess Explore
(Evaluate the accuracy and (Visualization and basic
usefulness of the models) description of the data)
SEMMA
Model Modify
(Use variety of statistical and (Select variables, transform
machine learning models ) variable representations)
True False TP
Positive
TP + FN
Count (TP) Count (FP)
TN
True Negative Rate =
TN + FP
Negative
False True
Negative Negative
Count (FN) Count (TN) TP TP
Precision = Recall =
TP + FP TP + FN
Preprocessed Classifier
Data
1/3 Model
Prediction
Assessment
Testing Data Accuracy
(scoring)
0.9
0.8
A
True Positive Rate (Sensitivity)
0.7
B
0.6
C
0.5
0.4
0.3
0.2
0.1
0
0 0.1 0.2 0.3 0.4 0.5 0.6 0.7 0.8 0.9 1
1 1, 2, 3, 4 1 3 1, 2 3 1, 2, 4 3
1 2, 3, 4 2 6 1, 3 2 2, 3, 4 3
1 2, 3 3 4 1, 4 3
1 1, 2, 4 4 5 2, 3 4
1 1, 2, 3, 4 2, 4 5
1 2, 4 3, 4 3
RapidMiner
Software
SAS
/
SAS
Enterprise
Miner
Microsoft Excel
n Commercial KXEN
MATLAB
Clementine)
KNIME
Zementis
Statsoft
Statistica
n StatSoft – Statistical Data Salford
CART,
Mars,
other
Miner Orange
Angoss
Bayesia
Source
Megaputer
Viscovery
Miner3D
Total
(w/
others) Alone
n RapidMiner… Thinkanalytics
0 20 40 60 80 100 120
Source: KDNuggets.com, May 2009
5-49 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Data Mining Myths
n Data mining …
n provides instant solutions/predictions
n is not yet viable for business applications
n requires a separate, dedicated database
n can only be done by those with advanced
degrees
n is only for large firms that have lots of
customer data
n is another name for the good-old statistics
5-50 Copyright © 2011 Pearson Education, Inc. Publishing as Prentice Hall
Common Data Mining Mistakes
1. Selecting the wrong problem for data mining
2. Ignoring what your sponsor thinks data
mining is and what it really can/cannot do
3. Not leaving insufficient time for data
acquisition, selection and preparation
4. Looking only at aggregated results and not
at individual records/predictions
5. Being sloppy about keeping track of the data
mining procedure and results
n Questions / Comments…