Professional Documents
Culture Documents
Data Mining vs. Statistics: Pavel Brusilovsky
Data Mining vs. Statistics: Pavel Brusilovsky
Statistics
Pavel Brusilovsky
Objectives
• Intro to Data Mining
2
Data Mining
• Data Mining
– is a cutting edge technology to analyze diverse,
multidisciplinary and multidimensional complex data
3
What is the Taxonomy of Data Mining?
• Data mining taxonomy, based on application
– Data Mining
– Text Mining
– Web Mining
– Image Mining…
Source: 5
http://www.knowledgetechnologies.org/proceedings/presentations/treloar/nathantreloar.ppt
Example: Amazon.com purchase suggestion
Amazon.com increased
sales by 15%, using
data/text mining
generated purchase
suggestions
6
Data Mining and Related Fields
8
What are Data Mining Myths?
• Myth 1: Data mining automatically discovers hidden pattern in your
data
• Myth 2: Data mining is design for business analysts who are not
professional in quantitative fields
9
What are logical steps of Data Mining?
SEMMA methodology (SAS Enterprise Miner)
• The core process of conducting data mining study includes the following
steps (SEMMA):
– Sample
– Explore
– Modify
– Model
– Assess
• SEMMA is a logical organization of the functional tool set of SAS
Enterprise Miner for carrying out the core tasks of data mining
• SEMMA is focused on the model development aspects of data mining
10
CRoss-Industry Standard Process for Data
Mining (CRISP-DM)
SPSS Clementine
Six phases of CRISP-DM:
1. Business understanding
2. Data understanding
3. Data preparation
4. Modeling
5. Evaluation
6. Model deployment
www.crips-dm.org
11
Statistics vs. Data Mining: Concepts
Feature Statistics Data Mining
Type of Problem Well structured Unstructured / Semi-structured
Inference Role Explicit inference plays No explicit inference
great role in any analysis
Objective of the Analysis First – objective Data rarely collected for objective of
and Data Collection formulation, and then - the analysis/modeling
data collection
Size of data set Data set is small and Data set is large and data set is
hopefully homogeneous heterogeneous
Paradigm/Approach Theory-based (deductive) Synergy of theory-based and
heuristic-based approaches
(inductive)
Signal-to-Noise Ratio STNR > 3 0 < STNR <= 3
Type of Analysis Confirmative Explorative
Number of variables Small Large
12
Statistics vs. Data Mining: Regression Modeling
Feature Statistics Data Mining
Number of inputs Small Large
Type of inputs Interval scaled and categorical with Any mixture of interval scaled,
small number of categories categorical, and text variables
(percentage of categorical variables is
small)
Multicollinearity Wide range of degree of Severe multicollinearity is
multicollinearity with intolerance to always there, tolerance to
multicollinearity multicollinearity
Distributional Intolerance to distrubitional Tolerance to distributional
assumptions, assumption violation, assumption violation,
homoscedasticity, homoscedasticity, outliers/leverage points, and
outliers, missing Outliers/leverage points, missing missing values
values values
14
What are differences between Data/Text
Mining and Statistics?
• Statistical analysis is designed to deal with structured data in order to
solve structured problem:
– Results are software and researcher independent
– Inference reflects statistical hypothesis testing
• Data mining is designed to deal with structured data in order to solve
unstructured business problems
– Results are software and researcher dependent (absence of
implementation standards)
– Inference reflects computational properties of data mining
algorithm at hand
• Text mining is designed to deal with unstructured data in order to
solve unstructured problems
– Results are software and researcher dependent
– Inference reflects computational properties and visualization
capability of text mining algorithm at hand
15
When data mining technology is
appropriate?
• Data mining technology is appropriate if:
– The business problem is unstructured
– Accurate prediction is more important than the explanation
– The data include the mixture of interval, nominal, ordinal, count,
and text variables, and the role and the number of non-numeric
variables are essential
– Among those variables there are a lot of irrelevant and redundant
attributes
– The relationship among variables could be non-linear with
uncharacterizable nonlinearities
– The data are highly heterogeneous with a large percentage of
outliers, leverage points, and missing values
– The sample size is relatively large
17
What is Breiman Uncertainty Principle?
• Breiman uncertainty principle:
Accuracy * Interpretability = Breiman’s constant
18
What are great Data Mining Ideas?
• Injecting randomness into function estimation procedure
20
What are the best data mining tools?
• Salford Systems’ Tools (CART, Random Forest, MARS, TreeNet)
• SPSS Clementine
21
Reference (Data Mining)