Professional Documents
Culture Documents
DBMS Support of The Data Mining
DBMS Support of The Data Mining
Task-relevant Data
Data Selection
Warehouse
Data Cleaning
Data Integration
Databases
Data preprocessing
Define a model
Data Mining
Train the model Training Data
Management System
(DMMS)
Mining Model
s
le
g
k
ru
s
or
in
ee
er
w
es
s
Tr
io
et
st
ri e
g
ay
at
in
lN
lu
on
Se
B
i
er
oc
ra
si
ve
st
q.
ss
ci
eu
m
u
aï
Se
De
A
Cl
N
Ti
N
√ √ √ √ √ √ Classification
√ √ √ √ √ Regression
√ √ √ Segmentaion
√ √ √ √ √ √ Assoc. Analysis
√ √ √ Anomaly Detect.
√ Seq. Analysis
√ Time series
Data Mining Language
New challenges in data mining API
Large spectrum of applications: embedded to interactive BI
Interoperability between different DM providers (engine) and DM
consumers (tools)
Data independence between content representation (trees,
attributes, networks, etc) and data mining task (prediction, scoring,
etc)
Requirements:
Algorithm-neutral
Task-oriented (specification of what we need, rather than how to)
Vendor-neutral
Flexible, extensible, declarative/self-contained
Sound familiar?
Yes, SQL
DMX Approach
Data Mining Extensions (DMX) to SQL
Table vs. Mining Model
TABLE MINING MODEL
schema Column definition Attribute (variable)
definition
contains Rows Patterns, knowledge,
cases
DDL (create,drop,al Create/drop/alter a model
operatio ter)
ns DML (insert, delete) Train (populate) a model
Query (select) Prediction/browsing a
model
Typical DM Process Using DMX
Define a model:
CREATE MINING MODEL ….
Data Mining
Train a model: Management System
INSERT INTO dmm …. (DMMS)
Training Data
SELECT FLATTENED
( SELECT $Sequence,
TopCount(PredictHistogram(Page), $Probability, 5) FROM PredictSeque
nce(WebSeqModel.PageSeq, 2)
)
FROM WebSeqModel NATURAL PREDICTION JOIN
(SELECT
(SELECT 1 AS SeqID, ’home’ AS Page UNION
SELECT 2 AS SeqID, ’news’ AS Page) AS PageSeq
) AS t
Time-Series Prediction
Model Definition
Mining methodology
Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
Performance: efficiency, effectiveness, and scalability
Pattern evaluation: the interestingness problem
Incorporation of background knowledge
Handling noise and incomplete data
Parallel, distributed and incremental mining methods
Integration of the discovered knowledge with existing one: knowledge fusion
User interaction
Data mining query languages and ad-hoc mining
Expression and visualization of data mining results
Interactive mining of knowledge at multiple levels of abstraction
Applications and social impacts
Domain-specific data mining & invisible data mining
Protection of data security, integrity, and privacy
Data Mining Vendors
SAS (Enterprise Miner)
IBM (DB2 Intelligent Miner)
Oracle (ODM option to Oracle 10g)
SPSS (Clementine)
Insightsful (Insightful Miner)
KXEN (Analytic Framework)
Prudsys (Discoverer and its family)
Microsoft (SQL Server 2005)
Angoss (KnowledgeServer and its family)
DBMiner (DBMiner)
Many others
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions
Data Exploration
Statistical Analysis, Querying and Reporting
respectively
pruning conditions (restrict by support, confidence, or size)
Stratified or correlated subqueries
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)
Retrieve all rules with descriptors of the form “Age = *” in the body, exce
pt when there is a rule with equal or greater support and confidence with a
rule containing a superset of the descriptors in the body
MSQL
GetRules(C) R1
where <pruning-conds>
correlated and not exists ( GetRules(C) R2
where <same pruning-conds>
and R2.Body HAS R1.Body)
GetRules(C) R1
where <pruning-conds>
and consequent is {(X=*)}
stratified and consequent in (SelectRules(R2)
where consequent is {(X=*)}
MSQL
Nested Get-Rules Queries and their optimization
Stratified (non-corrolated) queries are evaluated “bottom-up.
” The subquery is evaluated first, and replaced with its result
s in the outer query.
Correlated queries are evaluated either top-down or bottom-u
p (like “loop-unfolding”), and there are rules for choosing bet
ween the two options
MSQL
Top-Down Evaluation
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7