Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 54

DBMS support of the Data Mining

Advisor : S.-Y. Hwang Ph.D


D954020005 Tsung-Hsien Yang
D954020006 Shi-Hwao Wang
1/22/2008
Agenda
 Introduction to Data Mining
 The Promise of Data Mining
 KDD Process
 Data Mining Algorithms
 Data Mining Modeling and Language
 Conclusion
Introduction to Data Mining
 The Explosive Growth of Data: from terabytes to petabytes
 Major sources of abundant data
 Business: Web, e-commerce, transactions, stocks, …
 Science: Remote sensing, bioinformatics, scientific simulation,

 Society and everyone: news, digital cameras, YouTube
 Data collection and data availability
 Automated data collection tools, database systems, Web, comp
uterized society
What Is Data Mining?
 Data mining: Discovering interesting patterns from large amounts of data
 Data mining (knowledge discovery from data)
 Extraction of interesting (non-trivial, implicit, previously unknown and
potentially useful) patterns or knowledge from huge amount of data
 Alternative names
 Knowledge discovery (mining) in databases (KDD), knowledge extraction,
data/pattern analysis, data archeology, data dredging, information harvesti
ng, business intelligence, etc.
 Watch out: Is everything “data mining”?
 Simple search and query processing
 (Deductive) expert systems
The Promise of Data Mining
 Database analysis and decision support
 Market analysis and management
 target marketing, customer relation management, market baske
t analysis, cross selling, market segmentation
 Risk analysis and management
 Forecasting, customer retention, improved underwriting, qualit
y control, competitive analysis
 Fraud detection and management
 Other Applications
 Text mining (news group, email, documents) and Web analysis.
Knowledge Discovery (KDD) Process

 Data mining—core of Pattern Evaluation


knowledge discovery
process Data Mining

Task-relevant Data

Data Selection
Warehouse
Data Cleaning

Data Integration

Databases
Data preprocessing
Define a model

Data Mining
Train the model Training Data
Management System
(DMMS)

Test the model Test Data

Mining Model

Prediction using the model

Prediction Input Data


Data Mining Algorithms
 Decision Trees
 Naïve Bayesian
 Clustering
 Sequence Clustering
 Association Rules
 Neural Network
 Time Series
 Support Vector Machines
 ….
Data Mining Function
 Classification (attribute)
 Estimation (regression)
 Prediction (time series)
 Association (cross selling)
 Clustering (segmentation)
√ - first choice

Data Mining Algorithms √ - second choice

s
le
g

k
ru
s

or
in
ee

er

w
es

s
Tr

io

et
st

ri e
g
ay

at
in

lN
lu
on

Se
B

i
er

oc

ra
si

ve

st

q.

ss
ci

eu
m
u

Se
De

A
Cl

N
Ti
N

√ √ √ √ √ √ Classification

√ √ √ √ √ Regression

√ √ √ Segmentaion

√ √ √ √ √ √ Assoc. Analysis

√ √ √ Anomaly Detect.

√ Seq. Analysis

√ Time series
Data Mining Language
 New challenges in data mining API
 Large spectrum of applications: embedded to interactive BI
 Interoperability between different DM providers (engine) and DM
consumers (tools)
 Data independence between content representation (trees,
attributes, networks, etc) and data mining task (prediction, scoring,
etc)
 Requirements:
 Algorithm-neutral
 Task-oriented (specification of what we need, rather than how to)
 Vendor-neutral
 Flexible, extensible, declarative/self-contained
 Sound familiar?
 Yes, SQL
DMX Approach
 Data Mining Extensions (DMX) to SQL
 Table vs. Mining Model
TABLE MINING MODEL
schema Column definition Attribute (variable)
definition
contains Rows Patterns, knowledge,
cases
DDL (create,drop,al Create/drop/alter a model
operatio ter)
ns DML (insert, delete) Train (populate) a model
Query (select) Prediction/browsing a
model
Typical DM Process Using DMX
Define a model:
CREATE MINING MODEL ….

Data Mining
Train a model: Management System
INSERT INTO dmm …. (DMMS)
Training Data

Prediction using a model: Mining Model


SELECT …
FROM dmm PREDICTION JOIN …
Prediction Input Data
Defining a DM Model
 Defines
 Shape of “training cases” (top-level entity being
modeled)
 Input/output attributes (variables): type, distribut
ion
 Algorithms and parameters
 Example
CREATE MINING MODEL CollegePlanModel
(
StudentID LONG KEY,
Gender TEXT DISCRETE,
ParentIncome LONG NORMAL CONTINUOUS,
Encouragement TEXT DISCRETE,
CollegePlans TEXT DISCRETE PREDICT
) USING Microsoft_Decision_Trees
(complexity_penalty = 0.5)
Training a DM Model: Simple
INSERT INTO CollegePlanModel
(StudentID, Gender, ParentIncome,
Encouragement, CollegePlans)
OPENROWSET(‘<provider>’, ‘<connection>’,
‘SELECT StudentID,
Gender,
ParentIncome,
Encouragement,
CollegePlans
FROM CollegePlansTrainData’)
Prediction Using a DM Model
 PREDICTION JOIN
SELECT t.ID, CPModel.Plan
FROM CPModel PREDICTION JOIN OPENQUERY(…,‘SEL
ECT * FROM NewStudents’) AS t
ON CPModel.Gender = t.Gender AND
CPModel.IQ = t.IQ

CPModel ID Gender IQ Plan ID Gender IQ NewStudents


Classification
 Model Definition

CREATE MINING MODEL CPClass


(
StudentID LONG KEY,
Gender TEXT DISCRETE,
ParentIncome LONG CONTINUOUS,
Encouragement TEXT DISCRETE,
CollegePlans TEXT DISCRETE PREDICT
) USING Microsoft_Decision_Trees
Classification (cont)
 Find the new students whose predicted class (CollegeP
lan) is ‘Yes’ with confidence > 0.8
SELECT StudentID, PredictProbability(CPClass.CollegePlan)
FROM CPClass PREDICTION JOIN
OPENROWSET (’<provider>’,’<connection>’,
’SELECT * FROM NewStudents’) AS t
ON t.Gender = CPClass.Gender AND
t.ParentIncome = CPClass.ParentIncome AND
t.Encouragement = CPClass.Encouragement
WHERE
CPClass.CollegePlan = ‘Yes’ AND
PredictProbability(CPClass.CollegePlan) > 0.8
Regression
 Model Definition

CREATE MINING MODEL CustCredit


(
CustID LONG KEY,
Gender TEXT DISCRETE,
Age TEXT CONTINUOUS REGRESSOR,
Income LONG CONTINUOUS REGRESSOR,
Credit DOUBLE CONTINUOUS PREDICT
) USING Microsoft_Decision_Trees
Regression (cont)
 Predict Credit score (and stdev) for the new cus
tomer data entered from the web form.
SELECT CustCredit.Credit, PredictStdev(CustCredit.Credit)
FROM CustCredit PREDICTION JOIN
(SELECT ’Female’ AS Gender, 30 AS Age, 50000 AS Income) AS t
ON t.Gender = CustCredit.Gender AND
t.Age = CustCredit.Age AND
t.Income = CustCredit.Income
Segmentation
 Model Definition

CREATE MINING MODEL CPCluster


(
StudentID LONG KEY,
Gender TEXT DISCRETE,
ParentIncome LONG CONTINUOUS,
Encouragement TEXT DISCRETE,
CollegePlans TEXT DISCRETE
) USING Microsoft_Clustering
Segmentation (cont.)
 Find cluster and its probability for each s
tudent
SELECT StudentID, $Cluster, ClusterProbability()
FROM CPCluster PREDICTION JOIN
OPENROWSET (’<provider>’,’<connection>’,
’SELECT * FROM NewStudents’) AS t
ON t.Gender = CPCluster.Gender AND
t.ParentIncome = CPCluster.ParentIncome AND
t.Encouragement = CPCluster.Encouragement AND
t.CollegePlans = CPCluster.CollegePlans
Association Prediction
 Model Definition

CREATE MINING MODEL FavMovieModel (


ID LONG KEY,
MaritalStatus TEXT DISCRETE,
FavMovies TABLE PREDICT (
Title TEXT KEY
)
) USING Microsoft_Decision_Trees
Association Prediction (cont)
 As a web application, find 5 best recommendations for
a customer whose shopping cart contains ‘Star Wars’ a
nd ‘Matrix’.

SELECT FLATTENED PredictAssociation(FavMovieModel.FavMo


vies, INCLUDE_STATISTICS, 5)
FROM FavMovieModel NATURAL PREDICTION JOIN
(SELECT ’Single’ AS MaritalStatus,
(SELECT ’Star Wars’ AS Title UNION SELECT ’Matrix’ AS Title) A
S FavMovies) AS t
Sequence Prediction
 Model Definition

CREATE MINING MODEL WebSeqModel (


SessionLONG KEY,
PageSeq TABLE PREDICT (
SeqID LONG KEY SEQUENCE,
Page TEXT DISCRETE
)
) USING Microsoft_Sequence_Clustering
Sequence Prediction (cont)
 Show the next 2 steps that a web visitor who visited ‘home’  ‘new
s’ is going to take. For each step, it has to show top 5 candidate pag
es with the highest probability.

SELECT FLATTENED
( SELECT $Sequence,
TopCount(PredictHistogram(Page), $Probability, 5) FROM PredictSeque
nce(WebSeqModel.PageSeq, 2)
)
FROM WebSeqModel NATURAL PREDICTION JOIN
(SELECT
(SELECT 1 AS SeqID, ’home’ AS Page UNION
SELECT 2 AS SeqID, ’news’ AS Page) AS PageSeq
) AS t
Time-Series Prediction
 Model Definition

CREATE MINING MODEL StockModel (


Symbol LONG KEY,
DateRecorded DATE KEY TIME,
OpeningQuote DOUBLE CONTINUOUS,
ClosingQuote DOUBLE CONTINUOUS
) USING Microsoft_Time_Series
Time-Series Prediction (cont)
 Predict next five days of MSFT stock closing q
uotes.

SELECT FLATTENED PredictTimeSeries(StockModel.ClosingQuo


te, 5)
FROM FavMovieModel
WHERE StockModel.Symbol = ’MSFT’
Major Issues in Data Mining

 Mining methodology
 Mining different kinds of knowledge from diverse data types, e.g., bio, stream, Web
 Performance: efficiency, effectiveness, and scalability
 Pattern evaluation: the interestingness problem
 Incorporation of background knowledge
 Handling noise and incomplete data
 Parallel, distributed and incremental mining methods
 Integration of the discovered knowledge with existing one: knowledge fusion
 User interaction
 Data mining query languages and ad-hoc mining
 Expression and visualization of data mining results
 Interactive mining of knowledge at multiple levels of abstraction
 Applications and social impacts
 Domain-specific data mining & invisible data mining
 Protection of data security, integrity, and privacy
Data Mining Vendors
 SAS (Enterprise Miner)
 IBM (DB2 Intelligent Miner)
 Oracle (ODM option to Oracle 10g)
 SPSS (Clementine)
 Insightsful (Insightful Miner)
 KXEN (Analytic Framework)
 Prudsys (Discoverer and its family)
 Microsoft (SQL Server 2005)
 Angoss (KnowledgeServer and its family)
 DBMiner (DBMiner)
 Many others
Data Mining and Business
Intelligence
Increasing potential
to support
business decisions End User
Making
Decisions

Data Presentation Business


Analyst
Visualization Techniques
Data Mining Data
Information Discovery Analyst

Data Exploration
Statistical Analysis, Querying and Reporting

Data Warehouses / Data Marts


OLAP, MDA DBA
Data Sources
Paper, Files, Information Providers, Database Systems, OLTP
Data Mining Modeling and Language
 Problem Description
 two powerful tools
 Database management systems
 Efficient and effective data mining algorithms and fram
eworks
 Generally, this work asks:
 “How can we merge the two?”
 “How can we integrate data mining more closely with t
raditional database systems, particularly querying?”
Three Different Answers
 MSQL: A Query Language for Database Minin
g (Imielinski & Virmani, Rutgers University)
 DMQL: A Data Mining Query Language for Re
lational Databases (Han et al, Simon Fraser Uni
versity)
 Integrating Data Mining with SQL Databases:
OLE DB for Data Mining (Netz et al, Microsof
t)
MSQL
 Focus on Association Rules
 Seeks to provide a language both to selectively
generate rules, and separately to query the rule
base
 Expressive rule generation language, and
techniques for optimizing some commands
MSQL
 Get-Rules and Select-Rules Queries
 Get-Rules operator generates rules over elements of argumen
t class C, which satisfy conditions described in the “where” cl
ause
[Project Body, Consequent, confidence, support]
GetRules(C) [as R1]
[into <rulebase_name>]
[where <conds>]
[sql-group-by clause]
[using-clause]
MSQL
 <conds> may contain a number of conditions, includin
g:
 restrictions on the attributes in the body or consequent
 “rule.body HAS {(Job = ‘Doctor’}”
 “rule1.consequent IN rule2.body”
in, has, and is are rule
subset, superset,
and equality “rule.consequent IS {Age = *}”

respectively
 pruning conditions (restrict by support, confidence, or size)
 Stratified or correlated subqueries
MSQL
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
and not exists ( GetRules(Patients)
Support > .05 and
Confidence > .7
and R2.Body HAS R1.Body)

Retrieve all rules with descriptors of the form “Age = *” in the body, exce
pt when there is a rule with equal or greater support and confidence with a
rule containing a superset of the descriptors in the body
MSQL
GetRules(C) R1
where <pruning-conds>
correlated and not exists ( GetRules(C) R2
where <same pruning-conds>
and R2.Body HAS R1.Body)

GetRules(C) R1
where <pruning-conds>
and consequent is {(X=*)}
stratified and consequent in (SelectRules(R2)
where consequent is {(X=*)}
MSQL
 Nested Get-Rules Queries and their optimization
 Stratified (non-corrolated) queries are evaluated “bottom-up.
” The subquery is evaluated first, and replaced with its result
s in the outer query.
 Correlated queries are evaluated either top-down or bottom-u
p (like “loop-unfolding”), and there are rules for choosing bet
ween the two options
MSQL
Top-Down Evaluation
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7

For each rule produced by the outer, evaluate the


inner
not exists ( GetRules(Patients)
Support > .05 and Confidence > .7
and R2.Body HAS R1.Body)
MSQL
Bottom-Up Evaluation
not exists ( GetRules(Patients)
Support > .05 and Confidence > .7
and R2.Body HAS R1.Body)

For each rule produced by the inner, evaluate the


outer
GetRules(Patients)
where Body has {Age = *}
and Support > .05 and Confidence > .7
DMQL
 Commands specify the following:
 The set of data relevant to the data mining task (the training s
et)
 The kinds of knowledge to be discovered
 Generalized relation
 Characteristic rules
 Discriminant rules
 Classification rules
 Association rules
DMQL
 Commands Specify the following:
 Background knowledge
 Concept hierarchies based on attribute relationships,
etc.
 Various thresholds
 Minimum support, confidence, etc.
DMQL
 Syntax
use database <database_name>
Specify background
knowledge
{use hierarchy <hierarchy_name> for <attr
ibute>}
Specify rules to be
discovered <rule_spec>
Relevant attributes or
aggregations
related to <attr_or_agg_list>
Collect the set of from <relation(s)>
relevant data to mine [where <conditions>]
[order by <order list>]
Specify threshold {with [<kinds of>] threshold = <threshold
parameters
_value> [for <attribute(s)>]}
DMQL
use database Hospital
find association rules as Heart_Health
related to Salary, Age, Smoker, Heart_Disease
from Patient_Financial f, Patient_Medical m
where f.ID = m.ID and m.age >= 18
with support threshold = .05
with confidence threshold = .7
DMQL
 DMQL provides a display in command to view
resulting rules, but no advanced way to query
them
 Suggests that a GUI interface might aid in the
presentation of these results in different forms
(charts, graphs, etc.)
OLE DB for DM
 An extension to the OLE DB interface for Microsoft
SQL Server
 Seeks to support the following ideas:
 Define a model by specifying the set of attributes to be
predicted, the attributes used for the prediction, and the
algorithm
 Populate the model using the training data
 Predict attributes for new data using the populated model
 Browse the mining model (not fully addressed because it
varies a lot by model type)
OLE DB for DM
 Defining a Mining Model
 Identify the set of data attributes to be predicted, the set of att
ributes to be used for prediction, and the algorithm to be used
for building the model

 Populating the Model


 Pull the information into a single rowset using views, and trai
n the model using the data and algorithm specified
OLE DB for DM
 Using the mining model to predict
 Defines a new operator prediction join. A
model may be used to make predictions on
datasets by taking the prediction join of the
mining model and the data set.
OLE DB for DM
CREATE MINING MODEL Heart_Health Prediction
(
ID Int Key,
Age Int,
Smoker Int,
Salary Double discretized,
HeartAttack Int PREDICT, %Prediction column
)
USING Microsoft_Decision_Trees

Identifies the source columns for the training


data, the column to be predicted, and the data
mining algorithm.
OLE DB for DM
INSERT INTO Heart_Health Prediction
(Age, Smoker, Salary, HeartAttack )
OPENROWSET (’<provider>’,’<connection>’,
’SELECT Age, Smoker, Salary, HeartAttack
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID’)

The INSERT represents using a tuple for training the


model (not actually inserting it into the rowset).
OLE DB for DM
SELECT T.ID, H.HeartAttack
FROM Heart_Health Prediction H
PREDICTION JOIN (
OPENROWSET (’<provider>’,’<connection>’,
’SELECT ID, Age, Smoker, Salary
FROM Patient_Medical M, Patient_Financial F
WHERE M.ID = F.ID’) as T
ON H.Age = T.Age AND H.Smoker = T.Smoker AND H.Salary =
T.Salary
Prediction join connects the model and an actual data
table to make predictions
Key Ideas
 Important to have an API for creating and
manipulating data mining models
 The data is already in the DBMS, so it makes
sense to do the data mining where the data is
 Applications already use SQL, so a SQL
extension seems logical
Key Ideas
 Need a method for defining data mining models,
including algorithm specification, specification
of various parameters, and training set
specification (DMQL, MSQL, ODBDM)
 Need a method of querying the models (MSQL)
 Need a way of using the data mining model to
interact with other data in the database, for
purposes such as prediction (ODBDM)

You might also like