Professional Documents
Culture Documents
Data Warehousing and Data Mining: A.A. 04-05 Datawarehousing & Datamining 1
Data Warehousing and Data Mining: A.A. 04-05 Datawarehousing & Datamining 1
and
Data Mining
A.A. 04-05
Outline
1. Introduction and Terminology
2. Data Warehousing
3. Data Mining
Association rules
Sequential patterns
Classification
Clustering
A.A. 04-05
Few
sophisticated
users
Many unsophisticated
users
4
A.A. 04-05
Decision
Making
Data Presentation
PRESENTATION
Reporting/Visualization engines
Data Analysis
BUSINESS
LOGIC
DATA
A.A. 04-05
Outline
1. Introduction and Terminology
2. Data Warehousing
3. Data Mining
Association rules
Sequential patterns
Classification
Clustering
A.A. 04-05
Data Warehousing
DATA WAREHOUSE
Database with the following distinctive characteristics:
Separate from operational databases
Subject oriented: provides a simple, concise view on one or
more selected areas, in support of the decision process
Constructed by integrating multiple, heterogeneous data
sources
Contains historical data: spans a much longer time horizon
than operational databases
(Mostly) Read-Only access: periodic, infrequent updates
A.A. 04-05
10
Data Warehousing
A.A. 04-05
11
Data Warehousing
Multi-Tier Architecture
DB
DB
Cleaning
extraction
Data sources
A.A. 04-05
Data Warehouse
Server
Analysis
Reporting
Data Mining
Data Storage
OLAP
engine
Front-End
Tools
12
Data Warehousing
Supplier
Store
Period
Units
Sales
P1
S1
St1
1qtr
30
1500
P2
S1
St3
2qtr
100
9000
dimensions
A.A. 04-05
dependent attributes
13
Data Warehousing
Product
Code
dimension
Description
table
Name
Address
A.A. 04-05
Store
Units
Code
dimension
table
Code
Supplier
Period
supplier
Store
Sales
Fact table
Datawarehousing & Datamining
Name
dimension Address
table
Manager
STAR
SCHEMA
14
Data Warehousing
(1,1)
PRODUCT
(0,N)
(1,1)
FACTs
(1,1)
(0,N)
STORE
(1,1)
(0,N)
SUPPLIER
A.A. 04-05
15
Data Warehousing
16
Data Warehousing
MOLAP Approach
PC
DVD
CD
TV
1Qtr
2Qtr
3Qtr
4Qtr
Europe
North America
Middle East
Region
Pr
od
uc
t
Period
Far East
DATA CUBE
representing a
Fact Table
A.A. 04-05
17
Data Warehousing
A.A. 04-05
18
Data Warehousing
A.A. 04-05
19
Data Warehousing
Pr
od
uc
t
Period
PC
DVD
CD
TV
year
Europe
North America
Middle East
Far East
Region
Drill-Down = (Roll-Up)-1
A.A. 04-05
20
Data Warehousing
1Qtr
2Qtr
3Qtr
4Qtr
SUM
Europe
North America
Middle East
Far East
SUM
21
Data Warehousing
product
product, period
period
region
product, region
period, region
22
Data Warehousing
A.A. 04-05
23
Data Warehousing
Product
Period
Region
Tot. Sales
CD
1qtr
Europe
CD
1qtr
ALL
ALL
ALL
1qtr
Europe
ALL
CD
ALL
Europe
ALL
CD
ALL
ALL
ALL
1qtr
ALL
ALL
ALL
Europe
ALL
ALL
ALL
ALL
ALL
A.A. 04-05
24
Data Warehousing
A.A. 04-05
25
Data Warehousing
Product
Period
Region
Tot. Sales
CD
1qtr
Europe
CD
1qtr
ALL
ALL
CD
ALL
ALL
ALL
ALL
ALL
ALL
ALL
A.A. 04-05
26
Data Warehousing
product
product, period
period
region
product, region
period, region
A.A. 04-05
27
Outline
1. Introduction and Terminology
2. Data Warehousing
3. Data Mining
Association rules
Sequential patterns
Classification
Clustering
A.A. 04-05
28
Data Mining
29
Data Mining
KNOWLEDGE
Data Mining
Selection &
transformation
Data cleaning
& integration
DB
Evaluation &
presentation
Training
data
Data Warehouse
Domain
knowledge
DB
Information repositories
A.A. 04-05
30
Data Mining
31
Data Mining
DATA MINING
32
Data Mining
Interestingness measures
Purpose: filter irrelevant patterns to convey concise and
useful knowledge. Certain data mining tasks can produce
thousands or millions of patterns most of which are
redundant, trivial, irrelevant.
Objective measures: based on statistics and structure of
patterns (e.g., frequency counts)
Subjective measures: based on users belief about the
data. Patterns may become interesting if they confirm or
contradict a users hypothesis, depending on the context.
Interestingness measures can employed both after and during
the pattern discovery. In the latter case, they improve the
search efficiency
A.A. 04-05
33
Data Mining
HighPerformance
Computing
Data Mining
Machine
learning
A.A. 04-05
DB
Technology
Datawarehousing & Datamining
Visualization
34
Data Mining
A.A. 04-05
35
Association Rules
= Set of items
D = Set of transactions: t
For an itemset X
D then t
I,
Y, with Y
support
= support(X)
confidence
= support(X)/support(X-Y)
than min
36
Association Rules
37
Association Rules
Applications:
Cross/Up-selling (especially in e-comm., e.g., Amazon)
Cross-selling: push complemetary products
Up-selling: push similar products
Catalog design
Store layout (e.g., diapers and beer close to each other)
Financial forecast
Medical diagnosis
A.A. 04-05
38
Association Rules
min sup
2.
Observation:
Min sup and min confidence are objective measures of
interestingness. Their proper setting, however, requires
users domain knowledge. Low values may yield exponentially
(in |I|) many rules, high values may cut off interesting rules
A.A. 04-05
39
Association Rules
Example
Transaction-id
Items bought
Frequent Itemsets
Support
10
A, B, C
{A}
75%
20
A, C
{B}
50%
30
A, D
{C}
50%
40
B, E, F
{A, C}
50%
{C} :
= support({A,C})
= 50%
= support({A,C})/support({A}) = 66.6%
{A} (same support as {A}
{C}):
= support({A,C})/support({C}) = 100%
Datawarehousing & Datamining
40
Association Rules
A.A. 04-05
41
Association Rules
Notion of Closedness
I (items), D (transactions), min sup (support threshold)
I Y X}
42
Association Rules
Notion of Maximality
Maximal Frequent Itemsets =
{X I : supp(X) min sup & supp(Y)<min sup
I Y X}
information loss
A.A. 04-05
43
Association Rules
Tid
10
20
30
40
50
60
DATASET
Min sup = 3
Closed Frequent
Itemsets
B
B, A, D, F
B, L
B, G
B, H
A.A. 04-05
Maximal
Support
NO
YES
YES
YES
YES
6
4
4
3
3
Items
B, A, D, F, G, H
B, L
B, A, D, F, L, H
B, L, G, H
B, A, D, F
B, A, D, F, L, G
Supporting
Transactions
all
10, 30, 50, 60
20, 30, 40, 60
10, 40, 60
10, 30, 40
44
Association Rules
(B,6)
(A,4)
(B,4)
(D,4)
(B,4)
(A,4)
(B,4)
(F,4)
(B,4)
(L,4)
(G,3)
(H,3)
(B,3)
(B,3)
(A,4)
(D,4)
(B,4)
(B,4)
(B,4)
(A,4)
(B,4)
45
Sequential Patterns
A.A. 04-05
46
Sequential Patterns
Sequential Patterns
PROBLEM: Given a set of sequences, find the complete set
of frequent subsequences
sequence database
A sequence: <(ef)(ab)(df)(c)(b)>
SID
sequence
10
<(a)(abc)(ac)(d)(cf)>
20
<(ad)(c)(bc)(ae)>
30
<(ef)(ab)(df)(c)(b)>
40
<(e)(g)(af)(c)(b)(c)>
<(a)(bc)(d)(c)>
is a subsequence of
<(a)(abc)(ac)(d)(cf)>
<(
47
Sequential Patterns
Applications:
Marketing
Natural disaster forecast
Analysis of web log data
DNA analysis
A.A. 04-05
48
Classification
A.A. 04-05
49
Classification
k)
GOAL
From the training data find a model to describe the classes
accurately and synthetically using the datas features. The
model will then be used to assign class labels to unknown
(previously unseen) records
A.A. 04-05
50
Classification
Applications:
Classification of (potential) customers for: Credit approval,
risk prediction, selective marketing
Performance prediction based on selected indicators
Medical diagnosis based on symptoms or reactions to
therapy
A.A. 04-05
51
Classification
Classification process
BUILD
Training Set
Unseen Data
model
Test Data
VALIDATE
PREDICT
class labels
A.A. 04-05
52
Classification
Observations:
Features can be either categorical if belonging to
unordered domains (e.g., Car Type), or continuous, if
belonging to ordered domains (e.g., Age)
The class could be regarded as an additional attribute of
the examples, which we want to predict
Classification vs Regression:
Classification: builds models for categorical classes
Regression: builds models for continuous classes
Several types of models exist: decision trees, neural
networks, bayesan (statistical) classifiers.
A.A. 04-05
53
Classification
54
Classification
Example:
examples = car insurance applicants, class = insurance risk
features
class
Age < 25
Age
Car Type
Risk
23
family
high
18
sports
high
43
sports
high
68
family
low
32
truck
low
20
family
high
A.A. 04-05
Car type
{sports}
High
High
Low
Model
(decision tree)
55
Clustering
A.A. 04-05
56
Clustering
57
Clustering
Example
attribute Y
outlier
attribute Y
object
attribute X
A.A. 04-05
attribute X
58
Clustering
59
Clustering
Applications:
Marketing: identify groups of customers based on their
purchasing patterns
Biology: categorize genes with similar functionalities
Image processing: clustering of pixels
Web: clustering of documents in meta-search engines
GIS: identification of areas of similar land use
A.A. 04-05
60
Clustering
Challenges:
Scalability: strategies to deal with very large datasets
Variety of attribute types: defining a good notion of
similarity is hard in the presence of different types of
attributes and/or different scales of values
Variety of cluster shapes: common distance measures
provide only spherical clusters
Noisy data: outliers may affect the quality of the clustering
Sensitivity to input ordering: the ordering of the input data
should not affect the output (or its quality)
A.A. 04-05
61
Clustering
62