Professional Documents
Culture Documents
05 - dbdm2007 - Data Mining
05 - dbdm2007 - Data Mining
05 - dbdm2007 - Data Mining
— Chapter 5 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
October 22, 2007 Data Mining: Concepts and Techniques 1 October 22, 2007 Data Mining: Concepts and Techniques 2
Automated data collection tools, database systems, Web, unknown and potentially useful) patterns or knowledge from
computerized society huge amount of data
Data mining: a misnomer?
Major sources of abundant data
Alternative names
Business: Web, e-commerce, transactions, stocks, … Knowledge discovery (mining) in databases (KDD), knowledge
Science: Remote sensing, bioinformatics, scientific simulation, … extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Society and everyone: news, digital cameras,
Watch out: Is everything “data mining”?
We are drowning in data, but starving for knowledge!
Simple search and query processing
“Necessity is the mother of invention”—Data mining—Automated
(Deductive) expert systems
analysis of massive data sets: natural from the evolution of Database
Technology
October 22, 2007 Data Mining: Concepts and Techniques 3 October 22, 2007 Data Mining: Concepts and Techniques 4
1
Why Data Mining?—Potential Applications Knowledge Discovery (KDD) Process
Data Mining and Business Intelligence Data Mining: Confluence of Multiple Disciplines
Increasing potential
to support Database
business decisions End User
Technology Statistics
Decision
Making
Data Exploration
Pattern
Statistical Summary, Querying, and Reporting
Recognition Other
Data Preprocessing/Integration, Data Warehouses Algorithm Disciplines
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
October 22, 2007 Data Mining: Concepts and Techniques 7 October 22, 2007 Data Mining: Concepts and Techniques 8
2
Why Not Traditional Data Analysis? Multi-Dimensional View of Data Mining
Tremendous amount of data Data to be mined
Algorithms must be highly scalable to handle such as tera-bytes of Relational, data warehouse, transactional, stream, object-
data oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
High-dimensionality of data
Knowledge to be mined
Micro-array may have tens of thousands of dimensions
Characterization, discrimination, association, classification, clustering,
High complexity of data
trend/deviation, outlier analysis, etc.
Data streams and sensor data
Multiple/integrated functions and mining at multiple levels
Time-series data, temporal data, sequence data
Techniques utilized
Structure data, graphs, social networks and multi-linked data Database-oriented, data warehouse (OLAP), machine learning, statistics,
Heterogeneous databases and legacy databases visualization, etc.
Spatial, spatiotemporal, multimedia, text and Web data Applications adapted
Software programs, scientific simulations Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
New and sophisticated applications market analysis, text mining, Web mining, etc.
October 22, 2007 Data Mining: Concepts and Techniques 9 October 22, 2007 Data Mining: Concepts and Techniques 10
Different views lead to different classifications Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Data view: Kinds of data to be mined
Object-relational databases
Knowledge view: Kinds of knowledge to be discovered Heterogeneous databases and legacy databases
Method view: Kinds of techniques utilized Spatial data and spatiotemporal data
Multimedia database
Application view: Kinds of applications adapted
Text databases
The World-Wide Web
October 22, 2007 Data Mining: Concepts and Techniques 11 October 22, 2007 Data Mining: Concepts and Techniques 12
3
Data Mining Functionalities Data Mining Functionalities (2)
Multidimensional concept description: Characterization and discrimination Cluster analysis
Generalize, summarize, and contrast data characteristics, e.g., dry Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
vs. wet regions
Maximizing intra-class similarity & minimizing interclass similarity
Frequent patterns, association, correlation vs. causality
Outlier analysis
Diaper Æ Beer [0.5%, 75%] (Correlation or causality?) Outlier: Data object that does not comply with the general behavior
Classification and prediction of the data
Construct models (functions) that describe and distinguish classes Noise or exception? Useful in fraud detection, rare events analysis
or concepts for future prediction Trend and evolution analysis
October 22, 2007 Data Mining: Concepts and Techniques 13 October 22, 2007 Data Mining: Concepts and Techniques 14
Are All the “Discovered” Patterns Interesting? Find All and Only Interesting Patterns?
Data mining may generate thousands of patterns: Not all of them are Find all the interesting patterns: Completeness
interesting
Can a data mining system find all the interesting patterns? Do we
Suggested approach: Human-centered, query-based, focused mining need to find all of the interesting patterns?
Interestingness measures Heuristic vs. exhaustive search
A pattern is interesting if it is easily understood by humans, valid on new Association vs. classification vs. clustering
or test data with some degree of certainty, potentially useful, novel, or
Search for only interesting patterns: An optimization problem
validates some hypothesis that a user seeks to confirm
Can a data mining system find only the interesting patterns?
Objective vs. subjective interestingness measures
Approaches
Objective: based on statistics and structures of patterns, e.g., support,
First general all the patterns and then filter out the uninteresting ones
confidence, etc.
Generate only the interesting patterns—mining query optimization
Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
October 22, 2007 Data Mining: Concepts and Techniques 15 October 22, 2007 Data Mining: Concepts and Techniques 16
4
Other Pattern Mining Issues Why Data Mining Query Language?
October 22, 2007 Data Mining: Concepts and Techniques 17 October 22, 2007 Data Mining: Concepts and Techniques 18
October 22, 2007 Data Mining: Concepts and Techniques 19 October 22, 2007 Data Mining: Concepts and Techniques 20
5
Primitive 2: Types of Knowledge to Be Mined Primitive 3: Background Knowledge
October 22, 2007 Data Mining: Concepts and Techniques 21 October 22, 2007 Data Mining: Concepts and Techniques 22
Simplicity
Different backgrounds/usages may require different forms of representation
e.g., (association) rule length, (decision) tree size
E.g., rules, tables, crosstabs, pie/bar chart, etc.
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification Concept hierarchy is also important
reliability or accuracy, certainty factor, rule strength, rule quality, Discovered knowledge might be more understandable when
discriminating weight, etc. represented at high level of abstraction
Utility
Interactive drill up/down, pivoting, slicing and dicing provide
potential usefulness, e.g., support (association), noise threshold
different perspectives to data
(description)
Different kinds of knowledge require different representation: association,
Novelty
classification, clustering, etc.
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
October 22, 2007 Data Mining: Concepts and Techniques 23 October 22, 2007 Data Mining: Concepts and Techniques 24
6
DMQL—A Data Mining Query Language An Example Query in DMQL
Motivation
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL has on relational
database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
Design
DMQL is designed with the primitives described earlier
October 22, 2007 Data Mining: Concepts and Techniques 25 October 22, 2007 Data Mining: Concepts and Techniques 26
Based on OLE, OLE DB, OLE DB for OLAP, C# Interactive mining multi-level knowledge
Integrating DBMS, data warehouse and data mining Necessity of mining knowledge and patterns at different levels of
DMML (Data Mining Mark-up Language) by DMG (www.dmg.org) abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Providing a platform and process structure for effective data mining
Integration of multiple mining functions
Emphasizing on deploying data mining technology to solve business
problems Characterized classification, first clustering and then association
October 22, 2007 Data Mining: Concepts and Techniques 27 October 22, 2007 Data Mining: Concepts and Techniques 28
7
Coupling Data Mining with DB/DW Systems Architecture: Typical Data Mining System
October 22, 2007 Data Mining: Concepts and Techniques 29 October 22, 2007 Data Mining: Concepts and Techniques 30
stream, Web
Performance: efficiency, effectiveness, and scalability A natural evolution of database technology, in great demand, with
Pattern evaluation: the interestingness problem wide applications
Incorporation of background knowledge A KDD process includes data cleaning, data integration, data
Handling noise and incomplete data selection, transformation, data mining, pattern evaluation, and
Parallel, distributed and incremental mining methods knowledge presentation
Integration of the discovered knowledge with existing one: knowledge Mining can be performed in a variety of information repositories
fusion
User interaction
Data mining functionalities: characterization, discrimination,
Data mining query languages and ad-hoc mining association, classification, clustering, outlier and trend analysis, etc.
Expression and visualization of data mining results Data mining systems and architectures
Interactive mining of knowledge at multiple levels of abstraction Major issues in data mining
Applications and social impacts
October22, Domain-specific
2007 data mining & invisible
Data Mining: Concepts anddata mining
Techniques 31 October 22, 2007 Data Mining: Concepts and Techniques 32
8
A Brief History of Data Mining Society Conferences and Journals on Data Mining
1989 IJCAI Workshop on Knowledge Discovery in Databases KDD Conferences Other related conferences
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, ACM SIGKDD Int. Conf. on ACM SIGMOD
1991) Knowledge Discovery in VLDB
1991-1994 Workshops on Knowledge Discovery in Databases
Databases and Data Mining
(KDD) (IEEE) ICDE
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. WWW, SIGIR
SIAM Data Mining Conf. (SDM)
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
(IEEE) Int. Conf. on Data ICML, CVPR, NIPS
1995-1998 International Conferences on Knowledge Discovery in Databases and Data
Mining (KDD’95-98)
Mining (ICDM) Journals
Conf. on Principles and Data Mining and Knowledge
Journal of Data Mining and Knowledge Discovery (1997)
practices of Knowledge Discovery (DAMI or DMKD)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
Discovery and Data Mining
October 22, 2007 Data Mining: Concepts and Techniques 33 October 22, 2007 Data Mining: Concepts and Techniques 34
Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AAAI/MIT Press, 1996
AI & Machine Learning U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Kaufmann, 2001
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006
IEEE-PAMI, etc. D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
Web and IR T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
Conferences: SIGIR, WWW, CIKM, etc. and Prediction, Springer-Verlag, 2001
Journals: WWW: Internet and Web Information Systems,
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
Statistics
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
Conferences: Joint Stat. Meeting, etc.
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
Journals: Annals of statistics, etc.
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Journals: IEEE Trans. visualization and computer graphics, etc. Implementations, Morgan Kaufmann, 2nd ed. 2005
October 22, 2007 Data Mining: Concepts and Techniques 35 October 22, 2007 Data Mining: Concepts and Techniques 36
9
Chapter 5: Mining Frequent Patterns,
Association and Correlations
October 22, 2007 Data Mining: Concepts and Techniques 37 October 22, 2007 Data Mining: Concepts and Techniques 38
10
Basic Concepts: Frequent Patterns and
Association Rules Closed Patterns and Max-Patterns
October 22, 2007 Data Mining: Concepts and Techniques 43 October 22, 2007 Data Mining: Concepts and Techniques 44
11
Scalable Methods for Mining Frequent Patterns Apriori: A Candidate Generation-and-Test Approach
The downward closure property of frequent patterns Apriori pruning principle: If there is any itemset which is
Any subset of a frequent itemset must be frequent
infrequent, its superset should not be generated/tested!
If {beer, diaper, nuts} is frequent, so is {beer,
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
diaper}
Method:
i.e., every transaction having {beer, diaper, nuts} also
@SDM’02) generated
October 22, 2007 Data Mining: Concepts and Techniques 45 October 22, 2007 Data Mining: Concepts and Techniques 46
12
Important Details of Apriori How to Generate Candidates?
How to generate candidates?
Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk
Step 1: self-joining Lk-1
Step 2: pruning
insert into Ck
How to count supports of candidates?
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
Example of Candidate-generation
from Lk-1 p, Lk-1 q
L3={abc, abd, acd, ace, bcd}
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
Self-joining: L3*L3
q.itemk-1
abcd from abc and abd
acde from acd and ace Step 2: pruning
Pruning: forall itemsets c in Ck do
acde is removed because ade is not in L3 forall (k-1)-subsets s of c do
C4={abcd} if (s is not in Lk-1) then delete c from Ck
October 22, 2007 Data Mining: Concepts and Techniques 49 October 22, 2007 Data Mining: Concepts and Techniques 50
Method: 1+2356
Candidate itemsets are stored in a hash-tree 13+56 234
567
Leaf node of hash-tree contains a list of itemsets and 367
145 345 356
counts 136
357 368
12+356
Interior node contains a hash table 124
689
125 159
Subset function: finds all the candidates contained in 457
458
a transaction
October 22, 2007 Data Mining: Concepts and Techniques 51 October 22, 2007 Data Mining: Concepts and Techniques 52
13
Efficient Implementation of Apriori in SQL Challenges of Frequent Pattern Mining
October 22, 2007 Data Mining: Concepts and Techniques 53 October 22, 2007 Data Mining: Concepts and Techniques 54
Partition: Scan Database Only Twice DHP: Reduce the Number of Candidates
Any itemset that is potentially frequent in DB must be A k-itemset whose corresponding hashing bucket count is
frequent in at least one of the partitions of DB below the threshold cannot be frequent
A. Savasere, E. Omiecinski, and S. Navathe. An efficient ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold
algorithm for mining association in large databases. In
VLDB’95 J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95
October 22, 2007 Data Mining: Concepts and Techniques 55 October 22, 2007 Data Mining: Concepts and Techniques 56
14
Sampling for Frequent Patterns DIC: Reduce Number of Scans
Multiple database scans are costly Grow long patterns from short ones using local
Mining long patterns needs many passes of
frequent items
scanning and generates lots of candidates
To find frequent itemset i1i2…i100 “abc” is a frequent pattern
# of scans: 100
Get all transactions having “abc”: DB|abc
# of Candidates: (1001) + (1002) + … + (110000) = 2100-
1 = 1.27*1030 ! “d” is a local frequent item in DB|abc Æ abcd is
Bottleneck: candidate-generation-and-test a frequent pattern
Can we avoid candidate generation?
October 22, 2007 Data Mining: Concepts and Techniques 59 October 22, 2007 Data Mining: Concepts and Techniques 60
15
Construct FP-tree from a Transaction Database Benefits of the FP-tree Structure
Partition Patterns and Databases Find Patterns Having P From P-conditional Database
Patterns containing p {}
Header Table
Patterns having m but no p Conditional pattern bases
Item frequency head f:4 c:1
… f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
Patterns having c but no a nor b, m, p a 3
b 3 a:3 p:1 a fc:3
Pattern f m 3 b fca:1, f:1, c:1
p 3
Completeness and non-redundency m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
October 22, 2007 Data Mining: Concepts and Techniques 63 October 22, 2007 Data Mining: Concepts and Techniques 64
16
From Conditional Pattern-bases to Conditional FP-trees Recursion: Mining Each Conditional FP-tree
For each pattern-base {}
Accumulate the count for each item in the base
{} Cond. pattern base of “am”: (fc:3) f:3
Construct the FP-tree for the frequent items of the
c:3
pattern base f:3
am-conditional FP-tree
c:3 {}
{}
m-conditional pattern base: Cond. pattern base of “cm”: (f:3)
fca:2, fcab:1 a:3
Header Table f:3
Item frequency head All frequent m-conditional FP-tree
f:4 c:1 patterns relate to m cm-conditional FP-tree
f 4 {}
c 4 c:3 b:1 b:1 m,
¼
a 3 f:3 ¼ fm, cm, am, {}
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
Cond. pattern base of “cam”: (f:3) f:3
m:2 b:1
p 3
a:3 cam-conditional FP-tree
p:2 m:1
m-conditional FP-tree
October 22, 2007 Data Mining: Concepts and Techniques 65 October 22, 2007 Data Mining: Concepts and Techniques 66
A Special Case: Single Prefix Path in FP-tree Mining Frequent Patterns With FP-trees
a3:n3
FP-tree
r1
{} Until the resulting FP-tree is empty, or it contains only
a1:n1
one path—single path will generate all the
b1:m1 C1:k1
¼ r1 =
a2:n2
+ b1:m1 C1:k1 combinations of its sub-paths, each of which is a
frequent pattern
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
October 22, 2007 Data Mining: Concepts and Techniques 67 October 22, 2007 Data Mining: Concepts and Techniques 68
17
Scaling FP-growth by DB Projection Partition-based Projection
am-proj DB cm-proj DB
fc f …
fc f
fc f
October 22, 2007 Data Mining: Concepts and Techniques 69 October 22, 2007 Data Mining: Concepts and Techniques 70
FP-Growth vs. Apriori: Scalability With the Support FP-Growth vs. Tree-Projection: Scalability with
Threshold the Support Threshold
Runtime (sec.)
Run time(sec.)
60
80
50
40
60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
October 22, 2007 Data Mining: Concepts and Techniques 71 October 22, 2007 Data Mining: Concepts and Techniques 72
18
Why Is FP-Growth the Winner? Implications of the Methodology
October 22, 2007 Data Mining: Concepts and Techniques 73 October 22, 2007 Data Mining: Concepts and Techniques 74
1st scan: find frequent items Tid Items Flist: list of all frequent items in support ascending order
A, B, C, D, E 10 A,B,C,D,E
Flist: d-a-f-e-c Min_sup=2
20 B,C,D,E,
2nd scan: find support for Divide search space TID Items
30 A,C,D,F
10 a, c, d, e, f
AB, AC, AD, AE, ABCDE Patterns having d 20 a, b, e
BC, BD, BE, BCDE Patterns having d but no a, etc.
30 c, e, f
Potential
40 a, c, d, f
CD, CE, CDE, DE, max-patterns Find frequent closed pattern recursively 50 c, e, f
Since BCDE is a max-pattern, no need to check BCD, BDE, Every transaction having d also has cfa Æ cfad is a
CDE in later scan frequent closed pattern
R. Bayardo. Efficiently mining long patterns from J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for
databases. In SIGMOD’98 Mining Frequent Closed Itemsets", DMKD'00.
October 22, 2007 Data Mining: Concepts and Techniques 75 October 22, 2007 Data Mining: Concepts and Techniques 76
19
CLOSET+: Mining Closed Itemsets by
CHARM: Mining by Exploring Vertical Data Format
Pattern-Growth
Itemset merging: if Y appears in every occurrence of X, then Y Vertical format: t(AB) = {T11, T25, …}
is merged with X tid-list: list of trans.-ids containing an itemset
Sub-itemset pruning: if Y כX, and sup(X) = sup(Y), X and all of Deriving closed patterns based on vertical intersections
X’s descendants in the set enumeration tree can be pruned
t(X) = t(Y): X and Y always happen together
Hybrid tree projection
t(X) ⊂ t(Y): transaction having X always has Y
Bottom-up physical tree-projection
Using diffset to accelerate mining
Top-down pseudo tree-projection
Only keep track of differences of tids
Item skipping: if a local frequent item has the same support in
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
several header tables at different levels, one can prune it from
the header table at higher levels
Diffset (XY, X) = {T2}
October 22, 2007 Data Mining: Concepts and Techniques 79 October 22, 2007 Data Mining: Concepts and Techniques 80
20
Visualization of Association Rules
Visualization of Association Rules: Rule Graph
(SGI/MineSet 3.0)
October 22, 2007 Data Mining: Concepts and Techniques 81 October 22, 2007 Data Mining: Concepts and Techniques 82
October 22, 2007 Data Mining: Concepts and Techniques 83 October 22, 2007 Data Mining: Concepts and Techniques 84
21
Mining Multiple-Level Association Rules Multi-level Association: Redundancy Filtering
October 22, 2007 Data Mining: Concepts and Techniques 85 October 22, 2007 Data Mining: Concepts and Techniques 86
October 22, 2007 Data Mining: Concepts and Techniques 87 October 22, 2007 Data Mining: Concepts and Techniques 88
22
Static Discretization of Quantitative Attributes Quantitative Association Rules
Proposed by Lent, Swami and Widom ICDE’97
Discretized prior to mining using concept hierarchy.
Numeric attributes are dynamically discretized
Numeric values are replaced by ranges. Such that the confidence or compactness of the rules
desirable
Dynamically raise supmin in FP-tree construction and Constraint-based association mining
mining, and select most promising path to mine Summary
October 22, 2007 Data Mining: Concepts and Techniques 91 October 22, 2007 Data Mining: Concepts and Techniques 92
23
Interestingness Measure: Correlations (Lift) Are lift and χ2 Good Measures of Correlation?
play basketball ⇒ eat cereal [40%, 66.7%] is misleading “Buy walnuts ⇒ buy milk [1%, 80%]” is misleading
The overall % of students eating cereal is 75% > 66.7%. if 85% of customers buy milk
play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, Support and confidence are not good to represent correlations
although with lower support and confidence So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
Measure of dependent/correlated events: lift
P ( A∪ B )
Basketball Not basketball Sum (row) lift =
P ( A) P ( B ) Milk No Milk Sum (row)
1000 / 5000 sup( X ) A1 1000 100 100 10,000 9.26 0.91 0.83 9055
lift ( B, C ) =
2000 / 5000
= 0.89 lift ( B, ¬C ) = = 1.33 coh = A2 100 1000 1000 100,000 8.44 0.09 0.05 670
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000 | universe ( X ) |
A3 1000 100 10000 100,000 9.18 0.09 0.09 8172
A4 1000 1000 1000 1000 1 0.5 0.33 0
October 22, 2007 Data Mining: Concepts and Techniques 93 October 22, 2007 Data Mining: Concepts and Techniques 94
October 22, 2007 Data Mining: Concepts and Techniques 95 October 22, 2007 Data Mining: Concepts and Techniques 96
24
Constraint-based (Query-Directed) Mining Constraints in Data Mining
Both are aimed at reducing search space When an intemset S violates the 10 a, b, c, d, f
20 b, c, d, f, g, h
Finding all patterns satisfying constraints vs. finding constraint, so does any of its superset
30 a, c, d, e, f
some (or one) answer in constraint-based search in AI sum(S.Price) ≤ v is anti-monotone 40 c, e, f, g
Constraint-pushing vs. heuristic search
sum(S.Price) ≥ v is not anti-monotone Item Profit
It is an interesting research problem on how to integrate
Example. C: range(S.profit) ≤ 15 is anti- a 40
them
monotone b 0
Constrained mining vs. query processing in DBMS c -20
Itemset ab violates C d 10
Database query processing requires to find all
So does every superset of ab e -30
Constrained pattern mining shares a similar philosophy
f 30
as pushing selections deeply in query processing g 20
h -10
October 22, 2007 Data Mining: Concepts and Techniques 99 October 22, 2007 Data Mining: Concepts and Techniques 100
25
Monotonicity for Constraint Pushing Succinctness
TDB (min_sup=2)
TID Transaction Succinctness:
Monotonicity
10 a, b, c, d, f
When an intemset S satisfies the Given A1, the set of items satisfying a succinctness
20 b, c, d, f, g, h
constraint, so does any of its constraint C, then any set S satisfying C is based on
30 a, c, d, e, f
sum(S.Price) ≥ v is monotone Item Profit Idea: Without looking at the transaction database,
a 40 whether an itemset S satisfies constraint C can be
min(S.Price) ≤ v is monotone b 0
determined based on the selection of items
Example. C: range(S.profit) ≥ 15 c -20
d 10 min(S.Price) ≤ v is succinct
Itemset ab satisfies C e -30
sum(S.Price) ≥ v is not succinct
So does every superset of ab f 30
g 20 Optimization: If C is succinct, C is pre-counting pushable
h -10
October 22, 2007 Data Mining: Concepts and Techniques 101 October 22, 2007 Data Mining: Concepts and Techniques 102
26
The Constrained Apriori Algorithm: Push The Constrained Apriori Algorithm: Push a
an Anti-monotone Constraint Deep Succinct Constraint Deep
Database D itemset sup. Database D itemset sup.
L1 itemset sup. L1 itemset sup.
TID Items C1 {1} 2 {1} 2 TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3 100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 200 235 Scan D {3} 3
{3} 3 {3} 3
300 1235 {4} 1 300 1235 {4} 1
{5} 3 {5} 3
400 25 {5} 3 400 25 {5} 3
C2 itemset sup C2 itemset C2 itemset sup C2 itemset
L2 itemset sup {1 2} L2 itemset sup {1 2}
{1 2} 1 Scan D {1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3} {1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5} {1 5} 1 {1 5} to be used
{2 3} 2 {2 3} 2
{2 3} 2 {2 3} {2 3} 2 {2 3}
{2 5} 3 {2 5} 3
{2 5} 3 {2 5} {2 5} 3 {2 5}
{3 5} 2 {3 5} 2 {3 5}
{3 5} 2 {3 5} {3 5} 2
C3 itemset Scan D L3 itemset sup Constraint: C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5 {2 3 5} {2 3 5} 2 min{S.price } <= 1
October 22, 2007 Data Mining: Concepts and Techniques 105 October 22, 2007 Data Mining: Concepts and Techniques 106
27
Can Apriori Handle Convertible Constraint? Mining With Convertible Constraints
Item Value
C: avg(X) >= 25, min_sup=2 a 40
A convertible, not monotone nor anti-monotone List items in every transaction in value descending f 30
nor succinct constraint cannot be pushed deep order R: <a, f, g, d, b, h, c, e> g 20
into the an Apriori mining algorithm C is convertible anti-monotone w.r.t. R d 10
Different constraints may require different or even Convertible anti- Convertible Strongly
Constraint
conflicting item-ordering monotone monotone convertible
If there exists conflict on order of items sum(S) ≤ v (items could be of any value,
No Yes No
v ≤ 0)
Try to satisfy one constraint first sum(S) ≥ v (items could be of any value,
No Yes No
Then using the order for the other constraint to v ≥ 0)
mine frequent itemsets in the corresponding sum(S) ≥ v (items could be of any value,
Yes No No
v ≤ 0)
projected database ……
October 22, 2007 Data Mining: Concepts and Techniques 111 October 22, 2007 Data Mining: Concepts and Techniques 112
28
Constraint-Based Mining—A General Picture A Classification of Constraints
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
Convertible Convertible
range(S) ≥ v no yes no anti-monotone monotone
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible convertible no
support(S) ≥ ξ yes no no Inconvertible
support(S) ≤ ξ no yes no
October 22, 2007 Data Mining: Concepts and Techniques 113 October 22, 2007 Data Mining: Concepts and Techniques 114
29
Frequent-Pattern Mining: Research Problems Ref: Basic Concepts of Frequent Pattern Mining
Mining fault-tolerant frequent, sequential and structured (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases.
patterns
SIGMOD'93.
Patterns allows limited faults (insertion, deletion,
(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
mutation) databases. SIGMOD'98.
Mining truly interesting patterns (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Surprising, novel, concise, … Discovering frequent closed itemsets for association rules. ICDT'99.
Application exploration (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential
patterns. ICDE'95
E.g., DNA sequence analysis and bio-pattern
classification
“Invisible” data mining
October 22, 2007 Data Mining: Concepts and Techniques 117 October 22, 2007 Data Mining: Concepts and Techniques 118
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
VLDB'94. algorithm for generation of frequent itemsets. J. Parallel and
Distributed Computing:02.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for
discovering association rules. KDD'94. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD’ 00.
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining
mining association rules in large databases. VLDB'95.
Frequent Closed Itemsets. DMKD'00.
J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by
for mining association rules. SIGMOD'95. Opportunistic Projection. KDD'02.
H. Toivonen. Sampling large databases for association rules. VLDB'96. J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset Patterns without Minimum Support. ICDM'02.
counting and implication rules for market basket analysis. SIGMOD'97. J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule Strategies for Mining Frequent Closed Itemsets. KDD'03.
mining with relational database systems: Alternatives and implications. G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying
SIGMOD'98. Frequent Patterns. KDD'03.
October 22, 2007 Data Mining: Concepts and Techniques 119 October 22, 2007 Data Mining: Concepts and Techniques 120
30
Ref: Vertical Format and Row Enumeration Methods Ref: Mining Multi-Level and Quantitative Rules
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm R. Srikant and R. Agrawal. Mining generalized association rules.
for discovery of association rules. DAMI:97. VLDB'95.
J. Han and Y. Fu. Discovery of multiple-level association rules from
Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset
large databases. VLDB'95.
Mining, SDM'02.
R. Srikant and R. Agrawal. Mining quantitative association rules in
C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual- large relational tables. SIGMOD'96.
Pruning Algorithm for Itemsets with Constraints. KDD’02. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining
F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: using two-dimensional optimized association rules: Scheme,
algorithms, and visualization. SIGMOD'96.
Finding Closed Patterns in Long Biological Datasets. KDD'03.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama.
Computing optimized rectilinear regions for association rules. KDD'97.
R.J. Miller and Y. Yang. Association rules over interval data.
SIGMOD'97.
Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative
Association Rules KDD'99.
October 22, 2007 Data Mining: Concepts and Techniques 121 October 22, 2007 Data Mining: Concepts and Techniques 122
Ref: Mining Correlations and Interesting Rules Ref: Mining Other Kinds of Rules
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94. B. Lent, A. Swami, and J. Widom. Clustering association rules.
ICDE'97.
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong
Generalizing association rules to correlations. SIGMOD'97.
negative associations in a large database of customer transactions.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable ICDE'98.
techniques for mining causal structures. VLDB'98. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S.
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Nestorov. Query flocks: A generalization of association-rule mining.
Interestingness Measure for Association Patterns. KDD'02. SIGMOD'98.
E. Omiecinski. Alternative Interest Measures for Mining F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
Associations. TKDE’03. paradigm for fast, quantifiable data mining. VLDB'98.
Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han. CoMine: Efficient Mining K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions.
EDBT’02.
of Correlated Patterns. ICDM’03.
October 22, 2007 Data Mining: Concepts and Techniques 123 October 22, 2007 Data Mining: Concepts and Techniques 124
31
Ref: Constraint-Based Pattern Mining Ref: Mining Sequential and Structured Patterns
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
constraints. KDD'97. and performance improvements. EDBT’96.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent
R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and
episodes in event sequences. DAMI:97.
pruning optimizations of constrained association rules. SIGMOD’98.
M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences.
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Machine Learning:01.
Mining with Regular Expression Constraints. VLDB’99. J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan:
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
constrained correlated sets. ICDE'00. Growth. ICDE'01.
J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
with Convertible Constraints. ICDE'01. X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential
Patterns in Large Datasets. SDM'03.
J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with
X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph
Constraints in Large Databases, CIKM'02.
Patterns. KDD'03.
October 22, 2007 Data Mining: Concepts and Techniques 125 October 22, 2007 Data Mining: Concepts and Techniques 126
Ref: Mining Spatial, Multimedia, and Web Data Ref: Mining Frequent Patterns in Time-Series Data
K. Koperski and J. Han, Discovery of Spatial Association Rules in B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules.
Geographic Information Databases, SSD’95. ICDE'98.
O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns
Trends by Applying OLAP and Data Mining Technology on Web Logs. in Time Series Database, ICDE'99.
ADL'98. H. Lu, L. Feng, and J. Han. Beyond Intra-Transaction Association
O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Analysis: Mining Multi-Dimensional Inter-Transaction Association Rules.
Multimedia with Progressive Resolution Refinement. ICDE'00. TOIS:00.
D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and
Patterns. SSTD'01. A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00.
W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on
Evolving Numerical Attributes. ICDE’01.
J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in
Time Series Data. TKDE’03.
October 22, 2007 Data Mining: Concepts and Techniques 127 October 22, 2007 Data Mining: Concepts and Techniques 128
32
Ref: Iceberg Cube and Cube Computation Ref: Iceberg Cube and Cube Exploration
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, J. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data
R. Ramakrishnan, and S. Sarawagi. On the computation of Cubes with Complex Measures. SIGMOD’ 01.
multidimensional aggregates. VLDB'96.
W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed Cube: An
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based Effective Approach to Reducing Data Cube Size. ICDE'02.
algorithm for simultaneous multidi-mensional aggregates.
G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining Multi-
SIGMOD'97.
Dimensional Constrained Gradients in Data Cubes. VLDB'01.
J. Gray, et al. Data cube: A relational aggregation operator
generalizing group-by, cross-tab and sub-totals. DAMI: 97. T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades:
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Generalizing association rules. DAMI:02.
Ullman. Computing iceberg queries efficiently. VLDB'98. L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient Cube: How to
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven Summarize the Semantics of a Data Cube. VLDB'02.
exploration of OLAP data cubes. EDBT'98. D. Xin, J. Han, X. Li, B. W. Wah. Star-Cubing: Computing Iceberg
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse Cubes by Top-Down and Bottom-Up Integration. VLDB'03.
and iceberg cubes. SIGMOD'99.
October 22, 2007 Data Mining: Concepts and Techniques 129 October 22, 2007 Data Mining: Concepts and Techniques 130
Ref: FP for Classification and Clustering Ref: Stream and Privacy-Preserving FP Mining
G. Dong and J. Li. Efficient mining of emerging patterns: A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving
Discovering trends and differences. KDD'99. Mining of Association Rules. KDD’02.
B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining
Rule Mining. KDD’98. in Vertically Partitioned Data. KDD’02.
W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient G. Manku and R. Motwani. Approximate Frequency Counts over
Classification Based on Multiple Class-Association Rules. ICDM'01. Data Streams. VLDB’02.
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-
similarity in large data sets. SIGMOD’ 02. Dimensional Regression Analysis of Time-Series Data Streams.
J. Yang and W. Wang. CLUSEQ: efficient and effective sequence VLDB'02.
clustering. ICDE’03. C. Giannella, J. Han, J. Pei, X. Yan and P. S. Yu. Mining Frequent
B. Fung, K. Wang, and M. Ester. Large Hierarchical Document Patterns in Data Streams at Multiple Time Granularities, Next
Clustering Using Frequent Itemset. SDM’03. Generation Data Mining:03.
X. Yin and J. Han. CPAR: Classification based on Predictive A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches
Association Rules. SDM'03. in Privacy Preserving Data Mining. PODS’03.
October 22, 2007 Data Mining: Concepts and Techniques 131 October 22, 2007 Data Mining: Concepts and Techniques 132
33
Ref: Other Freq. Pattern Mining Applications
October 22, 2007 Data Mining: Concepts and Techniques 133 October 22, 2007 Data Mining: Concepts and Techniques 134
34