05 - dbdm2007 - Data Mining

Data Mining:
Concepts and Techniques
— Chapter 5 —
Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved
October 22, 2007 Data Mining: Concepts and Techniques 1 October 22, 2007 Data Mining: Concepts and Techniques 2
Why Data Mining? What Is Data Mining?

The Explosive Growth of Data: from terabytes to petabytes Data mining (knowledge discovery from data)
Data collection and data availability Extraction of interesting (non-trivial, implicit, previously
Automated data collection tools, database systems, Web, unknown and potentially useful) patterns or knowledge from
computerized society huge amount of data
Data mining: a misnomer?
Major sources of abundant data
Alternative names
Business: Web, e-commerce, transactions, stocks, … Knowledge discovery (mining) in databases (KDD), knowledge
Science: Remote sensing, bioinformatics, scientific simulation, … extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
Society and everyone: news, digital cameras,
Watch out: Is everything “data mining”?
We are drowning in data, but starving for knowledge!
Simple search and query processing
“Necessity is the mother of invention”—Data mining—Automated
(Deductive) expert systems
analysis of massive data sets: natural from the evolution of Database
Technology
1
Why Data Mining?—Potential Applications Knowledge Discovery (KDD) Process
Data analysis and decision support

Data mining—core of Pattern Evaluation
Market analysis and management knowledge discovery
Target marketing, customer relationship management (CRM), process
market basket analysis, cross selling, market segmentation Data Mining
Risk analysis and management Task-relevant Data

Forecasting, customer retention, improved underwriting,
quality control, competitive analysis Selection
Data Warehouse
Fraud detection and detection of unusual patterns
(outliers) Data Cleaning
Other Applications Data Integration
Text mining (news group, email, documents) and Web
mining Databases
St d t i i
Data Mining and Business Intelligence Data Mining: Confluence of Multiple Disciplines
Increasing potential
to support Database
business decisions End User
Technology Statistics
Decision
Making
Data Presentation Business

Analyst
Visualization Techniques Machine Visualization
Data Mining Data Learning Data Mining
Information Discovery Analyst
Data Exploration
Pattern
Statistical Summary, Querying, and Reporting
Recognition Other
Data Preprocessing/Integration, Data Warehouses Algorithm Disciplines
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
2
Why Not Traditional Data Analysis? Multi-Dimensional View of Data Mining
Tremendous amount of data Data to be mined
Algorithms must be highly scalable to handle such as tera-bytes of Relational, data warehouse, transactional, stream, object-
data oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
High-dimensionality of data
Knowledge to be mined
Micro-array may have tens of thousands of dimensions
Characterization, discrimination, association, classification, clustering,
High complexity of data
trend/deviation, outlier analysis, etc.
Data streams and sensor data
Multiple/integrated functions and mining at multiple levels
Time-series data, temporal data, sequence data
Techniques utilized
Structure data, graphs, social networks and multi-linked data Database-oriented, data warehouse (OLAP), machine learning, statistics,
Heterogeneous databases and legacy databases visualization, etc.
Spatial, spatiotemporal, multimedia, text and Web data Applications adapted
Software programs, scientific simulations Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
New and sophisticated applications market analysis, text mining, Web mining, etc.
Data Mining: Classification Schemes Data Mining: On What Kinds of Data?

Database-oriented data sets and applications
General functionality
Relational database, data warehouse, transactional database
Descriptive data mining Advanced data sets and advanced applications
Predictive data mining Data streams and sensor data
Different views lead to different classifications Time-series data, temporal data, sequence data (incl. bio-sequences)
Structure data, graphs, social networks and multi-linked data
Data view: Kinds of data to be mined
Object-relational databases
Knowledge view: Kinds of knowledge to be discovered Heterogeneous databases and legacy databases
Method view: Kinds of techniques utilized Spatial data and spatiotemporal data
Multimedia database
Application view: Kinds of applications adapted
Text databases
The World-Wide Web
3
Data Mining Functionalities Data Mining Functionalities (2)
Multidimensional concept description: Characterization and discrimination Cluster analysis
Generalize, summarize, and contrast data characteristics, e.g., dry Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
vs. wet regions
Maximizing intra-class similarity & minimizing interclass similarity
Frequent patterns, association, correlation vs. causality
Outlier analysis
Diaper Æ Beer [0.5%, 75%] (Correlation or causality?) Outlier: Data object that does not comply with the general behavior
Classification and prediction of the data
Construct models (functions) that describe and distinguish classes Noise or exception? Useful in fraud detection, rare events analysis
or concepts for future prediction Trend and evolution analysis
E.g., classify countries based on (climate), or classify cars based on

Trend and deviation: e.g., regression analysis
(gas mileage) Sequential pattern mining: e.g., digital camera Æ large SD memory
Periodicity analysis
Predict some unknown or missing numerical values
Similarity-based analysis
Other pattern-directed or statistical analyses
Are All the “Discovered” Patterns Interesting? Find All and Only Interesting Patterns?
Data mining may generate thousands of patterns: Not all of them are Find all the interesting patterns: Completeness
interesting
Can a data mining system find all the interesting patterns? Do we
Suggested approach: Human-centered, query-based, focused mining need to find all of the interesting patterns?
Interestingness measures Heuristic vs. exhaustive search
A pattern is interesting if it is easily understood by humans, valid on new Association vs. classification vs. clustering
or test data with some degree of certainty, potentially useful, novel, or
Search for only interesting patterns: An optimization problem
validates some hypothesis that a user seeks to confirm
Can a data mining system find only the interesting patterns?
Objective vs. subjective interestingness measures
Approaches
Objective: based on statistics and structures of patterns, e.g., support,
First general all the patterns and then filter out the uninteresting ones
confidence, etc.
Generate only the interesting patterns—mining query optimization
Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.
4
Other Pattern Mining Issues Why Data Mining Query Language?
Precise patterns vs. approximate patterns Automated vs. query-driven?

Association and correlation mining: possible find sets of precise Finding all the patterns autonomously in a database?—unrealistic
patterns because the patterns could be too many but uninteresting
But approximate patterns can be more compact and sufficient Data mining should be an interactive process
How to find high quality approximate patterns?? User directs what to be mined
Gene sequence mining: approximate patterns are inherent Users must be provided with a set of primitives to be used to communicate
How to derive efficient approximate pattern mining algorithms?? with the data mining system
Constrained vs. non-constrained patterns Incorporating these primitives in a data mining query language
Why constraint-based mining? More flexible user interaction
What are the possible kinds of constraints? How to push Foundation for design of graphical user interface
constraints into the mining process? Standardization of data mining industry and practice
Primitives that Define a Data Mining Task Primitive 1: Task-Relevant Data
Task-relevant data Database or data warehouse name
Type of knowledge to be mined Database tables or data warehouse cubes
Background knowledge Condition for data selection

Pattern interestingness measurements Relevant attributes or dimensions
Visualization/presentation of discovered patterns
Data grouping criteria
5
Primitive 2: Types of Knowledge to Be Mined Primitive 3: Background Knowledge
Characterization A typical kind of background knowledge: Concept hierarchies
Discrimination Schema hierarchy

E.g., street < city < province_or_state < country
Association
Set-grouping hierarchy
Classification/prediction
E.g., {20-39} = young, {40-59} = middle_aged
Clustering Operation-derived hierarchy
Outlier analysis email address: hagonzal@cs.uiuc.edu
Other data mining tasks login-name < department < university < country
Rule-based hierarchy
low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 -
P2) < $50
Primitive 4: Pattern Interestingness Measure Primitive 5: Presentation of Discovered Patterns
Simplicity
Different backgrounds/usages may require different forms of representation
e.g., (association) rule length, (decision) tree size
E.g., rules, tables, crosstabs, pie/bar chart, etc.
Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification Concept hierarchy is also important
reliability or accuracy, certainty factor, rule strength, rule quality, Discovered knowledge might be more understandable when
discriminating weight, etc. represented at high level of abstraction
Utility
Interactive drill up/down, pivoting, slicing and dicing provide
potential usefulness, e.g., support (association), noise threshold
different perspectives to data
(description)
Different kinds of knowledge require different representation: association,
Novelty
classification, clustering, etc.
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)
6
DMQL—A Data Mining Query Language An Example Query in DMQL
Motivation
A DMQL can provide the ability to support ad-hoc and
interactive data mining
By providing a standardized language like SQL
Hope to achieve a similar effect like that SQL has on relational
database
Foundation for system development and evolution
Facilitate information exchange, technology transfer,
commercialization and wide acceptance
Design
DMQL is designed with the primitives described earlier
Other Data Mining Languages &

Integration of Data Mining and Data Warehousing
Standardization Efforts
Association rule language specifications Data mining systems, DBMS, Data warehouse systems coupling
MSQL (Imielinski & Virmani’99)
No coupling, loose-coupling, semi-tight-coupling, tight-coupling
MineRule (Meo Psaila and Ceri’96)
On-line analytical mining data
Query flocks based on Datalog syntax (Tsur et al’98)
OLEDB for DM (Microsoft’2000) and recently DMX (Microsoft SQLServer 2005) integration of mining and OLAP technologies
Based on OLE, OLE DB, OLE DB for OLAP, C# Interactive mining multi-level knowledge
Integrating DBMS, data warehouse and data mining Necessity of mining knowledge and patterns at different levels of
DMML (Data Mining Mark-up Language) by DMG (www.dmg.org) abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
Providing a platform and process structure for effective data mining
Integration of multiple mining functions
Emphasizing on deploying data mining technology to solve business
problems Characterized classification, first clustering and then association
7
Coupling Data Mining with DB/DW Systems Architecture: Typical Data Mining System
No coupling—flat file processing, not recommended

Graphical User Interface
Loose coupling
Fetching data from DB/DW
Pattern Evaluation
Semi-tight coupling—enhanced DM performance
Knowl
Provide efficient implement a few data mining primitives in a edge-
Data Mining Engine
DB/DW system, e.g., sorting, indexing, aggregation, histogram Base
analysis, multiway join, precomputation of some stat functions Database or Data
Tight coupling—A uniform information processing environment Warehouse Server
DM is smoothly integrated into a DB/DW system, mining query data cleaning, integration, and selection
is optimized based on mining query, indexing, query processing
methods, etc.
Data World-Wide Other Info
Database Repositories
Warehouse Web
Major Issues in Data Mining Summary

Mining methodology
Data mining: Discovering interesting patterns from large amounts of
Mining different kinds of knowledge from diverse data types, e.g., bio,
data

stream, Web
Performance: efficiency, effectiveness, and scalability A natural evolution of database technology, in great demand, with
Pattern evaluation: the interestingness problem wide applications
Incorporation of background knowledge A KDD process includes data cleaning, data integration, data
Handling noise and incomplete data selection, transformation, data mining, pattern evaluation, and
Parallel, distributed and incremental mining methods knowledge presentation
Integration of the discovered knowledge with existing one: knowledge Mining can be performed in a variety of information repositories
fusion
User interaction
Data mining functionalities: characterization, discrimination,
Data mining query languages and ad-hoc mining association, classification, clustering, outlier and trend analysis, etc.
Expression and visualization of data mining results Data mining systems and architectures
Interactive mining of knowledge at multiple levels of abstraction Major issues in data mining
Applications and social impacts
October22, Domain-specific
2007 data mining & invisible
Data Mining: Concepts anddata mining
Techniques 31 October 22, 2007 Data Mining: Concepts and Techniques 32
8
A Brief History of Data Mining Society Conferences and Journals on Data Mining
1989 IJCAI Workshop on Knowledge Discovery in Databases KDD Conferences Other related conferences
Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, ACM SIGKDD Int. Conf. on ACM SIGMOD
1991) Knowledge Discovery in VLDB
1991-1994 Workshops on Knowledge Discovery in Databases
Databases and Data Mining

(KDD) (IEEE) ICDE
Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. WWW, SIGIR
SIAM Data Mining Conf. (SDM)
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
(IEEE) Int. Conf. on Data ICML, CVPR, NIPS
1995-1998 International Conferences on Knowledge Discovery in Databases and Data
Mining (KDD’95-98)
Mining (ICDM) Journals
Conf. on Principles and Data Mining and Knowledge
Journal of Data Mining and Knowledge Discovery (1997)
practices of Knowledge Discovery (DAMI or DMKD)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
Discovery and Data Mining

More conferences on data mining (PKDD) IEEE Trans. On Knowledge

PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM and Data Eng. (TKDE)

Pacific-Asia Conf. on
(2001), etc. Knowledge Discovery and Data KDD Explorations
ACM Transactions on KDD starting in 2007 Mining (PAKDD) ACM Trans. on KDD
Where to Find References? DBLP, CiteSeer, Google Recommended Reference Books

Data mining and KDD (SIGKDD: CDROM)
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.

Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000
Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003
Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AAAI/MIT Press, 1996
AI & Machine Learning U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Kaufmann, 2001
Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006
IEEE-PAMI, etc. D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
Web and IR T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
Conferences: SIGIR, WWW, CIKM, etc. and Prediction, Springer-Verlag, 2001
Journals: WWW: Internet and Web Information Systems,
T. M. Mitchell, Machine Learning, McGraw Hill, 1997
Statistics
G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
Conferences: Joint Stat. Meeting, etc.
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
Journals: Annals of statistics, etc.

S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
Visualization
Conference proceedings: CHI, ACM-SIGGraph, etc. I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java
Journals: IEEE Trans. visualization and computer graphics, etc. Implementations, Morgan Kaufmann, 2nd ed. 2005
9
Chapter 5: Mining Frequent Patterns,
Association and Correlations
Basic concepts and a road map

Efficient and scalable frequent itemset mining
methods
Mining various kinds of association rules
From association mining to correlation
analysis
Constraint-based association mining
Summary
What Is Frequent Pattern Analysis? Why Is Freq. Pattern Mining Important?

Frequent pattern: a pattern (a set of items, subsequences, substructures,
Discloses an intrinsic and important property of data sets

etc.) that occurs frequently in a data set
Forms the foundation for many essential data mining tasks
First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
Association, correlation, and causality analysis
Motivation: Finding inherent regularities in data
Sequential, structural (e.g., sub-graph) patterns
What products were often purchased together?— Beer and diapers?! Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
What are the subsequent purchases after buying a PC?
What kinds of DNA are sensitive to this new drug?
Classification: associative classification
Can we automatically classify web documents?
Cluster analysis: frequent pattern-based clustering
Applications Data warehousing: iceberg cube and cube-gradient
Basket data analysis, cross-marketing, catalog design, sale campaign Semantic data compression: fascicles
analysis, Web log (click stream) analysis, and DNA sequence analysis. Broad applications
10
Basic Concepts: Frequent Patterns and
Association Rules Closed Patterns and Max-Patterns
Transaction-id Items bought Itemset X = {x1, …, xk}

A long pattern contains a combinatorial number of sub-
10 A, B, D patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
Find all the rules X Æ Y with minimum
20 A, C, D support and confidence (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
30 A, D, E
support, s, probability that a Solution: Mine closed patterns and max-patterns instead
40 B, E, F
transaction contains X ∪ Y An itemset X is closed if X is frequent and there exists no
50 B, C, D, E, F
confidence, c, conditional super-pattern Y ‫ כ‬X, with the same support as X
Customer Customer probability that a transaction (proposed by Pasquier, et al. @ ICDT’99)
buys both buys diaper
having X also contains Y
An itemset X is a max-pattern if X is frequent and there
Let supmin = 50%, confmin = 50% exists no frequent super-pattern Y ‫ כ‬X (proposed by
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Bayardo @ SIGMOD’98)
Association rules:
Customer A Æ D (60%, 100%) Closed pattern is a lossless compression of freq. patterns
buys beer
D Æ A (60%, 75%) Reducing the # of patterns and rules

Closed Patterns and Max-Patterns
Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
Min_sup = 1.
What is the set of closed itemset?
methods
<a1, …, a100>: 1
< a1, …, a50>: 2
What is the set of max-pattern?
analysis
<a1, …, a100>: 1
What is the set of all patterns?
!! Summary
11
Scalable Methods for Mining Frequent Patterns Apriori: A Candidate Generation-and-Test Approach
The downward closure property of frequent patterns Apriori pruning principle: If there is any itemset which is
Any subset of a frequent itemset must be frequent
infrequent, its superset should not be generated/tested!
If {beer, diaper, nuts} is frequent, so is {beer,
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
diaper}
Method:
i.e., every transaction having {beer, diaper, nuts} also
contains {beer, diaper} Initially, scan DB once to get frequent 1-itemset

Scalable mining methods: Three major approaches Generate length (k+1) candidate itemsets from length k
Apriori (Agrawal & Srikant@VLDB’94) frequent itemsets
Freq. pattern growth (FPgrowth—Han, Pei & Yin
Test the candidates against DB
@SIGMOD’00)
Terminate when no frequent or candidate set can be
Vertical data format approach (Charm—Zaki & Hsiao

@SDM’02) generated
The Apriori Algorithm—An Example The Apriori Algorithm

Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2 Pseudo-code:
L1 {A} 2

Tid Items C1 {B} 3 Ck: Candidate itemset of size k
{B} 3
10 A, C, D
1st scan {C} 3
{C} 3 Lk : frequent itemset of size k
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3 L1 = {frequent items};
40 B, E for (k = 1; Lk !=∅; k++) do begin
C2 Itemset sup C2
{A, B} 1
Itemset Ck+1 = candidates generated from Lk;
L2 Itemset sup 2nd scan
{A, C} 2
{A, C} 2
{A, B} for each transaction t in database do
{A, E} 1 {A, C}
{B, C} 2
{A, E}
increment the count of all candidates in Ck+1
{B, C} 2
{B, E} 3
{B, E} 3 {B, C} that are contained in t
{C, E} 2
{C, E} 2 {B, E} Lk+1 = candidates in Ck+1 with min_support
{C, E} end
C3 Itemset
3rd scan L3 Itemset sup return ∪k Lk;
{B, C, E} {B, C, E} 2
12
Important Details of Apriori How to Generate Candidates?
How to generate candidates?

Suppose the items in Lk-1 are listed in an order
Step 1: self-joining Lk
Step 1: self-joining Lk-1
Step 2: pruning
insert into Ck
How to count supports of candidates?
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
Example of Candidate-generation
from Lk-1 p, Lk-1 q
L3={abc, abd, acd, ace, bcd}
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
Self-joining: L3*L3
q.itemk-1
abcd from abc and abd
acde from acd and ace Step 2: pruning
Pruning: forall itemsets c in Ck do
acde is removed because ade is not in L3 forall (k-1)-subsets s of c do
C4={abcd} if (s is not in Lk-1) then delete c from Ck
How to Count Supports of Candidates? Example: Counting Supports of Candidates
Why counting supports of candidates a problem? Subset function

Transaction: 1 2 3 5 6
3,6,9
The total number of candidates can be very huge 1,4,7
One transaction may contain many candidates 2,5,8
Method: 1+2356
Candidate itemsets are stored in a hash-tree 13+56 234
567
Leaf node of hash-tree contains a list of itemsets and 367
145 345 356
counts 136
357 368
12+356
Interior node contains a hash table 124
689
125 159
Subset function: finds all the candidates contained in 457
458
a transaction
13
Efficient Implementation of Apriori in SQL Challenges of Frequent Pattern Mining
Hard to get good performance out of pure SQL (SQL- Challenges

92) based approaches alone Multiple scans of transaction database
Make use of object-relational extensions like UDFs, Huge number of candidates
BLOBs, Table functions etc. Tedious workload of support counting for candidates
Get orders of magnitude improvement Improving Apriori: general ideas

Reduce passes of transaction database scans
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
Shrink number of candidates
association rule mining with relational database
Facilitate support counting of candidates
systems: Alternatives and implications. In SIGMOD’98
Partition: Scan Database Only Twice DHP: Reduce the Number of Candidates
Any itemset that is potentially frequent in DB must be A k-itemset whose corresponding hashing bucket count is
frequent in at least one of the partitions of DB below the threshold cannot be frequent
Scan 1: partition database and find local frequent Candidates: a, b, c, d, e
patterns Hash entries: {ab, ad, ae} {bd, be, de} …
Scan 2: consolidate global frequent patterns Frequent 1-itemset: a, b, d, e
A. Savasere, E. Omiecinski, and S. Navathe. An efficient ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold
algorithm for mining association in large databases. In
VLDB’95 J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95
14
Sampling for Frequent Patterns DIC: Reduce Number of Scans
Select a sample of original database, mine frequent ABCD

Once both A and D are determined
patterns within sample using Apriori frequent, the counting of AD begins
ABC ABD ACD BCD Once all length-2 subsets of BCD are
Scan database once to verify frequent itemsets found in determined frequent, the counting of BCD
begins
sample, only borders of closure of frequent patterns are AB AC BC AD BD CD
Transactions
checked
1-itemsets
A B C D
Example: check abcd instead of ab, ac, …, etc. Apriori 2-itemsets
…
Scan database again to find missed frequent patterns {}
Itemset lattice 1-itemsets
H. Toivonen. Sampling large databases for association S. Brin R. Motwani, J. Ullman, 2-items
rules. In VLDB’96 and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
Mining Frequent Patterns Without

Bottleneck of Frequent-pattern Mining Candidate Generation
Multiple database scans are costly Grow long patterns from short ones using local
Mining long patterns needs many passes of
frequent items
scanning and generates lots of candidates
To find frequent itemset i1i2…i100 “abc” is a frequent pattern
# of scans: 100

Get all transactions having “abc”: DB|abc
# of Candidates: (1001) + (1002) + … + (110000) = 2100-
1 = 1.27*1030 ! “d” is a local frequent item in DB|abc Æ abcd is
Bottleneck: candidate-generation-and-test a frequent pattern
Can we avoid candidate generation?
15
Construct FP-tree from a Transaction Database Benefits of the FP-tree Structure
TID Items bought (ordered) frequent items Completeness

100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
Preserve complete information for frequent pattern
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3 mining
400 {b, c, k, s, p} {c, b, p} Never break a long pattern of any transaction
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table Compactness
1. Scan DB once, find
Reduce irrelevant info—infrequent items are gone
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4 Items in frequency descending order: the more
c 4 c:3 b:1 b:1 frequently occurring, the more likely to be shared
2. Sort frequent items in a 3
frequency descending b 3 Never be larger than the original database (not count
a:3 p:1
order, f-list m 3 node-links and the count field)
3. Scan DB again, p 3
m:2 b:1 For Connect-4 DB, compression ratio could be over 100
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
Partition Patterns and Databases Find Patterns Having P From P-conditional Database
Starting at the frequent item header table in the FP-tree

Frequent patterns can be partitioned into subsets Traverse the FP-tree by following the link of each frequent item p
according to f-list Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
F-list=f-c-a-b-m-p
Patterns containing p {}
Header Table
Patterns having m but no p Conditional pattern bases
Item frequency head f:4 c:1
… f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
Patterns having c but no a nor b, m, p a 3
b 3 a:3 p:1 a fc:3
Pattern f m 3 b fca:1, f:1, c:1
p 3
Completeness and non-redundency m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
16
From Conditional Pattern-bases to Conditional FP-trees Recursion: Mining Each Conditional FP-tree
For each pattern-base {}
Accumulate the count for each item in the base
{} Cond. pattern base of “am”: (fc:3) f:3
Construct the FP-tree for the frequent items of the
c:3
pattern base f:3
am-conditional FP-tree
c:3 {}
{}
m-conditional pattern base: Cond. pattern base of “cm”: (f:3)
fca:2, fcab:1 a:3
Header Table f:3
Item frequency head All frequent m-conditional FP-tree
f:4 c:1 patterns relate to m cm-conditional FP-tree
f 4 {}
c 4 c:3 b:1 b:1 m,
¼
a 3 f:3 ¼ fm, cm, am, {}
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
Cond. pattern base of “cam”: (f:3) f:3
m:2 b:1
p 3
a:3 cam-conditional FP-tree
p:2 m:1
m-conditional FP-tree
A Special Case: Single Prefix Path in FP-tree Mining Frequent Patterns With FP-trees
Suppose a (conditional) FP-tree T has a shared Idea: Frequent pattern growth

Recursively grow frequent patterns by pattern and
single prefix-path P
database partition
Mining can be decomposed into two parts
{}
Method
Reduction of the single prefix path into one node For each frequent item, construct its conditional
a1:n1
Concatenation of the mining results of the two pattern-base, and then its conditional FP-tree
a2:n2 parts Repeat the process on each newly created conditional
a3:n3
FP-tree
r1
{} Until the resulting FP-tree is empty, or it contains only
a1:n1
one path—single path will generate all the
b1:m1 C1:k1
¼ r1 =
a2:n2
+ b1:m1 C1:k1 combinations of its sub-paths, each of which is a
frequent pattern
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
17
Scaling FP-growth by DB Projection Partition-based Projection
FP-tree cannot fit in memory?—DB projection

Tran. DB
First partition a database into a set of projected DBs Parallel projection needs a lot fcamp
Then construct and mine FP-tree for each projected DB of disk space fcabm
fb
Partition projection saves it
Parallel projection vs. Partition projection techniques
cbp
fcamp
Parallel projection is space costly
p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB

fcam fcab f fc f …
cb fca cb … …
fcam fca …
am-proj DB cm-proj DB
fc f …
fc f
fc f
FP-Growth vs. Apriori: Scalability With the Support FP-Growth vs. Tree-Projection: Scalability with
Threshold the Support Threshold
Data set T25I20D10K Data set T25I20D100K

100 140
D2 FP-growth
90 D1 FP-grow th runtime
D1 Apriori runtime 120 D2 TreeProjection
80
70 100
Runtime (sec.)
Run time(sec.)
60
80
50
40
60
30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
18
Why Is FP-Growth the Winner? Implications of the Methodology
Divide-and-conquer: Mining closed frequent itemsets and max-patterns

decompose both the mining task and DB according to
CLOSET (DMKD’00)
the frequent patterns obtained so far
Mining sequential patterns
leads to focused search of smaller databases
Other factors FreeSpan (KDD’00), PrefixSpan (ICDE’01)
no candidate generation, no candidate test Constraint-based mining of frequent patterns
compressed database: FP-tree structure Convertible constraints (KDD’00, ICDE’01)

no repeated scan of entire database Computing iceberg data cubes with complex measures
basic ops—counting local freq items and building sub
H-tree and H-cubing algorithm (SIGMOD’01)
FP-tree, no pattern search and matching
MaxMiner: Mining Max-patterns Mining Frequent Closed Patterns: CLOSET
1st scan: find frequent items Tid Items Flist: list of all frequent items in support ascending order
A, B, C, D, E 10 A,B,C,D,E
Flist: d-a-f-e-c Min_sup=2
20 B,C,D,E,
2nd scan: find support for Divide search space TID Items
30 A,C,D,F
10 a, c, d, e, f
AB, AC, AD, AE, ABCDE Patterns having d 20 a, b, e
BC, BD, BE, BCDE Patterns having d but no a, etc.
30 c, e, f

Potential
40 a, c, d, f
CD, CE, CDE, DE, max-patterns Find frequent closed pattern recursively 50 c, e, f
Since BCDE is a max-pattern, no need to check BCD, BDE, Every transaction having d also has cfa Æ cfad is a
CDE in later scan frequent closed pattern
R. Bayardo. Efficiently mining long patterns from J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for
databases. In SIGMOD’98 Mining Frequent Closed Itemsets", DMKD'00.
19
CLOSET+: Mining Closed Itemsets by
CHARM: Mining by Exploring Vertical Data Format
Pattern-Growth
Itemset merging: if Y appears in every occurrence of X, then Y Vertical format: t(AB) = {T11, T25, …}
is merged with X tid-list: list of trans.-ids containing an itemset
Sub-itemset pruning: if Y ‫ כ‬X, and sup(X) = sup(Y), X and all of Deriving closed patterns based on vertical intersections
X’s descendants in the set enumeration tree can be pruned
t(X) = t(Y): X and Y always happen together
Hybrid tree projection
t(X) ⊂ t(Y): transaction having X always has Y

Bottom-up physical tree-projection
Using diffset to accelerate mining

Top-down pseudo tree-projection
Only keep track of differences of tids
Item skipping: if a local frequent item has the same support in
t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
several header tables at different levels, one can prune it from
the header table at higher levels
Diffset (XY, X) = {T2}
Efficient subset checking

Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et
al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)
Further Improvements of Mining Methods Visualization of Association Rules: Plane Graph
AFOPT (Liu, et al. @ KDD’03)

A “push-right” method for mining condensed frequent
pattern (CFP) tree

Carpenter (Pan, et al. @ KDD’03)
Mine data sets with small rows but numerous columns
Construct a row-enumeration tree for efficient mining
20
Visualization of Association Rules
Visualization of Association Rules: Rule Graph
(SGI/MineSet 3.0)

Mining Various Kinds of Association Rules
Basic concepts and a road map Mining multilevel association
Miming multidimensional association
methods
Mining various kinds of association rules Mining quantitative association
Mining interesting correlation patterns
analysis
Summary
21
Mining Multiple-Level Association Rules Multi-level Association: Redundancy Filtering
Items often form hierarchies

Some rules may be redundant due to “ancestor”
Flexible support settings
relationships between items.
Items at the lower level are expected to have lower
support Example
Exploration of shared multi-level mining (Agrawal & milk ⇒ wheat bread [support = 8%, confidence = 70%]
Srikant@VLB’95, Han & Fu@VLDB’95)
2% milk ⇒ wheat bread [support = 2%, confidence = 72%]
uniform support reduced support We say the first rule is an ancestor of the second rule.
Level 1
min_sup = 5%
Milk Level 1 A rule is redundant if its support is close to the “expected”
[support = 10%] min_sup = 5%
value, based on the rule’s ancestor.
Level 2 2% Milk Skim Milk Level 2
min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%
Mining Multi-Dimensional Association Mining Quantitative Associations
Single-dimensional rules: Techniques can be categorized by how numerical

buys(X, “milk”) ⇒ buys(X, “bread”) attributes, such as age or salary are treated
Multi-dimensional rules: ≥ 2 dimensions or predicates 1. Static discretization based on predefined concept
Inter-dimension assoc. rules (no repeated predicates) hierarchies (data cube methods)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X, “coke”) 2. Dynamic discretization based on data distribution
hybrid-dimension assoc. rules (repeated predicates) (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”) 3. Clustering: Distance-based association (e.g., Yang &
Categorical Attributes: finite number of possible values, no Miller@SIGMOD97)
ordering among values—data cube approach one dimensional clustering then association
Quantitative Attributes: numeric, implicit ordering among 4. Deviation: (such as Aumann and Lindell@KDD99)
values—discretization, clustering, and gradient approaches Sex = female => Wage: mean=$7/hr (overall mean = $9)
22
Static Discretization of Quantitative Attributes Quantitative Association Rules
Proposed by Lent, Swami and Widom ICDE’97
Discretized prior to mining using concept hierarchy.
Numeric attributes are dynamically discretized
Numeric values are replaced by ranges. Such that the confidence or compactness of the rules
In relational database, finding all frequent k-predicate sets mined is maximized

will require k or k+1 table scans. 2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat
() Cluster adjacent
Data cube is well suited for mining. association rules
The cells of an n-dimensional to form general
(age) (income) (buys)
rules using a 2-D grid
cuboid correspond to the
Example
predicate sets.
(age, income) (age,buys) (income,buys) age(X,”34-35”) ∧ income(X,”30-50K”)
Mining from data cubes ⇒ buys(X,”high resolution TV”)
can be much faster.
(age,income,buys)

Mining Other Interesting Patterns
Flexible support constraints (Wang et al. @ VLDB’02)

Some items (e.g., diamond) may occur rarely but are
valuable Efficient and scalable frequent itemset mining
Customized supmin specification and application methods
Top-K closed frequent patterns (Han, et al. @ ICDM’02) Mining various kinds of association rules
Hard to specify supmin, but top-k with lengthmin is more
From association mining to correlation analysis

desirable
Dynamically raise supmin in FP-tree construction and Constraint-based association mining
mining, and select most promising path to mine Summary
23
Interestingness Measure: Correlations (Lift) Are lift and χ2 Good Measures of Correlation?
play basketball ⇒ eat cereal [40%, 66.7%] is misleading “Buy walnuts ⇒ buy milk [1%, 80%]” is misleading
The overall % of students eating cereal is 75% > 66.7%. if 85% of customers buy milk
play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, Support and confidence are not good to represent correlations
although with lower support and confidence So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
Measure of dependent/correlated events: lift
P ( A∪ B )
Basketball Not basketball Sum (row) lift =
P ( A) P ( B ) Milk No Milk Sum (row)
P( A∪ B) Cereal 2000 1750 3750 Coffee m, c ~m, c c

lift = Not cereal 1000 250 1250 all _ conf =
sup(X ) No Coffee m, ~c ~m, ~c ~c
P( A) P( B) max_item _ sup(X ) Sum(col.) m ~m Σ

Sum(col.) 3000 2000 5000
DB m, c ~m, c m~c ~m~c lift all-conf coh χ2
1000 / 5000 sup( X ) A1 1000 100 100 10,000 9.26 0.91 0.83 9055
lift ( B, C ) =
2000 / 5000
= 0.89 lift ( B, ¬C ) = = 1.33 coh = A2 100 1000 1000 100,000 8.44 0.09 0.05 670
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000 | universe ( X ) |
A3 1000 100 10000 100,000 9.18 0.09 0.09 8172
A4 1000 1000 1000 1000 1 0.5 0.33 0

Which Measures Should Be Used?
lift and χ2 are not
good measures for Basic concepts and a road map
correlations in large
transactional DBs Efficient and scalable frequent itemset mining
all-conf or
coherence could be methods
good measures
(Omiecinski@TKDE’03) Mining various kinds of association rules
Both all-conf and
coherence have the From association mining to correlation analysis
downward closure
property Constraint-based association mining
Efficient algorithms
can be derived for Summary
mining (Lee et al.
@ICDM’03sub)
24
Constraint-based (Query-Directed) Mining Constraints in Data Mining
Knowledge type constraint:

Finding all the patterns in a database autonomously? —
classification, association, etc.
unrealistic!
Data constraint — using SQL-like queries
The patterns could be too many but not focused! find product pairs sold together in stores in Chicago in
Data mining should be an interactive process Dec.’02

User directs what to be mined using a data mining Dimension/level constraint
in relevance to region, price, brand, customer category
query language (or a graphical user interface)
Rule (or pattern) constraint
Constraint-based mining
small sales (price < $10) triggers big sales (sum >
User flexibility: provides constraints on what to be $200)
mined Interestingness constraint
System optimization: explores such constraints for strong rules: min_support ≥ 3%, min_confidence ≥
efficient mining—constraint-based mining 60%

Constrained Mining vs. Constraint-Based Search Anti-Monotonicity in Constraint Pushing

TDB (min_sup=2)
Constrained mining vs. constraint-based search/reasoning Anti-monotonicity TID Transaction
Both are aimed at reducing search space When an intemset S violates the 10 a, b, c, d, f
20 b, c, d, f, g, h
Finding all patterns satisfying constraints vs. finding constraint, so does any of its superset
30 a, c, d, e, f
some (or one) answer in constraint-based search in AI sum(S.Price) ≤ v is anti-monotone 40 c, e, f, g
Constraint-pushing vs. heuristic search
sum(S.Price) ≥ v is not anti-monotone Item Profit
It is an interesting research problem on how to integrate
Example. C: range(S.profit) ≤ 15 is anti- a 40
them
monotone b 0
Constrained mining vs. query processing in DBMS c -20
Itemset ab violates C d 10
Database query processing requires to find all
So does every superset of ab e -30
Constrained pattern mining shares a similar philosophy
f 30
as pushing selections deeply in query processing g 20
h -10
25
Monotonicity for Constraint Pushing Succinctness
TDB (min_sup=2)
TID Transaction Succinctness:
Monotonicity
10 a, b, c, d, f
When an intemset S satisfies the Given A1, the set of items satisfying a succinctness
20 b, c, d, f, g, h
constraint, so does any of its constraint C, then any set S satisfying C is based on
30 a, c, d, e, f
superset 40 c, e, f, g A1 , i.e., S contains a subset belonging to A1
sum(S.Price) ≥ v is monotone Item Profit Idea: Without looking at the transaction database,
a 40 whether an itemset S satisfies constraint C can be
min(S.Price) ≤ v is monotone b 0
determined based on the selection of items
Example. C: range(S.profit) ≥ 15 c -20
d 10 min(S.Price) ≤ v is succinct
Itemset ab satisfies C e -30
sum(S.Price) ≥ v is not succinct
So does every superset of ab f 30
g 20 Optimization: If C is succinct, C is pre-counting pushable
h -10
The Apriori Algorithm — Example Naïve Algorithm: Apriori + Constraint

Database D itemset sup. Database D itemset sup.
L1 itemset sup. L1 itemset sup.
TID Items C1 {1} 2 {1} 2 TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3 100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 200 235 Scan D {3} 3
{3} 3 {3} 3
300 1235 {4} 1 300 1235 {4} 1
{5} 3 {5} 3
400 25 {5} 3 400 25 {5} 3
C2 itemset sup C2 itemset C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2} L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3} {1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5} {2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3} {2 3} 2 {2 3}
{2 5} 3 {2 5} 3
{2 5} 3 {2 5} {2 5} 3 {2 5}
{3 5} 2 {3 5} 2
{3 5} 2 {3 5} {3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 {2 3 5} {2 3 5} 2 Sum{S.price} < 5
26
The Constrained Apriori Algorithm: Push The Constrained Apriori Algorithm: Push a
an Anti-monotone Constraint Deep Succinct Constraint Deep
Database D itemset sup. Database D itemset sup.
L1 itemset sup. L1 itemset sup.
TID Items C1 {1} 2 {1} 2 TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3 100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 200 235 Scan D {3} 3
{3} 3 {3} 3
300 1235 {4} 1 300 1235 {4} 1
{5} 3 {5} 3
400 25 {5} 3 400 25 {5} 3
C2 itemset sup C2 itemset C2 itemset sup C2 itemset
L2 itemset sup {1 2} L2 itemset sup {1 2}
{1 2} 1 Scan D {1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3} {1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5} {1 5} 1 {1 5} to be used
{2 3} 2 {2 3} 2
{2 3} 2 {2 3} {2 3} 2 {2 3}
{2 5} 3 {2 5} 3
{2 5} 3 {2 5} {2 5} 3 {2 5}
{3 5} 2 {3 5} 2 {3 5}
{3 5} 2 {3 5} {3 5} 2
C3 itemset Scan D L3 itemset sup Constraint: C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5 {2 3 5} {2 3 5} 2 min{S.price } <= 1
Converting “Tough” Constraints

Strongly Convertible Constraints
TDB (min_sup=2)
TID Transaction
Convert tough constraints into anti- avg(X) ≥ 25 is convertible anti-monotone w.r.t.
10 a, b, c, d, f
monotone or monotone by properly item value descending order R: <a, f, g, d, b,
ordering items
20 b, c, d, f, g, h h, c, e> Item Profit
30 a, c, d, e, f
If an itemset af violates a constraint C, so a 40
Examine C: avg(S.profit) ≥ 25 40 c, e, f, g does every itemset with af as prefix, such as b 0
Order items in value-descending Item Profit afd c -20
order a 40 avg(X) ≥ 25 is convertible monotone w.r.t. item d 10
b 0 value ascending order R-1: <e, c, h, b, d, g, f,
<a, f, g, d, b, h, c, e> e -30
a>

c -20
f 30
If an itemset afb violates C d 10 If an itemset d satisfies a constraint C, so
g 20
So does afbh, afb* e -30 does itemsets df and dfa, which having d as
h -10
f 30 a prefix
It becomes anti-monotone! g 20 Thus, avg(X) ≥ 25 is strongly convertible
h -10
27
Can Apriori Handle Convertible Constraint? Mining With Convertible Constraints
Item Value
C: avg(X) >= 25, min_sup=2 a 40
A convertible, not monotone nor anti-monotone List items in every transaction in value descending f 30
nor succinct constraint cannot be pushed deep order R: <a, f, g, d, b, h, c, e> g 20
into the an Apriori mining algorithm C is convertible anti-monotone w.r.t. R d 10
Within the level wise framework, no direct b 0

Item Value Scan TDB once
pruning based on the constraint can be made a 40 remove infrequent items
h -10
c -20
Itemset df violates constraint C: avg(X)>=25 b 0 Item h is dropped e -30
Since adf satisfies C, Apriori needs df to c -20 Itemsets a and f are good, …
TDB (min_sup=2)
assemble adf, df cannot be pruned d 10
Projection-based mining
TID Transaction
e -30
But it can be pushed into frequent-pattern Imposing an appropriate order on item projection 10 a, f, d, b, c
f 30
growth framework! Many tough constraints can be converted into 20 f, g, d, b, c
g 20
(anti)-monotone 30 a, f, d, c, e
h -10
40 f, g, h, c, e
Handling Multiple Constraints What Constraints Are Convertible?
Different constraints may require different or even Convertible anti- Convertible Strongly
Constraint
conflicting item-ordering monotone monotone convertible
avg(S) ≤ , ≥ v Yes Yes Yes

If there exists an order R s.t. both C1 and C2 are
median(S) ≤ , ≥ v Yes Yes Yes
convertible w.r.t. R, then there is no conflict between
sum(S) ≤ v (items could be of any value,
Yes No No
the two convertible constraints v ≥ 0)
If there exists conflict on order of items sum(S) ≤ v (items could be of any value,
No Yes No
v ≤ 0)
Try to satisfy one constraint first sum(S) ≥ v (items could be of any value,
No Yes No
Then using the order for the other constraint to v ≥ 0)
mine frequent itemsets in the corresponding sum(S) ≥ v (items could be of any value,
Yes No No
v ≤ 0)
projected database ……
28
Constraint-Based Mining—A General Picture A Classification of Constraints
Constraint Antimonotone Monotone Succinct

v∈S no yes yes
S⊇V no yes yes
S⊆V yes no yes

min(S) ≤ v no yes yes
Antimonotone Monotone
min(S) ≥ v yes no yes
max(S) ≤ v yes no yes
max(S) ≥ v no yes yes Strongly

count(S) ≤ v yes no weakly
convertible
Succinct
count(S) ≥ v no yes weakly
sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no
range(S) ≤ v yes no no
Convertible Convertible
range(S) ≥ v no yes no anti-monotone monotone
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible convertible no
support(S) ≥ ξ yes no no Inconvertible
support(S) ≤ ξ no yes no

Frequent-Pattern Mining: Summary
Frequent pattern mining—an important task in data mining

Scalable frequent pattern mining methods

Apriori (Candidate generation & test)
methods

Projection-based (FPgrowth, CLOSET+, ...)

Vertical format approach (CHARM, ...)
From association mining to correlation analysis
Mining a variety of rules and interesting patterns
Constraint-based mining
Summary Mining sequential and structured patterns
Extensions and applications
29
Frequent-Pattern Mining: Research Problems Ref: Basic Concepts of Frequent Pattern Mining
Mining fault-tolerant frequent, sequential and structured (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases.
patterns
SIGMOD'93.
Patterns allows limited faults (insertion, deletion,
(Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
mutation) databases. SIGMOD'98.
Mining truly interesting patterns (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
Surprising, novel, concise, … Discovering frequent closed itemsets for association rules. ICDT'99.
Application exploration (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential
patterns. ICDE'95
E.g., DNA sequence analysis and bio-pattern
classification
“Invisible” data mining
Ref: Apriori and Its Improvements Ref: Depth-First, Projection-Based FP Mining
R. Agrawal and R. Srikant. Fast algorithms for mining association rules. R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
VLDB'94. algorithm for generation of frequent itemsets. J. Parallel and
Distributed Computing:02.
H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for
discovering association rules. KDD'94. J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD’ 00.
A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining
mining association rules in large databases. VLDB'95.
Frequent Closed Itemsets. DMKD'00.
J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by
for mining association rules. SIGMOD'95. Opportunistic Projection. KDD'02.
H. Toivonen. Sampling large databases for association rules. VLDB'96. J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed
S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset Patterns without Minimum Support. ICDM'02.
counting and implication rules for market basket analysis. SIGMOD'97. J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best
S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule Strategies for Mining Frequent Closed Itemsets. KDD'03.
mining with relational database systems: Alternatives and implications. G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying
SIGMOD'98. Frequent Patterns. KDD'03.
30
Ref: Vertical Format and Row Enumeration Methods Ref: Mining Multi-Level and Quantitative Rules
M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm R. Srikant and R. Agrawal. Mining generalized association rules.
for discovery of association rules. DAMI:97. VLDB'95.
J. Han and Y. Fu. Discovery of multiple-level association rules from
Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset
large databases. VLDB'95.
Mining, SDM'02.
R. Srikant and R. Agrawal. Mining quantitative association rules in
C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual- large relational tables. SIGMOD'96.
Pruning Algorithm for Itemsets with Constraints. KDD’02. T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining
F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: using two-dimensional optimized association rules: Scheme,
algorithms, and visualization. SIGMOD'96.
Finding Closed Patterns in Long Biological Datasets. KDD'03.
K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama.
Computing optimized rectilinear regions for association rules. KDD'97.
R.J. Miller and Y. Yang. Association rules over interval data.
SIGMOD'97.
Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative
Association Rules KDD'99.
Ref: Mining Correlations and Interesting Rules Ref: Mining Other Kinds of Rules
M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94. B. Lent, A. Swami, and J. Widom. Clustering association rules.
ICDE'97.
S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong
Generalizing association rules to correlations. SIGMOD'97.
negative associations in a large database of customer transactions.
C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable ICDE'98.
techniques for mining causal structures. VLDB'98. D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S.
P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Nestorov. Query flocks: A generalization of association-rule mining.
Interestingness Measure for Association Patterns. KDD'02. SIGMOD'98.
E. Omiecinski. Alternative Interest Measures for Mining F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
Associations. TKDE’03. paradigm for fast, quantifiable data mining. VLDB'98.
Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han. CoMine: Efficient Mining K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions.
EDBT’02.
of Correlated Patterns. ICDM’03.
31
Ref: Constraint-Based Pattern Mining Ref: Mining Sequential and Structured Patterns
R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
constraints. KDD'97. and performance improvements. EDBT’96.
H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent
R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and
episodes in event sequences. DAMI:97.
pruning optimizations of constrained association rules. SIGMOD’98.
M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences.
M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Machine Learning:01.
Mining with Regular Expression Constraints. VLDB’99. J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan:
G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
constrained correlated sets. ICDE'00. Growth. ICDE'01.
J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
with Convertible Constraints. ICDE'01. X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential
Patterns in Large Datasets. SDM'03.
J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with
X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph
Constraints in Large Databases, CIKM'02.
Patterns. KDD'03.
Ref: Mining Spatial, Multimedia, and Web Data Ref: Mining Frequent Patterns in Time-Series Data
K. Koperski and J. Han, Discovery of Spatial Association Rules in B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules.
Geographic Information Databases, SSD’95. ICDE'98.
O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns
Trends by Applying OLAP and Data Mining Technology on Web Logs. in Time Series Database, ICDE'99.
ADL'98. H. Lu, L. Feng, and J. Han. Beyond Intra-Transaction Association
O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Analysis: Mining Multi-Dimensional Inter-Transaction Association Rules.
Multimedia with Progressive Resolution Refinement. ICDE'00. TOIS:00.
D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and
Patterns. SSTD'01. A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00.
W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on
Evolving Numerical Attributes. ICDE’01.
J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in
Time Series Data. TKDE’03.
32
Ref: Iceberg Cube and Cube Computation Ref: Iceberg Cube and Cube Exploration
S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, J. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data
R. Ramakrishnan, and S. Sarawagi. On the computation of Cubes with Complex Measures. SIGMOD’ 01.
multidimensional aggregates. VLDB'96.
W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed Cube: An
Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based Effective Approach to Reducing Data Cube Size. ICDE'02.
algorithm for simultaneous multidi-mensional aggregates.
G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining Multi-
SIGMOD'97.
Dimensional Constrained Gradients in Data Cubes. VLDB'01.
J. Gray, et al. Data cube: A relational aggregation operator
generalizing group-by, cross-tab and sub-totals. DAMI: 97. T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades:
M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Generalizing association rules. DAMI:02.
Ullman. Computing iceberg queries efficiently. VLDB'98. L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient Cube: How to
S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven Summarize the Semantics of a Data Cube. VLDB'02.
exploration of OLAP data cubes. EDBT'98. D. Xin, J. Han, X. Li, B. W. Wah. Star-Cubing: Computing Iceberg
K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse Cubes by Top-Down and Bottom-Up Integration. VLDB'03.
and iceberg cubes. SIGMOD'99.
Ref: FP for Classification and Clustering Ref: Stream and Privacy-Preserving FP Mining
G. Dong and J. Li. Efficient mining of emerging patterns: A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving
Discovering trends and differences. KDD'99. Mining of Association Rules. KDD’02.
B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining
Rule Mining. KDD’98. in Vertically Partitioned Data. KDD’02.
W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient G. Manku and R. Motwani. Approximate Frequency Counts over
Classification Based on Multiple Class-Association Rules. ICDM'01. Data Streams. VLDB’02.
H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-
similarity in large data sets. SIGMOD’ 02. Dimensional Regression Analysis of Time-Series Data Streams.
J. Yang and W. Wang. CLUSEQ: efficient and effective sequence VLDB'02.
clustering. ICDE’03. C. Giannella, J. Han, J. Pei, X. Yan and P. S. Yu. Mining Frequent
B. Fung, K. Wang, and M. Ester. Large Hierarchical Document Patterns in Data Streams at Multiple Time Granularities, Next
Clustering Using Frequent Itemset. SDM’03. Generation Data Mining:03.
X. Yin and J. Han. CPAR: Classification based on Predictive A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches
Association Rules. SDM'03. in Privacy Preserving Data Mining. PODS’03.
33
Ref: Other Freq. Pattern Mining Applications
Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient

Discovery of Functional and Approximate Dependencies Using
Partitions. ICDE’98.
H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and
Pattern Extraction with Fascicles. VLDB'99.
T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk.
Mining Database Structure; or How to Build a Data Quality
Browser. SIGMOD'02.
34

05 - dbdm2007 - Data Mining

Uploaded by

Copyright:

Available Formats

You might also like

05 - dbdm2007 - Data Mining

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

05 - dbdm2007 - Data Mining

Uploaded by

Copyright:

Available Formats

Data Mining:

Concepts and Techniques

Why Data Mining? What Is Data Mining?

 Data analysis and decision support

 Risk analysis and management Task-relevant Data

Data Presentation Business

Data Mining: Classification Schemes Data Mining: On What Kinds of Data?

 Predictive data mining  Data streams and sensor data

 E.g., classify countries based on (climate), or classify cars based on

 Precise patterns vs. approximate patterns  Automated vs. query-driven?

Primitives that Define a Data Mining Task Primitive 1: Task-Relevant Data

 Task-relevant data  Database or data warehouse name

 Type of knowledge to be mined  Database tables or data warehouse cubes

 Background knowledge  Condition for data selection

 Characterization  A typical kind of background knowledge: Concept hierarchies

 Discrimination  Schema hierarchy

Primitive 4: Pattern Interestingness Measure Primitive 5: Presentation of Discovered Patterns

Other Data Mining Languages &

 No coupling—flat file processing, not recommended

Major Issues in Data Mining Summary

 More conferences on data mining (PKDD)  IEEE Trans. On Knowledge

Where to Find References? DBLP, CiteSeer, Google Recommended Reference Books

 Basic concepts and a road map

What Is Frequent Pattern Analysis? Why Is Freq. Pattern Mining Important?

Transaction-id Items bought  Itemset X = {x1, …, xk}

Chapter 5: Mining Frequent Patterns,

contains {beer, diaper}  Initially, scan DB once to get frequent 1-itemset

The Apriori Algorithm—An Example The Apriori Algorithm

How to Count Supports of Candidates? Example: Counting Supports of Candidates

 Why counting supports of candidates a problem? Subset function

 Hard to get good performance out of pure SQL (SQL-  Challenges

 Get orders of magnitude improvement  Improving Apriori: general ideas

 Scan 1: partition database and find local frequent  Candidates: a, b, c, d, e

patterns  Hash entries: {ab, ad, ae} {bd, be, de} …

 Scan 2: consolidate global frequent patterns  Frequent 1-itemset: a, b, d, e

 Select a sample of original database, mine frequent ABCD

Mining Frequent Patterns Without

TID Items bought (ordered) frequent items  Completeness

 Starting at the frequent item header table in the FP-tree

 Suppose a (conditional) FP-tree T has a shared  Idea: Frequent pattern growth

 FP-tree cannot fit in memory?—DB projection

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB

Data set T25I20D10K Data set T25I20D100K

 Divide-and-conquer:  Mining closed frequent itemsets and max-patterns

 compressed database: FP-tree structure  Convertible constraints (KDD’00, ICDE’01)

MaxMiner: Mining Max-patterns Mining Frequent Closed Patterns: CLOSET

 Efficient subset checking

Further Improvements of Mining Methods Visualization of Association Rules: Plane Graph

 AFOPT (Liu, et al. @ KDD’03)

pattern (CFP) tree

 Construct a row-enumeration tree for efficient mining

Chapter 5: Mining Frequent Patterns,

 Items often form hierarchies

Mining Multi-Dimensional Association Mining Quantitative Associations

 Single-dimensional rules:  Techniques can be categorized by how numerical

 In relational database, finding all frequent k-predicate sets mined is maximized

Chapter 5: Mining Frequent Patterns,

P( A∪ B) Cereal 2000 1750 3750 Coffee m, c ~m, c c

Data analysis and decision support

Risk analysis and management Task-relevant Data

Predictive data mining Data streams and sensor data

E.g., classify countries based on (climate), or classify cars based on

Precise patterns vs. approximate patterns Automated vs. query-driven?

Task-relevant data Database or data warehouse name

Type of knowledge to be mined Database tables or data warehouse cubes

Background knowledge Condition for data selection

Characterization A typical kind of background knowledge: Concept hierarchies

Discrimination Schema hierarchy

No coupling—flat file processing, not recommended

More conferences on data mining (PKDD) IEEE Trans. On Knowledge

Basic concepts and a road map

Transaction-id Items bought Itemset X = {x1, …, xk}

contains {beer, diaper} Initially, scan DB once to get frequent 1-itemset

Why counting supports of candidates a problem? Subset function

Hard to get good performance out of pure SQL (SQL- Challenges

Get orders of magnitude improvement Improving Apriori: general ideas

Scan 1: partition database and find local frequent Candidates: a, b, c, d, e

patterns Hash entries: {ab, ad, ae} {bd, be, de} …

Scan 2: consolidate global frequent patterns Frequent 1-itemset: a, b, d, e

Select a sample of original database, mine frequent ABCD

TID Items bought (ordered) frequent items Completeness

Starting at the frequent item header table in the FP-tree

Suppose a (conditional) FP-tree T has a shared Idea: Frequent pattern growth

FP-tree cannot fit in memory?—DB projection

Divide-and-conquer: Mining closed frequent itemsets and max-patterns

compressed database: FP-tree structure Convertible constraints (KDD’00, ICDE’01)

Efficient subset checking

AFOPT (Liu, et al. @ KDD’03)

Construct a row-enumeration tree for efficient mining

Items often form hierarchies

Single-dimensional rules: Techniques can be categorized by how numerical

In relational database, finding all frequent k-predicate sets mined is maximized

Knowledge type constraint:

Data mining should be an interactive process Dec.’02

Within the level wise framework, no direct b 0

Scalable frequent pattern mining methods

Projection-based (FPgrowth, CLOSET+, ...)

Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient