05 - dbdm2007 - Data Mining

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 34

Data Mining:

Concepts and Techniques

— Chapter 5 —

Jiawei Han
Department of Computer Science
University of Illinois at Urbana-Champaign
www.cs.uiuc.edu/~hanj
©2006 Jiawei Han and Micheline Kamber, All rights reserved

October 22, 2007 Data Mining: Concepts and Techniques 1 October 22, 2007 Data Mining: Concepts and Techniques 2

Why Data Mining? What Is Data Mining?


„ The Explosive Growth of Data: from terabytes to petabytes „ Data mining (knowledge discovery from data)
„ Data collection and data availability „ Extraction of interesting (non-trivial, implicit, previously

„ Automated data collection tools, database systems, Web, unknown and potentially useful) patterns or knowledge from
computerized society huge amount of data
„ Data mining: a misnomer?
„ Major sources of abundant data
„ Alternative names
„ Business: Web, e-commerce, transactions, stocks, … „ Knowledge discovery (mining) in databases (KDD), knowledge
„ Science: Remote sensing, bioinformatics, scientific simulation, … extraction, data/pattern analysis, data archeology, data
dredging, information harvesting, business intelligence, etc.
„ Society and everyone: news, digital cameras,
„ Watch out: Is everything “data mining”?
„ We are drowning in data, but starving for knowledge!
„ Simple search and query processing
„ “Necessity is the mother of invention”—Data mining—Automated
„ (Deductive) expert systems
analysis of massive data sets: natural from the evolution of Database
Technology
October 22, 2007 Data Mining: Concepts and Techniques 3 October 22, 2007 Data Mining: Concepts and Techniques 4

1
Why Data Mining?—Potential Applications Knowledge Discovery (KDD) Process

„ Data analysis and decision support


„ Data mining—core of Pattern Evaluation
„ Market analysis and management knowledge discovery
„ Target marketing, customer relationship management (CRM), process
market basket analysis, cross selling, market segmentation Data Mining

„ Risk analysis and management Task-relevant Data


„ Forecasting, customer retention, improved underwriting,
quality control, competitive analysis Selection
Data Warehouse
„ Fraud detection and detection of unusual patterns
(outliers) Data Cleaning
„ Other Applications Data Integration
„ Text mining (news group, email, documents) and Web
mining Databases
October 22, 2007 Data Mining: Concepts and Techniques 5 October 22, 2007 Data Mining: Concepts and Techniques 6
St d t i i

Data Mining and Business Intelligence Data Mining: Confluence of Multiple Disciplines

Increasing potential
to support Database
business decisions End User
Technology Statistics
Decision
Making

Data Presentation Business


Analyst
Visualization Techniques Machine Visualization
Data Mining Data Learning Data Mining
Information Discovery Analyst

Data Exploration
Pattern
Statistical Summary, Querying, and Reporting
Recognition Other
Data Preprocessing/Integration, Data Warehouses Algorithm Disciplines
DBA
Data Sources
Paper, Files, Web documents, Scientific experiments, Database Systems
October 22, 2007 Data Mining: Concepts and Techniques 7 October 22, 2007 Data Mining: Concepts and Techniques 8

2
Why Not Traditional Data Analysis? Multi-Dimensional View of Data Mining
„ Tremendous amount of data „ Data to be mined

„ Algorithms must be highly scalable to handle such as tera-bytes of „ Relational, data warehouse, transactional, stream, object-
data oriented/relational, active, spatial, time-series, text, multi-media,
heterogeneous, legacy, WWW
„ High-dimensionality of data
„ Knowledge to be mined
„ Micro-array may have tens of thousands of dimensions
„ Characterization, discrimination, association, classification, clustering,
„ High complexity of data
trend/deviation, outlier analysis, etc.
„ Data streams and sensor data
„ Multiple/integrated functions and mining at multiple levels
„ Time-series data, temporal data, sequence data
„ Techniques utilized
„ Structure data, graphs, social networks and multi-linked data „ Database-oriented, data warehouse (OLAP), machine learning, statistics,
„ Heterogeneous databases and legacy databases visualization, etc.
„ Spatial, spatiotemporal, multimedia, text and Web data „ Applications adapted
„ Software programs, scientific simulations „ Retail, telecommunication, banking, fraud analysis, bio-data mining, stock
„ New and sophisticated applications market analysis, text mining, Web mining, etc.

October 22, 2007 Data Mining: Concepts and Techniques 9 October 22, 2007 Data Mining: Concepts and Techniques 10

Data Mining: Classification Schemes Data Mining: On What Kinds of Data?


„ Database-oriented data sets and applications
„ General functionality
„ Relational database, data warehouse, transactional database
„ Descriptive data mining „ Advanced data sets and advanced applications

„ Predictive data mining „ Data streams and sensor data

„ Different views lead to different classifications „ Time-series data, temporal data, sequence data (incl. bio-sequences)
„ Structure data, graphs, social networks and multi-linked data
„ Data view: Kinds of data to be mined
„ Object-relational databases
„ Knowledge view: Kinds of knowledge to be discovered „ Heterogeneous databases and legacy databases

„ Method view: Kinds of techniques utilized „ Spatial data and spatiotemporal data
„ Multimedia database
„ Application view: Kinds of applications adapted
„ Text databases
„ The World-Wide Web

October 22, 2007 Data Mining: Concepts and Techniques 11 October 22, 2007 Data Mining: Concepts and Techniques 12

3
Data Mining Functionalities Data Mining Functionalities (2)
„ Multidimensional concept description: Characterization and discrimination „ Cluster analysis

„ Generalize, summarize, and contrast data characteristics, e.g., dry „ Class label is unknown: Group data to form new classes, e.g.,
cluster houses to find distribution patterns
vs. wet regions
„ Maximizing intra-class similarity & minimizing interclass similarity
„ Frequent patterns, association, correlation vs. causality
„ Outlier analysis
„ Diaper Æ Beer [0.5%, 75%] (Correlation or causality?) „ Outlier: Data object that does not comply with the general behavior
„ Classification and prediction of the data
„ Construct models (functions) that describe and distinguish classes „ Noise or exception? Useful in fraud detection, rare events analysis
or concepts for future prediction „ Trend and evolution analysis

„ E.g., classify countries based on (climate), or classify cars based on


„ Trend and deviation: e.g., regression analysis
(gas mileage) „ Sequential pattern mining: e.g., digital camera Æ large SD memory
„ Periodicity analysis
„ Predict some unknown or missing numerical values
„ Similarity-based analysis
„ Other pattern-directed or statistical analyses

October 22, 2007 Data Mining: Concepts and Techniques 13 October 22, 2007 Data Mining: Concepts and Techniques 14

Are All the “Discovered” Patterns Interesting? Find All and Only Interesting Patterns?

„ Data mining may generate thousands of patterns: Not all of them are „ Find all the interesting patterns: Completeness
interesting
„ Can a data mining system find all the interesting patterns? Do we
„ Suggested approach: Human-centered, query-based, focused mining need to find all of the interesting patterns?
„ Interestingness measures „ Heuristic vs. exhaustive search
„ A pattern is interesting if it is easily understood by humans, valid on new „ Association vs. classification vs. clustering
or test data with some degree of certainty, potentially useful, novel, or
„ Search for only interesting patterns: An optimization problem
validates some hypothesis that a user seeks to confirm
„ Can a data mining system find only the interesting patterns?
„ Objective vs. subjective interestingness measures
„ Approaches
„ Objective: based on statistics and structures of patterns, e.g., support,
„ First general all the patterns and then filter out the uninteresting ones
confidence, etc.
„ Generate only the interesting patterns—mining query optimization
„ Subjective: based on user’s belief in the data, e.g., unexpectedness,
novelty, actionability, etc.

October 22, 2007 Data Mining: Concepts and Techniques 15 October 22, 2007 Data Mining: Concepts and Techniques 16

4
Other Pattern Mining Issues Why Data Mining Query Language?

„ Precise patterns vs. approximate patterns „ Automated vs. query-driven?


„ Association and correlation mining: possible find sets of precise „ Finding all the patterns autonomously in a database?—unrealistic
patterns because the patterns could be too many but uninteresting
„ But approximate patterns can be more compact and sufficient „ Data mining should be an interactive process
„ How to find high quality approximate patterns?? „ User directs what to be mined
„ Gene sequence mining: approximate patterns are inherent „ Users must be provided with a set of primitives to be used to communicate
„ How to derive efficient approximate pattern mining algorithms?? with the data mining system
„ Constrained vs. non-constrained patterns „ Incorporating these primitives in a data mining query language
„ Why constraint-based mining? „ More flexible user interaction
„ What are the possible kinds of constraints? How to push „ Foundation for design of graphical user interface
constraints into the mining process? „ Standardization of data mining industry and practice

October 22, 2007 Data Mining: Concepts and Techniques 17 October 22, 2007 Data Mining: Concepts and Techniques 18

Primitives that Define a Data Mining Task Primitive 1: Task-Relevant Data

„ Task-relevant data „ Database or data warehouse name

„ Type of knowledge to be mined „ Database tables or data warehouse cubes

„ Background knowledge „ Condition for data selection


„ Pattern interestingness measurements „ Relevant attributes or dimensions
„ Visualization/presentation of discovered patterns
„ Data grouping criteria

October 22, 2007 Data Mining: Concepts and Techniques 19 October 22, 2007 Data Mining: Concepts and Techniques 20

5
Primitive 2: Types of Knowledge to Be Mined Primitive 3: Background Knowledge

„ Characterization „ A typical kind of background knowledge: Concept hierarchies

„ Discrimination „ Schema hierarchy


„ E.g., street < city < province_or_state < country
„ Association
„ Set-grouping hierarchy
„ Classification/prediction
„ E.g., {20-39} = young, {40-59} = middle_aged
„ Clustering „ Operation-derived hierarchy
„ Outlier analysis „ email address: hagonzal@cs.uiuc.edu
„ Other data mining tasks login-name < department < university < country
„ Rule-based hierarchy
„ low_profit_margin (X) <= price(X, P1) and cost (X, P2) and (P1 -
P2) < $50

October 22, 2007 Data Mining: Concepts and Techniques 21 October 22, 2007 Data Mining: Concepts and Techniques 22

Primitive 4: Pattern Interestingness Measure Primitive 5: Presentation of Discovered Patterns

„ Simplicity
„ Different backgrounds/usages may require different forms of representation
e.g., (association) rule length, (decision) tree size
„ E.g., rules, tables, crosstabs, pie/bar chart, etc.
„ Certainty
e.g., confidence, P(A|B) = #(A and B)/ #(B), classification „ Concept hierarchy is also important

reliability or accuracy, certainty factor, rule strength, rule quality, „ Discovered knowledge might be more understandable when
discriminating weight, etc. represented at high level of abstraction
„ Utility
„ Interactive drill up/down, pivoting, slicing and dicing provide
potential usefulness, e.g., support (association), noise threshold
different perspectives to data
(description)
„ Different kinds of knowledge require different representation: association,
„ Novelty
classification, clustering, etc.
not previously known, surprising (used to remove redundant
rules, e.g., Illinois vs. Champaign rule implication support ratio)

October 22, 2007 Data Mining: Concepts and Techniques 23 October 22, 2007 Data Mining: Concepts and Techniques 24

6
DMQL—A Data Mining Query Language An Example Query in DMQL

„ Motivation
„ A DMQL can provide the ability to support ad-hoc and
interactive data mining
„ By providing a standardized language like SQL
„ Hope to achieve a similar effect like that SQL has on relational
database
„ Foundation for system development and evolution
„ Facilitate information exchange, technology transfer,
commercialization and wide acceptance
„ Design
„ DMQL is designed with the primitives described earlier

October 22, 2007 Data Mining: Concepts and Techniques 25 October 22, 2007 Data Mining: Concepts and Techniques 26

Other Data Mining Languages &


Integration of Data Mining and Data Warehousing
Standardization Efforts
„ Association rule language specifications „ Data mining systems, DBMS, Data warehouse systems coupling
„ MSQL (Imielinski & Virmani’99)
„ No coupling, loose-coupling, semi-tight-coupling, tight-coupling
„ MineRule (Meo Psaila and Ceri’96)
„ On-line analytical mining data
„ Query flocks based on Datalog syntax (Tsur et al’98)
„ OLEDB for DM (Microsoft’2000) and recently DMX (Microsoft SQLServer 2005) „ integration of mining and OLAP technologies

„ Based on OLE, OLE DB, OLE DB for OLAP, C# „ Interactive mining multi-level knowledge

„ Integrating DBMS, data warehouse and data mining „ Necessity of mining knowledge and patterns at different levels of
„ DMML (Data Mining Mark-up Language) by DMG (www.dmg.org) abstraction by drilling/rolling, pivoting, slicing/dicing, etc.
„ Providing a platform and process structure for effective data mining
„ Integration of multiple mining functions
„ Emphasizing on deploying data mining technology to solve business
problems „ Characterized classification, first clustering and then association

October 22, 2007 Data Mining: Concepts and Techniques 27 October 22, 2007 Data Mining: Concepts and Techniques 28

7
Coupling Data Mining with DB/DW Systems Architecture: Typical Data Mining System

„ No coupling—flat file processing, not recommended


Graphical User Interface
„ Loose coupling
„ Fetching data from DB/DW
Pattern Evaluation
„ Semi-tight coupling—enhanced DM performance
Knowl
„ Provide efficient implement a few data mining primitives in a edge-
Data Mining Engine
DB/DW system, e.g., sorting, indexing, aggregation, histogram Base
analysis, multiway join, precomputation of some stat functions Database or Data
„ Tight coupling—A uniform information processing environment Warehouse Server
„ DM is smoothly integrated into a DB/DW system, mining query data cleaning, integration, and selection
is optimized based on mining query, indexing, query processing
methods, etc.
Data World-Wide Other Info
Database Repositories
Warehouse Web

October 22, 2007 Data Mining: Concepts and Techniques 29 October 22, 2007 Data Mining: Concepts and Techniques 30

Major Issues in Data Mining Summary


„ Mining methodology
„ Data mining: Discovering interesting patterns from large amounts of
Mining different kinds of knowledge from diverse data types, e.g., bio,
data
„

stream, Web
„ Performance: efficiency, effectiveness, and scalability „ A natural evolution of database technology, in great demand, with
„ Pattern evaluation: the interestingness problem wide applications
„ Incorporation of background knowledge „ A KDD process includes data cleaning, data integration, data
„ Handling noise and incomplete data selection, transformation, data mining, pattern evaluation, and
„ Parallel, distributed and incremental mining methods knowledge presentation
„ Integration of the discovered knowledge with existing one: knowledge „ Mining can be performed in a variety of information repositories
fusion
„ User interaction
„ Data mining functionalities: characterization, discrimination,
„ Data mining query languages and ad-hoc mining association, classification, clustering, outlier and trend analysis, etc.
„ Expression and visualization of data mining results „ Data mining systems and architectures
„ Interactive mining of knowledge at multiple levels of abstraction „ Major issues in data mining
„ Applications and social impacts
October„22, Domain-specific
2007 data mining & invisible
Data Mining: Concepts anddata mining
Techniques 31 October 22, 2007 Data Mining: Concepts and Techniques 32

8
A Brief History of Data Mining Society Conferences and Journals on Data Mining

„ 1989 IJCAI Workshop on Knowledge Discovery in Databases „ KDD Conferences „ Other related conferences
„ Knowledge Discovery in Databases (G. Piatetsky-Shapiro and W. Frawley, „ ACM SIGKDD Int. Conf. on „ ACM SIGMOD
1991) Knowledge Discovery in „ VLDB
1991-1994 Workshops on Knowledge Discovery in Databases
Databases and Data Mining
„
(KDD) „ (IEEE) ICDE
„ Advances in Knowledge Discovery and Data Mining (U. Fayyad, G. WWW, SIGIR
„ SIAM Data Mining Conf. (SDM) „
Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy, 1996)
„ (IEEE) Int. Conf. on Data „ ICML, CVPR, NIPS
„ 1995-1998 International Conferences on Knowledge Discovery in Databases and Data
Mining (KDD’95-98)
Mining (ICDM) „ Journals
„ Conf. on Principles and Data Mining and Knowledge
„ Journal of Data Mining and Knowledge Discovery (1997) „
practices of Knowledge Discovery (DAMI or DMKD)
ACM SIGKDD conferences since 1998 and SIGKDD Explorations
Discovery and Data Mining
„

„ More conferences on data mining (PKDD) „ IEEE Trans. On Knowledge


PAKDD (1997), PKDD (1997), SIAM-Data Mining (2001), (IEEE) ICDM and Data Eng. (TKDE)
„
„ Pacific-Asia Conf. on
(2001), etc. Knowledge Discovery and Data „ KDD Explorations
„ ACM Transactions on KDD starting in 2007 Mining (PAKDD) „ ACM Trans. on KDD

October 22, 2007 Data Mining: Concepts and Techniques 33 October 22, 2007 Data Mining: Concepts and Techniques 34

Where to Find References? DBLP, CiteSeer, Google Recommended Reference Books


„ Data mining and KDD (SIGKDD: CDROM)
S. Chakrabarti. Mining the Web: Statistical Analysis of Hypertex and Semi-Structured Data. Morgan Kaufmann, 2002
Conferences: ACM-SIGKDD, IEEE-ICDM, SIAM-DM, PKDD, PAKDD, etc.
„
„

„ Journal: Data Mining and Knowledge Discovery, KDD Explorations, ACM TKDD „ R. O. Duda, P. E. Hart, and D. G. Stork, Pattern Classification, 2ed., Wiley-Interscience, 2000

„ Database systems (SIGMOD: ACM SIGMOD Anthology—CD ROM) „ T. Dasu and T. Johnson. Exploratory Data Mining and Data Cleaning. John Wiley & Sons, 2003

„ Conferences: ACM-SIGMOD, ACM-PODS, VLDB, IEEE-ICDE, EDBT, ICDT, DASFAA „ U. M. Fayyad, G. Piatetsky-Shapiro, P. Smyth, and R. Uthurusamy. Advances in Knowledge Discovery and Data Mining.
„ Journals: IEEE-TKDE, ACM-TODS/TOIS, JIIS, J. ACM, VLDB J., Info. Sys., etc. AAAI/MIT Press, 1996

„ AI & Machine Learning „ U. Fayyad, G. Grinstein, and A. Wierse, Information Visualization in Data Mining and Knowledge Discovery, Morgan
„ Conferences: Machine learning (ML), AAAI, IJCAI, COLT (Learning Theory), CVPR, NIPS, etc. Kaufmann, 2001
„ Journals: Machine Learning, Artificial Intelligence, Knowledge and Information Systems, „ J. Han and M. Kamber. Data Mining: Concepts and Techniques. Morgan Kaufmann, 2nd ed., 2006
IEEE-PAMI, etc. „ D. J. Hand, H. Mannila, and P. Smyth, Principles of Data Mining, MIT Press, 2001
„ Web and IR „ T. Hastie, R. Tibshirani, and J. Friedman, The Elements of Statistical Learning: Data Mining, Inference,
„ Conferences: SIGIR, WWW, CIKM, etc. and Prediction, Springer-Verlag, 2001
„ Journals: WWW: Internet and Web Information Systems,
„ T. M. Mitchell, Machine Learning, McGraw Hill, 1997
„ Statistics
„ G. Piatetsky-Shapiro and W. J. Frawley. Knowledge Discovery in Databases. AAAI/MIT Press, 1991
„ Conferences: Joint Stat. Meeting, etc.
P.-N. Tan, M. Steinbach and V. Kumar, Introduction to Data Mining, Wiley, 2005
Journals: Annals of statistics, etc.
„
„
S. M. Weiss and N. Indurkhya, Predictive Data Mining, Morgan Kaufmann, 1998
„ Visualization „

„ Conference proceedings: CHI, ACM-SIGGraph, etc. „ I. H. Witten and E. Frank, Data Mining: Practical Machine Learning Tools and Techniques with Java

„ Journals: IEEE Trans. visualization and computer graphics, etc. Implementations, Morgan Kaufmann, 2nd ed. 2005

October 22, 2007 Data Mining: Concepts and Techniques 35 October 22, 2007 Data Mining: Concepts and Techniques 36

9
Chapter 5: Mining Frequent Patterns,
Association and Correlations

„ Basic concepts and a road map


„ Efficient and scalable frequent itemset mining
methods
„ Mining various kinds of association rules
„ From association mining to correlation
analysis
„ Constraint-based association mining
„ Summary

October 22, 2007 Data Mining: Concepts and Techniques 37 October 22, 2007 Data Mining: Concepts and Techniques 38

What Is Frequent Pattern Analysis? Why Is Freq. Pattern Mining Important?


Frequent pattern: a pattern (a set of items, subsequences, substructures,
Discloses an intrinsic and important property of data sets
„
„
etc.) that occurs frequently in a data set
„ Forms the foundation for many essential data mining tasks
„ First proposed by Agrawal, Imielinski, and Swami [AIS93] in the context
of frequent itemsets and association rule mining
„ Association, correlation, and causality analysis
„ Motivation: Finding inherent regularities in data
„ Sequential, structural (e.g., sub-graph) patterns
„ What products were often purchased together?— Beer and diapers?! „ Pattern analysis in spatiotemporal, multimedia, time-
series, and stream data
„ What are the subsequent purchases after buying a PC?
„ What kinds of DNA are sensitive to this new drug?
„ Classification: associative classification
„ Can we automatically classify web documents?
„ Cluster analysis: frequent pattern-based clustering
„ Applications „ Data warehousing: iceberg cube and cube-gradient
„ Basket data analysis, cross-marketing, catalog design, sale campaign „ Semantic data compression: fascicles
analysis, Web log (click stream) analysis, and DNA sequence analysis. „ Broad applications
October 22, 2007 Data Mining: Concepts and Techniques 39 October 22, 2007 Data Mining: Concepts and Techniques 40

10
Basic Concepts: Frequent Patterns and
Association Rules Closed Patterns and Max-Patterns

Transaction-id Items bought „ Itemset X = {x1, …, xk}


„ A long pattern contains a combinatorial number of sub-
10 A, B, D patterns, e.g., {a1, …, a100} contains (1001) + (1002) + … +
„ Find all the rules X Æ Y with minimum
20 A, C, D support and confidence (110000) = 2100 – 1 = 1.27*1030 sub-patterns!
30 A, D, E
„ support, s, probability that a „ Solution: Mine closed patterns and max-patterns instead
40 B, E, F
transaction contains X ∪ Y „ An itemset X is closed if X is frequent and there exists no
50 B, C, D, E, F
„ confidence, c, conditional super-pattern Y ‫ כ‬X, with the same support as X
Customer Customer probability that a transaction (proposed by Pasquier, et al. @ ICDT’99)
buys both buys diaper
having X also contains Y
„ An itemset X is a max-pattern if X is frequent and there
Let supmin = 50%, confmin = 50% exists no frequent super-pattern Y ‫ כ‬X (proposed by
Freq. Pat.: {A:3, B:3, D:4, E:3, AD:3} Bayardo @ SIGMOD’98)
Association rules:
Customer A Æ D (60%, 100%) „ Closed pattern is a lossless compression of freq. patterns
buys beer
D Æ A (60%, 75%) „ Reducing the # of patterns and rules
October 22, 2007 Data Mining: Concepts and Techniques 41 October 22, 2007 Data Mining: Concepts and Techniques 42

Chapter 5: Mining Frequent Patterns,


Closed Patterns and Max-Patterns
Association and Correlations
„ Exercise. DB = {<a1, …, a100>, < a1, …, a50>}
„ Basic concepts and a road map
„ Min_sup = 1.
„ Efficient and scalable frequent itemset mining
„ What is the set of closed itemset?
methods
„ <a1, …, a100>: 1
„ Mining various kinds of association rules
„ < a1, …, a50>: 2
„ From association mining to correlation
„ What is the set of max-pattern?
analysis
„ <a1, …, a100>: 1
„ Constraint-based association mining
„ What is the set of all patterns?
„ !! „ Summary

October 22, 2007 Data Mining: Concepts and Techniques 43 October 22, 2007 Data Mining: Concepts and Techniques 44

11
Scalable Methods for Mining Frequent Patterns Apriori: A Candidate Generation-and-Test Approach

„ The downward closure property of frequent patterns „ Apriori pruning principle: If there is any itemset which is
„ Any subset of a frequent itemset must be frequent
infrequent, its superset should not be generated/tested!
„ If {beer, diaper, nuts} is frequent, so is {beer,
(Agrawal & Srikant @VLDB’94, Mannila, et al. @ KDD’ 94)
diaper}
„ Method:
„ i.e., every transaction having {beer, diaper, nuts} also

contains {beer, diaper} „ Initially, scan DB once to get frequent 1-itemset


„ Scalable mining methods: Three major approaches „ Generate length (k+1) candidate itemsets from length k
„ Apriori (Agrawal & Srikant@VLDB’94) frequent itemsets
„ Freq. pattern growth (FPgrowth—Han, Pei & Yin
„ Test the candidates against DB
@SIGMOD’00)
Terminate when no frequent or candidate set can be
„ Vertical data format approach (Charm—Zaki & Hsiao
„

@SDM’02) generated

October 22, 2007 Data Mining: Concepts and Techniques 45 October 22, 2007 Data Mining: Concepts and Techniques 46

The Apriori Algorithm—An Example The Apriori Algorithm


Supmin = 2 Itemset sup
Itemset sup
Database TDB {A} 2 Pseudo-code:
L1 {A} 2
„
Tid Items C1 {B} 3 Ck: Candidate itemset of size k
{B} 3
10 A, C, D
1st scan {C} 3
{C} 3 Lk : frequent itemset of size k
20 B, C, E {D} 1
{E} 3
30 A, B, C, E {E} 3 L1 = {frequent items};
40 B, E for (k = 1; Lk !=∅; k++) do begin
C2 Itemset sup C2
{A, B} 1
Itemset Ck+1 = candidates generated from Lk;
L2 Itemset sup 2nd scan
{A, C} 2
{A, C} 2
{A, B} for each transaction t in database do
{A, E} 1 {A, C}
{B, C} 2
{A, E}
increment the count of all candidates in Ck+1
{B, C} 2
{B, E} 3
{B, E} 3 {B, C} that are contained in t
{C, E} 2
{C, E} 2 {B, E} Lk+1 = candidates in Ck+1 with min_support
{C, E} end
C3 Itemset
3rd scan L3 Itemset sup return ∪k Lk;
{B, C, E} {B, C, E} 2
October 22, 2007 Data Mining: Concepts and Techniques 47 October 22, 2007 Data Mining: Concepts and Techniques 48

12
Important Details of Apriori How to Generate Candidates?
How to generate candidates?
„
„ Suppose the items in Lk-1 are listed in an order
„ Step 1: self-joining Lk
„ Step 1: self-joining Lk-1
„ Step 2: pruning
insert into Ck
„ How to count supports of candidates?
select p.item1, p.item2, …, p.itemk-1, q.itemk-1
„ Example of Candidate-generation
from Lk-1 p, Lk-1 q
„ L3={abc, abd, acd, ace, bcd}
where p.item1=q.item1, …, p.itemk-2=q.itemk-2, p.itemk-1 <
„ Self-joining: L3*L3
q.itemk-1
„ abcd from abc and abd
„ acde from acd and ace „ Step 2: pruning
„ Pruning: forall itemsets c in Ck do
„ acde is removed because ade is not in L3 forall (k-1)-subsets s of c do
„ C4={abcd} if (s is not in Lk-1) then delete c from Ck

October 22, 2007 Data Mining: Concepts and Techniques 49 October 22, 2007 Data Mining: Concepts and Techniques 50

How to Count Supports of Candidates? Example: Counting Supports of Candidates

„ Why counting supports of candidates a problem? Subset function


Transaction: 1 2 3 5 6
3,6,9
„ The total number of candidates can be very huge 1,4,7
„ One transaction may contain many candidates 2,5,8

„ Method: 1+2356
„ Candidate itemsets are stored in a hash-tree 13+56 234
567
„ Leaf node of hash-tree contains a list of itemsets and 367
145 345 356
counts 136
357 368
12+356
„ Interior node contains a hash table 124
689
125 159
„ Subset function: finds all the candidates contained in 457
458
a transaction
October 22, 2007 Data Mining: Concepts and Techniques 51 October 22, 2007 Data Mining: Concepts and Techniques 52

13
Efficient Implementation of Apriori in SQL Challenges of Frequent Pattern Mining

„ Hard to get good performance out of pure SQL (SQL- „ Challenges


92) based approaches alone „ Multiple scans of transaction database
„ Make use of object-relational extensions like UDFs, „ Huge number of candidates
BLOBs, Table functions etc. „ Tedious workload of support counting for candidates

„ Get orders of magnitude improvement „ Improving Apriori: general ideas


„ Reduce passes of transaction database scans
„ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating
„ Shrink number of candidates
association rule mining with relational database
„ Facilitate support counting of candidates
systems: Alternatives and implications. In SIGMOD’98

October 22, 2007 Data Mining: Concepts and Techniques 53 October 22, 2007 Data Mining: Concepts and Techniques 54

Partition: Scan Database Only Twice DHP: Reduce the Number of Candidates

„ Any itemset that is potentially frequent in DB must be „ A k-itemset whose corresponding hashing bucket count is
frequent in at least one of the partitions of DB below the threshold cannot be frequent

„ Scan 1: partition database and find local frequent „ Candidates: a, b, c, d, e

patterns „ Hash entries: {ab, ad, ae} {bd, be, de} …

„ Scan 2: consolidate global frequent patterns „ Frequent 1-itemset: a, b, d, e

„ A. Savasere, E. Omiecinski, and S. Navathe. An efficient „ ab is not a candidate 2-itemset if the sum of count of
{ab, ad, ae} is below support threshold
algorithm for mining association in large databases. In
VLDB’95 „ J. Park, M. Chen, and P. Yu. An effective hash-based
algorithm for mining association rules. In SIGMOD’95

October 22, 2007 Data Mining: Concepts and Techniques 55 October 22, 2007 Data Mining: Concepts and Techniques 56

14
Sampling for Frequent Patterns DIC: Reduce Number of Scans

„ Select a sample of original database, mine frequent ABCD


„ Once both A and D are determined
patterns within sample using Apriori frequent, the counting of AD begins
ABC ABD ACD BCD „ Once all length-2 subsets of BCD are
„ Scan database once to verify frequent itemsets found in determined frequent, the counting of BCD
begins
sample, only borders of closure of frequent patterns are AB AC BC AD BD CD
Transactions
checked
1-itemsets
A B C D
„ Example: check abcd instead of ab, ac, …, etc. Apriori 2-itemsets

„ Scan database again to find missed frequent patterns {}
Itemset lattice 1-itemsets
„ H. Toivonen. Sampling large databases for association S. Brin R. Motwani, J. Ullman, 2-items
rules. In VLDB’96 and S. Tsur. Dynamic itemset DIC 3-items
counting and implication rules for
market basket data. In
SIGMOD’97
October 22, 2007 Data Mining: Concepts and Techniques 57 October 22, 2007 Data Mining: Concepts and Techniques 58

Mining Frequent Patterns Without


Bottleneck of Frequent-pattern Mining Candidate Generation

„ Multiple database scans are costly „ Grow long patterns from short ones using local
„ Mining long patterns needs many passes of
frequent items
scanning and generates lots of candidates
„ To find frequent itemset i1i2…i100 „ “abc” is a frequent pattern
# of scans: 100
„
„ Get all transactions having “abc”: DB|abc
„ # of Candidates: (1001) + (1002) + … + (110000) = 2100-
1 = 1.27*1030 ! „ “d” is a local frequent item in DB|abc Æ abcd is
„ Bottleneck: candidate-generation-and-test a frequent pattern
„ Can we avoid candidate generation?

October 22, 2007 Data Mining: Concepts and Techniques 59 October 22, 2007 Data Mining: Concepts and Techniques 60

15
Construct FP-tree from a Transaction Database Benefits of the FP-tree Structure

TID Items bought (ordered) frequent items „ Completeness


100 {f, a, c, d, g, i, m, p} {f, c, a, m, p}
„ Preserve complete information for frequent pattern
200 {a, b, c, f, l, m, o} {f, c, a, b, m}
300 {b, f, h, j, o, w} {f, b} min_support = 3 mining
400 {b, c, k, s, p} {c, b, p} „ Never break a long pattern of any transaction
500 {a, f, c, e, l, p, m, n} {f, c, a, m, p} {}
Header Table „ Compactness
1. Scan DB once, find
„ Reduce irrelevant info—infrequent items are gone
frequent 1-itemset Item frequency head f:4 c:1
(single item pattern) f 4 „ Items in frequency descending order: the more
c 4 c:3 b:1 b:1 frequently occurring, the more likely to be shared
2. Sort frequent items in a 3
frequency descending b 3 „ Never be larger than the original database (not count
a:3 p:1
order, f-list m 3 node-links and the count field)
3. Scan DB again, p 3
m:2 b:1 „ For Connect-4 DB, compression ratio could be over 100
construct FP-tree
F-list=f-c-a-b-m-p p:2 m:1
October 22, 2007 Data Mining: Concepts and Techniques 61 October 22, 2007 Data Mining: Concepts and Techniques 62

Partition Patterns and Databases Find Patterns Having P From P-conditional Database

„ Starting at the frequent item header table in the FP-tree


„ Frequent patterns can be partitioned into subsets „ Traverse the FP-tree by following the link of each frequent item p
according to f-list „ Accumulate all of transformed prefix paths of item p to form p’s
conditional pattern base
„ F-list=f-c-a-b-m-p

„ Patterns containing p {}
Header Table
„ Patterns having m but no p Conditional pattern bases
Item frequency head f:4 c:1
„ … f 4 item cond. pattern base
c 4 c:3 b:1 b:1 c f:3
„ Patterns having c but no a nor b, m, p a 3
b 3 a:3 p:1 a fc:3
„ Pattern f m 3 b fca:1, f:1, c:1
p 3
„ Completeness and non-redundency m:2 b:1 m fca:2, fcab:1
p:2 m:1 p fcam:2, cb:1
October 22, 2007 Data Mining: Concepts and Techniques 63 October 22, 2007 Data Mining: Concepts and Techniques 64

16
From Conditional Pattern-bases to Conditional FP-trees Recursion: Mining Each Conditional FP-tree
„ For each pattern-base {}
„ Accumulate the count for each item in the base
{} Cond. pattern base of “am”: (fc:3) f:3
„ Construct the FP-tree for the frequent items of the
c:3
pattern base f:3
am-conditional FP-tree
c:3 {}
{}
m-conditional pattern base: Cond. pattern base of “cm”: (f:3)
fca:2, fcab:1 a:3
Header Table f:3
Item frequency head All frequent m-conditional FP-tree
f:4 c:1 patterns relate to m cm-conditional FP-tree
f 4 {}
c 4 c:3 b:1 b:1 m,
¼
a 3 f:3 ¼ fm, cm, am, {}
b 3 a:3 p:1 fcm, fam, cam,
m 3 c:3 fcam
Cond. pattern base of “cam”: (f:3) f:3
m:2 b:1
p 3
a:3 cam-conditional FP-tree
p:2 m:1
m-conditional FP-tree
October 22, 2007 Data Mining: Concepts and Techniques 65 October 22, 2007 Data Mining: Concepts and Techniques 66

A Special Case: Single Prefix Path in FP-tree Mining Frequent Patterns With FP-trees

„ Suppose a (conditional) FP-tree T has a shared „ Idea: Frequent pattern growth


„ Recursively grow frequent patterns by pattern and
single prefix-path P
database partition
„ Mining can be decomposed into two parts
{}
„ Method
„ Reduction of the single prefix path into one node „ For each frequent item, construct its conditional
a1:n1
„ Concatenation of the mining results of the two pattern-base, and then its conditional FP-tree
a2:n2 parts „ Repeat the process on each newly created conditional

a3:n3
FP-tree
r1
{} „ Until the resulting FP-tree is empty, or it contains only

a1:n1
one path—single path will generate all the
b1:m1 C1:k1
¼ r1 =
a2:n2
+ b1:m1 C1:k1 combinations of its sub-paths, each of which is a
frequent pattern
C2:k2 C3:k3
a3:n3 C2:k2 C3:k3
October 22, 2007 Data Mining: Concepts and Techniques 67 October 22, 2007 Data Mining: Concepts and Techniques 68

17
Scaling FP-growth by DB Projection Partition-based Projection

„ FP-tree cannot fit in memory?—DB projection


Tran. DB
„ First partition a database into a set of projected DBs „ Parallel projection needs a lot fcamp
„ Then construct and mine FP-tree for each projected DB of disk space fcabm
fb
Partition projection saves it
„ Parallel projection vs. Partition projection techniques „
cbp
fcamp
„ Parallel projection is space costly

p-proj DB m-proj DB b-proj DB a-proj DB c-proj DB f-proj DB


fcam fcab f fc f …
cb fca cb … …
fcam fca …

am-proj DB cm-proj DB
fc f …
fc f
fc f
October 22, 2007 Data Mining: Concepts and Techniques 69 October 22, 2007 Data Mining: Concepts and Techniques 70

FP-Growth vs. Apriori: Scalability With the Support FP-Growth vs. Tree-Projection: Scalability with
Threshold the Support Threshold

Data set T25I20D10K Data set T25I20D100K


100 140
D2 FP-growth
90 D1 FP-grow th runtime
D1 Apriori runtime 120 D2 TreeProjection
80
70 100

Runtime (sec.)
Run time(sec.)

60
80
50

40
60

30 40
20
20
10
0 0
0 0.5 1 1.5 2 2.5 3 0 0.5 1 1.5 2
Support threshold(%)
Support threshold (%)
October 22, 2007 Data Mining: Concepts and Techniques 71 October 22, 2007 Data Mining: Concepts and Techniques 72

18
Why Is FP-Growth the Winner? Implications of the Methodology

„ Divide-and-conquer: „ Mining closed frequent itemsets and max-patterns


„ decompose both the mining task and DB according to
„ CLOSET (DMKD’00)
the frequent patterns obtained so far
„ Mining sequential patterns
„ leads to focused search of smaller databases
„ Other factors „ FreeSpan (KDD’00), PrefixSpan (ICDE’01)
„ no candidate generation, no candidate test „ Constraint-based mining of frequent patterns

„ compressed database: FP-tree structure „ Convertible constraints (KDD’00, ICDE’01)


„ no repeated scan of entire database „ Computing iceberg data cubes with complex measures
„ basic ops—counting local freq items and building sub
„ H-tree and H-cubing algorithm (SIGMOD’01)
FP-tree, no pattern search and matching

October 22, 2007 Data Mining: Concepts and Techniques 73 October 22, 2007 Data Mining: Concepts and Techniques 74

MaxMiner: Mining Max-patterns Mining Frequent Closed Patterns: CLOSET

„ 1st scan: find frequent items Tid Items „ Flist: list of all frequent items in support ascending order
A, B, C, D, E 10 A,B,C,D,E
„ „ Flist: d-a-f-e-c Min_sup=2
20 B,C,D,E,
„ 2nd scan: find support for „ Divide search space TID Items
30 A,C,D,F
10 a, c, d, e, f
„ AB, AC, AD, AE, ABCDE „ Patterns having d 20 a, b, e
BC, BD, BE, BCDE Patterns having d but no a, etc.
30 c, e, f
„
Potential „
40 a, c, d, f
„ CD, CE, CDE, DE, max-patterns „ Find frequent closed pattern recursively 50 c, e, f

„ Since BCDE is a max-pattern, no need to check BCD, BDE, „ Every transaction having d also has cfa Æ cfad is a
CDE in later scan frequent closed pattern
„ R. Bayardo. Efficiently mining long patterns from „ J. Pei, J. Han & R. Mao. CLOSET: An Efficient Algorithm for
databases. In SIGMOD’98 Mining Frequent Closed Itemsets", DMKD'00.
October 22, 2007 Data Mining: Concepts and Techniques 75 October 22, 2007 Data Mining: Concepts and Techniques 76

19
CLOSET+: Mining Closed Itemsets by
CHARM: Mining by Exploring Vertical Data Format
Pattern-Growth
„ Itemset merging: if Y appears in every occurrence of X, then Y „ Vertical format: t(AB) = {T11, T25, …}
is merged with X „ tid-list: list of trans.-ids containing an itemset
„ Sub-itemset pruning: if Y ‫ כ‬X, and sup(X) = sup(Y), X and all of „ Deriving closed patterns based on vertical intersections
X’s descendants in the set enumeration tree can be pruned
„ t(X) = t(Y): X and Y always happen together
Hybrid tree projection
t(X) ⊂ t(Y): transaction having X always has Y
„
„
Bottom-up physical tree-projection
Using diffset to accelerate mining
„
„
„ Top-down pseudo tree-projection
„ Only keep track of differences of tids
„ Item skipping: if a local frequent item has the same support in
„ t(X) = {T1, T2, T3}, t(XY) = {T1, T3}
several header tables at different levels, one can prune it from
the header table at higher levels
„ Diffset (XY, X) = {T2}

„ Efficient subset checking


„ Eclat/MaxEclat (Zaki et al. @KDD’97), VIPER(P. Shenoy et
al.@SIGMOD’00), CHARM (Zaki & Hsiao@SDM’02)
October 22, 2007 Data Mining: Concepts and Techniques 77 October 22, 2007 Data Mining: Concepts and Techniques 78

Further Improvements of Mining Methods Visualization of Association Rules: Plane Graph

„ AFOPT (Liu, et al. @ KDD’03)


„ A “push-right” method for mining condensed frequent

pattern (CFP) tree


„ Carpenter (Pan, et al. @ KDD’03)
„ Mine data sets with small rows but numerous columns

„ Construct a row-enumeration tree for efficient mining

October 22, 2007 Data Mining: Concepts and Techniques 79 October 22, 2007 Data Mining: Concepts and Techniques 80

20
Visualization of Association Rules
Visualization of Association Rules: Rule Graph
(SGI/MineSet 3.0)

October 22, 2007 Data Mining: Concepts and Techniques 81 October 22, 2007 Data Mining: Concepts and Techniques 82

Chapter 5: Mining Frequent Patterns,


Mining Various Kinds of Association Rules
Association and Correlations
„ Basic concepts and a road map „ Mining multilevel association
„ Efficient and scalable frequent itemset mining
„ Miming multidimensional association
methods
„ Mining various kinds of association rules „ Mining quantitative association
„ From association mining to correlation
„ Mining interesting correlation patterns
analysis
„ Constraint-based association mining
„ Summary

October 22, 2007 Data Mining: Concepts and Techniques 83 October 22, 2007 Data Mining: Concepts and Techniques 84

21
Mining Multiple-Level Association Rules Multi-level Association: Redundancy Filtering

„ Items often form hierarchies


„ Some rules may be redundant due to “ancestor”
„ Flexible support settings
relationships between items.
„ Items at the lower level are expected to have lower
support „ Example
„ Exploration of shared multi-level mining (Agrawal & „ milk ⇒ wheat bread [support = 8%, confidence = 70%]
Srikant@VLB’95, Han & Fu@VLDB’95)
„ 2% milk ⇒ wheat bread [support = 2%, confidence = 72%]
uniform support reduced support „ We say the first rule is an ancestor of the second rule.
Level 1
min_sup = 5%
Milk Level 1 „ A rule is redundant if its support is close to the “expected”
[support = 10%] min_sup = 5%
value, based on the rule’s ancestor.
Level 2 2% Milk Skim Milk Level 2
min_sup = 5% [support = 6%] [support = 4%] min_sup = 3%

October 22, 2007 Data Mining: Concepts and Techniques 85 October 22, 2007 Data Mining: Concepts and Techniques 86

Mining Multi-Dimensional Association Mining Quantitative Associations

„ Single-dimensional rules: „ Techniques can be categorized by how numerical


buys(X, “milk”) ⇒ buys(X, “bread”) attributes, such as age or salary are treated
„ Multi-dimensional rules: ≥ 2 dimensions or predicates 1. Static discretization based on predefined concept
„ Inter-dimension assoc. rules (no repeated predicates) hierarchies (data cube methods)
age(X,”19-25”) ∧ occupation(X,“student”) ⇒ buys(X, “coke”) 2. Dynamic discretization based on data distribution
„ hybrid-dimension assoc. rules (repeated predicates) (quantitative rules, e.g., Agrawal & Srikant@SIGMOD96)
age(X,”19-25”) ∧ buys(X, “popcorn”) ⇒ buys(X, “coke”) 3. Clustering: Distance-based association (e.g., Yang &
„ Categorical Attributes: finite number of possible values, no Miller@SIGMOD97)
ordering among values—data cube approach „ one dimensional clustering then association
„ Quantitative Attributes: numeric, implicit ordering among 4. Deviation: (such as Aumann and Lindell@KDD99)
values—discretization, clustering, and gradient approaches Sex = female => Wage: mean=$7/hr (overall mean = $9)

October 22, 2007 Data Mining: Concepts and Techniques 87 October 22, 2007 Data Mining: Concepts and Techniques 88

22
Static Discretization of Quantitative Attributes Quantitative Association Rules
„ Proposed by Lent, Swami and Widom ICDE’97
„ Discretized prior to mining using concept hierarchy.
„ Numeric attributes are dynamically discretized
„ Numeric values are replaced by ranges. „ Such that the confidence or compactness of the rules

„ In relational database, finding all frequent k-predicate sets mined is maximized


will require k or k+1 table scans. „ 2-D quantitative association rules: Aquan1 ∧ Aquan2 ⇒ Acat
() „ Cluster adjacent
„ Data cube is well suited for mining. association rules
„ The cells of an n-dimensional to form general
(age) (income) (buys)
rules using a 2-D grid
cuboid correspond to the
„ Example
predicate sets.
(age, income) (age,buys) (income,buys) age(X,”34-35”) ∧ income(X,”30-50K”)
„ Mining from data cubes ⇒ buys(X,”high resolution TV”)
can be much faster.
(age,income,buys)
October 22, 2007 Data Mining: Concepts and Techniques 89 October 22, 2007 Data Mining: Concepts and Techniques 90

Chapter 5: Mining Frequent Patterns,


Mining Other Interesting Patterns
Association and Correlations
Flexible support constraints (Wang et al. @ VLDB’02)
„
„ Basic concepts and a road map
„ Some items (e.g., diamond) may occur rarely but are
valuable „ Efficient and scalable frequent itemset mining
„ Customized supmin specification and application methods
„ Top-K closed frequent patterns (Han, et al. @ ICDM’02) „ Mining various kinds of association rules
Hard to specify supmin, but top-k with lengthmin is more
„ From association mining to correlation analysis
„

desirable
„ Dynamically raise supmin in FP-tree construction and „ Constraint-based association mining
mining, and select most promising path to mine „ Summary

October 22, 2007 Data Mining: Concepts and Techniques 91 October 22, 2007 Data Mining: Concepts and Techniques 92

23
Interestingness Measure: Correlations (Lift) Are lift and χ2 Good Measures of Correlation?

„ play basketball ⇒ eat cereal [40%, 66.7%] is misleading „ “Buy walnuts ⇒ buy milk [1%, 80%]” is misleading
„ The overall % of students eating cereal is 75% > 66.7%. „ if 85% of customers buy milk
„ play basketball ⇒ not eat cereal [20%, 33.3%] is more accurate, „ Support and confidence are not good to represent correlations
although with lower support and confidence „ So many interestingness measures? (Tan, Kumar, Sritastava @KDD’02)
„ Measure of dependent/correlated events: lift
P ( A∪ B )
Basketball Not basketball Sum (row) lift =
P ( A) P ( B ) Milk No Milk Sum (row)

P( A∪ B) Cereal 2000 1750 3750 Coffee m, c ~m, c c


lift = Not cereal 1000 250 1250 all _ conf =
sup(X ) No Coffee m, ~c ~m, ~c ~c

P( A) P( B) max_item _ sup(X ) Sum(col.) m ~m Σ


Sum(col.) 3000 2000 5000
DB m, c ~m, c m~c ~m~c lift all-conf coh χ2

1000 / 5000 sup( X ) A1 1000 100 100 10,000 9.26 0.91 0.83 9055
lift ( B, C ) =
2000 / 5000
= 0.89 lift ( B, ¬C ) = = 1.33 coh = A2 100 1000 1000 100,000 8.44 0.09 0.05 670
3000 / 5000 * 3750 / 5000 3000 / 5000 *1250 / 5000 | universe ( X ) |
A3 1000 100 10000 100,000 9.18 0.09 0.09 8172
A4 1000 1000 1000 1000 1 0.5 0.33 0
October 22, 2007 Data Mining: Concepts and Techniques 93 October 22, 2007 Data Mining: Concepts and Techniques 94

Chapter 5: Mining Frequent Patterns,


Which Measures Should Be Used?
Association and Correlations
„ lift and χ2 are not
good measures for „ Basic concepts and a road map
correlations in large
transactional DBs „ Efficient and scalable frequent itemset mining
„ all-conf or
coherence could be methods
good measures
(Omiecinski@TKDE’03) „ Mining various kinds of association rules
„ Both all-conf and
coherence have the „ From association mining to correlation analysis
downward closure
property „ Constraint-based association mining
„ Efficient algorithms
can be derived for „ Summary
mining (Lee et al.
@ICDM’03sub)

October 22, 2007 Data Mining: Concepts and Techniques 95 October 22, 2007 Data Mining: Concepts and Techniques 96

24
Constraint-based (Query-Directed) Mining Constraints in Data Mining

„ Knowledge type constraint:


„ Finding all the patterns in a database autonomously? —
„ classification, association, etc.
unrealistic!
„ Data constraint — using SQL-like queries
„ The patterns could be too many but not focused! „ find product pairs sold together in stores in Chicago in

„ Data mining should be an interactive process Dec.’02


„ User directs what to be mined using a data mining „ Dimension/level constraint
„ in relevance to region, price, brand, customer category
query language (or a graphical user interface)
„ Rule (or pattern) constraint
„ Constraint-based mining
„ small sales (price < $10) triggers big sales (sum >
„ User flexibility: provides constraints on what to be $200)
mined „ Interestingness constraint
„ System optimization: explores such constraints for „ strong rules: min_support ≥ 3%, min_confidence ≥

efficient mining—constraint-based mining 60%


October 22, 2007 Data Mining: Concepts and Techniques 97 October 22, 2007 Data Mining: Concepts and Techniques 98

Constrained Mining vs. Constraint-Based Search Anti-Monotonicity in Constraint Pushing


TDB (min_sup=2)
„ Constrained mining vs. constraint-based search/reasoning „ Anti-monotonicity TID Transaction

„ Both are aimed at reducing search space „ When an intemset S violates the 10 a, b, c, d, f
20 b, c, d, f, g, h
„ Finding all patterns satisfying constraints vs. finding constraint, so does any of its superset
30 a, c, d, e, f
some (or one) answer in constraint-based search in AI „ sum(S.Price) ≤ v is anti-monotone 40 c, e, f, g
„ Constraint-pushing vs. heuristic search
„ sum(S.Price) ≥ v is not anti-monotone Item Profit
„ It is an interesting research problem on how to integrate
„ Example. C: range(S.profit) ≤ 15 is anti- a 40
them
monotone b 0
„ Constrained mining vs. query processing in DBMS c -20
„ Itemset ab violates C d 10
„ Database query processing requires to find all
„ So does every superset of ab e -30
„ Constrained pattern mining shares a similar philosophy
f 30
as pushing selections deeply in query processing g 20
h -10
October 22, 2007 Data Mining: Concepts and Techniques 99 October 22, 2007 Data Mining: Concepts and Techniques 100

25
Monotonicity for Constraint Pushing Succinctness
TDB (min_sup=2)
TID Transaction „ Succinctness:
„ Monotonicity
10 a, b, c, d, f
„ When an intemset S satisfies the „ Given A1, the set of items satisfying a succinctness
20 b, c, d, f, g, h
constraint, so does any of its constraint C, then any set S satisfying C is based on
30 a, c, d, e, f

superset 40 c, e, f, g A1 , i.e., S contains a subset belonging to A1

„ sum(S.Price) ≥ v is monotone Item Profit „ Idea: Without looking at the transaction database,
a 40 whether an itemset S satisfies constraint C can be
„ min(S.Price) ≤ v is monotone b 0
determined based on the selection of items
„ Example. C: range(S.profit) ≥ 15 c -20
d 10 „ min(S.Price) ≤ v is succinct
„ Itemset ab satisfies C e -30
„ sum(S.Price) ≥ v is not succinct
„ So does every superset of ab f 30
g 20 „ Optimization: If C is succinct, C is pre-counting pushable
h -10
October 22, 2007 Data Mining: Concepts and Techniques 101 October 22, 2007 Data Mining: Concepts and Techniques 102

The Apriori Algorithm — Example Naïve Algorithm: Apriori + Constraint


Database D itemset sup. Database D itemset sup.
L1 itemset sup. L1 itemset sup.
TID Items C1 {1} 2 {1} 2 TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3 100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 200 235 Scan D {3} 3
{3} 3 {3} 3
300 1235 {4} 1 300 1235 {4} 1
{5} 3 {5} 3
400 25 {5} 3 400 25 {5} 3
C2 itemset sup C2 itemset C2 itemset sup C2 itemset
L2 itemset sup {1 2} 1 Scan D {1 2} L2 itemset sup {1 2} 1 Scan D {1 2}
{1 3} 2 {1 3} 2 {1 3} {1 3} 2 {1 3} 2 {1 3}
{2 3} 2 {1 5} 1 {1 5} {2 3} 2 {1 5} 1 {1 5}
{2 3} 2 {2 3} {2 3} 2 {2 3}
{2 5} 3 {2 5} 3
{2 5} 3 {2 5} {2 5} 3 {2 5}
{3 5} 2 {3 5} 2
{3 5} 2 {3 5} {3 5} 2 {3 5}
C3 itemset Scan D L3 itemset sup C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 {2 3 5} {2 3 5} 2 Sum{S.price} < 5
October 22, 2007 Data Mining: Concepts and Techniques 103 October 22, 2007 Data Mining: Concepts and Techniques 104

26
The Constrained Apriori Algorithm: Push The Constrained Apriori Algorithm: Push a
an Anti-monotone Constraint Deep Succinct Constraint Deep
Database D itemset sup. Database D itemset sup.
L1 itemset sup. L1 itemset sup.
TID Items C1 {1} 2 {1} 2 TID Items C1 {1} 2 {1} 2
100 134 {2} 3 {2} 3 100 134 {2} 3 {2} 3
200 235 Scan D {3} 3 200 235 Scan D {3} 3
{3} 3 {3} 3
300 1235 {4} 1 300 1235 {4} 1
{5} 3 {5} 3
400 25 {5} 3 400 25 {5} 3
C2 itemset sup C2 itemset C2 itemset sup C2 itemset
L2 itemset sup {1 2} L2 itemset sup {1 2}
{1 2} 1 Scan D {1 2} 1 Scan D
{1 3} 2 {1 3} 2 {1 3} {1 3} 2 {1 3} 2 {1 3}
not immediately
{1 5} 1 {1 5} {1 5} 1 {1 5} to be used
{2 3} 2 {2 3} 2
{2 3} 2 {2 3} {2 3} 2 {2 3}
{2 5} 3 {2 5} 3
{2 5} 3 {2 5} {2 5} 3 {2 5}
{3 5} 2 {3 5} 2 {3 5}
{3 5} 2 {3 5} {3 5} 2
C3 itemset Scan D L3 itemset sup Constraint: C3 itemset Scan D L3 itemset sup Constraint:
{2 3 5} {2 3 5} 2 Sum{S.price} < 5 {2 3 5} {2 3 5} 2 min{S.price } <= 1
October 22, 2007 Data Mining: Concepts and Techniques 105 October 22, 2007 Data Mining: Concepts and Techniques 106

Converting “Tough” Constraints


Strongly Convertible Constraints
TDB (min_sup=2)
TID Transaction
„ Convert tough constraints into anti- „ avg(X) ≥ 25 is convertible anti-monotone w.r.t.
10 a, b, c, d, f
monotone or monotone by properly item value descending order R: <a, f, g, d, b,
ordering items
20 b, c, d, f, g, h h, c, e> Item Profit
30 a, c, d, e, f
„ If an itemset af violates a constraint C, so a 40
„ Examine C: avg(S.profit) ≥ 25 40 c, e, f, g does every itemset with af as prefix, such as b 0
„ Order items in value-descending Item Profit afd c -20
order a 40 „ avg(X) ≥ 25 is convertible monotone w.r.t. item d 10
b 0 value ascending order R-1: <e, c, h, b, d, g, f,
<a, f, g, d, b, h, c, e> e -30
a>
„
c -20
f 30
„ If an itemset afb violates C d 10 „ If an itemset d satisfies a constraint C, so
g 20
„ So does afbh, afb* e -30 does itemsets df and dfa, which having d as
h -10
f 30 a prefix
„ It becomes anti-monotone! g 20 „ Thus, avg(X) ≥ 25 is strongly convertible
h -10
October 22, 2007 Data Mining: Concepts and Techniques 107 October 22, 2007 Data Mining: Concepts and Techniques 108

27
Can Apriori Handle Convertible Constraint? Mining With Convertible Constraints
Item Value
„ C: avg(X) >= 25, min_sup=2 a 40
„ A convertible, not monotone nor anti-monotone „ List items in every transaction in value descending f 30
nor succinct constraint cannot be pushed deep order R: <a, f, g, d, b, h, c, e> g 20
into the an Apriori mining algorithm „ C is convertible anti-monotone w.r.t. R d 10

„ Within the level wise framework, no direct b 0


Item Value „ Scan TDB once
pruning based on the constraint can be made a 40 „ remove infrequent items
h -10
c -20
„ Itemset df violates constraint C: avg(X)>=25 b 0 „ Item h is dropped e -30
„ Since adf satisfies C, Apriori needs df to c -20 „ Itemsets a and f are good, …
TDB (min_sup=2)
assemble adf, df cannot be pruned d 10
„ Projection-based mining
TID Transaction
e -30
„ But it can be pushed into frequent-pattern „ Imposing an appropriate order on item projection 10 a, f, d, b, c
f 30
growth framework! „ Many tough constraints can be converted into 20 f, g, d, b, c
g 20
(anti)-monotone 30 a, f, d, c, e
h -10
40 f, g, h, c, e
October 22, 2007 Data Mining: Concepts and Techniques 109 October 22, 2007 Data Mining: Concepts and Techniques 110

Handling Multiple Constraints What Constraints Are Convertible?

„ Different constraints may require different or even Convertible anti- Convertible Strongly
Constraint
conflicting item-ordering monotone monotone convertible

avg(S) ≤ , ≥ v Yes Yes Yes


„ If there exists an order R s.t. both C1 and C2 are
median(S) ≤ , ≥ v Yes Yes Yes
convertible w.r.t. R, then there is no conflict between
sum(S) ≤ v (items could be of any value,
Yes No No
the two convertible constraints v ≥ 0)

„ If there exists conflict on order of items sum(S) ≤ v (items could be of any value,
No Yes No
v ≤ 0)
„ Try to satisfy one constraint first sum(S) ≥ v (items could be of any value,
No Yes No
„ Then using the order for the other constraint to v ≥ 0)

mine frequent itemsets in the corresponding sum(S) ≥ v (items could be of any value,
Yes No No
v ≤ 0)
projected database ……

October 22, 2007 Data Mining: Concepts and Techniques 111 October 22, 2007 Data Mining: Concepts and Techniques 112

28
Constraint-Based Mining—A General Picture A Classification of Constraints

Constraint Antimonotone Monotone Succinct


v∈S no yes yes
S⊇V no yes yes

S⊆V yes no yes


min(S) ≤ v no yes yes
Antimonotone Monotone
min(S) ≥ v yes no yes
max(S) ≤ v yes no yes

max(S) ≥ v no yes yes Strongly


count(S) ≤ v yes no weakly
convertible
Succinct
count(S) ≥ v no yes weakly

sum(S) ≤ v ( a ∈ S, a ≥ 0 ) yes no no
sum(S) ≥ v ( a ∈ S, a ≥ 0 ) no yes no

range(S) ≤ v yes no no
Convertible Convertible
range(S) ≥ v no yes no anti-monotone monotone
avg(S) θ v, θ ∈ { =, ≤, ≥ } convertible convertible no
support(S) ≥ ξ yes no no Inconvertible
support(S) ≤ ξ no yes no

October 22, 2007 Data Mining: Concepts and Techniques 113 October 22, 2007 Data Mining: Concepts and Techniques 114

Chapter 5: Mining Frequent Patterns,


Frequent-Pattern Mining: Summary
Association and Correlations
Frequent pattern mining—an important task in data mining
„ Basic concepts and a road map
„

„ Scalable frequent pattern mining methods


„ Efficient and scalable frequent itemset mining
Apriori (Candidate generation & test)
methods
„

„ Projection-based (FPgrowth, CLOSET+, ...)


„ Mining various kinds of association rules
„ Vertical format approach (CHARM, ...)
„ From association mining to correlation analysis
ƒ Mining a variety of rules and interesting patterns
„ Constraint-based association mining
ƒ Constraint-based mining
„ Summary ƒ Mining sequential and structured patterns
ƒ Extensions and applications
October 22, 2007 Data Mining: Concepts and Techniques 115 October 22, 2007 Data Mining: Concepts and Techniques 116

29
Frequent-Pattern Mining: Research Problems Ref: Basic Concepts of Frequent Pattern Mining

„ Mining fault-tolerant frequent, sequential and structured „ (Association Rules) R. Agrawal, T. Imielinski, and A. Swami. Mining
association rules between sets of items in large databases.
patterns
SIGMOD'93.
„ Patterns allows limited faults (insertion, deletion,
„ (Max-pattern) R. J. Bayardo. Efficiently mining long patterns from
mutation) databases. SIGMOD'98.
„ Mining truly interesting patterns „ (Closed-pattern) N. Pasquier, Y. Bastide, R. Taouil, and L. Lakhal.
„ Surprising, novel, concise, … Discovering frequent closed itemsets for association rules. ICDT'99.
„ Application exploration „ (Sequential pattern) R. Agrawal and R. Srikant. Mining sequential
patterns. ICDE'95
„ E.g., DNA sequence analysis and bio-pattern
classification
„ “Invisible” data mining

October 22, 2007 Data Mining: Concepts and Techniques 117 October 22, 2007 Data Mining: Concepts and Techniques 118

Ref: Apriori and Its Improvements Ref: Depth-First, Projection-Based FP Mining

„ R. Agrawal and R. Srikant. Fast algorithms for mining association rules. „ R. Agarwal, C. Aggarwal, and V. V. V. Prasad. A tree projection
VLDB'94. algorithm for generation of frequent itemsets. J. Parallel and
Distributed Computing:02.
„ H. Mannila, H. Toivonen, and A. I. Verkamo. Efficient algorithms for
discovering association rules. KDD'94. „ J. Han, J. Pei, and Y. Yin. Mining frequent patterns without candidate
generation. SIGMOD’ 00.
„ A. Savasere, E. Omiecinski, and S. Navathe. An efficient algorithm for
„ J. Pei, J. Han, and R. Mao. CLOSET: An Efficient Algorithm for Mining
mining association rules in large databases. VLDB'95.
Frequent Closed Itemsets. DMKD'00.
„ J. S. Park, M. S. Chen, and P. S. Yu. An effective hash-based algorithm „ J. Liu, Y. Pan, K. Wang, and J. Han. Mining Frequent Item Sets by
for mining association rules. SIGMOD'95. Opportunistic Projection. KDD'02.
„ H. Toivonen. Sampling large databases for association rules. VLDB'96. „ J. Han, J. Wang, Y. Lu, and P. Tzvetkov. Mining Top-K Frequent Closed
„ S. Brin, R. Motwani, J. D. Ullman, and S. Tsur. Dynamic itemset Patterns without Minimum Support. ICDM'02.
counting and implication rules for market basket analysis. SIGMOD'97. „ J. Wang, J. Han, and J. Pei. CLOSET+: Searching for the Best
„ S. Sarawagi, S. Thomas, and R. Agrawal. Integrating association rule Strategies for Mining Frequent Closed Itemsets. KDD'03.
mining with relational database systems: Alternatives and implications. „ G. Liu, H. Lu, W. Lou, J. X. Yu. On Computing, Storing and Querying
SIGMOD'98. Frequent Patterns. KDD'03.
October 22, 2007 Data Mining: Concepts and Techniques 119 October 22, 2007 Data Mining: Concepts and Techniques 120

30
Ref: Vertical Format and Row Enumeration Methods Ref: Mining Multi-Level and Quantitative Rules

„ M. J. Zaki, S. Parthasarathy, M. Ogihara, and W. Li. Parallel algorithm „ R. Srikant and R. Agrawal. Mining generalized association rules.
for discovery of association rules. DAMI:97. VLDB'95.
„ J. Han and Y. Fu. Discovery of multiple-level association rules from
„ Zaki and Hsiao. CHARM: An Efficient Algorithm for Closed Itemset
large databases. VLDB'95.
Mining, SDM'02.
„ R. Srikant and R. Agrawal. Mining quantitative association rules in
„ C. Bucila, J. Gehrke, D. Kifer, and W. White. DualMiner: A Dual- large relational tables. SIGMOD'96.
Pruning Algorithm for Itemsets with Constraints. KDD’02. „ T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama. Data mining
„ F. Pan, G. Cong, A. K. H. Tung, J. Yang, and M. Zaki , CARPENTER: using two-dimensional optimized association rules: Scheme,
algorithms, and visualization. SIGMOD'96.
Finding Closed Patterns in Long Biological Datasets. KDD'03.
„ K. Yoda, T. Fukuda, Y. Morimoto, S. Morishita, and T. Tokuyama.
Computing optimized rectilinear regions for association rules. KDD'97.
„ R.J. Miller and Y. Yang. Association rules over interval data.
SIGMOD'97.
„ Y. Aumann and Y. Lindell. A Statistical Theory for Quantitative
Association Rules KDD'99.
October 22, 2007 Data Mining: Concepts and Techniques 121 October 22, 2007 Data Mining: Concepts and Techniques 122

Ref: Mining Correlations and Interesting Rules Ref: Mining Other Kinds of Rules

„ M. Klemettinen, H. Mannila, P. Ronkainen, H. Toivonen, and A. I. „ R. Meo, G. Psaila, and S. Ceri. A new SQL-like operator for mining
association rules. VLDB'96.
Verkamo. Finding interesting rules from large sets of discovered
association rules. CIKM'94. „ B. Lent, A. Swami, and J. Widom. Clustering association rules.
ICDE'97.
„ S. Brin, R. Motwani, and C. Silverstein. Beyond market basket:
„ A. Savasere, E. Omiecinski, and S. Navathe. Mining for strong
Generalizing association rules to correlations. SIGMOD'97.
negative associations in a large database of customer transactions.
„ C. Silverstein, S. Brin, R. Motwani, and J. Ullman. Scalable ICDE'98.
techniques for mining causal structures. VLDB'98. „ D. Tsur, J. D. Ullman, S. Abitboul, C. Clifton, R. Motwani, and S.
„ P.-N. Tan, V. Kumar, and J. Srivastava. Selecting the Right Nestorov. Query flocks: A generalization of association-rule mining.
Interestingness Measure for Association Patterns. KDD'02. SIGMOD'98.
„ E. Omiecinski. Alternative Interest Measures for Mining „ F. Korn, A. Labrinidis, Y. Kotidis, and C. Faloutsos. Ratio rules: A new
Associations. TKDE’03. paradigm for fast, quantifiable data mining. VLDB'98.
„ Y. K. Lee, W.Y. Kim, Y. D. Cai, and J. Han. CoMine: Efficient Mining „ K. Wang, S. Zhou, J. Han. Profit Mining: From Patterns to Actions.
EDBT’02.
of Correlated Patterns. ICDM’03.
October 22, 2007 Data Mining: Concepts and Techniques 123 October 22, 2007 Data Mining: Concepts and Techniques 124

31
Ref: Constraint-Based Pattern Mining Ref: Mining Sequential and Structured Patterns

„ R. Srikant, Q. Vu, and R. Agrawal. Mining association rules with item „ R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations
constraints. KDD'97. and performance improvements. EDBT’96.
„ H. Mannila, H Toivonen, and A. I. Verkamo. Discovery of frequent
„ R. Ng, L.V.S. Lakshmanan, J. Han & A. Pang. Exploratory mining and
episodes in event sequences. DAMI:97.
pruning optimizations of constrained association rules. SIGMOD’98.
„ M. Zaki. SPADE: An Efficient Algorithm for Mining Frequent Sequences.
„ M.N. Garofalakis, R. Rastogi, K. Shim: SPIRIT: Sequential Pattern Machine Learning:01.
Mining with Regular Expression Constraints. VLDB’99. „ J. Pei, J. Han, H. Pinto, Q. Chen, U. Dayal, and M.-C. Hsu. PrefixSpan:
„ G. Grahne, L. Lakshmanan, and X. Wang. Efficient mining of Mining Sequential Patterns Efficiently by Prefix-Projected Pattern
constrained correlated sets. ICDE'00. Growth. ICDE'01.
„ J. Pei, J. Han, and L. V. S. Lakshmanan. Mining Frequent Itemsets „ M. Kuramochi and G. Karypis. Frequent Subgraph Discovery. ICDM'01.
with Convertible Constraints. ICDE'01. „ X. Yan, J. Han, and R. Afshar. CloSpan: Mining Closed Sequential
Patterns in Large Datasets. SDM'03.
„ J. Pei, J. Han, and W. Wang, Mining Sequential Patterns with
„ X. Yan and J. Han. CloseGraph: Mining Closed Frequent Graph
Constraints in Large Databases, CIKM'02.
Patterns. KDD'03.
October 22, 2007 Data Mining: Concepts and Techniques 125 October 22, 2007 Data Mining: Concepts and Techniques 126

Ref: Mining Spatial, Multimedia, and Web Data Ref: Mining Frequent Patterns in Time-Series Data

„ K. Koperski and J. Han, Discovery of Spatial Association Rules in „ B. Ozden, S. Ramaswamy, and A. Silberschatz. Cyclic association rules.
Geographic Information Databases, SSD’95. ICDE'98.
„ O. R. Zaiane, M. Xin, J. Han, Discovering Web Access Patterns and „ J. Han, G. Dong and Y. Yin, Efficient Mining of Partial Periodic Patterns
Trends by Applying OLAP and Data Mining Technology on Web Logs. in Time Series Database, ICDE'99.
ADL'98. „ H. Lu, L. Feng, and J. Han. Beyond Intra-Transaction Association
„ O. R. Zaiane, J. Han, and H. Zhu, Mining Recurrent Items in Analysis: Mining Multi-Dimensional Inter-Transaction Association Rules.
Multimedia with Progressive Resolution Refinement. ICDE'00. TOIS:00.
„ D. Gunopulos and I. Tsoukatos. Efficient Mining of Spatiotemporal „ B.-K. Yi, N. Sidiropoulos, T. Johnson, H. V. Jagadish, C. Faloutsos, and
Patterns. SSTD'01. A. Biliris. Online Data Mining for Co-Evolving Time Sequences. ICDE'00.
„ W. Wang, J. Yang, R. Muntz. TAR: Temporal Association Rules on
Evolving Numerical Attributes. ICDE’01.
„ J. Yang, W. Wang, P. S. Yu. Mining Asynchronous Periodic Patterns in
Time Series Data. TKDE’03.
October 22, 2007 Data Mining: Concepts and Techniques 127 October 22, 2007 Data Mining: Concepts and Techniques 128

32
Ref: Iceberg Cube and Cube Computation Ref: Iceberg Cube and Cube Exploration

„ S. Agarwal, R. Agrawal, P. M. Deshpande, A. Gupta, J. F. Naughton, „ J. Han, J. Pei, G. Dong, and K. Wang, Computing Iceberg Data
R. Ramakrishnan, and S. Sarawagi. On the computation of Cubes with Complex Measures. SIGMOD’ 01.
multidimensional aggregates. VLDB'96.
„ W. Wang, H. Lu, J. Feng, and J. X. Yu. Condensed Cube: An
„ Y. Zhao, P. M. Deshpande, and J. F. Naughton. An array-based Effective Approach to Reducing Data Cube Size. ICDE'02.
algorithm for simultaneous multidi-mensional aggregates.
„ G. Dong, J. Han, J. Lam, J. Pei, and K. Wang. Mining Multi-
SIGMOD'97.
Dimensional Constrained Gradients in Data Cubes. VLDB'01.
„ J. Gray, et al. Data cube: A relational aggregation operator
generalizing group-by, cross-tab and sub-totals. DAMI: 97. „ T. Imielinski, L. Khachiyan, and A. Abdulghani. Cubegrades:
„ M. Fang, N. Shivakumar, H. Garcia-Molina, R. Motwani, and J. D. Generalizing association rules. DAMI:02.
Ullman. Computing iceberg queries efficiently. VLDB'98. „ L. V. S. Lakshmanan, J. Pei, and J. Han. Quotient Cube: How to
„ S. Sarawagi, R. Agrawal, and N. Megiddo. Discovery-driven Summarize the Semantics of a Data Cube. VLDB'02.
exploration of OLAP data cubes. EDBT'98. „ D. Xin, J. Han, X. Li, B. W. Wah. Star-Cubing: Computing Iceberg
„ K. Beyer and R. Ramakrishnan. Bottom-up computation of sparse Cubes by Top-Down and Bottom-Up Integration. VLDB'03.
and iceberg cubes. SIGMOD'99.
October 22, 2007 Data Mining: Concepts and Techniques 129 October 22, 2007 Data Mining: Concepts and Techniques 130

Ref: FP for Classification and Clustering Ref: Stream and Privacy-Preserving FP Mining

„ G. Dong and J. Li. Efficient mining of emerging patterns: „ A. Evfimievski, R. Srikant, R. Agrawal, J. Gehrke. Privacy Preserving
Discovering trends and differences. KDD'99. Mining of Association Rules. KDD’02.
„ B. Liu, W. Hsu, Y. Ma. Integrating Classification and Association „ J. Vaidya and C. Clifton. Privacy Preserving Association Rule Mining
Rule Mining. KDD’98. in Vertically Partitioned Data. KDD’02.
„ W. Li, J. Han, and J. Pei. CMAR: Accurate and Efficient „ G. Manku and R. Motwani. Approximate Frequency Counts over
Classification Based on Multiple Class-Association Rules. ICDM'01. Data Streams. VLDB’02.
„ H. Wang, W. Wang, J. Yang, and P.S. Yu. Clustering by pattern „ Y. Chen, G. Dong, J. Han, B. W. Wah, and J. Wang. Multi-
similarity in large data sets. SIGMOD’ 02. Dimensional Regression Analysis of Time-Series Data Streams.
„ J. Yang and W. Wang. CLUSEQ: efficient and effective sequence VLDB'02.
clustering. ICDE’03. „ C. Giannella, J. Han, J. Pei, X. Yan and P. S. Yu. Mining Frequent
„ B. Fung, K. Wang, and M. Ester. Large Hierarchical Document Patterns in Data Streams at Multiple Time Granularities, Next
Clustering Using Frequent Itemset. SDM’03. Generation Data Mining:03.
„ X. Yin and J. Han. CPAR: Classification based on Predictive „ A. Evfimievski, J. Gehrke, and R. Srikant. Limiting Privacy Breaches
Association Rules. SDM'03. in Privacy Preserving Data Mining. PODS’03.
October 22, 2007 Data Mining: Concepts and Techniques 131 October 22, 2007 Data Mining: Concepts and Techniques 132

33
Ref: Other Freq. Pattern Mining Applications

„ Y. Huhtala, J. Kärkkäinen, P. Porkka, H. Toivonen. Efficient


Discovery of Functional and Approximate Dependencies Using
Partitions. ICDE’98.
„ H. V. Jagadish, J. Madar, and R. Ng. Semantic Compression and
Pattern Extraction with Fascicles. VLDB'99.
„ T. Dasu, T. Johnson, S. Muthukrishnan, and V. Shkapenyuk.
Mining Database Structure; or How to Build a Data Quality
Browser. SIGMOD'02.

October 22, 2007 Data Mining: Concepts and Techniques 133 October 22, 2007 Data Mining: Concepts and Techniques 134

34

You might also like