Download as pdf or txt
Download as pdf or txt
You are on page 1of 116

Tutorial PM-3

High Performance Data Mining

Vipin Kumar (University of Minnesota)


Mohammed Zaki (Rensselaer Polytechnic)

309
High Performance Data Mining

Vipin Kumar Mohammed J. Zaki


Computer Science Dept. Computer Science Dept.
University of Minnesota Rensselaer Polytechnic Institute
Minneapolis, MN, USA. Troy, NY, USA.
kumar@cs.umn.edu zaki @cs.rpi.edu
www.cs.umn.edu/-kumar www.cs.rpi.edu/-zaki

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11

Tutorial overview
• [ Overview of KDD and data mining I
• Parallel design space
• Classification
• Associations
• Sequences
• Clustering
m Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21
311
Overview of data mining

• What is Data Mining/KDD?


• Why is KDD necessary
• The KDD process
• Mining operations and methods
• Core issues in KDD

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 31

What is data mining?

The iterative and interactive process of


discovering valid, novel, useful, and
understandable patterns or models in

Massive databases

I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4 I

312
What is data mining?

• Valid: generalize to the future


• Novel: what we don't know
• Useful: be able to take some action
• Understandable: leading to insight
• Iterative: takes multiple passes
• Interactive: human in the loop
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 5 I

Data mining goals

• Prediction
- What?

- Opaque

• Description
- Why?

- Transparent

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 I

313
Data mining operations

• Verification driven
- Validating hypothesis
-Querying and reporting (spreadsheets, pivot
tables)
analysis (dimensional
- M u l t i d i m e n s i o n a l

summaries); On Line Analytical Processing


- Statistical analysis

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 7 I

Data mining operations

• Discovery driven
-Exploratory data analysis
- Predictive modeling
- Database segmentation
- Link analysis
- Deviation detection

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8I

314
KDD Process

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 91

Data mining process

• Understand application domain


- Prior knowledge, user goals
• Create target dataset
- Select data, focus on subsets
• Data cleaning and transformation
-Remove noise, outliers, missing values
- Select features, reduce dimensions
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 10 ]

315
Data mining process
• Apply data mining algorithm
-Associations, sequences, classification,
clustering, etc.
• Interpret, evaluate and visualize patterns
- W h a t ' s new and interesting?
- I t e r a t e if needed
• Manage discovered knowledge
- C l o s e the loop
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11 I

Why Mine Data? Commercial


ViewPoint...
• Lots of data is being collected and
warehoused.d
• Computing has become affordable.
• Competitive Pressure is Strong
- Provide better, customized services for an
edge.
-Information is becoming product in its own
right.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 12 I

316
Why Mine Data?
Scientific Viewpoint...
• Data collected and stored at enormous speeds
(Gbyte/hour)
- remote s e n s o r on a satellite
- t e l e s c o p e s c a n n i n g the skies
- m i c r o a r r a y s g e n e r a t i n g g e n e e x p r e s s i o n data
- scientific simulations generating t e r a b y t e s of data
• Traditional techniques are infeasible for raw data
• Data mining for data reduction..
- cataloging, classifying, s e g m e n t i n g data
- Helps scientists in H y p o t h e s i s Formation

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 131

Origins of Data Mining

• Draws ideas from machine learning/AI,


pattern recognition, statistics, database
systems, and data visualization.
• Traditional Techniques may be
unsuitable
-Enormity of data
- H i g h Dimensionality of data
- Heterogeneous, Distributed nature of data
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 14 I

317
Data mining methods

• Predictive modeling (classification,


regression)
• Segmentation (clustering)
• Dependency modeling (graphical
models, density estimation)
• Summarization (associations)
• Change and deviation detection
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)

Data mining techniques

• Association rules: detect sets of attributes that


frequently co-occur, and rules among them, e.g.
90% of the people who buy cookies, also buy
milk (60% of all grocery shoppers buy both)
• Sequence mining (categorical): discover
sequences of events that commonly occur
together, .e.g. In a set of DNA sequences
ACGTC is followed by GTCA after a gap of 9,
with 30% probability
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 16 I

318
Data mining techniques

• Classification and regression: assign a new data


record to one of several predefined categories or
classes. Regression deals with predicting real-
valued fields. Also called supervised learning.
• Clustering: partition the dataset into subsets or
groups such that elements of a group share a
common set of properties, with high within group
similarity and small inter-group similarity. Also
called unsupervised learning.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 17 I

Data mining techniques

• Similarity search: given a database of objects,


and a "query" object, find the object(s) that are
within a user-defined distance of the queried
object, or find all pairs within some distance of
each other.
• Deviation detection: find the record(s) that is
(are) the most different from the other records,
i.e., find all outliers. These may be thrown away
as noise or may be the "interesting" ones.
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 18 I

319
Data mining techniques
Many other methods, such as
- Neural networks
- Genetic algorithms
- H i d d e n Markov models
- Time series
- Bayesian networks
- Soft computing: rough and fuzzy sets

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 19 I

Main challenges for KDD

• Scalability
- Efficient and sufficient sampling
-In-memory vs. disk-based processing
- High performance computing
• Automation
- Ease of use
- Using prior knowledge

I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 20 I

320
Tutorial overview
• Overview of KDD and data mining
• Parallel design space I
• Classification

• Associations
• Sequences
• Clustering
• Future directions and summary
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 1

Speeding up data mining

• Data oriented approach


- Discretization
- Feature selection
- Feature construction (PCA)
- Sampling

• Methods oriented approach


- Efficient and scalable algorithms

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 22 I

321
Speeding up data mining (contd.)

• Methods oriented approach (contd.)


-Parallel data mining
• Task or control parallelism
• Data parallelism
• Hybrid parallelism
-Distributed data mining
• Voting
• Meta-learning, etc.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 23 I

Need for Parallel Formulations

• Need to handle very large datasets.


• Memory limitations of sequential computers
cause sequential algorithms to make multiple
expensive I/0 passes over data.
• Need for scalable, efficient (fast) data mining
computations
- gain competitive advantage.
- Handle larger data for greater accuracy in shorter
times.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 24 I

322
Parallel Design Space
• Parallel architectures
- Distributed memory
- Shared disk
- Shared memory
-Hybrid cluster of SMPs
• Task and data parallelism
• Static and dynamic load balancing
• Horizontal and vertical data layout
• Data and concept partitioning
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 25 I

Parallel Hardware
• Distributed-memory machines
- Each processor has local memory and disk
- Communication via message-passing
- Hard to program: explicit data distribution
- Goal: minimize communication
• Shared-memory machines
- Shared global address space and disk
- Communication via shared memory variables
- Ease of programming
- Goal: maximize locality, minimize false sharing
• Current trend: Cluster of SMPs
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 26 I

323
Distributed Memory Architecture
(Shared Nothing)
Interconnection N e t w o r k J
~

1 D

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 271

DMM: Shared Disk Architecture

~ ,0, Interconnection N e t w o r k

)
• mW /

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 28 I

324
Shared Memory Architecture
(Shared Everything)
<-_____ Interconnection Network

1 ID
m
r I 1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 29 ]

Cluster of SMPs
ion Network

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 30 I

32.5
Task vs. Data Parallelism
• Data Parallelism
-Data partitioned among P processors
- Each processor performs the same work
on local partition
• Task Parallelism
- Each processor performs different
computation
- Data may be (selectively) replicated or
partitioned
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 31 1

Task vs. Data Parallelism


PO PI P2

Task Parallelism

Uniprocessor @+1
Data Parallelism
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 32 I

326
Static vs. Dynamic Load Balance
• Static Load Balancing
- Work is initially divided (heuristically)
- N o subsequent computation or data
movement
• Dynamic Load Balancing
- Steal work from heavily loaded processors
- Reassign to lightly loaded processors
-Important for irregular computation,
heterogeneous environments, etc.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 33 I

Horizontal/Vertical Data Format


Horizontal Vertical

2 N) Manied 1CCK N:)

4 Yes Maded 12CK I~b i

6 N:) Manied B:K I~b

8 N~ 8n~ 8EK Yes

10 I~ @rge ~K Yes

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 34 I

327
Data and concept partitioning

• Shared
- SMP or shared disk architectures

• Replicated
- Partially or totally

• Partitioned
- Round-robin partitioning
- Hash partitioning
- Range partitioning
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)

Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• 1Classification
• Associations

• Sequences
• Clustering
• Future directions and summary
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 36J

328
What is Classification?

Classification is the process of assigning


new objects to predefined categories or
classes

• Given a set of labeled records


• Build a model (decision tree)
• Predict labels for future unlabeled
records
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 371

Classification learning

• Supervised learning (labels known)


• Example described in terms of attributes
-Categorical (unordered symbolic values)
-Numeric (integers, reals)
• Class (output/predicted attribute):
categorical for classification, numeric for
regression

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 38 I

329
Classification learning

• Training set: set of examples, where


each example is a feature vector (i.e., a
set of (attribute,value) pairs) with its
associated class. Model built on this set.
• Test set: a set of examples disjoint from
the training set, used for testing the
accuracy of a model.

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 39

Classification Models
• Some models are better than others
- Accuracy

- Understandability

• Models range from easy to understand


to incomprehensible
- Decision trees Easier

- Rule induction

- Regression models
- Neural networks Harder
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4O I

330
Decision Tree Classification

Splitting Attributes

Yes/J Refund L',Ao /


J ,,,,,r=

< 80; Taxlno ~0K~J

The splitting attribute at a node is


determined based on the Gini index.

HighPerformanceDataMining(VipinKumarandMohammedZaki) 41 ]

From Tree to Rules

1) Refund = Yes ~ NO

Yes~ Refund [.~ 2) Refund = No and


MarSt in {Single, Divorced}
~-MarSt
Single, D~erced Married and Taxlnc < 80K ~ NO
W
Taxlnc 3) Refund = No and
< 8OK/ ~,= 80K MarSt in {Single, Divorced}
and Taxlnc >= 80K = , YES

4) Refund = No and
MarSt in {Married } ~ NO

I HighPerformanceDataMining(VipinKumarandMohammedZaki) 42 1
331
Classification algorithm

• Build tree
-Start with data at root node
-Select an attribute and formulate a
logical test on attribute
-Branch on each outcome of the test,
and move subset of examples
satisfying that outcome to
corresponding child node
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 43 I

Classification algorithm
- Recurse on each child node
- Repeat until leaves are "pure", i.e., have
example from a single class, or "nearly
pure", i.e., majority of examples are from
the same class
• Prune tree
- R e m o v e subtrees that do not improve
classification accuracy
- Avoid over-fitting, i.e., training set specific
artifacts
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 44 i

332
Build tree
• Evaluate split-points for all attributes
• Select the "best" point and the "winning"
attribute
• Split the data into two
• Breadth/depth-first construction

• CRITICAL STEPS:
- Formulation of good split tests
- Selection measure for attributes

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 45 [

How to capture good splits?

• Occam's razor: Prefer the simplest


hypothesis that fits the data
• Minimum message/description length
- Dataset D
- H y p o t h e s e s H1, H2, ..., Hx describing D
- MML(Hi) = Mlength(Hi)+Mlength(DIHi)
- Pick Hk with minimum M M L
• Mlength given by Gini index, Gain, etc.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 46 I
333
Tree pruning using MDL
• Data encoding: sum classification errors
• Model encoding:
- Encode the tree structure
- Encode the split points
• Pruning: choose smallest length option
-Convert to leaf
- P r u n e left or right child
- Do nothing

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 47 I

Hunt's Method
• Attributes: Refund (Yes, No), Marital Status
(Single, Married, Divorced), Taxable Income
• Class: Cheat, Don't Cheat

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 48 I

334
What's really happening?
Marital Status
Cheat
Don't Cheat I
I
+ 0 0 0 0
O÷oC 0 000 0
+
+ 0 0 0
0
0+0
0 0 0 0 O
0
++ 0
++
Marrie, + + 0 O
I -'Income
< 80K I
High Performance Data
• Mining (Vipin Kumar and Mohammed Zaki) 49 t

Finding good split points

• Use Gini index for partition purity


Gini(S)=l-~p~
i=1

Gini (S l , S 2) = ~ Gini (S l ) + n~ Gini (S 2)


n n

• If S is pure, Gini(S) = 0
• Find split, point with minimum Gini
• Only need class distributions
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 50 I

335
Finding good splits points
~ Cheat
~,Don't Cheat
Marit tl Status Maril d Status
' + 0 ) 0 0 ' 0 0 0
4- 0
O+ 0 ¢ 0+00 0 000o
4- 0 o ° o°°°° 4- 0 0 0 ÷
0 O+ 0 0
0+0
0 0 0 0 0 0 a 0 a e~
+%, 0
0
**.*÷÷ o °
4--+
+ +( O + +T~
+ o

Income Ncome
NO YES NO YES
Left 14 9 Top 5 23
Right 1 18 Bottorr 10 4

Gini(split) = 0.31 Gini(split) = 0.34


High Performance Data Mining (Vipin Kumar and Mohammed Zaki) s,I

Categorical Attributes:
Computing Gini Index
For each distinct value, gather counts for
each class in the dataset
• Use the count matrix to make decisions

Multi-way split I Two-way split


I (find best partition of values)
I
;arTyp(

2 I 2

I Gini I o,4:te
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 52

336
Continuous Attributes:
Computing Gini Index...
• For efficient computation: for each attribute,
- Sort the attribute on values
- Linearly scan these values, each time updating the count
matrix and computing gini index
- Choose the split position that has the least gini index

Sorted Values
Split Position.=

C4.5
• Simple depth-first construction.
• Sorts Continuous Attributes at each node.
• Needs entire data to fit in memory.
• Unsuitable for Large Datasets.
- Needs out-of-core sorting.
• Classification Accuracy shown to
improve when entire datasets are
used!

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 54 I

337
SPRINT [Shafer,Agrawal,Mehta]

Attribute Lists:

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 55 I

SPRINT (contd.)
• The arrays of the continuous attributes are pre-sorted.
- T h e sorted order is m a i n t a i n e d d u r i n g each split.
• The classification tree is grown in a breadth-first fashion.
• Class information is clubbed with each attribute list.
• Attribute lists are physically split among nodes.
• Split determining phase is just a linear scan of lists at each
node.
• Hashing scheme used in splitting phase.
- tids of the splitting attribute are h a s h e d with the tree node as the
key.
• lookup table
- r e m a i n i n g attribute arrays are split by q u e r y i n g this hash structure.

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 56 ]

338
SPRINT Disadvantages

• Size of hash table is O(N) for top levels


of the tree.
• If hash table does not fit in memory
(mostly true for large datasets), then
build in parts so that each part fits.
- Multiple expensive I/O passes over the
entire dataset.

I High Performance Data Mining(Vipin Kumar and Mohammed Zaki) 57 I

Constructing a Decision Tree in


Parallel
m categorical attributes
¢/1
Partitioning of data
o¢J only
ID
- global reduction p e r
node is required
Good Bad
- large number of
classification t r e e
nodes gives high
communication cost

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 58 I

339
Constructing a Decision Tree in
Parallel
l 0,000 training records Partitioning of
classification tree nodes
natural concurrency
7,0q ds -
load imbalance as the
amount of work associated
with each node varies
- child nodes use the same
data as used by parent node
- loss of locality
- high data movement cost

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 59 I

Challenges in Constructing Parallel


Classifier
• Partitioning of data only
- large number of classification tree nodes gives high communication
cost

• Partitioning of classification tree nodes


- natural concurrency
- load imbalance as the amount of work associated with each node
varies
- child nodes use the same data as used by parent node
- loss of locality
- high data movement cost

• How do we efficiently perform the computation in


parallel?
I High performance Data Mining (Vipin Kumar and Mohammed Zaki) 60 I

340
Synchronous Tree Construction Approach

Proc 0 Proc 1 Proc 2 Proc 3 Partition Data Across Processors

~-iii~iii~:::~ + N°requireddat
a is
movement
Class Distribution Information
I

I - - Load imbalance
y i can be eliminated by breadth-
Proc 0 Proc 1 P~oc 2 Proc 3 first expansion

I igh communication cost


. becomes too high in lower parts
':,~ ................
"- ............................................
"
.--"
.... :: of the tree
Class Distribution Information

I High Performance Data Mining(Vipin Kumar and Mohammed Zaki) 61 I

Partitioned Tree Construction Approach


i ......................................... ;~ .......................................... ==

Partition Data andNodes


ProcO Procl Proc2 Proc3
I
!
-I- Highly concurrent

r......................................... :
t
High communication cost
due to excessive data
movements
ProcO Procl Proc2 Proc3

t - Load imbalance

procO prowl Proc~ Proc3

I High Performance Data Mining(Vipin Kumar and Mohammed Zaki) 62 I

341
Hybrid Parallel Formulation

( ]
Computation Frontier at depth 3

Partition 1 Partition 2

SynchronousTree Partitioned Tree


Construction Approach Construction Approach
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 631

Load Balancing

C
Partition 1 Partition2 Step 1: exchange data between Step2: load balance within
processor partitions processor partitions

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 64 I

342
Splitting Criterion

• Switch to Partitioned Tree Construction when


2 Communication Cost > Moving Cost + Load Balancing

• Splitting criterion ensures


C o m m u n i c a t i o n Cost < 2 * C o m m u n i c a t i o n COStoptima l

G. Karypis and V. Kumar, IEEE Transactions on Parallel and


Distributed Systems, Oct. 1994

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 65 I

Experimental Results

• Data set
- function 2 data set discussed in SLIQ paper
(Mehta, Agrawal and Rissanen, EDBT'96)
- 2 class labels, 3 categorical and 6 continuous
attributes
• IBM SP2 with 128 processors
- 66.7 MHz CPU with 256 MB real memory
- AIX version 4
- high performance switch

I
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 66 J

343
Speedup Comparison of the Three
Parallel Algorithms

~ e d u p for 0.8 malon examp~s, partitioned - " - + - ' , hybrid - " - o - ' , synch t a r s u s - " - x - " Speedup for 1.6 rnllion e x a n ~ , part~ioned - " ~ ' . hybrid - " - c - ' . s ' ~ - "-x-"
. . . . . . . /4

4 6 8 10 12 14 16
2 4 6 No OI Pr~esso IO 12 14 16 NO O~ Promisors

0.8 million examples 1.6 million examples


High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 67 I

Splitting Criterion Verification in the


Hybrid Algorithm
Splitting Criterion Ratio = ~ Communication Cost
Moving Cost + Load Balancing
~J~ume~ lot ~o~,ng at dJr~mnt v ~ o . q o~ nu"bo,8 ~ 0.8 mlWon 1~u~qplea ~ IO¢~olllling at dllfenmf ~mlul~ ol mno, 1 6 ~ 1 6 milton exanlples
7O0

600
//~
/J/~
S00
/ /,/
/"
'il
/

30O

2OO
/"
I
,% -2 0 2 4 6 8 10 -2 0 2 4 6 6 10
bg2(Splmng C,'ile.~ R ~ o ) , x,.O -> ncx,.1 I o ~ ( Sp4111ngC411eriaRatk)), x..O - > ~ 1

0.8 million examples on 8 processors 1.6 million examples on 16 processors


I High Performance Data Mining (Vipin Kumar and Mohammed ZakJ) 68 ]

344
Speedup of the Hybrid Algorithm with
Different Size Data Sets
Speedup c~rves for different sized data.sets
100
0.8 million examples --e---
1,6 million examples ,.~-.
3.2 million examples
6.4 million examples
12.8 minion examples
25.6 million examples -ix-,-

.....-
....-'°'°" ....A

,-" .A-.......
...'/"
40
..~-.~'.-" ~.=~ .............................
i),- .......

20

i ! i i i i
20 40 60 80 100 120 140
Number of proceesols

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 69 I

Scaleup of the Hybrid Algorithm


R u n U m e s of o u r a l g o r i t h m f o r 5 0 K e x a m p l e s at e a c h p r o c e s s o r
90 ~ I !

80 ......... : ........ : ............. ! ......... ~. . . . . . . . . . . . . . . . . . :..... ~. . . . . . . .

70 .......... i .......................................................................................

~ 5 0 ............................. = .......... ~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ ......

I:
t ~ 4 0 ...................................... * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2(] ........... • ................................................................ ; .......... , .........

1(~ ......... , ....... ~........ i ......... ~........ ~.......... i ........ :: .... ~ ....

! i : i i i i
i I ; I I i i
0 1 2 3 4 5 6 7 8 9
N u m b e r of P r o c e s s o m

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 70 I

345
Summary of Algorithms for
Categorical Attributes
• Synchronous Tree Construction Approach
- no data movement required
- high communication cost as tree becomes bushy
• Partitioned Tree Construction Approach
- processors work independently once partitioned completely
- load imbalance and high cost of data movement
• Hybrid Algorithm
- combines good features of two approaches
- adapts dynamically according to the size and shape of trees

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 71 I

Handling Continuous Attributes

• Sort continuous attributes at each node of the


tree (as in C4.5). Expensive, hence
Undesirable!
• Discretize continuous attributes
- CLOUDS (Alsabti, Ranka, and Singh, 1998)
- SPEC (Srivastava, Han, Kumar, and Singh, 1997)
• Use a pre-sorted list for each continuous
attributes
- SPRINT(Shafer, Agrawal, and Mehta, VLDB'96)
- ScalParC (Joshi, Karypis, and Kumar, IPPS'98)
!
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 72 I

3 4 6
Design Approach
• Goal: Scalability in both Runtime and
Memory Requirements.
• Parallelization overhead: To = P Tp - Ts
• T o ~ O(Ts) for scalability. Per processor
overhead should not exceed O(Ts/P).
• Two components of To:
- Load Imbalance.

- Communication time.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 73 [

Load Balancing
BI B2
value rid cid value rid cid SP : split point
5 1 0 6 oll rid : record id
10 8 l 12 1 0 cid : class label
PO 15i 4 0 PO 18 8 1
2o I 6 0 24 6 0
SP " - ~ -

PI
:o,
25 2 0

353 1
_304

Pl
36 3 1
42
0

2 0
7 0 48 5 1
45 5 l 54 7 0

BI B2
SP---~ 5 1 0 12 1 0 PG
10 8 I 18 8 1 30 0 l 6 0 1
35 3 1 36 3 1,
15 4 0 PO 24 6 0 1 40 7 0 48 5 1
PO 20 6 0 30 4 0
-i25 2 0 . p ~ 42 2 0 45 5 1 54 7 0
P1 .~ ~ • t ~ .

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 74 I

347
Parallel SPRINT

• Attribute lists are partitioned among processors


- Each processor gets N/p records of e a c h attribute list

• Attribute lists of continuous attributes are pre-sorted


• Split Determining Phase
- Categorical attributes
• Local construction of count matrices followed by a reduction to
add them up
- Continuous attributes
• Prefix-scan followed by local Gini computations followed by a
gini index reduction to find the maximum point

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 75 ]

Parallelizing Split Determination


Phase
• Easy.
I
CarType local
Count Matrices
global
I Age lc¢al
Count Matrices
global
value rid cid I value rid cid ~ 0 1
SpoOl 0 1 t~ I 19 2 1 ~
P0 Fam 1 i 01 Faro Faro i 28 8 0
Fam 2 1 _ I - Parallel o 1

PI
Fam 3 0
Spo 4 1 Famlll01
0 I
Spo [ - ~ Parallel
Reduction
i
i P1
33 1 0
38 3 1 R
[~
P,o
x.
Scan

o 1
FamSP° ~
5 0l _6
r-m-m
01 II- " -- L
5400 4R
60 1~ i
S p o ~ _~
P2 Fam 7 1 Fam"t._LLeJ P2 58 7 1
I
Faro 8 0 i 70 0 0

Categorical Attribute Continuous Attribute

|
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 76 J

348
Parallelizing Splitting Phase..

• SPRINT Approach:
- Builds hash table on all processors by an
all-to-all broadcast operation.
- S o , O(N) data sent & received by each
processor--> Not Scalable in Runtime as
well as Memory Requirements.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 77 ]

Scalable Parallel Formulation of


Splitting Phase [ScalParC]
• Challenging!!
- Requirement for consistent splitting, given
the relatively random sorted order among
attributes,
• Four cases:

!
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 78 I

349
Categorical/Categorical
Splitting a categorical attribute list requires access
exactly to those entries of the hash-table that are
created by the same processor

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 701

Continuous/Continuous

P1

P2

P3
Ilil
8 0

• The hash-table entries required by a processor are


generated by other processors
- Need to develop a scheme to transfer these entries to the
processors that need them
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 80 I

350
Getting the required entries of the
hash table
The required entries are transferred into two steps
- From splitting attribute order to tid sorted order
- From tid sorted order to attribute list order

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8, I

This Design is Inspired by..


Communication Structure of Parallel Sparse
Matrix-Vector Algorithms.
0 1 2 3 4 5 6 7 8
o x 0
1 X 0 P0
2 ® <>
3
4
o
0
x X P1
i
5 X 0

6
7
xo X O P2
li
8 0 X 0
X : Salary O : Age ~ : Node Table Entries
[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 82 I

351
Parallel Hashing Paradigm
[] Distributed Hash Table.
[] Hash Function: h(k) = (Pi, I)
[] Construction:
- (k,v) --> (pi,I) --> form send buffers {(I,v)} for each Pi --> all-
to-all-personalized communication.
.. Enquiry:
- (k) --> (Pi, I) --> form {(I)} buffers for each Pi --> all-to-all-
personalized c o m m --> each Pi replaces received {(I)} with
{(v)} --> all-to-all-personalized c o m m .

• If each processor has m keys to hash, then


time is O(m)if m is D,(p); i.e. D,(p 2) overall
kp.v.~, j - . . . . . . . . . . . . . . . .

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 831

Applying the paradigm to


ScalParC : Update
Salary Age N o d e Table
rid Salary Age cid value rid cid value rid cid rid kid
0 24542 70 0 24542 0 0 19 2 1 0-
1 98816 33 0 P0 49241 2 1 P0 24 ~ 0 P0 1 -
2 49241 19 1 64911 8 0 28 0 2 -
3 126146 38 1 94766 4 0 33 1 0 3 -
4 94766 50 0 P1 97672 5 0 Pl 38 3 1 PI 4-
597672 24 0 98816 110 4061 5 -
6 136838 40 1 SP"'i~- 126146 311 50 4 0 6'-
7 153032 58 1 ~2 136838 6,1 P2 5871 P2 7 -
8 64911 28 0 153032 7 l 70 0 0 8,-

(a) (b)

hash buffers
PO P2 0 2 1
Po [(O,L)I(2,L)(2,L)[ m[(O,L)[(2,L)I(I,L)
I N o d e Table
P0 P1 P2
P0 PI 4 5 3
~, I(1.L)I..L)I(2.L)I comm ~,I(1.L)I~2.L)I<0.R>I update rid]OI1121 31415 17181
kialLILlLI RILILIRIRILI
[6

PI P2 8 6 7 I I
Ico,.) ~o,~)1<1..)1 ~l(2,L)l<o,~)l.,~)l
(¢)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 84 I

352
Applying the Paradigm to
ScalParC: Enquiry
Age enquiry buffers intermediate index buffers
value rid cid PO PI P2 P0 PI P'2
2 5 8
19 2 1
P0 24 5 0 ~ol 212 21
28 8 0 I 3 6

PI
33 1 0
38 3 1 -I'1° °1 °°° El I 2
I I0 1
="°~" ' ' '
40!6 1 0 4 7
50 4 0 Pzlol 1 ,I
P2 58 7 1
70 0 0 intermediate valuebuffers result buffers
PO PI P2
~ILILI L
Node Table
P0 PI } P2
pl[ L R L oo~ P,I L I . I g
rid 0 I 2 34516
kid g L L R L L R R7 ~ PO PI P2

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 85 I

Verifying Scalability
• Updating the distributed hash table
- Each processor receives at most N/p entries
- Each processor can send at most Nk/p entries (k is the number of
continuous attributes)
• This is a worst case situation that does not happen frequently
• Inquiring the distributed hash table
- Each processor will perform k inquires, one for each continuous
attribute
- Each processor receives at most N/p entries
- Each processor sends at most N/p entries
• T h e per-processor communication overhead is in general
O(N/p)
• Each processor requires only O(N/p) m e m o r y making S c a l P a r C
m e m o r y efficient
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 86 I

353
A worst case scenario for Updating
Hash Table
I One Processor might need to send O(kN/p) data! Happens
Infrequently.
BI B2
value rid cid value rid cid
5 1 0
10 8 l
~431 4 0
P O 1. 5 . . 4 . 0 P0 21 0 1
20 6 0 28 5 1
25 2 0 _ _357 0
SP~ . . . . .
30 0 1 42 2 0
35 3 1 P1 49 8 1
P1 . . . .
40 7 0 56 1 0
45 5 ' 1 63 6 0

BI B2 ~ BI B2

Sp~10-8 . 1. - 42 2 0 - 30 0 1 - I 7 3 1
35 3 1 P 0 21 0 1
.15. 4. 0 49 8 1 P1 4 0 7 0 28 5 1 .,,,t_Sp
P0 20 6 0 P1 56 1 0 35 ? " 0 "
25 2 0 63 6 0 45 5 1 -
p-11 I ~ I . . . . P-1 ~ ~ I I
/ x z

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 87 i

Categorical/Continuous and
Continuous/Categorical
• Special cases of the Continuous/Continuous
- Categorical/Continuous does not require
communication for loading the hash table
- Continuous/Categorical does not require
communication for inquiring the hash table

1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 88 [

354
Algorithm: Level-wise
Communications
• Tree is built in a breadth-first manner.
• At each level of the decision tree
- Count Matrices for all attributes for all nodes are
reduced in one single communication approach.
- Loading the hash-table for all the nodes is
combined into a single communication operation
Inquiring the hash-table to split a particular
-

attribute list is combined in a single


communication operation
A total of k+ 1 All-to-All personalized
-

communication operations are performed for each


level of the tree
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 89 I

Structure of the Algorithm

• Sort Continuous Attributes (Pre-Sort)


• do level-wise while (at least one node requiring split)
- Compute count matrices (Find-Split I)
Compute best gini index for nodes requiring split
-

(Find-Split II)
Partition splitting attributes and update Node
-

Table (Perform-Split I)
Partition non-splitting attributes by inquiring Node
-

Table (Perform-Split II)


• end do
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 90 I

355
Example
Salary Age Counl Matrices
local global NodeT a b l e
value rid cid rid kid

24 50 R
2818 0 0I 0I
PI PI
33,,0
3813 1
"'~® i [ ]
SP~
P2 P2 58
o-~N I N
1
70 0

Salary Age ~ Node Table

PI ~ ~ ['~ p ~

,'" "'. Salary Age

"'" "'" I 15303217111 P21 ~ 17111


High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 91

Experimental Results
• Experiments w e r e performed on a 128-
processor C r a y T 3 D
• Training sets w e r e synthetically g e n e r a t e d
- Each contained only continuous attributes (5-9)
- The were two possible class labels

]
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 92 l

356
Parallel Runtime

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 93 ]

Constant size/processor

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 94 I

357
Different Number of Attributes

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 951

SMP Parallel Design Space


• Data parallelism: within a tree node
- Split attributes: vertical
• BASIC
• Fixed W i n d o w
• Moving Window

- S p l i t data (attribute lists): horizontal


• Task parallelism: between tree nodes
- SUBTREE
• Static vs. dynamic load balancing
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 96 I

358
SPRINT: Attribute Lists
TrainingSet Attributelists
Tid Age CarType Class Age Class Tid CarType[ Class iTid

2o
2 43 sports High . . . . ,
s~ I rn~, |

32 Low 4 ~::famil~::7:l:!!
4 32 truck Low 43 truck I Low 4

continuous(sorted) categorical(orig order)


High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 97 I

Splitting Attribute Lists


Attribute lists lbr n o d e 0 Decision Tree
Age Class Tid Car Type Class Tid Age < 2 7 . 5

32 Low 4 family Low 3


43 High 2 truck Low 4

Hash Table
Tid
Child

Attribute lists for node 1


I
Attribute lists for node 2

Age Class Tid Car Type [ Class Tid


32 Low 4 High 2 1
43 High 2
trick Low 4

1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 98 J

359
SPRINT: File per Attribute

®OO0 O 0 ~ O O0
TOTAL FILES PER ATTRIBUTE: 32
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 991

Optimized File Usage

@® @®@® @
TOTAL FILES PER ATrRIBUTE: 4
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 100 I

360
BASIC: Data Parallel
• Given current leaf frontier
• Atomically acquire an attribute and
evaluate it for all leaves
• Master finds winning attribute and
constructs hash table
• Atomically acquire an attribute and split
its list for all leaves
• Barrier synchronization between phases
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 101 ]

BASIC (Tree View)


P={O,L2~3}

BARRIER
| l | l l | | | |

Illlllll

IIIlll/

G ®® ® ® ®
J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 102 1

361
BASIC (Level View, A=3, P=4)
II PO(4) PI(4) II P2(4) n p3(o)

@
• ®

_,eaf
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 103 I

Fixed Window: Data Parallel


• Partition leaves into blocks/window of K
• Dynamically acquire any attribute for
any leaf within current block and
evaluate it
• Last processor to work on a leaf notes
the winning attribute and builds the
hash table
• Split data as in BASIC
• Barrier synchronization between each
block of K leaves
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 104 I

362
FWK (Tree View)
e={o,l,z,3}

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 105 I

FW2 (Level View)


i l r,o(4) D el(4) e2(2) P3(2)

I1~
Leaf
' I
Leaf Lea1

Window=2 I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 106 I

363
Moving Window: Data Parallel
• Partition leaves into blocks/window of K
• Dynamically acquire any attribute for
any leaf, say i, within current block
• Wait for last block's i-th leave
• Last processor to work on a leaf notes
the winning attribute and builds the
hash table
• Split data as in BASIC
• No barriers, only conditional wait
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~071

MWK (Tree Level)


P={0,I,2,3}
~|eo|la||ep
CHRONIZATION .

• ..... . . . . . . . . . . . . . . . . . . . .......... ., ..... ,

....

, '~'",1 ' "J' '


~®. ®-0.10® ,:®~ .0
dDillllllllnillln|ll
n-'

• dll lib
""~

l Bill lind
'

illlll

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 108 ]

364
MW2 (Level View)
il PO(3) P1(3) II
m P2(3) II P3(3)

Leaf Leaf
." q IP
Window=2
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 109 [

SUBTREE: Task Parallel


• Processor group (P) and leaf frontier (L)
[] Apply BASIC, get new leaf frontier (NL)
[] if NL is empty, put processors in FreeQ,
else new group, NP = P + FreeQ
[] Master divides NP and NL into two parts
[] NP1 works on NL1, NP2 works on NL2
[] Initially all processors work on root

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 110 I

365
SUBTREE (4 procs)

II P0
P1

II P2
FreeQ
II P3

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 111 I

Experimental Setup
• SMP Machines
- 112 MHz PowerPC-604 processors
- 16KB L1-cache, 1MB L2-cache
- Machine: 4-way SMP
• 1 2 8 M B memory, 3 0 0 M B disk

• Synthetic Datasets: simple (F2) and


c o m p l e x (F7) decision tree

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 112 I

366
Experimental Datasets
Dataset Function #Attr Num DBSize Num Max
Records Levels Nodes
F2A8D1M F2 8 1000K 192MB 4 2

F2A32D250K F2 32 250K 192MB 4 2

F2A64D125K F2 64 125K 192MB 4 2

F7A8D1 M F7 8 1000K 192MB 60 4662

F7A32D250K ; F7 32 250K 192MB 59 802

F7A64D125K F7 64 125K 192MB 55 384

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 113 I

Setup and Sort Time


Dataset Setup Sort Total Setup% Sort%
(sec) (sec) (sec)
F2A8D1M 721 633 3597 20% 18%

F2A32D250K 685 598 3584 19% 17%

F2A64D125K 705 626 3665 19% 17%

F7A8D1M 989 817 23360 4% 4%

F7A32D250K 838 780 24706 3% 3%

F7A64D125K 672 636 22664 3% 3%

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1141
367
Parallel Performance
F2A64D125K F7A64D125K

I
D MW4 • SUBTREE
25000
[D MW4• SUBTREEI
2500
20000
~2000 -~
,,,
E 1500 = t=oo
._E
i= ='-
p, 1000 ._~ 1oo00
-, p,
= 500 = ~o.
,,>f

0 0,
1 2 3 4 , 2 3 4
Number of Processors
Number of Processors

I High PerformanceData Mining (Vipin Kumar and MohammedZaki) 115 I

Parallel Performance
A64D125K A64D125K
I MW4-F2 SUBTREE.F2
1 I--m-F2 SUBTREE.F2
I
MW4.F7 ~ SUBREE.F7 ~ MW4-F7 ~ SUBREE-F7
4 3.5

3.52 '"
3

~ ~ 2
~ 1.5

0.5 ~ 0.5
0 . . . . 0 . . . .
I 2 3 4 I 2 3 4
Numberof Processors Number of Processors
I High PerformanceData Mining (Vipin Kumar and MohammedZaki) 116 I
368
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• l Associations
• Sequences
• Clustering
• Future directions and summary
l High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 117 I

What is association mining?

• Given a set of items/attributes, and a set


of objects containing a subset of the
items
• Find rules: if I1 then 12 (sup, conf)
• I1, 12 are sets of items
• I1, 12 have sufficient support: P(11+12)
• Rule has sufficient confidence: P(12111)

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 118 I

369
Association mining
• User specifies "interestingness"
-Minimum support (minsup)
- Minimum confidence (minconf)
• Find all frequent itemsets (> minsup)
-Exponential Search Space
-Computation and I/O Intensive
• Generate strong rules (> minconf)
- Relatively cheap

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 119 I

Association Rule Discovery: Support


and Confidence

iil i!!!!~~!i~iiiiiii!i!!!i¸¸I~~i!ii! !i iilliiiii!i !!!i¸il¸i ¸ Example:


Ass°eiati°n
~ s i Ruleia ;~ X Y I {Diaper,Milk}=:~,,~Beer
tr(Diaper, Milk, Beer) 2
S-- =--=0.4
Total Number of Transactions 5
ty(Diaper, Milk, Beer) = 0.66
o-(Diaper, Milk) [
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 120 [

370
Handling Exponential Complexity

• Given n transactions and m different items:


-number of possible association rules: O(m2 m-l)
- c o m p u t a t i o n complexity: O(nm2m)
• Systematic search for all patterns, based on
support constraint [Agarwal & Srikant]:
- If {A,B} has s u p p o r t at least (z, then both A and B
have support at least t~.
- I f either A or B has support less than ~, then {A,B}
has support less than t~.
- Use patterns of k- l items to find patterns of k items.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 121 I

Apriori Principle
• Collect single item counts. Find large items.
• Find candidate pairs, count them => large
pairs of items.
• Find candidate triplets, count them => large
triplets of items, And so on...
• Guiding Principle: Every subset of a
frequent itemset has to be frequent.
- Used for pruning m a n y candidates.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 122 I

371
Illustrating Apriori Principle
Items (1-itemsets)

Pairs (2-itemsets)

I Minimum Support = 3 I
~il Triplets (3-itemsets)
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6+6+2=14 "°.%
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 123 ]

Counting Candidates
• Frequent Itemsets are found by counting
candidates.
• Simple way:
- Search for each candidate in each transaction.
Expensive!l!

Transactions
Candidates

N M

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 124 I

372
Association Rule Discovery: Hash
tree for fast access.
Candidate Hash Tree
Hash Function l

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 125 I

Association Rule Discovery: Subset


Operation
transaction
Hash Function

:,9

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 126 }

373
Association Rule Discovery: Subset
Hash Function
Operation (con
transaction

• \

124 155
457 458
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 127 J

Parallel Formulation of Association


Rules
• Need"
- Huge Transaction Datasets (10s of TB)
- L a r g e Number of Candidates.

• Data Distribution:
-Partition the Transaction Database, or
- Partition the Candidates, or
- Both
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 128 I

374
Parallel Association Rules: Count
Distribution (CD)
• Each Processor has complete candidate
hash tree.
• Each Processor updates its hash tree with
local data.
• Each Processor participates in global
reduction to get global counts of candidates
in the hash tree.
• Multiple database scans per iteration are
required if hash tree too big for memory.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 129 [

CD: Illustration

i i

~ GlobalReductionof C o u n t s ~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 130 [

375
Parallel Association Rules: Data
Distribution (DD)
• Candidate set is partitioned among the
processors.
• Once local data has been partitioned, it is
broadcast to all other processors.
• High Communication Cost due to data
movement.
• Redundant work due to multiple traversals
of the hash trees.

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 131 I

DD: Illustration
Data
~-~ _ ~ Broadcast

I-I

to-All Broadcast of C a n d i d ~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 132 [

376
Parallel Association Rules: Intelligent
Data Distribution (IDD)
• Data Distribution using point-to-point
communication.
• Intelligent partitioning of candidate sets.
- P a r t i t i o n i n g b a s e d on t h e first i t e m of c a n d i d a t e s .
- Bitmap to keep track of local candidate items.
• Pruning at the root of candidate hash tree using the
bitmap.
• Suitable for single data source such as database
server.
• With smaller candidate set, load balancing is
difficult.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 133 I

IDD: Illustration
Data
Shift

-I

Count

~ t

-fto-AllBroadcast of C a n d i d ~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 134 I

377
Filtering Transactions in IDD
bitmask~ transaction

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 135 ]

Parallel Association Rules: Hybrid


Distribution (HD)
• Candidate set is partitioned into G groups to
just fit in main memory
- Ensures Good load balance with smaller
candidate set.
• Logical processor mesh G x P/G is formed.
• Perform IDD along the column processors
- Data movement among processors is
minimized.
• Perform CD along the row processors
- Smaller number of processors is global
reduction operation.
[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 136 ]

378
HD: Illustration
PIG Processors per Group

CD
along
Rows

o ~.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 137 ]

Parallel Association Rules:


Experimental Setup
• 128-processor Cray T3D
- 150 MHz DEC Alpha (EV4)
- 64 MB of main memory per processor
- 3-D torus interconnection network with peak unidirectional
bandwidth of 150 MB/sec.
• MPI used for communications.
• Synthetic data set: avg transaction size 15 and 1000
distinct items.
• For larger data sets, multiple read of transactions in
blocks of 1000.
• HD switch to CD after 90.7% of the total computation
is done.
I High Performance Data Mining (Vipin Kumar and Mohammed 7aki) 138 I

379
Parallel Association Rules: Scaleup
Results (100K,0.25%) i i i i i

CD -~ ....
IDD -+--
2000 .~<_ HD -E~--
DD ~ ....

.+

~ 10OO ..

500 /~ ~.÷". . "

° 0 '
20 ;o '
60 20 '
1O0 '
120 140
Number of p r o c e s s o r s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 139 t

Parallel Association Rules: Sizeup


Re:;ults (np=16,0.25%)
300O
CD ~,
IDD -+--.
/+ HD -e--
/-
2500
///"

I/I
//
2O00
I/"
_= // . .Y"

g
1500
/ .'*/" .ij.i f / f
1000

500

.d:f i¢/- .f

0 I I ! I
0 200 400 600 800 1000
Number of t r a n s a c t i o n s p e r p r o c e s s o r (K)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 140 ]

380
Parallel Association Rules: Response
Time (np= 16,50K)
500 i i

CD -~--
HD -+--"
450 IDD -o--
simple hybrid --~.....

400

35O x
,. /:p
..."/~,"(2408 K)
3OO

250 . -, - .×/

& .........:..::;:;"iio8,~
K)
200 ,.,."'" .S.2;;""
n-
• - El"" ,..;:;C/"
150

100

50
1211 K)

0 i I I I
O5 0.25 O.1 0.06
Minimum support (%)
High Performance Data Mining (Vipin Kumar and Mohammed ~ki) 141 I

Parallel Association Rules: Response


Time ,(np=64,50K) , , ,

CD - e ~
H D -4---
1200 IDD -o--
simple hybrid -.~.....

1000 0

! ...m /
x
..... / ~ (5232 K)

..o-'" .. ~ /

........... •"× ,/" 12408 K)


...-."""
..-"'" .....,. //
200 ................... ..............
~...........
i~o~ K~
(345 K)
121T1K) I ! I I
0.5 0.25 0.1 o.oe.04
Minimum support (%)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 142 1

381
Parallel Association Rules:
Minimum Support Reachable

0.25 ~ 0.15 0.06 0.03


0.2 0.1 0.04 0.02

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 143 I

Parallel Association Rules:


Processor Configuration in HD

64 Processors and 0.04 minimum support

ml~ll ~ mm 4 ~6 ~8
~ ~ ~,x~ ~,~ ~~x~,

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 144 I

382
Parallel Association Rules:
Summary of Experiments
[] HD shows the same linear speedup and
sizeup behavior as that of CD.
[] HD Exploits Total Aggregate Main
Memory, while CD does not.
[] IDD has much better scaleup behavior
than DD

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 145 i

Eclat Approach

[] Frequent itemset lattice


[] Vertical or "inverted" tid-list format
[] Support counting via intersections
[] Lattice decomposition: break into
subproblems
[] Efficient search strategies
[] Independent solution of subproblems
1

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 146 J

383
Example Database DISTINCT DATABASE ITEMS
Jane Agatha Sir Arthur Mark P. G.
Austen Christie Conan Doyle Twain Wodehouse
A C D T W
DATABASE ALL FREQUENT ITEM-SETS
Transcation Items Support I Item-Sets
1 ACTW 1 0 0 % (6H C

2 CDW

3 ACTW 67%
(4)1[ A, D, T,
AC, AW
C D , CT, A C W
A T , DW, T W , A C T , A T W
4 ACDW 5 0 % (3) CDW, CTW, ACTW
5 ACDTW

6 CDT

MAXIMAL ITEM-SETS (MIN SUPPORT = 50%): ACTW, CDW

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 147 I
I I

Frequent ItemsetLattice
.................. 211/.::::-~:~:"T""~::::.::::::..~IIIII
..............
. . . . . . .
i/./:
. . . . . -..... " ...............

0
DOWNWARD CLOSED ON ITEM-SET SUPPORT
MAXIMAL FREQUENT ITEM-SETS_" A C ~ CDW
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 148 I

384
Eclat: Support Counting
................ ~...::::.......::::..-'T"---..::::::....:::::..
A C D ~
.............

! l
/
A C D T W CO CW CDW

TRANSACTION LISTS
Intersect
C & D
(TID LISTSI
ii,
Intersect
CD & C W

High Performance Data Mining (Vipin Kumar and Mohammed Zaki)


1
1491

Eclat: Lattice Decomposition


ACD'rVv

.... ""'-,.i'"'". ."/ ....'"'"""

{}
EQUIVALENCE I,A,: ~ .. AC. A.. A W, AC'r, ACW. A~. AC.W,]
CLASSES [C] C, C D , C T , C W , C D W , CT~N }
[ D ] = { D, D W } [31"] = { T , T W } [W] = { W } I
C R O ~ - C L A ~ LINK~ U~;FD FOR PRUNIN G
High Peffo~nance Data Mining (Vipin Kumar and Mohammed Zald) 150 I

385
Lattice Search Strategies

• Bottom-up
-Level-wise like Apriori
• Top-down
- Start with largest element. If frequent,
done! Else look at subsets
• Hybrid
- Find long frequent/maximal itemsets
- F i n d remaining using bottom-up search
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 151 ]

Bottom-up Lattice Search ACD~

.... . , . " . ". "-,...


..'"

ACDT ACDW ADTW


"~'-~::~: . . . . . . . . ...'"~:.:: ...... .... ::::{/:>'"
/.

ACID

NEW EQUIVALENCE I [ A C ] = { A C , A C T , A C W , ACTVV } I


CLASSES [ A T ] = { AT, ATVV }
lAW] = { AW }
KNOWN MAXIMAL FREQUENT ITEMSETS: CDW, CT~/
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 152 I

386
Top-down Lattice Search

A
MINIMAL INFREQUENT ITEM-SET; AD
MAXIMAL I~RIr=OIL]IP=NT |TIEM-~EET_- A~TW
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 153 I

Parallel Eclat/MaxEclat
• Partition tidlists among processors
• Schedule equivalence classes
• Selectively replicate database
• Solve Asynchronously/Independently
• Algorithm Space
- Eclat (Bottom-up)
- MaxEclat (Hybrid)

• Effective aggregate memory utilization


I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 154 I

387
Count Distribution
PROCESSOR 0 PROCESSOR 1 PROCESSOR 2

PARTITIONED DATABASE

COUNT ITEMS

II GET F R E Q U E N T I T E M S
FORM CANDIDATE PAIRS

C O U N T PAIRS

B GET FREQUENT PAIRS


FORM CANDIDATE TRIPLES

COUNT TRIPLES

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 155 ]

Parallel Eclat
PROCESSOR O PROCESSOR 1 PROCESSOR 2

PARTITIONED DATABASE

SELECTIVELY REPLICATED DATABASE

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 156 I

388
Parallel Eclat Algorithm
ITEM SET LATTICE CLASS PARTITIONING

EQUIVALENCE CLASSES
[A] = { AC, AT, AW }
[c] = { CD, CT, CW }
[D] = { DW }
i-r] = {-rw }

CLASS SCHEDULE
Processor 0 (P0)

Processor 1 (P1)
[C] = { CD, CT, CW } I
I [D] { DW }

TID-LIST COMMUNICATION
Original Database Partitioned Database After Tid-List Exchange I
A C D T W P0 P1 P0 P1 J
[~-~ ~---I I C o T wll
I n ml i i i l! !l l! ! !l / m m m u l u m = i = u l
in == . = Hi i'= '= m ==ll

,i=," In
lib
[---
n i
n i
Ill
ml
"11---
llll
Ii
II
I
n ill
i ill
=11
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 157 I

Eclat Experiments
[] Database: T20.16.D4550K
- 1000 Items
- 4550,000 Transactions
[] Machine: Eight 4-way SMP nodes
- Digitars Memory Channel (5gs,30Mb/s)
-233 Mhz, 256 MB, 1MB L2 Cache
[] Hierarchical Parallelization
- M e s s a g e passing + S M P
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 158]

389
Parallel Performance
Parallel Performance(1 Proc/Host' Parallel Speedup (1 Proc/Host) I
6000

5000

4000

3~00
0

21100

lOOO

H1 H2 H4 H8
Numberof Hosts

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 159 I

Parallel Performance
Parallel Performance(2 Proc/Host Parallel Speedup (2 Proc/Host)
3500 7r

300O 6

2500 5

21100 4

1500 3

IOO0 2

5130 !

H1 H2 H4 H8
Numberof Hosts
I
I High Performance Data Mining (Vipin Kumar and Mohammed Za~d) 160 I

390
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
•[Sequences
• Clustering
• Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I

Discovering Sequential Associations


Given:
A set of objects with associated event occurrences.
events

Object 1

imeline
l0 20 30 40 50

Object 2

I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 162 I

391
Sequential Pattern Discovery:
Definition
Given is a set of objects, with each object associated with its own
timeline of events, find rules that predict strong sequential
dependencies among different events.

(A B) (C) }, (D E) i
Rules are formed by first disovering patterns. Event occurrences in
the patterns are governed by timing constraints.

(A B) (C) (D E)
i: <-_x~ <=m,T'I!T
'>"~LI"
<=w,
r 1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 163 ]

Sequential Patterns: Complexity


Much higher computational complexity than
association rule discovery.
- O(m k 2 k-l) number of possible sequential patterns having
k events, where m is the total number of possible events.
Time constraints offer some pruning. Further use of
support based pruning contains complexity.
- A subsequence of a sequence occurs at least as many times
as the sequence.
- A sequence has no more occurrences than any of its
subsequences.
- Build sequences in increasing number of events. [GSPalgorithm
by Agarwal & Srikant]
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 164 ]

392
Apriori-like Approach

• Make multiple scans


• Join frequent patterns to generate new
candidate sequences
• Example:
-(2 3) (4 6) (7)joins with (3) (4 6) (7) (8) to
create (2 3) (4 6) (7) (8)
-(1 2) with (2 3) creates (1 2 3)
-(1) with (2) creates (1) (2) and (1 2)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 165 I

Sequential Apriori:
Count operation
• Various possibilities for counting occurrences
of a sequence in an object's timeline
- C o u n t only one occurrence per object.
- C o u n t the n u m b e r of span-size w i n d o w s the
s e q u e n c e occurs in.
- C o u n t the n u m b e r of distinct occurrences of a
sequence:
• Each event-timestamp pair considered at most once.
• Each counted occurrence has at least one new event-
timestamp pair.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 166 I

393
Sequential Apriori:
Count operation (contd..)
• Hash Tree used for fast search of candidate
occurrences. Similar to association rule
discovery, except for following differences.
- E v e r y event-timestamp pair in the timeline
is hashed at the root.
-Events eligible for hashing at the next level
are determined by the maximum gap (xg),
window size (ws), and span (ms)
constraints.
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 167 I

Sequence Hash Tree for Fast Access


Hash Function
(1@0)
Object:
(4@0)
(1@12)
2,5,8 1 9
(1@15) (3@5)
I 1 I I I
1(2@5) 0 5 10 15 20
(s@s) ms
< >
(1)(1)(9) 1(7)(6,9)
_~5) 1 Four Interesting Paths: |
I 1@0, 2@5, 3@5

I
(1 )(2,5)
(4,5,8)
Candidate Sequence Hash Tree

I
(1)(5)(9)
(1) (2,3)
(4)(8,9)
I
|
I
1@0, 3@5
4@0, 5@5
1@12, 1@15
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 168 ]
I
394
Counting Leaf Level Candidates
Part of Candidate Sequence Hash__Tree

Object 2:

Count = 12
I I
[ "< -> ~

I High .Performance Data Mining (Vipin Kumar and Mohammed Zaki) 169 I

Parallel Sequential Associations


• Need:
- Enormity of Data.
- Memory and Disk limitations of serial algorithms
running on single processor.
• Can algorithms for non-sequential
associations be extended easily?
- Sequential Nature gives rise to complex issues:
• Longer Timelines
• Large Span values
• Large N u m b e r of Candidates
I
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 170 I

395
Parallel Sequential Associations:
Event Distribution (EVE)
P0 P1 P2

~ 1to All Broadcastof SupportC o u n ~


High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 171

EVE Algorithm: Challenging Case


Transfer Hash Tree States
P0 P1 ~,~To~e~ P2 ~

Transfer Partial Counts


I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ' 172 I

396
Event and Candidate Distribution
(EVECAN)
Rotate Candidates in
either
Round-Robin Manner

or in
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)
~7~J

EVECAN: Parallelizing Candidate


Generation
I Candidates Stored in a Distributed Hash Table
I Hash Function:
Candidate Sequence, S = > h(S) = (Pi, I)

i
i ,: <

Scalable Communication Mechanism to construct and probe

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 174 I

397
The SPADE Approach
• Search space is a lattice
• Use "inverted" vertical lists
• Find support via list intersections
• Break large search space into small
independent suffix-based chunks
• Process each chunk recursively using
breadth/depth-first search
• Takes only a few data scans
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)

Example Database
DATABASE FREQUENTSEQUENCES(75%)
Customer-ld Transaction-Time Items Supp0d Sequences

r 11 ' ; 1 ~'• t
April C,
C, D, T, W, C->D, C->T
1 t April7 ~II~D 100%(3) C->W,D->T, D->W
1 April25
, ;~, T W TW, C->TW,D->TW
2 April 2 A C
2 April 5 D ~~=o~.~,~d A'>W, AC->T, ~!AC->D ~
2 April 20 CTW • J AU >W~A'>IW :~.

3
I
" [, AC->TW C->D.~ i '
, April15 I A C D
MAXIMAL FREQUENTSEQUENCES
3 • April 27 .DTW AC->D AC->TW C->D->TW
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 176 I

398
Frequent sequence lattice
LATTICE INDUCED BY MAXIMAL FREQUENT SEQUENCES
AC->D, AC->TW, C->D->TW

{}
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 177 I

Sequence lattice decomposition

{}
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 178 I

399
Temporal Joins
Customer-Transaction List Intersection
P->X->Y

P->X P->Y
(3110 I TII)

3 I 10
4 I 60 s / 70
7 | 40 P->Y->X

-m~l-:~ r - 13 10
-]
16 80
'15 I so 20
17 I 20 P->XY
20 I 10
I ClI~ Tin

i:
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 179 ]

Dynamic Computation Tree


lip

t High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 180 I

400
Parallel Design Space

• Data parallelism: within a class


- I d l i s t parallelism
• Single idlist join
• Level-wise idlist join
- Join parallelism
• Task parallelism: between classes
- pSPADE
• Static vs. dynamic load balancing
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I

Idlist Data Parallelism

P0: cid in range 1-500


• Single idlist join PI: cid in range 501-1000
- Split idlists on cid
Parallel Idlist Intersection
range
- Each proc intersects
over its cid range
- Barrier
synchronization for
each join

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 182 I

401
Idlist Data Parallelism

Level-wise idlist join P0: cid in range 1-500


PI: cid in range 501-1000
- Process all classes
at current level Parallel Pair-Wise
- Each proc still Intersections
intersects over its
local DB
- Barrier
synchronization and
sum-reduction for
each level
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 183 ]

Join Data Parallelism


• Each processor Assign each idlist to a
different processor
performs different
intersections within a
class
• Barrier after self-joining
within a class, before
processing classes at
next level
• # synchronizations is
between Single and
Level-wise parallelism

I
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 184 I

402
TaskParallelism:
StaticLoadBalancing
• Given level 1 classes, C1, C2, C3, ...
• Assign weights W l , W2, W3, ... on the class size (#
sequences in the class)
• Greedy scheduling of entire classes
- Sort classes on weight (descending)
- Assign class to proc with least total weight
• Each proc solves assigned classes asynchronously
• Hard to get accurate work estimate

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 185 I

Static andDynamic
Load7q~l~O
~I Balincin~q

High Performance Data Mining (Vipin~umar and Mohammed Zaki) 186 I

403
Task Parallelism:
Dynamic Load Balancing
• Given level 1 classes, C1, C2, C3, ... and
their weights
• Sort classes on weight (descending) and
insert in a logical task queue
• A proc atomically grabs the first available
class and solves it completely
• Repeat until no more classes in queue
• Uses only inter-class parallelism
• Queue contention negligible since classes
are coarse-grained
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 187 I

Task Parallelism:
Recursive DLB (pSPADE)
• Process available classes in parallel
• Worst case: P-1 procs free, 1 proc busy
• Classes sorted on weight, last class
usually small (but it could be large)
• Provide mechanism for free procs to
join busy group
• At each level, get free procs, insert the
classes into a shared task queue;
process available classes in parallel
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 188 I

404
pSPADE: Recursive
D,;namic LoadBalancing

] High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 189 ]

Experimental Setup
• SGI Origin 2000
- 12 195Mhz R10K MIPS processors
- 4 M B cache, 2GB memory
• Synthetic datasets
- C : number of transactions/customer
-T: average transaction size
- S/I: average sequence/itemset size

- D : number of customers
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 190 I

405
Data vs. Task Parallelism
C10-T5-S4-11.25-D1M C20-T5-S8-11.25-D1M

i D Data • Task I DData 1Task I


500" 400
350
400'
-'G" 300
O)

._E 300- 250


I-- .E_ 200
r-
i--
.2 200. 150
.m

100
100-
u~ 50
O.
1 2 4 1 2 4
Number of Processors Number of Processors
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 191 I

Static vs. D y n a m i c D L B

[rlStatic • Dynamic • Recursive ]


• Dynamic is up to
600- 38% better than
500- Static
400- • Recursive is up to
p..

~ 300-
12% better than
g
Dynamic
200-
• Over all Recursive
100-
is 44% better than
O_
C10T5S411.25 C20T5S811.25 C20T5S812 Static
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 192 I

406
Parallel Performance
C10-T5-S4-11.25-D1M C10-T5-S4-11.25-D1M

400-
,
350-
~ . 300-
3.
250- E
. n
i.,-
•~ 200- 2,
.o t50-
..=

~I.
~, 100-
50- .
.
1 2 4 8 t 2 4 8
Number of Processors Number of Processors
High Performance Data Mining (Vipin Kumar end Mohammed Zaki) 193 I

Parallel Performance
C20-T5-S8-12-D1M C20-T5-S8-12.D1M

1400.

1200.
1000

~' 800'
.s_
I--
-- 600.
.o
= 400, T=
"~ 200.

t 2 4 8 t 2 4 8
Number of Processors Number of Processors
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 194 I

407
Scaleup and Support
CS-T2.5-S4-1125 C5-T2.5.S4-11.25.D1M
f I
180- 90-'/
160- 80/-;
140- 701/
=) t20-
E
=E 6o. // '
.m
~'- t00- ~-- 50-
._o 80- .o
"-!
40-
~, 60~ o /
40 20
10 ",~
o~ O"
2 4 8 10 0.10% 0.08% 0,05% 0.03% 0.01%
Millionsof Customers Number of Processors
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 195 J

pSPADE Summary
• Task parallel better than data parallel
• Dynamic load balancing
• Asynchronous algorithms (independent
classes)
• Good locality (uses only intersections)
• Good speedup and scaleup
• What next? Gigabyte databases
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 196 J

408
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
• Sequences
• 1Clustering
I Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) lg7 r

What is clustering?

Given N k-dimensional feature vectors,


find a "meaningful" partition of the N
examples into c subsets or groups
• Discover the "labels" automatically
• c may be given, or "discovered"
• much more difficult than classification,
since in the latter the groups are given,
and we seek a compact description
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 198 I

409
Clustering Illustration
I Euclidean Distance Based Clustering in 3-D space.

00 o

I0

/
/
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 199 I

Clustering

• Have to define some notion of


"similarity" between examples
• Goal: maximize intra-cluster similarity
and minimize inter-cluster similarity
• Feature vector be
- A l l numeric (well defined distances)
- All categorical or mixed (harder to define
similarity; geometric notions don't work)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 200 t

410
Clustering schemes

• Distance-based
- Numeric
• Euclidean distance (root of sum of squared
differences along each dimension)
• Angle between two vectors
- Categorical
• Number of common features (categorical)
• Partition-based
- Enumerate partitions and score each
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 201 I

Clustering schemes

• Model-based
- Estimate a density (e.g., a mixture of
gaussians)
- Go bump-hunting
- C o m p u t e P(Feature Vector i I Cluster j)
- Finds overlapping clusters too
- E x a m p l e : bayesian clustering

1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 202 J

411
Before clustering

• Normalization:
- G i v e n three attributes
• A -- micro-seconds

• B -- milli-seconds

• C -- seconds

- Can't treat differences as the same in all


dimensions or attributes
- N e e d to scale or normalize for comparison
- C a n assign weight for more importance
[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 203 ]

The k-means algorithm

• Specify 'k', the number of clusters


• Guess 'k' seed cluster centers
• 1) Look at each example and assign it
to the center that is closest
• 2) Recalculate the center
• Iterate on steps 1 and 2 till centers
converge or for a fixed number of times
!
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 204 J

412
K-means algorithm

0 0
0 00: Initial seeds
0

0 0 o°
0

O0
O0
o/ o
' ! •
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 205 I
I

K-means algorithm

O
o****
O New centers ~w-

0 0 0
o ~o
000
0 0
I=
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 206 [

413
K-means algorithm

Final centers -~v

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 207 I

Operations in k-means
• Main Operation: Calculate distance to
all k means or centroids
• Other operations:
- Find the closest centroid for each point
- C a l c u l a t e mean squared error (MSE) for all
points
- Recalculate centroids

!
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 208 I

414
Parallel k-means
• Divide N points among P processors
• Replicate the k centroids
• Each processor computes distance of
each local point to the centroids
• Assign points to closest centroid and
compute local MSE
• Perform reduction for global centroids
and global MSE value
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 209 I

Gaussian mixture models

• Drawbacks of K-means
- Doesn't do well with overlapping clusters
-Clusters easliy pulled off center by outliers
- Each records is either in or out of a cluster;
no notion of some records begin more or
less likely than others to really belong to
cluster they have been assigned
• G M M : probabilistic variants of K-means
1
High Performance Data Mining (Vipin Kumar and Mohammed zaki) 210 J

415
Estimation-maximization

• Choose K seeds: means of a gaussian


distribution
• Estimation: calculate probability of
belonging to a cluster based on distance
• Maximization: move mean of gaussian
to centroid of data set, weighted by the
contribution of each point
• Repeat till means don't move
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 211 I

Deviation/outlier detection

• Find points that are very different from


the other points in the dataset
• Could be "noise", that causes problems
for classification or clustering
• Could be the really "interesting" points,
for example, in fraud detection, we are
mainly interested in finding the
deviations from the norm
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 212 I

416
Deviation detection

outlier.

J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 213 ]

K-nearest neighbors

• Classification technique to assign a


class to a new example
• Find k-nearest neighbors, i.e., most
similar points in the dataset (compare
against all points!)
• Assign the new case to the same class
to which most of its neighbors belong

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 214 I

417
K-nearest neighbors

+~
0 0
Neighborhood
0 000 0
+ 5 of class 0
0 0 3 of class +

0 0 0 0 0
-k-o
**2++ o o
++
+
+0 0

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 215 n

t
_

Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
• Sequences
• Clustering
• l Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 216J
418
Large-scale Parallel KDD Systems

• Terabyte-sized datasets
• Centralized or distributed datasets
• Incremental changes
• Heterogeneous data sources
• Pre-processing, mining, post-processing
• Modular (rapid development)
• Benchmarking (algorithm comparison)
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 217 ]

Research Directions
• Fast algorithms: different mining tasks
-Classification, clustering, associations, etc.
-Incorporating concept hierarchies
• Parallelism and scalability
- Millions of records
- T h o u s a n d s of attributes/dimensions
- Single pass algorithms
- Sampling
- Parallel I/O and file systems
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 218 I

419
Research Directions (contd.)

• Data locality and type


-Distributed data sources (www)
- Text and multimedia mining
- Spatial data mining
• Incremental mining: refine knowledge
as data changes
• Interactivity: anytime mining

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 219 I

Research Directions (contd.)

• Tight database integration


- Push common primitives inside DBMS
-Use multiple tables
- Use efficient indexing techniques

- C a c h i n g strategies for sequence of data


mining operations
- Data mining query language and parallel
query optimization
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 220 J

420
Research Directions (contd.)

[] Understandability: Too many patterns


-Incorporate background knowledge
-Integrate constraints
- Meta-level mining
- Visualization

[] Usability: build a complete system


-Pre-processing, mining, post-processing,
persistent management of mined results
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 221 I

Summary Of the Tutorial


• Data mining is a rapidly growing field
- Fueled by enormous data collection rates, and
need for intelligent analysis for business and
scientific gains.
• Large and high-dimensional nature data
requires new analysis techniques and
algorithms.
• Scalable, fast parallel algorithms are
becoming indispensable.
• Many research and commercial
opportunities!!!
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 222I
421
Resources
Workshops
• HiPC Special Session on Large-Scale Data Mining, 2000.
http:llwww.cs.rpLedul-zakilLSDMI
• ACM SIGKDD Workshop on Distdbuted Data Mining, 2000.
http:/Iwww.eecs.wsu.edul-hillollDKDIdpkd2OOO.html
• 3rd IEEE IPDPS Workshop on High Performance Data Mining, 2000.
http://www.cs.rpi.edu/~zaki/HPDM/
• ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems, 1999.
http:llwww.cs.rpi.edul-zakiiWKDD991
• ACM SIGKDD Workshop on Distributed Data Mining, 1998.
http://www.eecs.wsu.edu/-hillol/DDMWS/papers.html
• 1st IEEE IPPS Workshop on High Performance Data Mining 1998.
http://www.cise.ufl.edu/- ranka/

Books
• A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer Academic
Pub., Boston, MA, 1998.
• M.J. Zaki and C.-T. Ho (eds). Large-Scale Parallel Data Mining. LNAI State-of-the-Art Survey,
Volume 1759, Springer-Verlag, 2000.
• H. Kargupta and P. Chan (eds). Advance in Distributed and Parallel Knowledge Discovery, AAAI
Press, Summer 2000.

High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 223 ]

Resources (contd.)
Journal Special Issues
• P. Stolorz and R. Musick (eds.). Scalable High-Performance Computing for KDD, Data Mining and
Knowledge Discovery: An International Journal, Vol. 1, No. 4, December 1997.
• Y. Guo and R. Grossman (eds.). Scalable Parallel and Distributed Data Mining, Data Mining and
Knowledge Discovery: An International Journal, Vol. 3, No. 3, September 1999.
• V. Kumar, S. Ranka and V. Singh. High Performance Data Mining, Journal of Parallel and Distributed
Computing, forthcoming, 2000.

Survey Articles
• F. Provost and V. Kollud. A survey of methods for scaling up inductive algorithms. Data Mining and
Knowledge Discovery: An International Journal, 3(2): 131--169, 1999.
• A. Srivastava, E.-H. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classification
algodthms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237--262, 1999.
• M.J. Zaki. Parallel and distdbuted association mining: A survey. In IEEE Concurrency special issue
on Parallel Data Mining, 7(4):14-25, Oct-Dec 1999.
• D. Skillicom. Strategies for parallel data mining. IEEE Concurrency, 7(4):26--35, Oct-Dec 1999.
• M.V. Joshi, E.-H. Han, G. Karypis and V. Kumar. Efficient parallel algorithm for mining associations.
In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Spdnger-Verlag 2000.
• M.J. Zaki. Parallel and distdbuted data mining: An introduction. In Zaki and Ho (eds.), Large-Scale
Parallel Data Mining, LNAI 1759, Spdnger-Vedag 2000.

I
High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 224 I

422
References: Classification
• J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H. W. To, and
Y. Dan. Large scale data mining: Challenges and responses. In 3rd Intl. Conf. on Knowledge Discovery and
Data Mining, August 1997.
• S. Goil and A. Choudhary. Efficient parallel classification using dimensional aggregates. In Zaki and Ho
(eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor. Searching the nuggets in parallel. In Fayyad
et al.(eds.), Advances in KDD, AAAI Press, 1998.
• M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classification algorithm for mining
large datasets. In Intl. Parallel Processing Symposium, 1998.
• R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. Suttner, editors, Parallel
Processing for Artificial Intelligence 3. Elsevier-Science, 1997.
• S. Lavington, N. Dewhurst, E. Wilkins, and A. Freitas. Interfacing knowledge discovery algorithms to large
databases management systems. Information and Software Technology, 41:605--617, 1999.
• F. Provost and J. Aronis. Scaling up inductive learning with massive parallelism. Machine Leaming, 23(1),
April 1996.
• F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and
Knowledge Discovery: An International Journal, 3(2): 131-- 169, 1999.
• John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier for data mining. In
Proc. of the 22rid Int'l Conference on Very Large Databases, Bombay, india, September 1996.
• M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-of-core divide and conquer techniques with application
to classification trees. In 13th International Parallel Processing Symposium, April 1999.
• A. Srivastava, E-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification
algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237--261, 1999.
• M.J. Zaki, C.-T. Ho, and R. Agrawal.Parallel classification for data mining on shared-memory
multiprocessors. In 15th IEEE Intl. Conf. on Data Engineering, March 1999.

High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 225 I

References: Associations
• R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast discovery of association
rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages 307--328.
AAAI Press, Menlo Park, CA, 1996.
• R. Agrawal and J. Sharer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg.,
8(6):962--989, December 1996.
• D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distdbuted algorithm for mining association rules. In 4th
Intl. Conf. Parallel and Distributed Info. Systems, December 1996.
• D. Cheung, V. Ng, A. Fu, and Y. Fu. Efficient mining of association rules in distributed databases. IEEE
Transactions on Knowledge and Data Engg., 8(6):911-922, December 1996.
• D. Cheung, K. Hu, and S. Xia. Asynchronous parallel algorithm for mining association rules on shared-
memory multi-processors. In 10th ACM Symp. Parallel Algorithms and Architectures, June 1998.
• D. Cheung and Y. Xiao. Effect of data distribution in parallel mining of associations. Data Mining and
Knowledge Discovery: An International Journal. 3(3):291--314, 1999.
• E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD
Conf. Management of Data, May 1997.
• M. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M.
Zaki and C.-T. Ho (eds), Large-Scale Parallel Data Mining, LNAI State-of-the-Art Survey, Volume 1759,
Spdnger-Verlag, 2000.
• S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In
Zaki and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Spdnger-Vedag 2000.

High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 226 I

423
References: Associations (contd.)
• A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical
Report CS-TR-3515, University of Maryland, College Park, August 1995.
• J.S. Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf.
Information and Knowledge Management, November 1995.
• T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl.
Conf. Parallel and Distributed Info. Systems, December 1996.
• T. Shintani and M. Kitsuregawa. Parallel algorithms for mining generalized association rules with
classification hieramhy. In ACM SIGMOD International Conference on Management of Data, May 1998.
• M. Tamura and M. Kitsuregawa. Dynamic load balancing for parallel association rule mining on
heterogeneous PC cluster systems. In 25th Int'l Conf. on Very Large Data Bases, September 1999.
• M.J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14--25, October-
December 1999.
• M.J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-
memory multi-processors. In Supercomputing'98, November 1996.
• M . J . Zaki, S. Parthasarathy, M. Ogihara, and W. U. New algorithms for fast discovery of association rules.
In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
• M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. U. Parallel algorithms for fast discovery of association
rules. Data Mining and Knowledge Discovery: An Intemational Journal, 1(4):343-373, December 1997.
• M.J. Zaki, S. Parthasarathy, W. Li, A Localized Algorithm for Parallel Association Mining, 9th Annual ACM
Symposium on Parallel Algorithms and Architectures (SPAA), June, 1997.

High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 227 I

References: Sequences

• R. Agrawal and R. Srikant. Mining sequential patterns. In 1 lth Intl. Conf. on Data Engg., 1995.
• H. Mannila and H. Toivonen and I. Verkamo. Discovery of frequent episodes in event sequences. Data
Mining and Knowledge Discovery: An Intemational Journal. 1(3):259-289, 1997.
• T. Oates, M. D. Schmill, and P. R. Cohen. Parallel and distributed search for structure in multivariate time
sedes. In 9th European Conference on Machine Learning, 1997.
• T. Oates, M. D. Schmill, D. Jansen, and P. R. Cohen. A family of algorithms for finding temporal structure in
data. In 6th Intl. Workshop on AI and Statistics, March 1997.
• T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach.
In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998.
• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In
5th Intl. Conf. Extending Database Technology, Mamh 1998.
• M.J. Zaki. Efficient enumeration of frequent sequences. In 7th Intl. Conf. on Information and Knowledge
Management, November 1998.
• Mohammed J. Zaki, Parallel Sequence Mining on SMP Machines, ACM SIGKDD Workshop on Large-Scale
Parallel KDD Systems, 1999 (LNAI Vo11759).

1
High P e r f o r m a n c e Data Mining (Vipin K u m a r a n d M o h a m m e d Zaki) 228 J

424
References: Clustering
• K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. 1st IPPS Workshop on High
Performance Data Mining, March 1998.
• I. Dhillon and D. Modha. A data clustering algorithm on distributed memory machines. In Zaki and Ho (eds),
Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• L. lyer and J. Aronson. A parallel branch-and-bound algorithm for cluster analysis. Annals of Operations
Research Vol. 90, pp 65-86, 1999.
• E. Johnson and H. Kargupta. Collective hierarchical clustering from distributed heterogeneous data. In Zaki
and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000.
• D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In Int'l Conf. Pattern Recognition,
August 1996.
• X. Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11:270--290, 1989.
• C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313--1325, 1995.
• S. Ranka and S. Sahni. Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distdbuted
Systems, 2(2): 129-- 137, 1991.
• F. Rivera, M. Ismail, and E. Zapata. Parallel squared error clustering on hypercube arrays. Journal of
Parallel and Distributed Computing, 8:292--299, 1990.
• G. Rudolph. Parallel clustering on a unidirectional dng. In R. Grebe et al., editor, Transputer Applications
and Systems'93: Volume 1, pages 487--493. lOS Press, Amsterdam, 1993.
• H. Nagesh, S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large
data sets. Technical Report 9906-010, Center for Parallel and Distributed Computing, Northwestern
University, June 1999.
• X. Xu, J. Jager and H.-P. KriegeL A fast parallel clustering algorithm for large spatial databases. Data
Mining and Knowledge Discovery: An International Journal. 3(3):263--290, 1999.

I High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 229 I

References: Distributed DM
• J. Aronis, V. Kolluri, F. Provost, and B. Buchanan. The WORLD: Knowledge discovery from multiple
distributed databases. In Florida Artificial Intelligence Research Symposium, May 1997.
• R. Bhatnagar and S. Srinivasan. Pattern discovery in distributed databases. In AAAI National Conference
on Artificial Intelligence, July 1997.
• J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Kohler, and J. Syed. An architecture for distributed
enterprise data mining. In 7th Intl. Conf. High-Performance Computing and Networking, April 1999.
• R.L. Grossman, S. M. Bailey, H. Sivakumar, and A. L. Tudnsky. Papyrus: A system for data mining over
local and wide area clusters and super-clusters. In Supercomputing'99, November 1999.
• Y. Guo and J. Sutiwaraphun. Knowledge probing in distnbuted data mining. In 3rd Pacific-Asia Conference
on Knowledge Discovery and Data Mining, August 1999.
• L. Hall, N. Chawla, K. Bowyer and W. P. Kegelmeyer. Leaming rules from distributed data. In Zaki and Ho
(eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, distdbuted data mining using an agent based
architecture. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
• H. Kargupta, B-H. Park, D. Hershberger, and E. Johnson. CollecUve data mining: A new perspective toward
distnbuted data mining. In Kargupta and Chan (eds), Advances in Distributed DM, AAAI Press, 2000.
• S. Parthasarathy and R. Subramonian. Facilitating data mining on a network of workstations. In Kargupta
and Chan (eds), Advances in Distnbuted DM, AAAI Press, 2000.
• A. Prodromidis, S. Stoifo, and P. Chan. Meta-learning in distributed data mining systems: Issues and
approaches. In Kargupta and Chan (eds), Advances in Distributed DM, AAAI Press, 2000.
• S. Stolfo, A. Prodromidis, S. Tselepis, W. Lee, W. Fan, and P. Chan. Jam: Java agents for meta-leaming
over distributed databases. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.

!
High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 230 I

425

You might also like