Tutorial PM-3

High Performance Data Mining

Vipin Kumar (University of Minnesota)

Mohammed Zaki (Rensselaer Polytechnic)

High Performance Data Mining

Vipin Kumar Mohammed J. Zaki

Computer Science Dept. Computer Science Dept.
University of Minnesota Rensselaer Polytechnic Institute
Minneapolis, MN, USA. Troy, NY, USA. zaki

Tutorial overview
• [ Overview of KDD and data mining I
• Parallel design space
• Classification
• Associations
• Sequences
• Clustering
m Future directions and summary
Overview of data mining

• What is Data Mining/KDD?

• Why is KDD necessary
• The KDD process
• Mining operations and methods
• Core issues in KDD

What is data mining?

The iterative and interactive process of

discovering valid, novel, useful, and
understandable patterns or models in

Massive databases

What is data mining?

• Valid: generalize to the future

• Novel: what we don't know
• Useful: be able to take some action
• Understandable: leading to insight
• Iterative: takes multiple passes
• Interactive: human in the loop
Data mining goals

• Prediction
- What?

- Opaque

• Description
- Why?

- Transparent

Data mining operations

• Verification driven
- Validating hypothesis
-Querying and reporting (spreadsheets, pivot
analysis (dimensional
- M u l t i d i m e n s i o n a l

summaries); On Line Analytical Processing

- Statistical analysis

Data mining operations

• Discovery driven
-Exploratory data analysis
- Predictive modeling
- Database segmentation
- Link analysis
- Deviation detection

KDD Process

Data mining process

• Understand application domain

- Prior knowledge, user goals
• Create target dataset
- Select data, focus on subsets
• Data cleaning and transformation
-Remove noise, outliers, missing values
- Select features, reduce dimensions
Data mining process
• Apply data mining algorithm
-Associations, sequences, classification,
clustering, etc.
• Interpret, evaluate and visualize patterns
- W h a t ' s new and interesting?
- I t e r a t e if needed
• Manage discovered knowledge
- C l o s e the loop
Why Mine Data? Commercial

• Lots of data is being collected and
• Computing has become affordable.
• Competitive Pressure is Strong
- Provide better, customized services for an
-Information is becoming product in its own
Why Mine Data?
Scientific Viewpoint...
• Data collected and stored at enormous speeds
- remote s e n s o r on a satellite
- t e l e s c o p e s c a n n i n g the skies
- m i c r o a r r a y s g e n e r a t i n g g e n e e x p r e s s i o n data
- scientific simulations generating t e r a b y t e s of data
• Traditional techniques are infeasible for raw data
• Data mining for data reduction..
- cataloging, classifying, s e g m e n t i n g data
- Helps scientists in H y p o t h e s i s Formation

Origins of Data Mining

• Draws ideas from machine learning/AI,

pattern recognition, statistics, database
systems, and data visualization.
• Traditional Techniques may be
-Enormity of data
- H i g h Dimensionality of data
- Heterogeneous, Distributed nature of data
Data mining methods

• Predictive modeling (classification,

• Segmentation (clustering)
• Dependency modeling (graphical
models, density estimation)
• Summarization (associations)
• Change and deviation detection
Data mining techniques

• Association rules: detect sets of attributes that

frequently co-occur, and rules among them, e.g.
90% of the people who buy cookies, also buy
milk (60% of all grocery shoppers buy both)
• Sequence mining (categorical): discover
sequences of events that commonly occur
together, .e.g. In a set of DNA sequences
ACGTC is followed by GTCA after a gap of 9,
with 30% probability
Data mining techniques

• Classification and regression: assign a new data

record to one of several predefined categories or
classes. Regression deals with predicting real-
valued fields. Also called supervised learning.
• Clustering: partition the dataset into subsets or
groups such that elements of a group share a
common set of properties, with high within group
similarity and small inter-group similarity. Also
called unsupervised learning.

Data mining techniques

• Similarity search: given a database of objects,

and a "query" object, find the object(s) that are
within a user-defined distance of the queried
object, or find all pairs within some distance of
each other.
• Deviation detection: find the record(s) that is
(are) the most different from the other records,
i.e., find all outliers. These may be thrown away
as noise or may be the "interesting" ones.
Data mining techniques
Many other methods, such as
- Neural networks
- Genetic algorithms
- H i d d e n Markov models
- Time series
- Bayesian networks
- Soft computing: rough and fuzzy sets

Main challenges for KDD

• Scalability
- Efficient and sufficient sampling
-In-memory vs. disk-based processing
- High performance computing
• Automation
- Ease of use
- Using prior knowledge

Tutorial overview
• Overview of KDD and data mining
• Parallel design space I
• Classification

• Associations
• Sequences
• Clustering
• Future directions and summary
Speeding up data mining

• Data oriented approach

- Discretization
- Feature selection
- Feature construction (PCA)
- Sampling

• Methods oriented approach

- Efficient and scalable algorithms

Speeding up data mining (contd.)

• Methods oriented approach (contd.)

-Parallel data mining
• Task or control parallelism
• Data parallelism
• Hybrid parallelism
-Distributed data mining
• Voting
• Meta-learning, etc.

Need for Parallel Formulations

• Need to handle very large datasets.

• Memory limitations of sequential computers
cause sequential algorithms to make multiple
expensive I/0 passes over data.
• Need for scalable, efficient (fast) data mining
- gain competitive advantage.
- Handle larger data for greater accuracy in shorter
Parallel Design Space
• Parallel architectures
- Distributed memory
- Shared disk
- Shared memory
-Hybrid cluster of SMPs
• Task and data parallelism
• Static and dynamic load balancing
• Horizontal and vertical data layout
• Data and concept partitioning
Parallel Hardware
• Distributed-memory machines
- Each processor has local memory and disk
- Communication via message-passing
- Hard to program: explicit data distribution
- Goal: minimize communication
• Shared-memory machines
- Shared global address space and disk
- Communication via shared memory variables
- Ease of programming
- Goal: maximize locality, minimize false sharing
• Current trend: Cluster of SMPs
Distributed Memory Architecture
(Shared Nothing)
Interconnection N e t w o r k J

1 D

DMM: Shared Disk Architecture

~ ,0, Interconnection N e t w o r k

• mW /

Shared Memory Architecture
(Shared Everything)
<-_____ Interconnection Network

1 ID
r I 1
Cluster of SMPs
ion Network

Task vs. Data Parallelism
• Data Parallelism
-Data partitioned among P processors
- Each processor performs the same work
on local partition
• Task Parallelism
- Each processor performs different
- Data may be (selectively) replicated or
Task vs. Data Parallelism


Task Parallelism

Uniprocessor @+1
Data Parallelism
Static vs. Dynamic Load Balance
• Static Load Balancing
- Work is initially divided (heuristically)
- N o subsequent computation or data
• Dynamic Load Balancing
- Steal work from heavily loaded processors
- Reassign to lightly loaded processors
-Important for irregular computation,
heterogeneous environments, etc.
Horizontal/Vertical Data Format

Horizontal Vertical

2 N) Manied 1CCK N:)

4 Yes Maded 12CK I~b i

6 N:) Manied B:K I~b

8 N~ 8n~ 8EK Yes

10 I~ @rge ~K Yes

Data and concept partitioning

• Shared
- SMP or shared disk architectures

• Replicated
- Partially or totally

• Partitioned
- Round-robin partitioning
- Hash partitioning
- Range partitioning
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• 1Classification
• Associations

• Sequences
• Clustering
• Future directions and summary
What is Classification?

Classification is the process of assigning

new objects to predefined categories or

• Given a set of labeled records

• Build a model (decision tree)
• Predict labels for future unlabeled
Classification learning

• Supervised learning (labels known)

• Example described in terms of attributes
-Categorical (unordered symbolic values)
-Numeric (integers, reals)
• Class (output/predicted attribute):
categorical for classification, numeric for

Classification learning

• Training set: set of examples, where

each example is a feature vector (i.e., a
set of (attribute,value) pairs) with its
associated class. Model built on this set.
• Test set: a set of examples disjoint from
the training set, used for testing the
accuracy of a model.

Classification Models
• Some models are better than others
- Accuracy

- Understandability

• Models range from easy to understand

to incomprehensible
- Decision trees Easier

- Rule induction

- Regression models
- Neural networks Harder
Decision Tree Classification

Splitting Attributes

Yes/J Refund L',Ao /

J ,,,,,r=

< 80; Taxlno ~0K~J

The splitting attribute at a node is

determined based on the Gini index.

From Tree to Rules

1) Refund = Yes ~ NO

Yes~ Refund [.~ 2) Refund = No and

MarSt in {Single, Divorced}
Single, D~erced Married and Taxlnc < 80K ~ NO
Taxlnc 3) Refund = No and
< 8OK/ ~,= 80K MarSt in {Single, Divorced}
and Taxlnc >= 80K = , YES

4) Refund = No and
MarSt in {Married } ~ NO

Classification algorithm

• Build tree
-Start with data at root node
-Select an attribute and formulate a
logical test on attribute
-Branch on each outcome of the test,
and move subset of examples
satisfying that outcome to
corresponding child node
Classification algorithm
- Recurse on each child node
- Repeat until leaves are "pure", i.e., have
example from a single class, or "nearly
pure", i.e., majority of examples are from
the same class
• Prune tree
- R e m o v e subtrees that do not improve
classification accuracy
- Avoid over-fitting, i.e., training set specific
Build tree
• Evaluate split-points for all attributes
• Select the "best" point and the "winning"
• Split the data into two
- Formulation of good split tests
- Selection measure for attributes

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 45 [

How to capture good splits?

• Occam's razor: Prefer the simplest

hypothesis that fits the data
• Minimum message/description length
- Dataset D
- H y p o t h e s e s H1, H2, ..., Hx describing D
- MML(Hi) = Mlength(Hi)+Mlength(DIHi)
- Pick Hk with minimum M M L
• Mlength given by Gini index, Gain, etc.
Tree pruning using MDL
• Data encoding: sum classification errors
• Model encoding:
- Encode the tree structure
- Encode the split points
• Pruning: choose smallest length option
-Convert to leaf
- P r u n e left or right child
- Do nothing

Hunt's Method
• Attributes: Refund (Yes, No), Marital Status
(Single, Married, Divorced), Taxable Income
• Class: Cheat, Don't Cheat

What's really happening?
Marital Status
Don't Cheat I
+ 0 0 0 0
O÷oC 0 000 0
+ 0 0 0
0 0 0 0 O
++ 0
Marrie, + + 0 O
I -'Income
< 80K I
High Performance Data
Finding good split points

• Use Gini index for partition purity


Gini (S l , S 2) = ~ Gini (S l ) + n~ Gini (S 2)

n n

• If S is pure, Gini(S) = 0
• Find split, point with minimum Gini
• Only need class distributions
Finding good splits points
~ Cheat
~,Don't Cheat
Marit tl Status Maril d Status
' + 0 ) 0 0 ' 0 0 0
4- 0
O+ 0 ¢ 0+00 0 000o
4- 0 o ° o°°°° 4- 0 0 0 ÷
0 O+ 0 0
0 0 0 0 0 0 a 0 a e~
+%, 0
**.*÷÷ o °
+ +( O + +T~
+ o

Income Ncome
Left 14 9 Top 5 23
Right 1 18 Bottorr 10 4

Gini(split) = 0.31 Gini(split) = 0.34

Categorical Attributes:
Computing Gini Index
For each distinct value, gather counts for
each class in the dataset
• Use the count matrix to make decisions

Multi-way split I Two-way split

I (find best partition of values)

2 I 2

I Gini I o,4:te
Continuous Attributes:
Computing Gini Index...
• For efficient computation: for each attribute,
- Sort the attribute on values
- Linearly scan these values, each time updating the count
matrix and computing gini index
- Choose the split position that has the least gini index

Sorted Values
Split Position.=

• Sorts Continuous Attributes at each node.
• Needs entire data to fit in memory.
• Unsuitable for Large Datasets.
- Needs out-of-core sorting.
• Classification Accuracy shown to
improve when entire datasets are

SPRINT [Shafer,Agrawal,Mehta]

Attribute Lists:

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 55 I

SPRINT (contd.)
• The arrays of the continuous attributes are pre-sorted.
- T h e sorted order is m a i n t a i n e d d u r i n g each split.
• The classification tree is grown in a breadth-first fashion.
• Class information is clubbed with each attribute list.
• Attribute lists are physically split among nodes.
• Split determining phase is just a linear scan of lists at each
• Hashing scheme used in splitting phase.
- tids of the splitting attribute are h a s h e d with the tree node as the
• lookup table
- r e m a i n i n g attribute arrays are split by q u e r y i n g this hash structure.

SPRINT Disadvantages

• Size of hash table is O(N) for top levels

of the tree.
• If hash table does not fit in memory
(mostly true for large datasets), then
build in parts so that each part fits.
- Multiple expensive I/O passes over the
entire dataset.

Constructing a Decision Tree in

m categorical attributes
Partitioning of data
o¢J only
- global reduction p e r
node is required
Good Bad
- large number of
classification t r e e
nodes gives high
communication cost

Constructing a Decision Tree in
l 0,000 training records Partitioning of
classification tree nodes
natural concurrency
7,0q ds -
load imbalance as the
amount of work associated
with each node varies
- child nodes use the same
data as used by parent node
- loss of locality
- high data movement cost

Challenges in Constructing Parallel

• Partitioning of data only
- large number of classification tree nodes gives high communication

• Partitioning of classification tree nodes

- natural concurrency
- load imbalance as the amount of work associated with each node
- child nodes use the same data as used by parent node
- loss of locality
- high data movement cost

• How do we efficiently perform the computation in

Synchronous Tree Construction Approach

Proc 0 Proc 1 Proc 2 Proc 3 Partition Data Across Processors

~-iii~iii~:::~ + N°requireddat
a is
Class Distribution Information

I - - Load imbalance
y i can be eliminated by breadth-
Proc 0 Proc 1 P~oc 2 Proc 3 first expansion

I igh communication cost

. becomes too high in lower parts
':,~ ................
"- ............................................
.... :: of the tree
Class Distribution Information

Partitioned Tree Construction Approach

i ......................................... ;~ .......................................... ==

Partition Data andNodes

ProcO Procl Proc2 Proc3
-I- Highly concurrent

r......................................... :
High communication cost
due to excessive data
ProcO Procl Proc2 Proc3

t - Load imbalance

procO prowl Proc~ Proc3

Hybrid Parallel Formulation

( ]
Computation Frontier at depth 3

Partition 1 Partition 2

SynchronousTree Partitioned Tree

Construction Approach Construction Approach
Load Balancing

Partition 1 Partition2 Step 1: exchange data between Step2: load balance within
processor partitions processor partitions

Splitting Criterion

• Switch to Partitioned Tree Construction when

2 Communication Cost > Moving Cost + Load Balancing

• Splitting criterion ensures

C o m m u n i c a t i o n Cost < 2 * C o m m u n i c a t i o n COStoptima l

G. Karypis and V. Kumar, IEEE Transactions on Parallel and

Distributed Systems, Oct. 1994

Experimental Results

• Data set
- function 2 data set discussed in SLIQ paper
(Mehta, Agrawal and Rissanen, EDBT'96)
- 2 class labels, 3 categorical and 6 continuous
• IBM SP2 with 128 processors
- 66.7 MHz CPU with 256 MB real memory
- AIX version 4
- high performance switch

Speedup Comparison of the Three
Parallel Algorithms

~ e d u p for 0.8 malon examp~s, partitioned - " - + - ' , hybrid - " - o - ' , synch t a r s u s - " - x - " Speedup for 1.6 rnllion e x a n ~ , part~ioned - " ~ ' . hybrid - " - c - ' . s ' ~ - "-x-"
. . . . . . . /4

4 6 8 10 12 14 16
2 4 6 No OI Pr~esso IO 12 14 16 NO O~ Promisors

0.8 million examples 1.6 million examples

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 67 I

Splitting Criterion Verification in the

Hybrid Algorithm
Splitting Criterion Ratio = ~ Communication Cost
Moving Cost + Load Balancing
~J~ume~ lot ~o~,ng at dJr~mnt v ~ o . q o~ nu"bo,8 ~ 0.8 mlWon 1~u~qplea ~ IO¢~olllling at dllfenmf ~mlul~ ol mno, 1 6 ~ 1 6 milton exanlples

/ /,/


,% -2 0 2 4 6 8 10 -2 0 2 4 6 6 10
bg2(Splmng C,'ile.~ R ~ o ) , x,.O -> ncx,.1 I o ~ ( Sp4111ngC411eriaRatk)), x..O - > ~ 1

0.8 million examples on 8 processors 1.6 million examples on 16 processors

Speedup of the Hybrid Algorithm with
Different Size Data Sets
Speedup c~rves for different sized data.sets
0.8 million examples --e---
1,6 million examples ,.~-.
3.2 million examples
6.4 million examples
12.8 minion examples
25.6 million examples -ix-,-

....-'°'°" ....A

,-" .A-.......
..~-.~'.-" ~.=~ .............................
i),- .......


i ! i i i i
20 40 60 80 100 120 140
Number of proceesols

Scaleup of the Hybrid Algorithm

R u n U m e s of o u r a l g o r i t h m f o r 5 0 K e x a m p l e s at e a c h p r o c e s s o r
90 ~ I !

80 ......... : ........ : ............. ! ......... ~. . . . . . . . . . . . . . . . . . :..... ~. . . . . . . .

70 .......... i .......................................................................................

~ 5 0 ............................. = .......... ~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ ......

t ~ 4 0 ...................................... * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .

2(] ........... • ................................................................ ; .......... , .........

1(~ ......... , ....... ~........ i ......... ~........ ~.......... i ........ :: .... ~ ....

! i : i i i i
i I ; I I i i
0 1 2 3 4 5 6 7 8 9
N u m b e r of P r o c e s s o m

Summary of Algorithms for
Categorical Attributes
• Synchronous Tree Construction Approach
- no data movement required
- high communication cost as tree becomes bushy
• Partitioned Tree Construction Approach
- processors work independently once partitioned completely
- load imbalance and high cost of data movement
• Hybrid Algorithm
- combines good features of two approaches
- adapts dynamically according to the size and shape of trees

Handling Continuous Attributes

• Sort continuous attributes at each node of the

tree (as in C4.5). Expensive, hence
• Discretize continuous attributes
- CLOUDS (Alsabti, Ranka, and Singh, 1998)
- SPEC (Srivastava, Han, Kumar, and Singh, 1997)
• Use a pre-sorted list for each continuous
- SPRINT(Shafer, Agrawal, and Mehta, VLDB'96)
- ScalParC (Joshi, Karypis, and Kumar, IPPS'98)
3 4 6
Design Approach
• Goal: Scalability in both Runtime and
Memory Requirements.
• Parallelization overhead: To = P Tp - Ts
• T o ~ O(Ts) for scalability. Per processor
overhead should not exceed O(Ts/P).
• Two components of To:
- Load Imbalance.

- Communication time.
Load Balancing
value rid cid value rid cid SP : split point
5 1 0 6 oll rid : record id
10 8 l 12 1 0 cid : class label
PO 15i 4 0 PO 18 8 1
2o I 6 0 24 6 0
SP " - ~ -

25 2 0

353 1

36 3 1

2 0
7 0 48 5 1
45 5 l 54 7 0

SP---~ 5 1 0 12 1 0 PG
10 8 I 18 8 1 30 0 l 6 0 1
35 3 1 36 3 1,
15 4 0 PO 24 6 0 1 40 7 0 48 5 1
PO 20 6 0 30 4 0
-i25 2 0 . p ~ 42 2 0 45 5 1 54 7 0
P1 .~ ~ • t ~ .

Parallel SPRINT

• Attribute lists are partitioned among processors

- Each processor gets N/p records of e a c h attribute list

• Attribute lists of continuous attributes are pre-sorted

• Split Determining Phase
- Categorical attributes
• Local construction of count matrices followed by a reduction to
add them up
- Continuous attributes
• Prefix-scan followed by local Gini computations followed by a
gini index reduction to find the maximum point

Parallelizing Split Determination

• Easy.
CarType local
Count Matrices
I Age lc¢al
Count Matrices
value rid cid I value rid cid ~ 0 1
SpoOl 0 1 t~ I 19 2 1 ~
P0 Fam 1 i 01 Faro Faro i 28 8 0
Fam 2 1 _ I - Parallel o 1

Fam 3 0
Spo 4 1 Famlll01
0 I
Spo [ - ~ Parallel
i P1
33 1 0
38 3 1 R

o 1
FamSP° ~
5 0l _6
01 II- " -- L
5400 4R
60 1~ i
S p o ~ _~
P2 Fam 7 1 Fam"t._LLeJ P2 58 7 1
Faro 8 0 i 70 0 0

Categorical Attribute Continuous Attribute

Parallelizing Splitting Phase..

• SPRINT Approach:
- Builds hash table on all processors by an
all-to-all broadcast operation.
- S o , O(N) data sent & received by each
processor--> Not Scalable in Runtime as
well as Memory Requirements.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 77 ]

Scalable Parallel Formulation of

Splitting Phase [ScalParC]
• Challenging!!
- Requirement for consistent splitting, given
the relatively random sorted order among
• Four cases:

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 78 I

Splitting a categorical attribute list requires access
exactly to those entries of the hash-table that are
created by the same processor

8 0

• The hash-table entries required by a processor are

generated by other processors
- Need to develop a scheme to transfer these entries to the
processors that need them
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 80 I

Getting the required entries of the
hash table
The required entries are transferred into two steps
- From splitting attribute order to tid sorted order
- From tid sorted order to attribute list order

This Design is Inspired by..

Communication Structure of Parallel Sparse
Matrix-Vector Algorithms.
0 1 2 3 4 5 6 7 8
o x 0
1 X 0 P0
2 ® <>
x X P1
5 X 0

xo X O P2
8 0 X 0
X : Salary O : Age ~ : Node Table Entries
Parallel Hashing Paradigm
[] Distributed Hash Table.
[] Hash Function: h(k) = (Pi, I)
[] Construction:
- (k,v) --> (pi,I) --> form send buffers {(I,v)} for each Pi --> all-
to-all-personalized communication.
.. Enquiry:
- (k) --> (Pi, I) --> form {(I)} buffers for each Pi --> all-to-all-
personalized c o m m --> each Pi replaces received {(I)} with
{(v)} --> all-to-all-personalized c o m m .

• If each processor has m keys to hash, then

time is O(m)if m is D,(p); i.e. D,(p 2) overall
kp.v.~, j - . . . . . . . . . . . . . . . .

Applying the paradigm to

ScalParC : Update
Salary Age N o d e Table
rid Salary Age cid value rid cid value rid cid rid kid
0 24542 70 0 24542 0 0 19 2 1 0-
1 98816 33 0 P0 49241 2 1 P0 24 ~ 0 P0 1 -
2 49241 19 1 64911 8 0 28 0 2 -
3 126146 38 1 94766 4 0 33 1 0 3 -
4 94766 50 0 P1 97672 5 0 Pl 38 3 1 PI 4-
597672 24 0 98816 110 4061 5 -
6 136838 40 1 SP"'i~- 126146 311 50 4 0 6'-
7 153032 58 1 ~2 136838 6,1 P2 5871 P2 7 -
8 64911 28 0 153032 7 l 70 0 0 8,-

(a) (b)

hash buffers
PO P2 0 2 1
Po [(O,L)I(2,L)(2,L)[ m[(O,L)[(2,L)I(I,L)
I N o d e Table
P0 P1 P2
P0 PI 4 5 3
~, I(1.L)I..L)I(2.L)I comm ~,I(1.L)I~2.L)I<0.R>I update rid]OI1121 31415 17181

PI P2 8 6 7 I I
Ico,.) ~o,~)1<1..)1 ~l(2,L)l<o,~)l.,~)l
Applying the Paradigm to
ScalParC: Enquiry
Age enquiry buffers intermediate index buffers
value rid cid PO PI P2 P0 PI P'2
2 5 8
19 2 1
P0 24 5 0 ~ol 212 21
28 8 0 I 3 6

33 1 0
38 3 1 -I'1° °1 °°° El I 2
I I0 1
="°~" ' ' '
40!6 1 0 4 7
50 4 0 Pzlol 1 ,I
P2 58 7 1
70 0 0 intermediate valuebuffers result buffers
Node Table
P0 PI } P2
pl[ L R L oo~ P,I L I . I g
rid 0 I 2 34516
kid g L L R L L R R7 ~ PO PI P2

Verifying Scalability
• Updating the distributed hash table
- Each processor receives at most N/p entries
- Each processor can send at most Nk/p entries (k is the number of
continuous attributes)
• This is a worst case situation that does not happen frequently
• Inquiring the distributed hash table
- Each processor will perform k inquires, one for each continuous
- Each processor receives at most N/p entries
- Each processor sends at most N/p entries
• T h e per-processor communication overhead is in general
• Each processor requires only O(N/p) m e m o r y making S c a l P a r C
m e m o r y efficient
A worst case scenario for Updating
Hash Table
I One Processor might need to send O(kN/p) data! Happens
value rid cid value rid cid
5 1 0
10 8 l
~431 4 0
P O 1. 5 . . 4 . 0 P0 21 0 1
20 6 0 28 5 1
25 2 0 _ _357 0
SP~ . . . . .
30 0 1 42 2 0
35 3 1 P1 49 8 1
P1 . . . .
40 7 0 56 1 0
45 5 ' 1 63 6 0

BI B2 ~ BI B2

Sp~10-8 . 1. - 42 2 0 - 30 0 1 - I 7 3 1
35 3 1 P 0 21 0 1
.15. 4. 0 49 8 1 P1 4 0 7 0 28 5 1 .,,,t_Sp
P0 20 6 0 P1 56 1 0 35 ? " 0 "
25 2 0 63 6 0 45 5 1 -
p-11 I ~ I . . . . P-1 ~ ~ I I
/ x z

Categorical/Continuous and
• Special cases of the Continuous/Continuous
- Categorical/Continuous does not require
communication for loading the hash table
- Continuous/Categorical does not require
communication for inquiring the hash table

Algorithm: Level-wise
• Tree is built in a breadth-first manner.
• At each level of the decision tree
- Count Matrices for all attributes for all nodes are
reduced in one single communication approach.
- Loading the hash-table for all the nodes is
combined into a single communication operation
Inquiring the hash-table to split a particular

attribute list is combined in a single

communication operation
A total of k+ 1 All-to-All personalized

communication operations are performed for each

level of the tree
Structure of the Algorithm

• Sort Continuous Attributes (Pre-Sort)

• do level-wise while (at least one node requiring split)
- Compute count matrices (Find-Split I)
Compute best gini index for nodes requiring split

(Find-Split II)
Partition splitting attributes and update Node

Table (Perform-Split I)
Partition non-splitting attributes by inquiring Node

Table (Perform-Split II)

• end do
Salary Age Counl Matrices
local global NodeT a b l e
value rid cid rid kid

24 50 R
2818 0 0I 0I
3813 1
"'~® i [ ]
P2 P2 58
o-~N I N
70 0

Salary Age ~ Node Table

PI ~ ~ ['~ p ~

,'" "'. Salary Age

"'" "'" I 15303217111 P21 ~ 17111

Experimental Results
• Experiments w e r e performed on a 128-
processor C r a y T 3 D
• Training sets w e r e synthetically g e n e r a t e d
- Each contained only continuous attributes (5-9)
- The were two possible class labels

Parallel Runtime

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 93 ]

Constant size/processor

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 94 I

Different Number of Attributes

SMP Parallel Design Space

• Data parallelism: within a tree node
- Split attributes: vertical
• Fixed W i n d o w
• Moving Window

- S p l i t data (attribute lists): horizontal

• Task parallelism: between tree nodes
• Static vs. dynamic load balancing
SPRINT: Attribute Lists
TrainingSet Attributelists
Tid Age CarType Class Age Class Tid CarType[ Class iTid

2 43 sports High . . . . ,
s~ I rn~, |

32 Low 4 ~::famil~::7:l:!!
4 32 truck Low 43 truck I Low 4

continuous(sorted) categorical(orig order)

Splitting Attribute Lists

Attribute lists lbr n o d e 0 Decision Tree
Age Class Tid Car Type Class Tid Age < 2 7 . 5

32 Low 4 family Low 3

43 High 2 truck Low 4

Hash Table

Attribute lists for node 1

Attribute lists for node 2

Age Class Tid Car Type [ Class Tid

32 Low 4 High 2 1
43 High 2
trick Low 4

SPRINT: File per Attribute

®OO0 O 0 ~ O O0
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 991

Optimized File Usage

@® @®@® @
BASIC: Data Parallel
• Given current leaf frontier
• Atomically acquire an attribute and
evaluate it for all leaves
• Master finds winning attribute and
constructs hash table
• Atomically acquire an attribute and split
its list for all leaves
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 101 ]

BASIC (Tree View)


| l | l l | | | |



G ®® ® ® ®
BASIC (Level View, A=3, P=4)
II PO(4) PI(4) II P2(4) n p3(o)

• ®

Fixed Window: Data Parallel

• Partition leaves into blocks/window of K
• Dynamically acquire any attribute for
any leaf within current block and
evaluate it
• Last processor to work on a leaf notes
the winning attribute and builds the
hash table
• Split data as in BASIC
• Barrier synchronization between each
block of K leaves
FWK (Tree View)

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 105 I

FW2 (Level View)

i l r,o(4) D el(4) e2(2) P3(2)

' I
Leaf Lea1

Window=2 I
Moving Window: Data Parallel
• Partition leaves into blocks/window of K
• Dynamically acquire any attribute for
any leaf, say i, within current block
• Wait for last block's i-th leave
• Last processor to work on a leaf notes
the winning attribute and builds the
hash table
• Split data as in BASIC
• No barriers, only conditional wait
MWK (Tree Level)


• ..... . . . . . . . . . . . . . . . . . . . .......... ., ..... ,


, '~'",1 ' "J' '

~®. ®-0.10® ,:®~ .0

• dll lib

l Bill lind


MW2 (Level View)
il PO(3) P1(3) II
m P2(3) II P3(3)

Leaf Leaf
." q IP
SUBTREE: Task Parallel

• Processor group (P) and leaf frontier (L)
[] Apply BASIC, get new leaf frontier (NL)
[] if NL is empty, put processors in FreeQ,
else new group, NP = P + FreeQ
[] Master divides NP and NL into two parts
[] NP1 works on NL1, NP2 works on NL2
[] Initially all processors work on root

SUBTREE (4 procs)



Experimental Setup
• SMP Machines
- 112 MHz PowerPC-604 processors
- 16KB L1-cache, 1MB L2-cache
- Machine: 4-way SMP
• 1 2 8 M B memory, 3 0 0 M B disk

• Synthetic Datasets: simple (F2) and

c o m p l e x (F7) decision tree

Experimental Datasets
Dataset Function #Attr Num DBSize Num Max
Records Levels Nodes
F2A8D1M F2 8 1000K 192MB 4 2

F2A32D250K F2 32 250K 192MB 4 2

F2A64D125K F2 64 125K 192MB 4 2

F7A8D1 M F7 8 1000K 192MB 60 4662

F7A32D250K ; F7 32 250K 192MB 59 802

F7A64D125K F7 64 125K 192MB 55 384

Setup and Sort Time

Dataset Setup Sort Total Setup% Sort%
(sec) (sec) (sec)
F2A8D1M 721 633 3597 20% 18%

F2A32D250K 685 598 3584 19% 17%

F2A64D125K 705 626 3665 19% 17%

F7A8D1M 989 817 23360 4% 4%

F7A32D250K 838 780 24706 3% 3%

F7A64D125K 672 636 22664 3% 3%

Parallel Performance
F2A64D125K F7A64D125K

~2000 -~
E 1500 = t=oo
i= ='-
p, 1000 ._~ 1oo00
-, p,
= 500 = ~o.

0 0,
1 2 3 4 , 2 3 4
Number of Processors
Number of Processors

Parallel Performance
A64D125K A64D125K
1 I--m-F2 SUBTREE.F2
4 3.5

3.52 '"

~ ~ 2
~ 1.5

0.5 ~ 0.5
0 . . . . 0 . . . .
I 2 3 4 I 2 3 4
Numberof Processors Number of Processors
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• l Associations
• Sequences
• Clustering
• Future directions and summary
What is association mining?

• Given a set of items/attributes, and a set

of objects containing a subset of the
• Find rules: if I1 then 12 (sup, conf)
• I1, 12 are sets of items
• I1, 12 have sufficient support: P(11+12)
• Rule has sufficient confidence: P(12111)

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 118 I

Association mining
• User specifies "interestingness"
-Minimum support (minsup)
- Minimum confidence (minconf)
• Find all frequent itemsets (> minsup)
-Exponential Search Space
-Computation and I/O Intensive
• Generate strong rules (> minconf)
- Relatively cheap

Association Rule Discovery: Support

and Confidence

iil i!!!!~~!i~iiiiiii!i!!!i¸¸I~~i!ii! !i iilliiiii!i !!!i¸il¸i ¸ Example:

~ s i Ruleia ;~ X Y I {Diaper,Milk}=:~,,~Beer
tr(Diaper, Milk, Beer) 2
S-- =--=0.4
Total Number of Transactions 5
ty(Diaper, Milk, Beer) = 0.66
o-(Diaper, Milk) [
Handling Exponential Complexity

• Given n transactions and m different items:

-number of possible association rules: O(m2 m-l)
- c o m p u t a t i o n complexity: O(nm2m)
• Systematic search for all patterns, based on
support constraint [Agarwal & Srikant]:
- If {A,B} has s u p p o r t at least (z, then both A and B
have support at least t~.
- I f either A or B has support less than ~, then {A,B}
has support less than t~.
- Use patterns of k- l items to find patterns of k items.
Apriori Principle
• Collect single item counts. Find large items.
• Find candidate pairs, count them => large
pairs of items.
• Find candidate triplets, count them => large
triplets of items, And so on...
• Guiding Principle: Every subset of a
frequent itemset has to be frequent.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 122 I

Illustrating Apriori Principle
Items (1-itemsets)

Pairs (2-itemsets)

I Minimum Support = 3 I
~il Triplets (3-itemsets)
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6+6+2=14 "°.%
Counting Candidates
• Frequent Itemsets are found by counting
• Simple way:
- Search for each candidate in each transaction.



Association Rule Discovery: Hash
tree for fast access.
Candidate Hash Tree
Hash Function l

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 125 I

Association Rule Discovery: Subset

Hash Function


Association Rule Discovery: Subset
Hash Function
Operation (con

• \

124 155
457 458
Parallel Formulation of Association

• Need"
- Huge Transaction Datasets (10s of TB)
- L a r g e Number of Candidates.

• Data Distribution:
-Partition the Transaction Database, or
- Partition the Candidates, or
- Both
Parallel Association Rules: Count
Distribution (CD)
• Each Processor has complete candidate
hash tree.
• Each Processor updates its hash tree with
local data.
• Each Processor participates in global
reduction to get global counts of candidates
in the hash tree.
• Multiple database scans per iteration are
required if hash tree too big for memory.

CD: Illustration

i i

~ GlobalReductionof C o u n t s ~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 130 [

Parallel Association Rules: Data
Distribution (DD)
• Candidate set is partitioned among the
• Once local data has been partitioned, it is
broadcast to all other processors.
• High Communication Cost due to data
• Redundant work due to multiple traversals
of the hash trees.

DD: Illustration
~-~ _ ~ Broadcast


to-All Broadcast of C a n d i d ~
Parallel Association Rules: Intelligent
Data Distribution (IDD)
• Data Distribution using point-to-point
• Intelligent partitioning of candidate sets.
- P a r t i t i o n i n g b a s e d on t h e first i t e m of c a n d i d a t e s .
- Bitmap to keep track of local candidate items.
• Pruning at the root of candidate hash tree using the
• Suitable for single data source such as database
• With smaller candidate set, load balancing is
IDD: Illustration



~ t

-fto-AllBroadcast of C a n d i d ~
Filtering Transactions in IDD
bitmask~ transaction

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 135 ]

Parallel Association Rules: Hybrid

Distribution (HD)
• Candidate set is partitioned into G groups to
just fit in main memory
- Ensures Good load balance with smaller
candidate set.
• Logical processor mesh G x P/G is formed.
• Perform IDD along the column processors
- Data movement among processors is
• Perform CD along the row processors
- Smaller number of processors is global
reduction operation.
HD: Illustration
PIG Processors per Group


o ~.

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 137 ]

Parallel Association Rules:

Experimental Setup
• 128-processor Cray T3D
- 150 MHz DEC Alpha (EV4)
- 64 MB of main memory per processor
- 3-D torus interconnection network with peak unidirectional
bandwidth of 150 MB/sec.
• MPI used for communications.
• Synthetic data set: avg transaction size 15 and 1000
distinct items.
• For larger data sets, multiple read of transactions in
blocks of 1000.
• HD switch to CD after 90.7% of the total computation
is done.
Parallel Association Rules: Scaleup
Results (100K,0.25%) i i i i i

CD -~ ....
IDD -+--
2000 .~<_ HD -E~--
DD ~ ....


~ 10OO ..

500 /~ ~.÷". . "

° 0 '
20 ;o '
60 20 '
1O0 '
120 140
Number of p r o c e s s o r s
Parallel Association Rules: Sizeup

Re:;ults (np=16,0.25%)
CD ~,
IDD -+--.
/+ HD -e--

_= // . .Y"

/ .'*/" .ij.i f / f


.d:f i¢/- .f

0 I I ! I
0 200 400 600 800 1000
Number of t r a n s a c t i o n s p e r p r o c e s s o r (K)
Parallel Association Rules: Response
Time (np= 16,50K)
500 i i

CD -~--
HD -+--"
450 IDD -o--
simple hybrid --~.....


35O x
,. /:p
..."/~,"(2408 K)

250 . -, - .×/

& .........:..::;:;"iio8,~
200 ,.,."'" .S.2;;""
• - El"" ,..;:;C/"


1211 K)

0 i I I I
O5 0.25 O.1 0.06
Minimum support (%)
Parallel Association Rules: Response

Time ,(np=64,50K) , , ,

CD - e ~
H D -4---
1200 IDD -o--
simple hybrid -.~.....

1000 0

! ...m /
..... / ~ (5232 K)

..o-'" .. ~ /

........... •"× ,/" 12408 K)

..-"'" .....,. //
200 ................... ..............
i~o~ K~
(345 K)
121T1K) I ! I I
0.5 0.25 0.1 o.oe.04
Minimum support (%)
Parallel Association Rules:
Minimum Support Reachable

0.25 ~ 0.15 0.06 0.03

0.2 0.1 0.04 0.02

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 143 I

Parallel Association Rules:

Processor Configuration in HD

64 Processors and 0.04 minimum support

ml~ll ~ mm 4 ~6 ~8
~ ~ ~,x~ ~,~ ~~x~,

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 144 I

Parallel Association Rules:
Summary of Experiments
[] HD shows the same linear speedup and
sizeup behavior as that of CD.
[] HD Exploits Total Aggregate Main
Memory, while CD does not.
[] IDD has much better scaleup behavior
than DD

Eclat Approach

[] Frequent itemset lattice

[] Vertical or "inverted" tid-list format
[] Support counting via intersections
[] Lattice decomposition: break into
[] Efficient search strategies
[] Independent solution of subproblems

Jane Agatha Sir Arthur Mark P. G.
Austen Christie Conan Doyle Twain Wodehouse
Transcation Items Support I Item-Sets
1 ACTW 1 0 0 % (6H C


3 ACTW 67%
(4)1[ A, D, T,
C D , CT, A C W
A T , DW, T W , A C T , A T W
4 ACDW 5 0 % (3) CDW, CTW, ACTW



Frequent ItemsetLattice
.................. 211/.::::-~:~:"T""~::::.::::::..~IIIII
. . . . . . .
. . . . . -..... " ...............

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 148 I

................ ~...::::.......::::..-'T"---..::::::....:::::..
A C D ~

! l

C & D
CD & C W

High Performance Data Mining (Vipin Kumar and Mohammed Zaki)


Eclat: Lattice Decomposition


.... ""'-,.i'"'". ."/ ....'"'"""

EQUIVALENCE I,A,: ~ .. AC. A.. A W, AC'r, ACW. A~. AC.W,]
CLASSES [C] C, C D , C T , C W , C D W , CT~N }
[ D ] = { D, D W } [31"] = { T , T W } [W] = { W } I
Lattice Search Strategies

• Bottom-up
-Level-wise like Apriori
• Top-down
- Start with largest element. If frequent,
done! Else look at subsets
• Hybrid
- Find long frequent/maximal itemsets
- F i n d remaining using bottom-up search
Bottom-up Lattice Search ACD~

.... . , . " . ". "-,...



"~'-~::~: . . . . . . . . ...'"~:.:: ...... .... ::::{/:>'"



CLASSES [ A T ] = { AT, ATVV }
lAW] = { AW }
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 152 I

Top-down Lattice Search

Parallel Eclat/MaxEclat
• Partition tidlists among processors
• Schedule equivalence classes
• Selectively replicate database
• Solve Asynchronously/Independently
• Algorithm Space
- Eclat (Bottom-up)
- MaxEclat (Hybrid)

• Effective aggregate memory utilization

Count Distribution








High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 155 ]

Parallel Eclat



Parallel Eclat Algorithm

[A] = { AC, AT, AW }
[c] = { CD, CT, CW }
[D] = { DW }
i-r] = {-rw }

Processor 0 (P0)

Processor 1 (P1)
[C] = { CD, CT, CW } I
I [D] { DW }

Original Database Partitioned Database After Tid-List Exchange I
[~-~ ~---I I C o T wll
I n ml i i i l! !l l! ! !l / m m m u l u m = i = u l
in == . = Hi i'= '= m ==ll

,i=," In
n i
n i
n ill
i ill
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 157 I

[] Database: T20.16.D4550K
- 1000 Items
- 4550,000 Transactions
[] Machine: Eight 4-way SMP nodes
- Digitars Memory Channel (5gs,30Mb/s)
-233 Mhz, 256 MB, 1MB L2 Cache
[] Hierarchical Parallelization
- M e s s a g e passing + S M P
Parallel Performance
Parallel Performance(1 Proc/Host' Parallel Speedup (1 Proc/Host) I






H1 H2 H4 H8
Numberof Hosts

Parallel Performance
Parallel Performance(2 Proc/Host Parallel Speedup (2 Proc/Host)
3500 7r

300O 6

2500 5

21100 4

1500 3

IOO0 2

5130 !

H1 H2 H4 H8
Numberof Hosts
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
• Clustering
• Future directions and summary
Discovering Sequential Associations

A set of objects with associated event occurrences.

Object 1

l0 20 30 40 50

Object 2

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 162 I

Sequential Pattern Discovery:
Given is a set of objects, with each object associated with its own
timeline of events, find rules that predict strong sequential
dependencies among different events.

Rules are formed by first disovering patterns. Event occurrences in
the patterns are governed by timing constraints.

(A B) (C) (D E)
i: <-_x~ <=m,T'I!T
r 1
Sequential Patterns: Complexity

Much higher computational complexity than
association rule discovery.
- O(m k 2 k-l) number of possible sequential patterns having
k events, where m is the total number of possible events.
Time constraints offer some pruning. Further use of
support based pruning contains complexity.
- A subsequence of a sequence occurs at least as many times
as the sequence.
- A sequence has no more occurrences than any of its
- Build sequences in increasing number of events. [GSPalgorithm
by Agarwal & Srikant]
Apriori-like Approach

• Make multiple scans

• Join frequent patterns to generate new
candidate sequences
• Example:
-(2 3) (4 6) (7)joins with (3) (4 6) (7) (8) to
create (2 3) (4 6) (7) (8)
-(1 2) with (2 3) creates (1 2 3)
-(1) with (2) creates (1) (2) and (1 2)
Sequential Apriori:
Count operation
• Various possibilities for counting occurrences
of a sequence in an object's timeline
- C o u n t only one occurrence per object.
- C o u n t the n u m b e r of span-size w i n d o w s the
s e q u e n c e occurs in.
- C o u n t the n u m b e r of distinct occurrences of a
• Each event-timestamp pair considered at most once.
• Each counted occurrence has at least one new event-
timestamp pair.

Sequential Apriori:
Count operation (contd..)
• Hash Tree used for fast search of candidate
occurrences. Similar to association rule
discovery, except for following differences.
- E v e r y event-timestamp pair in the timeline
is hashed at the root.
-Events eligible for hashing at the next level
are determined by the maximum gap (xg),
window size (ws), and span (ms)
Sequence Hash Tree for Fast Access

Hash Function
2,5,8 1 9
(1@15) (3@5)
I 1 I I I
1(2@5) 0 5 10 15 20
(s@s) ms
< >
(1)(1)(9) 1(7)(6,9)
_~5) 1 Four Interesting Paths: |
I 1@0, 2@5, 3@5

(1 )(2,5)
Candidate Sequence Hash Tree

(1) (2,3)
1@0, 3@5
4@0, 5@5
1@12, 1@15
Counting Leaf Level Candidates
Part of Candidate Sequence Hash__Tree

Object 2:

Count = 12
[ "< -> ~

I High .Performance Data Mining (Vipin Kumar and Mohammed Zaki) 169 I

• Need:
- Enormity of Data.
- Memory and Disk limitations of serial algorithms
running on single processor.
• Can algorithms for non-sequential
associations be extended easily?
- Sequential Nature gives rise to complex issues:
• Longer Timelines
• Large Span values
• Large N u m b e r of Candidates
Parallel Sequential Associations:
Event Distribution (EVE)
P0 P1 P2

~ 1to All Broadcastof SupportC o u n ~

EVE Algorithm: Challenging Case

Transfer Hash Tree States
P0 P1 ~,~To~e~ P2 ~

Transfer Partial Counts

Event and Candidate Distribution
Rotate Candidates in
Round-Robin Manner

or in
EVECAN: Parallelizing Candidate

I Candidates Stored in a Distributed Hash Table
I Hash Function:
Candidate Sequence, S = > h(S) = (Pi, I)

i ,: <

Scalable Communication Mechanism to construct and probe

The SPADE Approach
• Search space is a lattice
• Use "inverted" vertical lists
• Find support via list intersections
• Break large search space into small
independent suffix-based chunks
• Process each chunk recursively using
breadth/depth-first search
• Takes only a few data scans
Example Database
Customer-ld Transaction-Time Items Supp0d Sequences

r 11 ' ; 1 ~'• t
April C,
C, D, T, W, C->D, C->T
1 t April7 ~II~D 100%(3) C->W,D->T, D->W
1 April25
, ;~, T W TW, C->TW,D->TW
2 April 2 A C
2 April 5 D ~~=o~.~,~d A'>W, AC->T, ~!AC->D ~
2 April 20 CTW • J AU >W~A'>IW :~.

" [, AC->TW C->D.~ i '
, April15 I A C D
3 • April 27 .DTW AC->D AC->TW C->D->TW
Frequent sequence lattice
AC->D, AC->TW, C->D->TW

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 177 I

Sequence lattice decomposition

Temporal Joins
Customer-Transaction List Intersection

P->X P->Y
(3110 I TII)

3 I 10
4 I 60 s / 70
7 | 40 P->Y->X

-m~l-:~ r - 13 10
16 80
'15 I so 20
17 I 20 P->XY
20 I 10
I ClI~ Tin

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 179 ]

t High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 180 I

Parallel Design Space

• Data parallelism: within a class

- I d l i s t parallelism
• Single idlist join
• Level-wise idlist join
- Join parallelism
• Task parallelism: between classes
• Static vs. dynamic load balancing
Idlist Data Parallelism

P0: cid in range 1-500

• Single idlist join PI: cid in range 501-1000
- Split idlists on cid
Parallel Idlist Intersection
- Each proc intersects
over its cid range
- Barrier
synchronization for
each join

Idlist Data Parallelism

Level-wise idlist join P0: cid in range 1-500

PI: cid in range 501-1000
- Process all classes
at current level Parallel Pair-Wise
- Each proc still Intersections
intersects over its
local DB
- Barrier
synchronization and
sum-reduction for
each level
Join Data Parallelism

• Each processor Assign each idlist to a
different processor
performs different
intersections within a
• Barrier after self-joining
within a class, before
processing classes at
next level
• # synchronizations is
between Single and
Level-wise parallelism

• Given level 1 classes, C1, C2, C3, ...
• Assign weights W l , W2, W3, ... on the class size (#
sequences in the class)
• Greedy scheduling of entire classes
- Sort classes on weight (descending)
- Assign class to proc with least total weight
• Each proc solves assigned classes asynchronously
• Hard to get accurate work estimate

Static andDynamic
~I Balincin~q

High Performance Data Mining (Vipin~umar and Mohammed Zaki) 186 I

Task Parallelism:
Dynamic Load Balancing
• Given level 1 classes, C1, C2, C3, ... and
their weights
• Sort classes on weight (descending) and
insert in a logical task queue
• A proc atomically grabs the first available
class and solves it completely
• Repeat until no more classes in queue
• Uses only inter-class parallelism
• Queue contention negligible since classes
are coarse-grained
Task Parallelism:
Recursive DLB (pSPADE)
• Process available classes in parallel
• Worst case: P-1 procs free, 1 proc busy
• Classes sorted on weight, last class
usually small (but it could be large)
• Provide mechanism for free procs to
join busy group
• At each level, get free procs, insert the
classes into a shared task queue;
process available classes in parallel
pSPADE: Recursive
D,;namic LoadBalancing

Experimental Setup
• SGI Origin 2000
- 12 195Mhz R10K MIPS processors
- 4 M B cache, 2GB memory
• Synthetic datasets
- C : number of transactions/customer
-T: average transaction size
- S/I: average sequence/itemset size

- D : number of customers
Data vs. Task Parallelism
C10-T5-S4-11.25-D1M C20-T5-S8-11.25-D1M

i D Data • Task I DData 1Task I

500" 400
-'G" 300

._E 300- 250

I-- .E_ 200
.2 200. 150

u~ 50
1 2 4 1 2 4
Number of Processors Number of Processors
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 191 I

Static vs. D y n a m i c D L B

[rlStatic • Dynamic • Recursive ]

• Dynamic is up to
600- 38% better than
500- Static
400- • Recursive is up to

~ 300-
12% better than
• Over all Recursive
is 44% better than
C10T5S411.25 C20T5S811.25 C20T5S812 Static
Parallel Performance
C10-T5-S4-11.25-D1M C10-T5-S4-11.25-D1M

~ . 300-
250- E
. n
•~ 200- 2,
.o t50-

~, 100-
50- .
1 2 4 8 t 2 4 8
Number of Processors Number of Processors
Parallel Performance
C20-T5-S8-12-D1M C20-T5-S8-12.D1M



~' 800'
-- 600.
= 400, T=
"~ 200.

t 2 4 8 t 2 4 8
Number of Processors Number of Processors
Scaleup and Support
CS-T2.5-S4-1125 C5-T2.5.S4-11.25.D1M
f I
180- 90-'/
160- 80/-;
140- 701/
=) t20-
=E 6o. // '
~'- t00- ~-- 50-
._o 80- .o
~, 60~ o /
40 20
10 ",~
o~ O"
2 4 8 10 0.10% 0.08% 0,05% 0.03% 0.01%
Millionsof Customers Number of Processors
pSPADE Summary
• Task parallel better than data parallel
• Dynamic load balancing
• Asynchronous algorithms (independent
• Good locality (uses only intersections)
• Good speedup and scaleup
• What next? Gigabyte databases
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
• Sequences
• 1Clustering
I Future directions and summary
What is clustering?

Given N k-dimensional feature vectors,

find a "meaningful" partition of the N
examples into c subsets or groups
• Discover the "labels" automatically
• c may be given, or "discovered"
• much more difficult than classification,
since in the latter the groups are given,
and we seek a compact description
Clustering Illustration
I Euclidean Distance Based Clustering in 3-D space.

00 o


• Have to define some notion of

"similarity" between examples
• Goal: maximize intra-cluster similarity
and minimize inter-cluster similarity
• Feature vector be
- A l l numeric (well defined distances)
- All categorical or mixed (harder to define
similarity; geometric notions don't work)
Clustering schemes

• Distance-based
- Numeric
• Euclidean distance (root of sum of squared
differences along each dimension)
• Angle between two vectors
- Categorical
• Number of common features (categorical)
• Partition-based
- Enumerate partitions and score each
Clustering schemes

• Model-based
- Estimate a density (e.g., a mixture of
- Go bump-hunting
- C o m p u t e P(Feature Vector i I Cluster j)
- Finds overlapping clusters too
- E x a m p l e : bayesian clustering

Before clustering

• Normalization:
- G i v e n three attributes
• A -- micro-seconds

• B -- milli-seconds

• C -- seconds

- Can't treat differences as the same in all

dimensions or attributes
- N e e d to scale or normalize for comparison
- C a n assign weight for more importance
The k-means algorithm

• Specify 'k', the number of clusters

• Guess 'k' seed cluster centers
• 1) Look at each example and assign it
to the center that is closest
• 2) Recalculate the center
• Iterate on steps 1 and 2 till centers
converge or for a fixed number of times
K-means algorithm

0 0
0 00: Initial seeds

0 0 o°

o/ o
' ! •
K-means algorithm

O New centers ~w-

0 0 0
o ~o
0 0
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 206 [

Final centers -~v

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 207 I

Operations in k-means
• Main Operation: Calculate distance to
all k means or centroids
• Other operations:
- Find the closest centroid for each point
- C a l c u l a t e mean squared error (MSE) for all
- Recalculate centroids

Parallel k-means
• Divide N points among P processors
• Replicate the k centroids
• Each processor computes distance of
each local point to the centroids
• Assign points to closest centroid and
compute local MSE
• Perform reduction for global centroids
and global MSE value
Gaussian mixture models

• Drawbacks of K-means
- Doesn't do well with overlapping clusters
-Clusters easliy pulled off center by outliers
- Each records is either in or out of a cluster;
no notion of some records begin more or
less likely than others to really belong to
cluster they have been assigned
• G M M : probabilistic variants of K-means
• Choose K seeds: means of a gaussian

• Estimation: calculate probability of
belonging to a cluster based on distance
• Maximization: move mean of gaussian
to centroid of data set, weighted by the
contribution of each point
• Repeat till means don't move
Deviation/outlier detection

• Find points that are very different from

the other points in the dataset
• Could be "noise", that causes problems
for classification or clustering
• Could be the really "interesting" points,
for example, in fraud detection, we are
mainly interested in finding the
deviations from the norm
Deviation detection


K-nearest neighbors

• Classification technique to assign a

class to a new example
• Find k-nearest neighbors, i.e., most
similar points in the dataset (compare
against all points!)
• Assign the new case to the same class
to which most of its neighbors belong

K-nearest neighbors

0 0
0 000 0
+ 5 of class 0
0 0 3 of class +

0 0 0 0 0
**2++ o o
+0 0

Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
• Sequences
• Clustering
• l Future directions and summary
Large-scale Parallel KDD Systems

• Terabyte-sized datasets
• Centralized or distributed datasets
• Incremental changes
• Heterogeneous data sources
• Pre-processing, mining, post-processing
• Modular (rapid development)
• Benchmarking (algorithm comparison)
Research Directions
• Fast algorithms: different mining tasks
-Classification, clustering, associations, etc.
-Incorporating concept hierarchies
• Parallelism and scalability
- Millions of records
- T h o u s a n d s of attributes/dimensions
- Single pass algorithms
- Sampling
- Parallel I/O and file systems
Research Directions (contd.)

• Data locality and type

-Distributed data sources (www)
- Text and multimedia mining
- Spatial data mining
• Incremental mining: refine knowledge
as data changes
• Interactivity: anytime mining

Research Directions (contd.)

• Tight database integration

- Push common primitives inside DBMS
-Use multiple tables
- Use efficient indexing techniques

- C a c h i n g strategies for sequence of data

mining operations
- Data mining query language and parallel
query optimization
Research Directions (contd.)

[] Understandability: Too many patterns

-Incorporate background knowledge
-Integrate constraints
- Meta-level mining
- Visualization

[] Usability: build a complete system

-Pre-processing, mining, post-processing,
persistent management of mined results
Summary Of the Tutorial

• Data mining is a rapidly growing field
- Fueled by enormous data collection rates, and
need for intelligent analysis for business and
scientific gains.
• Large and high-dimensional nature data
requires new analysis techniques and
• Scalable, fast parallel algorithms are
becoming indispensable.
• Many research and commercial
You might also like