DM Tutoria

Tutorial PM-3
High Performance Data Mining
Vipin Kumar (University of Minnesota)

Mohammed Zaki (Rensselaer Polytechnic)
309
High Performance Data Mining
Vipin Kumar Mohammed J. Zaki

Computer Science Dept. Computer Science Dept.
University of Minnesota Rensselaer Polytechnic Institute
Minneapolis, MN, USA. Troy, NY, USA.
kumar@cs.umn.edu zaki @cs.rpi.edu
www.cs.umn.edu/-kumar www.cs.rpi.edu/-zaki
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11
Tutorial overview
• [ Overview of KDD and data mining I
• Parallel design space
• Classification
• Associations
• Sequences
• Clustering
m Future directions and summary
311
Overview of data mining
• What is Data Mining/KDD?

• Why is KDD necessary
• The KDD process
• Mining operations and methods
• Core issues in KDD
What is data mining?
The iterative and interactive process of

discovering valid, novel, useful, and
understandable patterns or models in
Massive databases
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4 I
312
What is data mining?
• Valid: generalize to the future

• Novel: what we don't know
• Useful: be able to take some action
• Understandable: leading to insight
• Iterative: takes multiple passes
• Interactive: human in the loop
Data mining goals
• Prediction
- What?
- Opaque
• Description
- Why?
- Transparent
313
Data mining operations
• Verification driven
- Validating hypothesis
-Querying and reporting (spreadsheets, pivot
tables)
analysis (dimensional
- M u l t i d i m e n s i o n a l
summaries); On Line Analytical Processing

- Statistical analysis
Data mining operations
• Discovery driven
-Exploratory data analysis
- Predictive modeling
- Database segmentation
- Link analysis
- Deviation detection
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8I
314
KDD Process
Data mining process
• Understand application domain

- Prior knowledge, user goals
• Create target dataset
- Select data, focus on subsets
• Data cleaning and transformation
-Remove noise, outliers, missing values
- Select features, reduce dimensions
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 10 ]
315
Data mining process
• Apply data mining algorithm
-Associations, sequences, classification,
clustering, etc.
• Interpret, evaluate and visualize patterns
- W h a t ' s new and interesting?
- I t e r a t e if needed
• Manage discovered knowledge
- C l o s e the loop
Why Mine Data? Commercial

ViewPoint...
• Lots of data is being collected and
warehoused.d
• Computing has become affordable.
• Competitive Pressure is Strong
- Provide better, customized services for an
edge.
-Information is becoming product in its own
right.
316
Why Mine Data?
Scientific Viewpoint...
• Data collected and stored at enormous speeds
(Gbyte/hour)
- remote s e n s o r on a satellite
- t e l e s c o p e s c a n n i n g the skies
- m i c r o a r r a y s g e n e r a t i n g g e n e e x p r e s s i o n data
- scientific simulations generating t e r a b y t e s of data
• Traditional techniques are infeasible for raw data
• Data mining for data reduction..
- cataloging, classifying, s e g m e n t i n g data
- Helps scientists in H y p o t h e s i s Formation
Origins of Data Mining
• Draws ideas from machine learning/AI,

pattern recognition, statistics, database
systems, and data visualization.
• Traditional Techniques may be
unsuitable
-Enormity of data
- H i g h Dimensionality of data
- Heterogeneous, Distributed nature of data
317
Data mining methods
• Predictive modeling (classification,

regression)
• Segmentation (clustering)
• Dependency modeling (graphical
models, density estimation)
• Summarization (associations)
• Change and deviation detection
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)
Data mining techniques
• Association rules: detect sets of attributes that

frequently co-occur, and rules among them, e.g.
90% of the people who buy cookies, also buy
milk (60% of all grocery shoppers buy both)
• Sequence mining (categorical): discover
sequences of events that commonly occur
together, .e.g. In a set of DNA sequences
ACGTC is followed by GTCA after a gap of 9,
with 30% probability
318
• Classification and regression: assign a new data

record to one of several predefined categories or
classes. Regression deals with predicting real-
valued fields. Also called supervised learning.
• Clustering: partition the dataset into subsets or
groups such that elements of a group share a
common set of properties, with high within group
similarity and small inter-group similarity. Also
called unsupervised learning.
• Similarity search: given a database of objects,

and a "query" object, find the object(s) that are
within a user-defined distance of the queried
object, or find all pairs within some distance of
each other.
• Deviation detection: find the record(s) that is
(are) the most different from the other records,
i.e., find all outliers. These may be thrown away
as noise or may be the "interesting" ones.
I
319
Many other methods, such as
- Neural networks
- Genetic algorithms
- H i d d e n Markov models
- Time series
- Bayesian networks
- Soft computing: rough and fuzzy sets
Main challenges for KDD
• Scalability
- Efficient and sufficient sampling
-In-memory vs. disk-based processing
- High performance computing
• Automation
- Ease of use
- Using prior knowledge
I
320
Tutorial overview
• Overview of KDD and data mining
• Parallel design space I
• Classification
• Associations
• Sequences
• Clustering
• Future directions and summary
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 1
Speeding up data mining
• Data oriented approach

- Discretization
- Feature selection
- Feature construction (PCA)
- Sampling
• Methods oriented approach

- Efficient and scalable algorithms
321
Speeding up data mining (contd.)
• Methods oriented approach (contd.)

-Parallel data mining
• Task or control parallelism
• Data parallelism
• Hybrid parallelism
-Distributed data mining
• Voting
• Meta-learning, etc.
Need for Parallel Formulations
• Need to handle very large datasets.

• Memory limitations of sequential computers
cause sequential algorithms to make multiple
expensive I/0 passes over data.
• Need for scalable, efficient (fast) data mining
computations
- gain competitive advantage.
- Handle larger data for greater accuracy in shorter
times.
322
Parallel Design Space
• Parallel architectures
- Distributed memory
- Shared disk
- Shared memory
-Hybrid cluster of SMPs
• Task and data parallelism
• Static and dynamic load balancing
• Horizontal and vertical data layout
• Data and concept partitioning
Parallel Hardware
• Distributed-memory machines
- Each processor has local memory and disk
- Communication via message-passing
- Hard to program: explicit data distribution
- Goal: minimize communication
• Shared-memory machines
- Shared global address space and disk
- Communication via shared memory variables
- Ease of programming
- Goal: maximize locality, minimize false sharing
• Current trend: Cluster of SMPs
323
Distributed Memory Architecture
(Shared Nothing)
Interconnection N e t w o r k J
~
1 D
DMM: Shared Disk Architecture
~ ,0, Interconnection N e t w o r k
)
• mW /
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 28 I
324
Shared Memory Architecture
(Shared Everything)
<-_____ Interconnection Network
1 ID
m
r I 1
Cluster of SMPs
ion Network
32.5
Task vs. Data Parallelism
• Data Parallelism
-Data partitioned among P processors
- Each processor performs the same work
on local partition
• Task Parallelism
- Each processor performs different
computation
- Data may be (selectively) replicated or
partitioned
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 31 1
Task vs. Data Parallelism

PO PI P2
Task Parallelism
Uniprocessor @+1
Data Parallelism
326
Static vs. Dynamic Load Balance
• Static Load Balancing
- Work is initially divided (heuristically)
- N o subsequent computation or data
movement
• Dynamic Load Balancing
- Steal work from heavily loaded processors
- Reassign to lightly loaded processors
-Important for irregular computation,
heterogeneous environments, etc.
Horizontal/Vertical Data Format

Horizontal Vertical
2 N) Manied 1CCK N:)
4 Yes Maded 12CK I~b i
6 N:) Manied B:K I~b
8 N~ 8n~ 8EK Yes
10 I~ @rge ~K Yes
327
Data and concept partitioning
• Shared
- SMP or shared disk architectures
• Replicated
- Partially or totally
• Partitioned
- Round-robin partitioning
- Hash partitioning
- Range partitioning
Tutorial overview
• 1Classification
• Associations
• Sequences
• Clustering
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 36J
328
What is Classification?
Classification is the process of assigning

new objects to predefined categories or
classes
• Given a set of labeled records

• Build a model (decision tree)
• Predict labels for future unlabeled
records
Classification learning
• Supervised learning (labels known)

• Example described in terms of attributes
-Categorical (unordered symbolic values)
-Numeric (integers, reals)
• Class (output/predicted attribute):
categorical for classification, numeric for
regression
329
Classification learning
• Training set: set of examples, where

each example is a feature vector (i.e., a
set of (attribute,value) pairs) with its
associated class. Model built on this set.
• Test set: a set of examples disjoint from
the training set, used for testing the
accuracy of a model.
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 39
Classification Models
• Some models are better than others
- Accuracy
- Understandability
• Models range from easy to understand

to incomprehensible
- Decision trees Easier
- Rule induction
- Regression models
- Neural networks Harder
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4O I
330
Decision Tree Classification
Splitting Attributes
Yes/J Refund L',Ao /

J ,,,,,r=
< 80; Taxlno ~0K~J
The splitting attribute at a node is

determined based on the Gini index.
HighPerformanceDataMining(VipinKumarandMohammedZaki) 41 ]
From Tree to Rules
1) Refund = Yes ~ NO
Yes~ Refund [.~ 2) Refund = No and

MarSt in {Single, Divorced}
~-MarSt
Single, D~erced Married and Taxlnc < 80K ~ NO
W
Taxlnc 3) Refund = No and
< 8OK/ ~,= 80K MarSt in {Single, Divorced}
and Taxlnc >= 80K = , YES
4) Refund = No and
MarSt in {Married } ~ NO
I HighPerformanceDataMining(VipinKumarandMohammedZaki) 42 1
331
Classification algorithm
• Build tree
-Start with data at root node
-Select an attribute and formulate a
logical test on attribute
-Branch on each outcome of the test,
and move subset of examples
satisfying that outcome to
corresponding child node
Classification algorithm
- Recurse on each child node
- Repeat until leaves are "pure", i.e., have
example from a single class, or "nearly
pure", i.e., majority of examples are from
the same class
• Prune tree
- R e m o v e subtrees that do not improve
classification accuracy
- Avoid over-fitting, i.e., training set specific
artifacts
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 44 i
332
Build tree
• Evaluate split-points for all attributes
• Select the "best" point and the "winning"
attribute
• Split the data into two
• Breadth/depth-first construction
• CRITICAL STEPS:
- Formulation of good split tests
- Selection measure for attributes
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 45 [
How to capture good splits?
• Occam's razor: Prefer the simplest

hypothesis that fits the data
• Minimum message/description length
- Dataset D
- H y p o t h e s e s H1, H2, ..., Hx describing D
- MML(Hi) = Mlength(Hi)+Mlength(DIHi)
- Pick Hk with minimum M M L
• Mlength given by Gini index, Gain, etc.
333
Tree pruning using MDL
• Data encoding: sum classification errors
• Model encoding:
- Encode the tree structure
- Encode the split points
• Pruning: choose smallest length option
-Convert to leaf
- P r u n e left or right child
- Do nothing
Hunt's Method
• Attributes: Refund (Yes, No), Marital Status
(Single, Married, Divorced), Taxable Income
• Class: Cheat, Don't Cheat
334
What's really happening?
Marital Status
Cheat
Don't Cheat I
I
+ 0 0 0 0
O÷oC 0 000 0
+
+ 0 0 0
0
0+0
0 0 0 0 O
0
++ 0
++
Marrie, + + 0 O
I -'Income
< 80K I
High Performance Data
• Mining (Vipin Kumar and Mohammed Zaki) 49 t
Finding good split points
• Use Gini index for partition purity

Gini(S)=l-~p~
i=1
Gini (S l , S 2) = ~ Gini (S l ) + n~ Gini (S 2)

n n
• If S is pure, Gini(S) = 0
• Find split, point with minimum Gini
• Only need class distributions
335
Finding good splits points
~ Cheat
~,Don't Cheat
Marit tl Status Maril d Status
' + 0 ) 0 0 ' 0 0 0
4- 0
O+ 0 ¢ 0+00 0 000o
4- 0 o ° o°°°° 4- 0 0 0 ÷
0 O+ 0 0
0+0
0 0 0 0 0 0 a 0 a e~
+%, 0
0
**.*÷÷ o °
4--+
+ +( O + +T~
+ o
Income Ncome
NO YES NO YES
Left 14 9 Top 5 23
Right 1 18 Bottorr 10 4
Gini(split) = 0.31 Gini(split) = 0.34

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) s,I
Categorical Attributes:
Computing Gini Index
For each distinct value, gather counts for
each class in the dataset
• Use the count matrix to make decisions
Multi-way split I Two-way split

I (find best partition of values)
I
;arTyp(
2 I 2
I Gini I o,4:te
I
336
Continuous Attributes:
Computing Gini Index...
• For efficient computation: for each attribute,
- Sort the attribute on values
- Linearly scan these values, each time updating the count
matrix and computing gini index
- Choose the split position that has the least gini index
Sorted Values
Split Position.=
C4.5
• Simple depth-first construction.
• Sorts Continuous Attributes at each node.
• Needs entire data to fit in memory.
• Unsuitable for Large Datasets.
- Needs out-of-core sorting.
• Classification Accuracy shown to
improve when entire datasets are
used!
337
SPRINT [Shafer,Agrawal,Mehta]
Attribute Lists:
SPRINT (contd.)
• The arrays of the continuous attributes are pre-sorted.
- T h e sorted order is m a i n t a i n e d d u r i n g each split.
• The classification tree is grown in a breadth-first fashion.
• Class information is clubbed with each attribute list.
• Attribute lists are physically split among nodes.
• Split determining phase is just a linear scan of lists at each
node.
• Hashing scheme used in splitting phase.
- tids of the splitting attribute are h a s h e d with the tree node as the
key.
• lookup table
- r e m a i n i n g attribute arrays are split by q u e r y i n g this hash structure.
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 56 ]
338
SPRINT Disadvantages
• Size of hash table is O(N) for top levels

of the tree.
• If hash table does not fit in memory
(mostly true for large datasets), then
build in parts so that each part fits.
- Multiple expensive I/O passes over the
entire dataset.
I High Performance Data Mining(Vipin Kumar and Mohammed Zaki) 57 I
Constructing a Decision Tree in

Parallel
m categorical attributes
¢/1
Partitioning of data
o¢J only
ID
- global reduction p e r
node is required
Good Bad
- large number of
classification t r e e
nodes gives high
communication cost
339
Constructing a Decision Tree in
Parallel
l 0,000 training records Partitioning of
classification tree nodes
natural concurrency
7,0q ds -
load imbalance as the
amount of work associated
with each node varies
- child nodes use the same
data as used by parent node
- loss of locality
- high data movement cost
Challenges in Constructing Parallel

Classifier
• Partitioning of data only
- large number of classification tree nodes gives high communication
cost
• Partitioning of classification tree nodes

- natural concurrency
- load imbalance as the amount of work associated with each node
varies
- child nodes use the same data as used by parent node
- loss of locality
- high data movement cost
• How do we efficiently perform the computation in

parallel?
I High performance Data Mining (Vipin Kumar and Mohammed Zaki) 60 I
340
Synchronous Tree Construction Approach
Proc 0 Proc 1 Proc 2 Proc 3 Partition Data Across Processors
~-iii~iii~:::~ + N°requireddat
a is
movement
Class Distribution Information
I
I - - Load imbalance
y i can be eliminated by breadth-
Proc 0 Proc 1 P~oc 2 Proc 3 first expansion
I igh communication cost

. becomes too high in lower parts
':,~ ................
"- ............................................
"
.--"
.... :: of the tree
Class Distribution Information
Partitioned Tree Construction Approach

i ......................................... ;~ .......................................... ==
Partition Data andNodes

ProcO Procl Proc2 Proc3
I
!
-I- Highly concurrent
r......................................... :
t
High communication cost
due to excessive data
movements
ProcO Procl Proc2 Proc3
t - Load imbalance
procO prowl Proc~ Proc3
341
Hybrid Parallel Formulation
( ]
Computation Frontier at depth 3
Partition 1 Partition 2
SynchronousTree Partitioned Tree

Construction Approach Construction Approach
Load Balancing
C
Partition 1 Partition2 Step 1: exchange data between Step2: load balance within
processor partitions processor partitions
342
Splitting Criterion
• Switch to Partitioned Tree Construction when

2 Communication Cost > Moving Cost + Load Balancing
• Splitting criterion ensures

C o m m u n i c a t i o n Cost < 2 * C o m m u n i c a t i o n COStoptima l
G. Karypis and V. Kumar, IEEE Transactions on Parallel and

Distributed Systems, Oct. 1994
Experimental Results
• Data set
- function 2 data set discussed in SLIQ paper
(Mehta, Agrawal and Rissanen, EDBT'96)
- 2 class labels, 3 categorical and 6 continuous
attributes
• IBM SP2 with 128 processors
- 66.7 MHz CPU with 256 MB real memory
- AIX version 4
- high performance switch
I
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 66 J
343
Speedup Comparison of the Three
Parallel Algorithms
~ e d u p for 0.8 malon examp~s, partitioned - " - + - ' , hybrid - " - o - ' , synch t a r s u s - " - x - " Speedup for 1.6 rnllion e x a n ~ , part~ioned - " ~ ' . hybrid - " - c - ' . s ' ~ - "-x-"
. . . . . . . /4
4 6 8 10 12 14 16
2 4 6 No OI Pr~esso IO 12 14 16 NO O~ Promisors
0.8 million examples 1.6 million examples

Splitting Criterion Verification in the

Hybrid Algorithm
Splitting Criterion Ratio = ~ Communication Cost
Moving Cost + Load Balancing
~J~ume~ lot ~o~,ng at dJr~mnt v ~ o . q o~ nu"bo,8 ~ 0.8 mlWon 1~u~qplea ~ IO¢~olllling at dllfenmf ~mlul~ ol mno, 1 6 ~ 1 6 milton exanlples
7O0
600
//~
/J/~
S00
/ /,/
/"
'il
/
30O
2OO
/"
I
,% -2 0 2 4 6 8 10 -2 0 2 4 6 6 10
bg2(Splmng C,'ile.~ R ~ o ) , x,.O -> ncx,.1 I o ~ ( Sp4111ngC411eriaRatk)), x..O - > ~ 1
0.8 million examples on 8 processors 1.6 million examples on 16 processors

I High Performance Data Mining (Vipin Kumar and Mohammed ZakJ) 68 ]
344
Speedup of the Hybrid Algorithm with
Different Size Data Sets
Speedup c~rves for different sized data.sets
100
0.8 million examples --e---
1,6 million examples ,.~-.
3.2 million examples
6.4 million examples
12.8 minion examples
25.6 million examples -ix-,-
.....-
....-'°'°" ....A
,-" .A-.......
...'/"
40
..~-.~'.-" ~.=~ .............................
i),- .......
20
i ! i i i i
20 40 60 80 100 120 140
Number of proceesols
Scaleup of the Hybrid Algorithm

R u n U m e s of o u r a l g o r i t h m f o r 5 0 K e x a m p l e s at e a c h p r o c e s s o r
90 ~ I !
80 ......... : ........ : ............. ! ......... ~. . . . . . . . . . . . . . . . . . :..... ~. . . . . . . .
70 .......... i .......................................................................................
~ 5 0 ............................. = .......... ~. . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . ~ ......
I:
t ~ 4 0 ...................................... * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
2(] ........... • ................................................................ ; .......... , .........
1(~ ......... , ....... ~........ i ......... ~........ ~.......... i ........ :: .... ~ ....
! i : i i i i
i I ; I I i i
0 1 2 3 4 5 6 7 8 9
N u m b e r of P r o c e s s o m
345
Summary of Algorithms for
Categorical Attributes
• Synchronous Tree Construction Approach
- no data movement required
- high communication cost as tree becomes bushy
• Partitioned Tree Construction Approach
- processors work independently once partitioned completely
- load imbalance and high cost of data movement
• Hybrid Algorithm
- combines good features of two approaches
- adapts dynamically according to the size and shape of trees
Handling Continuous Attributes
• Sort continuous attributes at each node of the

tree (as in C4.5). Expensive, hence
Undesirable!
• Discretize continuous attributes
- CLOUDS (Alsabti, Ranka, and Singh, 1998)
- SPEC (Srivastava, Han, Kumar, and Singh, 1997)
• Use a pre-sorted list for each continuous
attributes
- SPRINT(Shafer, Agrawal, and Mehta, VLDB'96)
- ScalParC (Joshi, Karypis, and Kumar, IPPS'98)
!
3 4 6
Design Approach
• Goal: Scalability in both Runtime and
Memory Requirements.
• Parallelization overhead: To = P Tp - Ts
• T o ~ O(Ts) for scalability. Per processor
overhead should not exceed O(Ts/P).
• Two components of To:
- Load Imbalance.
- Communication time.
Load Balancing
BI B2
value rid cid value rid cid SP : split point
5 1 0 6 oll rid : record id
10 8 l 12 1 0 cid : class label
PO 15i 4 0 PO 18 8 1
2o I 6 0 24 6 0
SP " - ~ -
PI
:o,
25 2 0
353 1
_304
Pl
36 3 1
42
0
2 0
7 0 48 5 1
45 5 l 54 7 0
BI B2
SP---~ 5 1 0 12 1 0 PG
10 8 I 18 8 1 30 0 l 6 0 1
35 3 1 36 3 1,
15 4 0 PO 24 6 0 1 40 7 0 48 5 1
PO 20 6 0 30 4 0
-i25 2 0 . p ~ 42 2 0 45 5 1 54 7 0
P1 .~ ~ • t ~ .
347
Parallel SPRINT
• Attribute lists are partitioned among processors

- Each processor gets N/p records of e a c h attribute list
• Attribute lists of continuous attributes are pre-sorted

• Split Determining Phase
- Categorical attributes
• Local construction of count matrices followed by a reduction to
add them up
- Continuous attributes
• Prefix-scan followed by local Gini computations followed by a
gini index reduction to find the maximum point
Parallelizing Split Determination

Phase
• Easy.
I
CarType local
Count Matrices
global
I Age lc¢al
Count Matrices
global
value rid cid I value rid cid ~ 0 1
SpoOl 0 1 t~ I 19 2 1 ~
P0 Fam 1 i 01 Faro Faro i 28 8 0
Fam 2 1 _ I - Parallel o 1
PI
Fam 3 0
Spo 4 1 Famlll01
0 I
Spo [ - ~ Parallel
Reduction
i
i P1
33 1 0
38 3 1 R
[~
P,o
x.
Scan
i®
o 1
FamSP° ~
5 0l _6
r-m-m
01 II- " -- L
5400 4R
60 1~ i
S p o ~ _~
P2 Fam 7 1 Fam"t._LLeJ P2 58 7 1
I
Faro 8 0 i 70 0 0
Categorical Attribute Continuous Attribute
|
348
Parallelizing Splitting Phase..
• SPRINT Approach:
- Builds hash table on all processors by an
all-to-all broadcast operation.
- S o , O(N) data sent & received by each
processor--> Not Scalable in Runtime as
well as Memory Requirements.
Scalable Parallel Formulation of

Splitting Phase [ScalParC]
• Challenging!!
- Requirement for consistent splitting, given
the relatively random sorted order among
attributes,
• Four cases:
!
349
Categorical/Categorical
Splitting a categorical attribute list requires access
exactly to those entries of the hash-table that are
created by the same processor
Continuous/Continuous
P1
P2
P3
Ilil
8 0
• The hash-table entries required by a processor are

generated by other processors
- Need to develop a scheme to transfer these entries to the
processors that need them
350
Getting the required entries of the
hash table
The required entries are transferred into two steps
- From splitting attribute order to tid sorted order
- From tid sorted order to attribute list order
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8, I
This Design is Inspired by..

Communication Structure of Parallel Sparse
Matrix-Vector Algorithms.
0 1 2 3 4 5 6 7 8
o x 0
1 X 0 P0
2 ® <>
3
4
o
0
x X P1
i
5 X 0
6
7
xo X O P2
li
8 0 X 0
X : Salary O : Age ~ : Node Table Entries
[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 82 I
351
Parallel Hashing Paradigm
[] Distributed Hash Table.
[] Hash Function: h(k) = (Pi, I)
[] Construction:
- (k,v) --> (pi,I) --> form send buffers {(I,v)} for each Pi --> all-
to-all-personalized communication.
.. Enquiry:
- (k) --> (Pi, I) --> form {(I)} buffers for each Pi --> all-to-all-
personalized c o m m --> each Pi replaces received {(I)} with
{(v)} --> all-to-all-personalized c o m m .
• If each processor has m keys to hash, then

time is O(m)if m is D,(p); i.e. D,(p 2) overall
kp.v.~, j - . . . . . . . . . . . . . . . .
Applying the paradigm to

ScalParC : Update
Salary Age N o d e Table
rid Salary Age cid value rid cid value rid cid rid kid
0 24542 70 0 24542 0 0 19 2 1 0-
1 98816 33 0 P0 49241 2 1 P0 24 ~ 0 P0 1 -
2 49241 19 1 64911 8 0 28 0 2 -
3 126146 38 1 94766 4 0 33 1 0 3 -
4 94766 50 0 P1 97672 5 0 Pl 38 3 1 PI 4-
597672 24 0 98816 110 4061 5 -
6 136838 40 1 SP"'i~- 126146 311 50 4 0 6'-
7 153032 58 1 ~2 136838 6,1 P2 5871 P2 7 -
8 64911 28 0 153032 7 l 70 0 0 8,-
(a) (b)
hash buffers
PO P2 0 2 1
Po [(O,L)I(2,L)(2,L)[ m[(O,L)[(2,L)I(I,L)
I N o d e Table
P0 P1 P2
P0 PI 4 5 3
~, I(1.L)I..L)I(2.L)I comm ~,I(1.L)I~2.L)I<0.R>I update rid]OI1121 31415 17181
kialLILlLI RILILIRIRILI
[6
PI P2 8 6 7 I I
Ico,.) ~o,~)1<1..)1 ~l(2,L)l<o,~)l.,~)l
(¢)
352
Applying the Paradigm to
ScalParC: Enquiry
Age enquiry buffers intermediate index buffers
value rid cid PO PI P2 P0 PI P'2
2 5 8
19 2 1
P0 24 5 0 ~ol 212 21
28 8 0 I 3 6
PI
33 1 0
38 3 1 -I'1° °1 °°° El I 2
I I0 1
="°~" ' ' '
40!6 1 0 4 7
50 4 0 Pzlol 1 ,I
P2 58 7 1
70 0 0 intermediate valuebuffers result buffers
PO PI P2
~ILILI L
Node Table
P0 PI } P2
pl[ L R L oo~ P,I L I . I g
rid 0 I 2 34516
kid g L L R L L R R7 ~ PO PI P2
Verifying Scalability
• Updating the distributed hash table
- Each processor receives at most N/p entries
- Each processor can send at most Nk/p entries (k is the number of
continuous attributes)
• This is a worst case situation that does not happen frequently
• Inquiring the distributed hash table
- Each processor will perform k inquires, one for each continuous
attribute
- Each processor receives at most N/p entries
- Each processor sends at most N/p entries
• T h e per-processor communication overhead is in general
O(N/p)
• Each processor requires only O(N/p) m e m o r y making S c a l P a r C
m e m o r y efficient
353
A worst case scenario for Updating
Hash Table
I One Processor might need to send O(kN/p) data! Happens
Infrequently.
BI B2
value rid cid value rid cid
5 1 0
10 8 l
~431 4 0
P O 1. 5 . . 4 . 0 P0 21 0 1
20 6 0 28 5 1
25 2 0 _ _357 0
SP~ . . . . .
30 0 1 42 2 0
35 3 1 P1 49 8 1
P1 . . . .
40 7 0 56 1 0
45 5 ' 1 63 6 0
BI B2 ~ BI B2
Sp~10-8 . 1. - 42 2 0 - 30 0 1 - I 7 3 1
35 3 1 P 0 21 0 1
.15. 4. 0 49 8 1 P1 4 0 7 0 28 5 1 .,,,t_Sp
P0 20 6 0 P1 56 1 0 35 ? " 0 "
25 2 0 63 6 0 45 5 1 -
p-11 I ~ I . . . . P-1 ~ ~ I I
/ x z
Categorical/Continuous and
Continuous/Categorical
• Special cases of the Continuous/Continuous
- Categorical/Continuous does not require
communication for loading the hash table
- Continuous/Categorical does not require
communication for inquiring the hash table
1
354
Algorithm: Level-wise
Communications
• Tree is built in a breadth-first manner.
• At each level of the decision tree
- Count Matrices for all attributes for all nodes are
reduced in one single communication approach.
- Loading the hash-table for all the nodes is
combined into a single communication operation
Inquiring the hash-table to split a particular
-
attribute list is combined in a single

communication operation
A total of k+ 1 All-to-All personalized
-
communication operations are performed for each

level of the tree
Structure of the Algorithm
• Sort Continuous Attributes (Pre-Sort)

• do level-wise while (at least one node requiring split)
- Compute count matrices (Find-Split I)
Compute best gini index for nodes requiring split
-
(Find-Split II)
Partition splitting attributes and update Node
-
Table (Perform-Split I)
Partition non-splitting attributes by inquiring Node
-
Table (Perform-Split II)

• end do
355
Example
Salary Age Counl Matrices
local global NodeT a b l e
value rid cid rid kid
24 50 R
2818 0 0I 0I
PI PI
33,,0
3813 1
"'~® i [ ]
SP~
P2 P2 58
o-~N I N
1
70 0
Salary Age ~ Node Table
PI ~ ~ ['~ p ~
,'" "'. Salary Age
"'" "'" I 15303217111 P21 ~ 17111

Experimental Results
• Experiments w e r e performed on a 128-
processor C r a y T 3 D
• Training sets w e r e synthetically g e n e r a t e d
- Each contained only continuous attributes (5-9)
- The were two possible class labels
]
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 92 l
356
Parallel Runtime
Constant size/processor
357
Different Number of Attributes
SMP Parallel Design Space

• Data parallelism: within a tree node
- Split attributes: vertical
• BASIC
• Fixed W i n d o w
• Moving Window
- S p l i t data (attribute lists): horizontal

• Task parallelism: between tree nodes
- SUBTREE
• Static vs. dynamic load balancing
358
SPRINT: Attribute Lists
TrainingSet Attributelists
Tid Age CarType Class Age Class Tid CarType[ Class iTid
2o
2 43 sports High . . . . ,
s~ I rn~, |
32 Low 4 ~::famil~::7:l:!!
4 32 truck Low 43 truck I Low 4
continuous(sorted) categorical(orig order)

Splitting Attribute Lists

Attribute lists lbr n o d e 0 Decision Tree
Age Class Tid Car Type Class Tid Age < 2 7 . 5
32 Low 4 family Low 3

43 High 2 truck Low 4
Hash Table
Tid
Child
Attribute lists for node 1

I
Attribute lists for node 2
Age Class Tid Car Type [ Class Tid

32 Low 4 High 2 1
43 High 2
trick Low 4
1
359
SPRINT: File per Attribute
®OO0 O 0 ~ O O0
TOTAL FILES PER ATTRIBUTE: 32
Optimized File Usage
@® @®@® @
TOTAL FILES PER ATrRIBUTE: 4
360
BASIC: Data Parallel
• Given current leaf frontier
• Atomically acquire an attribute and
evaluate it for all leaves
• Master finds winning attribute and
constructs hash table
• Atomically acquire an attribute and split
its list for all leaves
• Barrier synchronization between phases
BASIC (Tree View)

P={O,L2~3}
BARRIER
| l | l l | | | |
Illlllll
IIIlll/
G ®® ® ® ®
J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 102 1
361
BASIC (Level View, A=3, P=4)
II PO(4) PI(4) II P2(4) n p3(o)
@
• ®
_,eaf
Fixed Window: Data Parallel

• Partition leaves into blocks/window of K
• Dynamically acquire any attribute for
any leaf within current block and
evaluate it
• Last processor to work on a leaf notes
the winning attribute and builds the
hash table
• Split data as in BASIC
• Barrier synchronization between each
block of K leaves
362
FWK (Tree View)
e={o,l,z,3}
FW2 (Level View)

i l r,o(4) D el(4) e2(2) P3(2)
I1~
Leaf
' I
Leaf Lea1
Window=2 I
363
Moving Window: Data Parallel
• Partition leaves into blocks/window of K
• Dynamically acquire any attribute for
any leaf, say i, within current block
• Wait for last block's i-th leave
• Last processor to work on a leaf notes
the winning attribute and builds the
hash table
• Split data as in BASIC
• No barriers, only conditional wait
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~071
MWK (Tree Level)

P={0,I,2,3}
~|eo|la||ep
CHRONIZATION .
• ..... . . . . . . . . . . . . . . . . . . . .......... ., ..... ,
....
, '~'",1 ' "J' '

~®. ®-0.10® ,:®~ .0
dDillllllllnillln|ll
n-'
• dll lib
""~
l Bill lind
'
illlll
364
MW2 (Level View)
il PO(3) P1(3) II
m P2(3) II P3(3)
Leaf Leaf
." q IP
Window=2
SUBTREE: Task Parallel

• Processor group (P) and leaf frontier (L)
[] Apply BASIC, get new leaf frontier (NL)
[] if NL is empty, put processors in FreeQ,
else new group, NP = P + FreeQ
[] Master divides NP and NL into two parts
[] NP1 works on NL1, NP2 works on NL2
[] Initially all processors work on root
365
SUBTREE (4 procs)
II P0
P1
II P2
FreeQ
II P3
Experimental Setup
• SMP Machines
- 112 MHz PowerPC-604 processors
- 16KB L1-cache, 1MB L2-cache
- Machine: 4-way SMP
• 1 2 8 M B memory, 3 0 0 M B disk
• Synthetic Datasets: simple (F2) and

c o m p l e x (F7) decision tree
366
Experimental Datasets
Dataset Function #Attr Num DBSize Num Max
Records Levels Nodes
F2A8D1M F2 8 1000K 192MB 4 2
F2A32D250K F2 32 250K 192MB 4 2
F2A64D125K F2 64 125K 192MB 4 2
F7A8D1 M F7 8 1000K 192MB 60 4662
F7A32D250K ; F7 32 250K 192MB 59 802
F7A64D125K F7 64 125K 192MB 55 384
Setup and Sort Time

Dataset Setup Sort Total Setup% Sort%
(sec) (sec) (sec)
F2A8D1M 721 633 3597 20% 18%
F2A32D250K 685 598 3584 19% 17%
F2A64D125K 705 626 3665 19% 17%
F7A8D1M 989 817 23360 4% 4%
F7A32D250K 838 780 24706 3% 3%
F7A64D125K 672 636 22664 3% 3%
367
Parallel Performance
F2A64D125K F7A64D125K
I
D MW4 • SUBTREE
25000
[D MW4• SUBTREEI
2500
20000
~2000 -~
,,,
E 1500 = t=oo
._E
i= ='-
p, 1000 ._~ 1oo00
-, p,
= 500 = ~o.
,,>f
0 0,
1 2 3 4 , 2 3 4
Number of Processors
Number of Processors
I High PerformanceData Mining (Vipin Kumar and MohammedZaki) 115 I
A64D125K A64D125K
I MW4-F2 SUBTREE.F2
1 I--m-F2 SUBTREE.F2
I
MW4.F7 ~ SUBREE.F7 ~ MW4-F7 ~ SUBREE-F7
4 3.5
3.52 '"
3
~ ~ 2
~ 1.5
0.5 ~ 0.5
0 . . . . 0 . . . .
I 2 3 4 I 2 3 4
Numberof Processors Number of Processors
I High PerformanceData Mining (Vipin Kumar and MohammedZaki) 116 I
368
Tutorial overview
• Classification
• l Associations
• Sequences
• Clustering
l High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 117 I
What is association mining?
• Given a set of items/attributes, and a set

of objects containing a subset of the
items
• Find rules: if I1 then 12 (sup, conf)
• I1, 12 are sets of items
• I1, 12 have sufficient support: P(11+12)
• Rule has sufficient confidence: P(12111)
369
Association mining
• User specifies "interestingness"
-Minimum support (minsup)
- Minimum confidence (minconf)
• Find all frequent itemsets (> minsup)
-Exponential Search Space
-Computation and I/O Intensive
• Generate strong rules (> minconf)
- Relatively cheap
Association Rule Discovery: Support

and Confidence
iil i!!!!~~!i~iiiiiii!i!!!i¸¸I~~i!ii! !i iilliiiii!i !!!i¸il¸i ¸ Example:

Ass°eiati°n
~ s i Ruleia ;~ X Y I {Diaper,Milk}=:~,,~Beer
tr(Diaper, Milk, Beer) 2
S-- =--=0.4
Total Number of Transactions 5
ty(Diaper, Milk, Beer) = 0.66
o-(Diaper, Milk) [
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 120 [
370
Handling Exponential Complexity
• Given n transactions and m different items:

-number of possible association rules: O(m2 m-l)
- c o m p u t a t i o n complexity: O(nm2m)
• Systematic search for all patterns, based on
support constraint [Agarwal & Srikant]:
- If {A,B} has s u p p o r t at least (z, then both A and B
have support at least t~.
- I f either A or B has support less than ~, then {A,B}
has support less than t~.
- Use patterns of k- l items to find patterns of k items.
Apriori Principle
• Collect single item counts. Find large items.
• Find candidate pairs, count them => large
pairs of items.
• Find candidate triplets, count them => large
triplets of items, And so on...
• Guiding Principle: Every subset of a
frequent itemset has to be frequent.
- Used for pruning m a n y candidates.
371
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
I Minimum Support = 3 I
~il Triplets (3-itemsets)
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6+6+2=14 "°.%
Counting Candidates
• Frequent Itemsets are found by counting
candidates.
• Simple way:
- Search for each candidate in each transaction.
Expensive!l!
Transactions
Candidates
N M
372
Association Rule Discovery: Hash
tree for fast access.
Candidate Hash Tree
Hash Function l
Association Rule Discovery: Subset

Operation
transaction
Hash Function
:,9
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 126 }
373
Association Rule Discovery: Subset
Hash Function
Operation (con
transaction
• \
124 155
457 458
Parallel Formulation of Association

Rules
• Need"
- Huge Transaction Datasets (10s of TB)
- L a r g e Number of Candidates.
• Data Distribution:
-Partition the Transaction Database, or
- Partition the Candidates, or
- Both
374
Parallel Association Rules: Count
Distribution (CD)
• Each Processor has complete candidate
hash tree.
• Each Processor updates its hash tree with
local data.
• Each Processor participates in global
reduction to get global counts of candidates
in the hash tree.
• Multiple database scans per iteration are
required if hash tree too big for memory.
CD: Illustration
i i
~ GlobalReductionof C o u n t s ~
375
Parallel Association Rules: Data
Distribution (DD)
• Candidate set is partitioned among the
processors.
• Once local data has been partitioned, it is
broadcast to all other processors.
• High Communication Cost due to data
movement.
• Redundant work due to multiple traversals
of the hash trees.
DD: Illustration
Data
~-~ _ ~ Broadcast
I-I
to-All Broadcast of C a n d i d ~
376
Parallel Association Rules: Intelligent
Data Distribution (IDD)
• Data Distribution using point-to-point
communication.
• Intelligent partitioning of candidate sets.
- P a r t i t i o n i n g b a s e d on t h e first i t e m of c a n d i d a t e s .
- Bitmap to keep track of local candidate items.
• Pruning at the root of candidate hash tree using the
bitmap.
• Suitable for single data source such as database
server.
• With smaller candidate set, load balancing is
difficult.
IDD: Illustration
Data
Shift
-I
Count
~ t
-fto-AllBroadcast of C a n d i d ~
377
Filtering Transactions in IDD
bitmask~ transaction
Parallel Association Rules: Hybrid

Distribution (HD)
• Candidate set is partitioned into G groups to
just fit in main memory
- Ensures Good load balance with smaller
candidate set.
• Logical processor mesh G x P/G is formed.
• Perform IDD along the column processors
- Data movement among processors is
minimized.
• Perform CD along the row processors
- Smaller number of processors is global
reduction operation.
[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 136 ]
378
HD: Illustration
PIG Processors per Group
CD
along
Rows
o ~.
Parallel Association Rules:

Experimental Setup
• 128-processor Cray T3D
- 150 MHz DEC Alpha (EV4)
- 64 MB of main memory per processor
- 3-D torus interconnection network with peak unidirectional
bandwidth of 150 MB/sec.
• MPI used for communications.
• Synthetic data set: avg transaction size 15 and 1000
distinct items.
• For larger data sets, multiple read of transactions in
blocks of 1000.
• HD switch to CD after 90.7% of the total computation
is done.
I High Performance Data Mining (Vipin Kumar and Mohammed 7aki) 138 I
379
Parallel Association Rules: Scaleup
Results (100K,0.25%) i i i i i
CD -~ ....
IDD -+--
2000 .~<_ HD -E~--
DD ~ ....
.+
~ 10OO ..
500 /~ ~.÷". . "
° 0 '
20 ;o '
60 20 '
1O0 '
120 140
Number of p r o c e s s o r s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 139 t
Parallel Association Rules: Sizeup

Re:;ults (np=16,0.25%)
300O
CD ~,
IDD -+--.
/+ HD -e--
/-
2500
///"
I/I
//
2O00
I/"
_= // . .Y"
g
1500
/ .'*/" .ij.i f / f
1000
500
.d:f i¢/- .f
0 I I ! I
0 200 400 600 800 1000
Number of t r a n s a c t i o n s p e r p r o c e s s o r (K)
380
Parallel Association Rules: Response
Time (np= 16,50K)
500 i i
CD -~--
HD -+--"
450 IDD -o--
simple hybrid --~.....
400
35O x
,. /:p
..."/~,"(2408 K)
3OO
250 . -, - .×/
& .........:..::;:;"iio8,~
K)
200 ,.,."'" .S.2;;""
n-
• - El"" ,..;:;C/"
150
100
50
1211 K)
0 i I I I
O5 0.25 O.1 0.06
Minimum support (%)
High Performance Data Mining (Vipin Kumar and Mohammed ~ki) 141 I
Parallel Association Rules: Response

Time ,(np=64,50K) , , ,
CD - e ~
H D -4---
1200 IDD -o--
simple hybrid -.~.....
1000 0
! ...m /
x
..... / ~ (5232 K)
..o-'" .. ~ /
........... •"× ,/" 12408 K)

...-."""
..-"'" .....,. //
200 ................... ..............
~...........
i~o~ K~
(345 K)
121T1K) I ! I I
0.5 0.25 0.1 o.oe.04
Minimum support (%)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 142 1
381
Minimum Support Reachable
0.25 ~ 0.15 0.06 0.03

0.2 0.1 0.04 0.02

Processor Configuration in HD
64 Processors and 0.04 minimum support
ml~ll ~ mm 4 ~6 ~8
~ ~ ~,x~ ~,~ ~~x~,
382
Summary of Experiments
[] HD shows the same linear speedup and
sizeup behavior as that of CD.
[] HD Exploits Total Aggregate Main
Memory, while CD does not.
[] IDD has much better scaleup behavior
than DD
Eclat Approach
[] Frequent itemset lattice

[] Vertical or "inverted" tid-list format
[] Support counting via intersections
[] Lattice decomposition: break into
subproblems
[] Efficient search strategies
[] Independent solution of subproblems
1
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 146 J
383
Example Database DISTINCT DATABASE ITEMS
Jane Agatha Sir Arthur Mark P. G.
Austen Christie Conan Doyle Twain Wodehouse
A C D T W
DATABASE ALL FREQUENT ITEM-SETS
Transcation Items Support I Item-Sets
1 ACTW 1 0 0 % (6H C
2 CDW
3 ACTW 67%
(4)1[ A, D, T,
AC, AW
C D , CT, A C W
A T , DW, T W , A C T , A T W
4 ACDW 5 0 % (3) CDW, CTW, ACTW
5 ACDTW
6 CDT
MAXIMAL ITEM-SETS (MIN SUPPORT = 50%): ACTW, CDW
I I
Frequent ItemsetLattice
.................. 211/.::::-~:~:"T""~::::.::::::..~IIIII
..............
. . . . . . .
i/./:
. . . . . -..... " ...............
0
DOWNWARD CLOSED ON ITEM-SET SUPPORT
MAXIMAL FREQUENT ITEM-SETS_" A C ~ CDW
384
Eclat: Support Counting
................ ~...::::.......::::..-'T"---..::::::....:::::..
A C D ~
.............
! l
/
A C D T W CO CW CDW
TRANSACTION LISTS
Intersect
C & D
(TID LISTSI
ii,
Intersect
CD & C W

1
1491
Eclat: Lattice Decomposition

ACD'rVv
.... ""'-,.i'"'". ."/ ....'"'"""
{}
EQUIVALENCE I,A,: ~ .. AC. A.. A W, AC'r, ACW. A~. AC.W,]
CLASSES [C] C, C D , C T , C W , C D W , CT~N }
[ D ] = { D, D W } [31"] = { T , T W } [W] = { W } I
C R O ~ - C L A ~ LINK~ U~;FD FOR PRUNIN G
High Peffo~nance Data Mining (Vipin Kumar and Mohammed Zald) 150 I
385
Lattice Search Strategies
• Bottom-up
-Level-wise like Apriori
• Top-down
- Start with largest element. If frequent,
done! Else look at subsets
• Hybrid
- Find long frequent/maximal itemsets
- F i n d remaining using bottom-up search
Bottom-up Lattice Search ACD~
.... . , . " . ". "-,...

..'"
ACDT ACDW ADTW

"~'-~::~: . . . . . . . . ...'"~:.:: ...... .... ::::{/:>'"
/.
ACID
NEW EQUIVALENCE I [ A C ] = { A C , A C T , A C W , ACTVV } I

CLASSES [ A T ] = { AT, ATVV }
lAW] = { AW }
KNOWN MAXIMAL FREQUENT ITEMSETS: CDW, CT~/
I
386
Top-down Lattice Search
A
MINIMAL INFREQUENT ITEM-SET; AD
MAXIMAL I~RIr=OIL]IP=NT |TIEM-~EET_- A~TW
Parallel Eclat/MaxEclat
• Partition tidlists among processors
• Schedule equivalence classes
• Selectively replicate database
• Solve Asynchronously/Independently
• Algorithm Space
- Eclat (Bottom-up)
- MaxEclat (Hybrid)
• Effective aggregate memory utilization

387
Count Distribution
PROCESSOR 0 PROCESSOR 1 PROCESSOR 2
PARTITIONED DATABASE
COUNT ITEMS
II GET F R E Q U E N T I T E M S
FORM CANDIDATE PAIRS
C O U N T PAIRS
B GET FREQUENT PAIRS

FORM CANDIDATE TRIPLES
COUNT TRIPLES
Parallel Eclat
PROCESSOR O PROCESSOR 1 PROCESSOR 2
PARTITIONED DATABASE
SELECTIVELY REPLICATED DATABASE
388
Parallel Eclat Algorithm
ITEM SET LATTICE CLASS PARTITIONING
EQUIVALENCE CLASSES
[A] = { AC, AT, AW }
[c] = { CD, CT, CW }
[D] = { DW }
i-r] = {-rw }
CLASS SCHEDULE
Processor 0 (P0)
Processor 1 (P1)
[C] = { CD, CT, CW } I
I [D] { DW }
TID-LIST COMMUNICATION
Original Database Partitioned Database After Tid-List Exchange I
A C D T W P0 P1 P0 P1 J
[~-~ ~---I I C o T wll
I n ml i i i l! !l l! ! !l / m m m u l u m = i = u l
in == . = Hi i'= '= m ==ll
,i=," In
lib
[---
n i
n i
Ill
ml
"11---
llll
Ii
II
I
n ill
i ill
=11
Eclat Experiments
[] Database: T20.16.D4550K
- 1000 Items
- 4550,000 Transactions
[] Machine: Eight 4-way SMP nodes
- Digitars Memory Channel (5gs,30Mb/s)
-233 Mhz, 256 MB, 1MB L2 Cache
[] Hierarchical Parallelization
- M e s s a g e passing + S M P
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 158]
389
Parallel Performance(1 Proc/Host' Parallel Speedup (1 Proc/Host) I
6000
5000
4000
3~00
0
21100
lOOO
H1 H2 H4 H8
Numberof Hosts
Parallel Performance(2 Proc/Host Parallel Speedup (2 Proc/Host)
3500 7r
300O 6
2500 5
21100 4
1500 3
IOO0 2
5130 !
H1 H2 H4 H8
Numberof Hosts
I
I High Performance Data Mining (Vipin Kumar and Mohammed Za~d) 160 I
390
Tutorial overview
• Classification
• Associations
•[Sequences
• Clustering
Discovering Sequential Associations

Given:
A set of objects with associated event occurrences.
events
Object 1
imeline
l0 20 30 40 50
Object 2
I
391
Sequential Pattern Discovery:
Definition
Given is a set of objects, with each object associated with its own
timeline of events, find rules that predict strong sequential
dependencies among different events.
(A B) (C) }, (D E) i
Rules are formed by first disovering patterns. Event occurrences in
the patterns are governed by timing constraints.
(A B) (C) (D E)
i: <-_x~ <=m,T'I!T
'>"~LI"
<=w,
r 1
Sequential Patterns: Complexity

Much higher computational complexity than
association rule discovery.
- O(m k 2 k-l) number of possible sequential patterns having
k events, where m is the total number of possible events.
Time constraints offer some pruning. Further use of
support based pruning contains complexity.
- A subsequence of a sequence occurs at least as many times
as the sequence.
- A sequence has no more occurrences than any of its
subsequences.
- Build sequences in increasing number of events. [GSPalgorithm
by Agarwal & Srikant]
I
392
Apriori-like Approach
• Make multiple scans

• Join frequent patterns to generate new
candidate sequences
• Example:
-(2 3) (4 6) (7)joins with (3) (4 6) (7) (8) to
create (2 3) (4 6) (7) (8)
-(1 2) with (2 3) creates (1 2 3)
-(1) with (2) creates (1) (2) and (1 2)
Sequential Apriori:
Count operation
• Various possibilities for counting occurrences
of a sequence in an object's timeline
- C o u n t only one occurrence per object.
- C o u n t the n u m b e r of span-size w i n d o w s the
s e q u e n c e occurs in.
- C o u n t the n u m b e r of distinct occurrences of a
sequence:
• Each event-timestamp pair considered at most once.
• Each counted occurrence has at least one new event-
timestamp pair.
393
Sequential Apriori:
Count operation (contd..)
• Hash Tree used for fast search of candidate
occurrences. Similar to association rule
discovery, except for following differences.
- E v e r y event-timestamp pair in the timeline
is hashed at the root.
-Events eligible for hashing at the next level
are determined by the maximum gap (xg),
window size (ws), and span (ms)
constraints.
Sequence Hash Tree for Fast Access

Hash Function
(1@0)
Object:
(4@0)
(1@12)
2,5,8 1 9
(1@15) (3@5)
I 1 I I I
1(2@5) 0 5 10 15 20
(s@s) ms
< >
(1)(1)(9) 1(7)(6,9)
_~5) 1 Four Interesting Paths: |
I 1@0, 2@5, 3@5
I
(1 )(2,5)
(4,5,8)
Candidate Sequence Hash Tree
I
(1)(5)(9)
(1) (2,3)
(4)(8,9)
I
|
I
1@0, 3@5
4@0, 5@5
1@12, 1@15
I
394
Counting Leaf Level Candidates
Part of Candidate Sequence Hash__Tree
Object 2:
Count = 12
I I
[ "< -> ~
I High .Performance Data Mining (Vipin Kumar and Mohammed Zaki) 169 I
Parallel Sequential Associations

• Need:
- Enormity of Data.
- Memory and Disk limitations of serial algorithms
running on single processor.
• Can algorithms for non-sequential
associations be extended easily?
- Sequential Nature gives rise to complex issues:
• Longer Timelines
• Large Span values
• Large N u m b e r of Candidates
I
395
Parallel Sequential Associations:
Event Distribution (EVE)
P0 P1 P2
~ 1to All Broadcastof SupportC o u n ~

EVE Algorithm: Challenging Case

Transfer Hash Tree States
P0 P1 ~,~To~e~ P2 ~
Transfer Partial Counts

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ' 172 I
396
Event and Candidate Distribution
(EVECAN)
Rotate Candidates in
either
Round-Robin Manner
or in
~7~J
EVECAN: Parallelizing Candidate

Generation
I Candidates Stored in a Distributed Hash Table
I Hash Function:
Candidate Sequence, S = > h(S) = (Pi, I)
i
i ,: <
Scalable Communication Mechanism to construct and probe
397
The SPADE Approach
• Search space is a lattice
• Use "inverted" vertical lists
• Find support via list intersections
• Break large search space into small
independent suffix-based chunks
• Process each chunk recursively using
breadth/depth-first search
• Takes only a few data scans
Example Database
DATABASE FREQUENTSEQUENCES(75%)
Customer-ld Transaction-Time Items Supp0d Sequences
r 11 ' ; 1 ~'• t
April C,
C, D, T, W, C->D, C->T
1 t April7 ~II~D 100%(3) C->W,D->T, D->W
1 April25
, ;~, T W TW, C->TW,D->TW
2 April 2 A C
2 April 5 D ~~=o~.~,~d A'>W, AC->T, ~!AC->D ~
2 April 20 CTW • J AU >W~A'>IW :~.
3
I
" [, AC->TW C->D.~ i '
, April15 I A C D
MAXIMAL FREQUENTSEQUENCES
3 • April 27 .DTW AC->D AC->TW C->D->TW
398
Frequent sequence lattice
LATTICE INDUCED BY MAXIMAL FREQUENT SEQUENCES
AC->D, AC->TW, C->D->TW
{}
Sequence lattice decomposition
{}
399
Temporal Joins
Customer-Transaction List Intersection
P->X->Y
P->X P->Y
(3110 I TII)
3 I 10
4 I 60 s / 70
7 | 40 P->Y->X
-m~l-:~ r - 13 10
-]
16 80
'15 I so 20
17 I 20 P->XY
20 I 10
I ClI~ Tin
i:
Dynamic Computation Tree

lip
t High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 180 I
400
Parallel Design Space
• Data parallelism: within a class

- I d l i s t parallelism
• Single idlist join
• Level-wise idlist join
- Join parallelism
• Task parallelism: between classes
- pSPADE
• Static vs. dynamic load balancing
Idlist Data Parallelism
P0: cid in range 1-500

• Single idlist join PI: cid in range 501-1000
- Split idlists on cid
Parallel Idlist Intersection
range
- Each proc intersects
over its cid range
- Barrier
synchronization for
each join
401
Idlist Data Parallelism
Level-wise idlist join P0: cid in range 1-500

PI: cid in range 501-1000
- Process all classes
at current level Parallel Pair-Wise
- Each proc still Intersections
intersects over its
local DB
- Barrier
synchronization and
sum-reduction for
each level
Join Data Parallelism

• Each processor Assign each idlist to a
different processor
performs different
intersections within a
class
• Barrier after self-joining
within a class, before
processing classes at
next level
• # synchronizations is
between Single and
Level-wise parallelism
I
402
TaskParallelism:
StaticLoadBalancing
• Given level 1 classes, C1, C2, C3, ...
• Assign weights W l , W2, W3, ... on the class size (#
sequences in the class)
• Greedy scheduling of entire classes
- Sort classes on weight (descending)
- Assign class to proc with least total weight
• Each proc solves assigned classes asynchronously
• Hard to get accurate work estimate
Static andDynamic
Load7q~l~O
~I Balincin~q
High Performance Data Mining (Vipin~umar and Mohammed Zaki) 186 I
403
Task Parallelism:
Dynamic Load Balancing
• Given level 1 classes, C1, C2, C3, ... and
their weights
• Sort classes on weight (descending) and
insert in a logical task queue
• A proc atomically grabs the first available
class and solves it completely
• Repeat until no more classes in queue
• Uses only inter-class parallelism
• Queue contention negligible since classes
are coarse-grained
Task Parallelism:
Recursive DLB (pSPADE)
• Process available classes in parallel
• Worst case: P-1 procs free, 1 proc busy
• Classes sorted on weight, last class
usually small (but it could be large)
• Provide mechanism for free procs to
join busy group
• At each level, get free procs, insert the
classes into a shared task queue;
process available classes in parallel
404
pSPADE: Recursive
D,;namic LoadBalancing
] High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 189 ]
Experimental Setup
• SGI Origin 2000
- 12 195Mhz R10K MIPS processors
- 4 M B cache, 2GB memory
• Synthetic datasets
- C : number of transactions/customer
-T: average transaction size
- S/I: average sequence/itemset size
- D : number of customers
405
Data vs. Task Parallelism
C10-T5-S4-11.25-D1M C20-T5-S8-11.25-D1M
i D Data • Task I DData 1Task I

500" 400
350
400'
-'G" 300
O)
._E 300- 250

I-- .E_ 200
r-
i--
.2 200. 150
.m
100
100-
u~ 50
O.
1 2 4 1 2 4
Number of Processors Number of Processors
Static vs. D y n a m i c D L B
[rlStatic • Dynamic • Recursive ]

• Dynamic is up to
600- 38% better than
500- Static
400- • Recursive is up to
p..
~ 300-
12% better than
g
Dynamic
200-
• Over all Recursive
100-
is 44% better than
O_
C10T5S411.25 C20T5S811.25 C20T5S812 Static
I
406
C10-T5-S4-11.25-D1M C10-T5-S4-11.25-D1M
400-
,
350-
~ . 300-
3.
250- E
. n
i.,-
•~ 200- 2,
.o t50-
..=
~I.
~, 100-
50- .
.
1 2 4 8 t 2 4 8
High Performance Data Mining (Vipin Kumar end Mohammed Zaki) 193 I
C20-T5-S8-12-D1M C20-T5-S8-12.D1M
1400.
1200.
1000
~' 800'
.s_
I--
-- 600.
.o
= 400, T=
"~ 200.
t 2 4 8 t 2 4 8
407
Scaleup and Support
CS-T2.5-S4-1125 C5-T2.5.S4-11.25.D1M
f I
180- 90-'/
160- 80/-;
140- 701/
=) t20-
E
=E 6o. // '
.m
~'- t00- ~-- 50-
._o 80- .o
"-!
40-
~, 60~ o /
40 20
10 ",~
o~ O"
2 4 8 10 0.10% 0.08% 0,05% 0.03% 0.01%
Millionsof Customers Number of Processors
pSPADE Summary
• Task parallel better than data parallel
• Dynamic load balancing
• Asynchronous algorithms (independent
classes)
• Good locality (uses only intersections)
• Good speedup and scaleup
• What next? Gigabyte databases
1
408
Tutorial overview
• Classification
• Associations
• Sequences
• 1Clustering
I Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) lg7 r
What is clustering?
Given N k-dimensional feature vectors,

find a "meaningful" partition of the N
examples into c subsets or groups
• Discover the "labels" automatically
• c may be given, or "discovered"
• much more difficult than classification,
since in the latter the groups are given,
and we seek a compact description
409
Clustering Illustration
I Euclidean Distance Based Clustering in 3-D space.
00 o
I0
/
/
Clustering
• Have to define some notion of

"similarity" between examples
• Goal: maximize intra-cluster similarity
and minimize inter-cluster similarity
• Feature vector be
- A l l numeric (well defined distances)
- All categorical or mixed (harder to define
similarity; geometric notions don't work)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 200 t
410
Clustering schemes
• Distance-based
- Numeric
• Euclidean distance (root of sum of squared
differences along each dimension)
• Angle between two vectors
- Categorical
• Number of common features (categorical)
• Partition-based
- Enumerate partitions and score each
Clustering schemes
• Model-based
- Estimate a density (e.g., a mixture of
gaussians)
- Go bump-hunting
- C o m p u t e P(Feature Vector i I Cluster j)
- Finds overlapping clusters too
- E x a m p l e : bayesian clustering
1
411
Before clustering
• Normalization:
- G i v e n three attributes
• A -- micro-seconds
• B -- milli-seconds
• C -- seconds
- Can't treat differences as the same in all

dimensions or attributes
- N e e d to scale or normalize for comparison
- C a n assign weight for more importance
[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 203 ]
The k-means algorithm
• Specify 'k', the number of clusters

• Guess 'k' seed cluster centers
• 1) Look at each example and assign it
to the center that is closest
• 2) Recalculate the center
• Iterate on steps 1 and 2 till centers
converge or for a fixed number of times
!
412
K-means algorithm
0 0
0 00: Initial seeds
0
0 0 o°
0
O0
O0
o/ o
' ! •
I
K-means algorithm
O
o****
O New centers ~w-
0 0 0
o ~o
000
0 0
I=
413
K-means algorithm
Final centers -~v
Operations in k-means
• Main Operation: Calculate distance to
all k means or centroids
• Other operations:
- Find the closest centroid for each point
- C a l c u l a t e mean squared error (MSE) for all
points
- Recalculate centroids
!
414
Parallel k-means
• Divide N points among P processors
• Replicate the k centroids
• Each processor computes distance of
each local point to the centroids
• Assign points to closest centroid and
compute local MSE
• Perform reduction for global centroids
and global MSE value
Gaussian mixture models
• Drawbacks of K-means
- Doesn't do well with overlapping clusters
-Clusters easliy pulled off center by outliers
- Each records is either in or out of a cluster;
no notion of some records begin more or
less likely than others to really belong to
cluster they have been assigned
• G M M : probabilistic variants of K-means
1
High Performance Data Mining (Vipin Kumar and Mohammed zaki) 210 J
415
Estimation-maximization
• Choose K seeds: means of a gaussian

distribution
• Estimation: calculate probability of
belonging to a cluster based on distance
• Maximization: move mean of gaussian
to centroid of data set, weighted by the
contribution of each point
• Repeat till means don't move
Deviation/outlier detection
• Find points that are very different from

the other points in the dataset
• Could be "noise", that causes problems
for classification or clustering
• Could be the really "interesting" points,
for example, in fraud detection, we are
mainly interested in finding the
deviations from the norm
416
Deviation detection
outlier.
J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 213 ]
K-nearest neighbors
• Classification technique to assign a

class to a new example
• Find k-nearest neighbors, i.e., most
similar points in the dataset (compare
against all points!)
• Assign the new case to the same class
to which most of its neighbors belong
417
K-nearest neighbors
+~
0 0
Neighborhood
0 000 0
+ 5 of class 0
0 0 3 of class +
0 0 0 0 0
-k-o
**2++ o o
++
+
+0 0
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 215 n
t
_
Tutorial overview
• Classification
• Associations
• Sequences
• Clustering
• l Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 216J
418
Large-scale Parallel KDD Systems
• Terabyte-sized datasets
• Centralized or distributed datasets
• Incremental changes
• Heterogeneous data sources
• Pre-processing, mining, post-processing
• Modular (rapid development)
• Benchmarking (algorithm comparison)
Research Directions
• Fast algorithms: different mining tasks
-Classification, clustering, associations, etc.
-Incorporating concept hierarchies
• Parallelism and scalability
- Millions of records
- T h o u s a n d s of attributes/dimensions
- Single pass algorithms
- Sampling
- Parallel I/O and file systems
419
Research Directions (contd.)
• Data locality and type

-Distributed data sources (www)
- Text and multimedia mining
- Spatial data mining
• Incremental mining: refine knowledge
as data changes
• Interactivity: anytime mining
• Tight database integration

- Push common primitives inside DBMS
-Use multiple tables
- Use efficient indexing techniques
- C a c h i n g strategies for sequence of data

mining operations
- Data mining query language and parallel
query optimization
1
420
[] Understandability: Too many patterns

-Incorporate background knowledge
-Integrate constraints
- Meta-level mining
- Visualization
[] Usability: build a complete system

-Pre-processing, mining, post-processing,
persistent management of mined results
Summary Of the Tutorial

• Data mining is a rapidly growing field
- Fueled by enormous data collection rates, and
need for intelligent analysis for business and
scientific gains.
• Large and high-dimensional nature data
requires new analysis techniques and
algorithms.
• Scalable, fast parallel algorithms are
becoming indispensable.
• Many research and commercial
opportunities!!!
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 222I
421
Resources
Workshops
• HiPC Special Session on Large-Scale Data Mining, 2000.
http:llwww.cs.rpLedul-zakilLSDMI
• ACM SIGKDD Workshop on Distdbuted Data Mining, 2000.
http:/Iwww.eecs.wsu.edul-hillollDKDIdpkd2OOO.html
• 3rd IEEE IPDPS Workshop on High Performance Data Mining, 2000.
http://www.cs.rpi.edu/~zaki/HPDM/
• ACM SIGKDD Workshop on Large-Scale Parallel KDD Systems, 1999.
http:llwww.cs.rpi.edul-zakiiWKDD991
• ACM SIGKDD Workshop on Distributed Data Mining, 1998.
http://www.eecs.wsu.edu/-hillol/DDMWS/papers.html
• 1st IEEE IPPS Workshop on High Performance Data Mining 1998.
http://www.cise.ufl.edu/- ranka/
Books
• A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer Academic
Pub., Boston, MA, 1998.
• M.J. Zaki and C.-T. Ho (eds). Large-Scale Parallel Data Mining. LNAI State-of-the-Art Survey,
Volume 1759, Springer-Verlag, 2000.
• H. Kargupta and P. Chan (eds). Advance in Distributed and Parallel Knowledge Discovery, AAAI
Press, Summer 2000.
High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 223 ]
Resources (contd.)
Journal Special Issues
• P. Stolorz and R. Musick (eds.). Scalable High-Performance Computing for KDD, Data Mining and
Knowledge Discovery: An International Journal, Vol. 1, No. 4, December 1997.
• Y. Guo and R. Grossman (eds.). Scalable Parallel and Distributed Data Mining, Data Mining and
Knowledge Discovery: An International Journal, Vol. 3, No. 3, September 1999.
• V. Kumar, S. Ranka and V. Singh. High Performance Data Mining, Journal of Parallel and Distributed
Computing, forthcoming, 2000.
Survey Articles
• F. Provost and V. Kollud. A survey of methods for scaling up inductive algorithms. Data Mining and
Knowledge Discovery: An International Journal, 3(2): 131--169, 1999.
• A. Srivastava, E.-H. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classification
algodthms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237--262, 1999.
• M.J. Zaki. Parallel and distdbuted association mining: A survey. In IEEE Concurrency special issue
on Parallel Data Mining, 7(4):14-25, Oct-Dec 1999.
• D. Skillicom. Strategies for parallel data mining. IEEE Concurrency, 7(4):26--35, Oct-Dec 1999.
• M.V. Joshi, E.-H. Han, G. Karypis and V. Kumar. Efficient parallel algorithm for mining associations.
In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Spdnger-Verlag 2000.
• M.J. Zaki. Parallel and distdbuted data mining: An introduction. In Zaki and Ho (eds.), Large-Scale
Parallel Data Mining, LNAI 1759, Spdnger-Vedag 2000.
I
High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 224 I
422
References: Classification
• J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H. W. To, and
Y. Dan. Large scale data mining: Challenges and responses. In 3rd Intl. Conf. on Knowledge Discovery and
Data Mining, August 1997.
• S. Goil and A. Choudhary. Efficient parallel classification using dimensional aggregates. In Zaki and Ho
(eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor. Searching the nuggets in parallel. In Fayyad
et al.(eds.), Advances in KDD, AAAI Press, 1998.
• M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classification algorithm for mining
large datasets. In Intl. Parallel Processing Symposium, 1998.
• R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. Suttner, editors, Parallel
Processing for Artificial Intelligence 3. Elsevier-Science, 1997.
• S. Lavington, N. Dewhurst, E. Wilkins, and A. Freitas. Interfacing knowledge discovery algorithms to large
databases management systems. Information and Software Technology, 41:605--617, 1999.
• F. Provost and J. Aronis. Scaling up inductive learning with massive parallelism. Machine Leaming, 23(1),
April 1996.
• F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and
Knowledge Discovery: An International Journal, 3(2): 131-- 169, 1999.
• John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier for data mining. In
Proc. of the 22rid Int'l Conference on Very Large Databases, Bombay, india, September 1996.
• M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-of-core divide and conquer techniques with application
to classification trees. In 13th International Parallel Processing Symposium, April 1999.
• A. Srivastava, E-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification
algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237--261, 1999.
• M.J. Zaki, C.-T. Ho, and R. Agrawal.Parallel classification for data mining on shared-memory
multiprocessors. In 15th IEEE Intl. Conf. on Data Engineering, March 1999.
References: Associations
• R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast discovery of association
rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages 307--328.
AAAI Press, Menlo Park, CA, 1996.
• R. Agrawal and J. Sharer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg.,
8(6):962--989, December 1996.
• D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distdbuted algorithm for mining association rules. In 4th
Intl. Conf. Parallel and Distributed Info. Systems, December 1996.
• D. Cheung, V. Ng, A. Fu, and Y. Fu. Efficient mining of association rules in distributed databases. IEEE
Transactions on Knowledge and Data Engg., 8(6):911-922, December 1996.
• D. Cheung, K. Hu, and S. Xia. Asynchronous parallel algorithm for mining association rules on shared-
memory multi-processors. In 10th ACM Symp. Parallel Algorithms and Architectures, June 1998.
• D. Cheung and Y. Xiao. Effect of data distribution in parallel mining of associations. Data Mining and
Knowledge Discovery: An International Journal. 3(3):291--314, 1999.
• E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD
Conf. Management of Data, May 1997.
• M. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M.
Zaki and C.-T. Ho (eds), Large-Scale Parallel Data Mining, LNAI State-of-the-Art Survey, Volume 1759,
Spdnger-Verlag, 2000.
• S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In
Zaki and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Spdnger-Vedag 2000.
423
References: Associations (contd.)
• A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical
Report CS-TR-3515, University of Maryland, College Park, August 1995.
• J.S. Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf.
Information and Knowledge Management, November 1995.
• T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl.
Conf. Parallel and Distributed Info. Systems, December 1996.
• T. Shintani and M. Kitsuregawa. Parallel algorithms for mining generalized association rules with
classification hieramhy. In ACM SIGMOD International Conference on Management of Data, May 1998.
• M. Tamura and M. Kitsuregawa. Dynamic load balancing for parallel association rule mining on
heterogeneous PC cluster systems. In 25th Int'l Conf. on Very Large Data Bases, September 1999.
• M.J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14--25, October-
December 1999.
• M.J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-
memory multi-processors. In Supercomputing'98, November 1996.
• M . J . Zaki, S. Parthasarathy, M. Ogihara, and W. U. New algorithms for fast discovery of association rules.
In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
• M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. U. Parallel algorithms for fast discovery of association
rules. Data Mining and Knowledge Discovery: An Intemational Journal, 1(4):343-373, December 1997.
• M.J. Zaki, S. Parthasarathy, W. Li, A Localized Algorithm for Parallel Association Mining, 9th Annual ACM
Symposium on Parallel Algorithms and Architectures (SPAA), June, 1997.
References: Sequences
• R. Agrawal and R. Srikant. Mining sequential patterns. In 1 lth Intl. Conf. on Data Engg., 1995.
• H. Mannila and H. Toivonen and I. Verkamo. Discovery of frequent episodes in event sequences. Data
Mining and Knowledge Discovery: An Intemational Journal. 1(3):259-289, 1997.
• T. Oates, M. D. Schmill, and P. R. Cohen. Parallel and distributed search for structure in multivariate time
sedes. In 9th European Conference on Machine Learning, 1997.
• T. Oates, M. D. Schmill, D. Jansen, and P. R. Cohen. A family of algorithms for finding temporal structure in
data. In 6th Intl. Workshop on AI and Statistics, March 1997.
• T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach.
In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998.
• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In
5th Intl. Conf. Extending Database Technology, Mamh 1998.
• M.J. Zaki. Efficient enumeration of frequent sequences. In 7th Intl. Conf. on Information and Knowledge
Management, November 1998.
• Mohammed J. Zaki, Parallel Sequence Mining on SMP Machines, ACM SIGKDD Workshop on Large-Scale
Parallel KDD Systems, 1999 (LNAI Vo11759).
1
High P e r f o r m a n c e Data Mining (Vipin K u m a r a n d M o h a m m e d Zaki) 228 J
424
References: Clustering
• K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. 1st IPPS Workshop on High
Performance Data Mining, March 1998.
• I. Dhillon and D. Modha. A data clustering algorithm on distributed memory machines. In Zaki and Ho (eds),
Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• L. lyer and J. Aronson. A parallel branch-and-bound algorithm for cluster analysis. Annals of Operations
Research Vol. 90, pp 65-86, 1999.
• E. Johnson and H. Kargupta. Collective hierarchical clustering from distributed heterogeneous data. In Zaki
and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000.
• D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In Int'l Conf. Pattern Recognition,
August 1996.
• X. Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11:270--290, 1989.
• C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313--1325, 1995.
• S. Ranka and S. Sahni. Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distdbuted
Systems, 2(2): 129-- 137, 1991.
• F. Rivera, M. Ismail, and E. Zapata. Parallel squared error clustering on hypercube arrays. Journal of
Parallel and Distributed Computing, 8:292--299, 1990.
• G. Rudolph. Parallel clustering on a unidirectional dng. In R. Grebe et al., editor, Transputer Applications
and Systems'93: Volume 1, pages 487--493. lOS Press, Amsterdam, 1993.
• H. Nagesh, S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large
data sets. Technical Report 9906-010, Center for Parallel and Distributed Computing, Northwestern
University, June 1999.
• X. Xu, J. Jager and H.-P. KriegeL A fast parallel clustering algorithm for large spatial databases. Data
Mining and Knowledge Discovery: An International Journal. 3(3):263--290, 1999.
I High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 229 I
References: Distributed DM
• J. Aronis, V. Kolluri, F. Provost, and B. Buchanan. The WORLD: Knowledge discovery from multiple
distributed databases. In Florida Artificial Intelligence Research Symposium, May 1997.
• R. Bhatnagar and S. Srinivasan. Pattern discovery in distributed databases. In AAAI National Conference
on Artificial Intelligence, July 1997.
• J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Kohler, and J. Syed. An architecture for distributed
enterprise data mining. In 7th Intl. Conf. High-Performance Computing and Networking, April 1999.
• R.L. Grossman, S. M. Bailey, H. Sivakumar, and A. L. Tudnsky. Papyrus: A system for data mining over
local and wide area clusters and super-clusters. In Supercomputing'99, November 1999.
• Y. Guo and J. Sutiwaraphun. Knowledge probing in distnbuted data mining. In 3rd Pacific-Asia Conference
on Knowledge Discovery and Data Mining, August 1999.
• L. Hall, N. Chawla, K. Bowyer and W. P. Kegelmeyer. Leaming rules from distributed data. In Zaki and Ho
(eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, distdbuted data mining using an agent based
architecture. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
• H. Kargupta, B-H. Park, D. Hershberger, and E. Johnson. CollecUve data mining: A new perspective toward
distnbuted data mining. In Kargupta and Chan (eds), Advances in Distributed DM, AAAI Press, 2000.
• S. Parthasarathy and R. Subramonian. Facilitating data mining on a network of workstations. In Kargupta
and Chan (eds), Advances in Distnbuted DM, AAAI Press, 2000.
• A. Prodromidis, S. Stoifo, and P. Chan. Meta-learning in distributed data mining systems: Issues and
approaches. In Kargupta and Chan (eds), Advances in Distributed DM, AAAI Press, 2000.
• S. Stolfo, A. Prodromidis, S. Tselepis, W. Lee, W. Fan, and P. Chan. Jam: Java agents for meta-leaming
over distributed databases. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
!
425

DM Tutoria

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

DM Tutoria

Uploaded by

Copyright:

Available Formats

Tutorial PM-3

High Performance Data Mining

Vipin Kumar (University of Minnesota)

Vipin Kumar Mohammed J. Zaki

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11

• What is Data Mining/KDD?

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 31

What is data mining?

The iterative and interactive process of

• Valid: generalize to the future

Data mining goals

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 6 I

summaries); On Line Analytical Processing

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 7 I

Data mining operations

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 8I

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 91

Data mining process

• Understand application domain

Why Mine Data? Commercial

Origins of Data Mining

• Draws ideas from machine learning/AI,

• Predictive modeling (classification,

Data mining techniques

• Association rules: detect sets of attributes that

• Classification and regression: assign a new data

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 17 I

Data mining techniques

• Similarity search: given a database of objects,

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 19 I

Main challenges for KDD

Speeding up data mining

• Data oriented approach

• Methods oriented approach

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 22 I

• Methods oriented approach (contd.)

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 23 I

Need for Parallel Formulations

• Need to handle very large datasets.

DMM: Shared Disk Architecture

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 28 I

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 30 I

Task vs. Data Parallelism

Horizontal/Vertical Data Format

2 N) Manied 1CCK N:)

4 Yes Maded 12CK I~b i

6 N:) Manied B:K I~b

8 N~ 8n~ 8EK Yes

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 34 I

Classification is the process of assigning

• Given a set of labeled records

• Supervised learning (labels known)

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 38 I

• Training set: set of examples, where

I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 39

• Models range from easy to understand

Yes/J Refund L',Ao /

< 80; Taxlno ~0K~J

The splitting attribute at a node is

From Tree to Rules

Yes~ Refund [.~ 2) Refund = No and

High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 45 [

How to capture good splits?