Professional Documents
Culture Documents
DM Tutoria
DM Tutoria
309
High Performance Data Mining
Tutorial overview
• [ Overview of KDD and data mining I
• Parallel design space
• Classification
• Associations
• Sequences
• Clustering
m Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21
311
Overview of data mining
Massive databases
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4 I
312
What is data mining?
• Prediction
- What?
- Opaque
• Description
- Why?
- Transparent
313
Data mining operations
• Verification driven
- Validating hypothesis
-Querying and reporting (spreadsheets, pivot
tables)
analysis (dimensional
- M u l t i d i m e n s i o n a l
• Discovery driven
-Exploratory data analysis
- Predictive modeling
- Database segmentation
- Link analysis
- Deviation detection
314
KDD Process
315
Data mining process
• Apply data mining algorithm
-Associations, sequences, classification,
clustering, etc.
• Interpret, evaluate and visualize patterns
- W h a t ' s new and interesting?
- I t e r a t e if needed
• Manage discovered knowledge
- C l o s e the loop
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 11 I
316
Why Mine Data?
Scientific Viewpoint...
• Data collected and stored at enormous speeds
(Gbyte/hour)
- remote s e n s o r on a satellite
- t e l e s c o p e s c a n n i n g the skies
- m i c r o a r r a y s g e n e r a t i n g g e n e e x p r e s s i o n data
- scientific simulations generating t e r a b y t e s of data
• Traditional techniques are infeasible for raw data
• Data mining for data reduction..
- cataloging, classifying, s e g m e n t i n g data
- Helps scientists in H y p o t h e s i s Formation
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 131
317
Data mining methods
318
Data mining techniques
319
Data mining techniques
Many other methods, such as
- Neural networks
- Genetic algorithms
- H i d d e n Markov models
- Time series
- Bayesian networks
- Soft computing: rough and fuzzy sets
• Scalability
- Efficient and sufficient sampling
-In-memory vs. disk-based processing
- High performance computing
• Automation
- Ease of use
- Using prior knowledge
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 20 I
320
Tutorial overview
• Overview of KDD and data mining
• Parallel design space I
• Classification
• Associations
• Sequences
• Clustering
• Future directions and summary
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 21 1
321
Speeding up data mining (contd.)
322
Parallel Design Space
• Parallel architectures
- Distributed memory
- Shared disk
- Shared memory
-Hybrid cluster of SMPs
• Task and data parallelism
• Static and dynamic load balancing
• Horizontal and vertical data layout
• Data and concept partitioning
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 25 I
Parallel Hardware
• Distributed-memory machines
- Each processor has local memory and disk
- Communication via message-passing
- Hard to program: explicit data distribution
- Goal: minimize communication
• Shared-memory machines
- Shared global address space and disk
- Communication via shared memory variables
- Ease of programming
- Goal: maximize locality, minimize false sharing
• Current trend: Cluster of SMPs
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 26 I
323
Distributed Memory Architecture
(Shared Nothing)
Interconnection N e t w o r k J
~
1 D
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 271
~ ,0, Interconnection N e t w o r k
)
• mW /
324
Shared Memory Architecture
(Shared Everything)
<-_____ Interconnection Network
1 ID
m
r I 1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 29 ]
Cluster of SMPs
ion Network
32.5
Task vs. Data Parallelism
• Data Parallelism
-Data partitioned among P processors
- Each processor performs the same work
on local partition
• Task Parallelism
- Each processor performs different
computation
- Data may be (selectively) replicated or
partitioned
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 31 1
Task Parallelism
Uniprocessor @+1
Data Parallelism
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 32 I
326
Static vs. Dynamic Load Balance
• Static Load Balancing
- Work is initially divided (heuristically)
- N o subsequent computation or data
movement
• Dynamic Load Balancing
- Steal work from heavily loaded processors
- Reassign to lightly loaded processors
-Important for irregular computation,
heterogeneous environments, etc.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 33 I
10 I~ @rge ~K Yes
327
Data and concept partitioning
• Shared
- SMP or shared disk architectures
• Replicated
- Partially or totally
• Partitioned
- Round-robin partitioning
- Hash partitioning
- Range partitioning
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• 1Classification
• Associations
• Sequences
• Clustering
• Future directions and summary
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 36J
328
What is Classification?
Classification learning
329
Classification learning
Classification Models
• Some models are better than others
- Accuracy
- Understandability
- Rule induction
- Regression models
- Neural networks Harder
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 4O I
330
Decision Tree Classification
Splitting Attributes
HighPerformanceDataMining(VipinKumarandMohammedZaki) 41 ]
1) Refund = Yes ~ NO
4) Refund = No and
MarSt in {Married } ~ NO
I HighPerformanceDataMining(VipinKumarandMohammedZaki) 42 1
331
Classification algorithm
• Build tree
-Start with data at root node
-Select an attribute and formulate a
logical test on attribute
-Branch on each outcome of the test,
and move subset of examples
satisfying that outcome to
corresponding child node
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 43 I
Classification algorithm
- Recurse on each child node
- Repeat until leaves are "pure", i.e., have
example from a single class, or "nearly
pure", i.e., majority of examples are from
the same class
• Prune tree
- R e m o v e subtrees that do not improve
classification accuracy
- Avoid over-fitting, i.e., training set specific
artifacts
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 44 i
332
Build tree
• Evaluate split-points for all attributes
• Select the "best" point and the "winning"
attribute
• Split the data into two
• Breadth/depth-first construction
• CRITICAL STEPS:
- Formulation of good split tests
- Selection measure for attributes
Hunt's Method
• Attributes: Refund (Yes, No), Marital Status
(Single, Married, Divorced), Taxable Income
• Class: Cheat, Don't Cheat
334
What's really happening?
Marital Status
Cheat
Don't Cheat I
I
+ 0 0 0 0
O÷oC 0 000 0
+
+ 0 0 0
0
0+0
0 0 0 0 O
0
++ 0
++
Marrie, + + 0 O
I -'Income
< 80K I
High Performance Data
• Mining (Vipin Kumar and Mohammed Zaki) 49 t
• If S is pure, Gini(S) = 0
• Find split, point with minimum Gini
• Only need class distributions
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 50 I
335
Finding good splits points
~ Cheat
~,Don't Cheat
Marit tl Status Maril d Status
' + 0 ) 0 0 ' 0 0 0
4- 0
O+ 0 ¢ 0+00 0 000o
4- 0 o ° o°°°° 4- 0 0 0 ÷
0 O+ 0 0
0+0
0 0 0 0 0 0 a 0 a e~
+%, 0
0
**.*÷÷ o °
4--+
+ +( O + +T~
+ o
Income Ncome
NO YES NO YES
Left 14 9 Top 5 23
Right 1 18 Bottorr 10 4
Categorical Attributes:
Computing Gini Index
For each distinct value, gather counts for
each class in the dataset
• Use the count matrix to make decisions
2 I 2
I Gini I o,4:te
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 52
336
Continuous Attributes:
Computing Gini Index...
• For efficient computation: for each attribute,
- Sort the attribute on values
- Linearly scan these values, each time updating the count
matrix and computing gini index
- Choose the split position that has the least gini index
Sorted Values
Split Position.=
C4.5
• Simple depth-first construction.
• Sorts Continuous Attributes at each node.
• Needs entire data to fit in memory.
• Unsuitable for Large Datasets.
- Needs out-of-core sorting.
• Classification Accuracy shown to
improve when entire datasets are
used!
337
SPRINT [Shafer,Agrawal,Mehta]
Attribute Lists:
SPRINT (contd.)
• The arrays of the continuous attributes are pre-sorted.
- T h e sorted order is m a i n t a i n e d d u r i n g each split.
• The classification tree is grown in a breadth-first fashion.
• Class information is clubbed with each attribute list.
• Attribute lists are physically split among nodes.
• Split determining phase is just a linear scan of lists at each
node.
• Hashing scheme used in splitting phase.
- tids of the splitting attribute are h a s h e d with the tree node as the
key.
• lookup table
- r e m a i n i n g attribute arrays are split by q u e r y i n g this hash structure.
338
SPRINT Disadvantages
339
Constructing a Decision Tree in
Parallel
l 0,000 training records Partitioning of
classification tree nodes
natural concurrency
7,0q ds -
load imbalance as the
amount of work associated
with each node varies
- child nodes use the same
data as used by parent node
- loss of locality
- high data movement cost
340
Synchronous Tree Construction Approach
~-iii~iii~:::~ + N°requireddat
a is
movement
Class Distribution Information
I
I - - Load imbalance
y i can be eliminated by breadth-
Proc 0 Proc 1 P~oc 2 Proc 3 first expansion
r......................................... :
t
High communication cost
due to excessive data
movements
ProcO Procl Proc2 Proc3
t - Load imbalance
341
Hybrid Parallel Formulation
( ]
Computation Frontier at depth 3
Partition 1 Partition 2
Load Balancing
C
Partition 1 Partition2 Step 1: exchange data between Step2: load balance within
processor partitions processor partitions
342
Splitting Criterion
Experimental Results
• Data set
- function 2 data set discussed in SLIQ paper
(Mehta, Agrawal and Rissanen, EDBT'96)
- 2 class labels, 3 categorical and 6 continuous
attributes
• IBM SP2 with 128 processors
- 66.7 MHz CPU with 256 MB real memory
- AIX version 4
- high performance switch
I
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 66 J
343
Speedup Comparison of the Three
Parallel Algorithms
~ e d u p for 0.8 malon examp~s, partitioned - " - + - ' , hybrid - " - o - ' , synch t a r s u s - " - x - " Speedup for 1.6 rnllion e x a n ~ , part~ioned - " ~ ' . hybrid - " - c - ' . s ' ~ - "-x-"
. . . . . . . /4
4 6 8 10 12 14 16
2 4 6 No OI Pr~esso IO 12 14 16 NO O~ Promisors
600
//~
/J/~
S00
/ /,/
/"
'il
/
30O
2OO
/"
I
,% -2 0 2 4 6 8 10 -2 0 2 4 6 6 10
bg2(Splmng C,'ile.~ R ~ o ) , x,.O -> ncx,.1 I o ~ ( Sp4111ngC411eriaRatk)), x..O - > ~ 1
344
Speedup of the Hybrid Algorithm with
Different Size Data Sets
Speedup c~rves for different sized data.sets
100
0.8 million examples --e---
1,6 million examples ,.~-.
3.2 million examples
6.4 million examples
12.8 minion examples
25.6 million examples -ix-,-
.....-
....-'°'°" ....A
,-" .A-.......
...'/"
40
..~-.~'.-" ~.=~ .............................
i),- .......
20
i ! i i i i
20 40 60 80 100 120 140
Number of proceesols
70 .......... i .......................................................................................
I:
t ~ 4 0 ...................................... * . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . . .
1(~ ......... , ....... ~........ i ......... ~........ ~.......... i ........ :: .... ~ ....
! i : i i i i
i I ; I I i i
0 1 2 3 4 5 6 7 8 9
N u m b e r of P r o c e s s o m
345
Summary of Algorithms for
Categorical Attributes
• Synchronous Tree Construction Approach
- no data movement required
- high communication cost as tree becomes bushy
• Partitioned Tree Construction Approach
- processors work independently once partitioned completely
- load imbalance and high cost of data movement
• Hybrid Algorithm
- combines good features of two approaches
- adapts dynamically according to the size and shape of trees
3 4 6
Design Approach
• Goal: Scalability in both Runtime and
Memory Requirements.
• Parallelization overhead: To = P Tp - Ts
• T o ~ O(Ts) for scalability. Per processor
overhead should not exceed O(Ts/P).
• Two components of To:
- Load Imbalance.
- Communication time.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 73 [
Load Balancing
BI B2
value rid cid value rid cid SP : split point
5 1 0 6 oll rid : record id
10 8 l 12 1 0 cid : class label
PO 15i 4 0 PO 18 8 1
2o I 6 0 24 6 0
SP " - ~ -
PI
:o,
25 2 0
353 1
_304
Pl
36 3 1
42
0
2 0
7 0 48 5 1
45 5 l 54 7 0
BI B2
SP---~ 5 1 0 12 1 0 PG
10 8 I 18 8 1 30 0 l 6 0 1
35 3 1 36 3 1,
15 4 0 PO 24 6 0 1 40 7 0 48 5 1
PO 20 6 0 30 4 0
-i25 2 0 . p ~ 42 2 0 45 5 1 54 7 0
P1 .~ ~ • t ~ .
347
Parallel SPRINT
PI
Fam 3 0
Spo 4 1 Famlll01
0 I
Spo [ - ~ Parallel
Reduction
i
i P1
33 1 0
38 3 1 R
[~
P,o
x.
Scan
i®
o 1
FamSP° ~
5 0l _6
r-m-m
01 II- " -- L
5400 4R
60 1~ i
S p o ~ _~
P2 Fam 7 1 Fam"t._LLeJ P2 58 7 1
I
Faro 8 0 i 70 0 0
|
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 76 J
348
Parallelizing Splitting Phase..
• SPRINT Approach:
- Builds hash table on all processors by an
all-to-all broadcast operation.
- S o , O(N) data sent & received by each
processor--> Not Scalable in Runtime as
well as Memory Requirements.
!
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 78 I
349
Categorical/Categorical
Splitting a categorical attribute list requires access
exactly to those entries of the hash-table that are
created by the same processor
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 701
Continuous/Continuous
P1
P2
P3
Ilil
8 0
350
Getting the required entries of the
hash table
The required entries are transferred into two steps
- From splitting attribute order to tid sorted order
- From tid sorted order to attribute list order
6
7
xo X O P2
li
8 0 X 0
X : Salary O : Age ~ : Node Table Entries
[ High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 82 I
351
Parallel Hashing Paradigm
[] Distributed Hash Table.
[] Hash Function: h(k) = (Pi, I)
[] Construction:
- (k,v) --> (pi,I) --> form send buffers {(I,v)} for each Pi --> all-
to-all-personalized communication.
.. Enquiry:
- (k) --> (Pi, I) --> form {(I)} buffers for each Pi --> all-to-all-
personalized c o m m --> each Pi replaces received {(I)} with
{(v)} --> all-to-all-personalized c o m m .
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 831
(a) (b)
hash buffers
PO P2 0 2 1
Po [(O,L)I(2,L)(2,L)[ m[(O,L)[(2,L)I(I,L)
I N o d e Table
P0 P1 P2
P0 PI 4 5 3
~, I(1.L)I..L)I(2.L)I comm ~,I(1.L)I~2.L)I<0.R>I update rid]OI1121 31415 17181
kialLILlLI RILILIRIRILI
[6
PI P2 8 6 7 I I
Ico,.) ~o,~)1<1..)1 ~l(2,L)l<o,~)l.,~)l
(¢)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 84 I
352
Applying the Paradigm to
ScalParC: Enquiry
Age enquiry buffers intermediate index buffers
value rid cid PO PI P2 P0 PI P'2
2 5 8
19 2 1
P0 24 5 0 ~ol 212 21
28 8 0 I 3 6
PI
33 1 0
38 3 1 -I'1° °1 °°° El I 2
I I0 1
="°~" ' ' '
40!6 1 0 4 7
50 4 0 Pzlol 1 ,I
P2 58 7 1
70 0 0 intermediate valuebuffers result buffers
PO PI P2
~ILILI L
Node Table
P0 PI } P2
pl[ L R L oo~ P,I L I . I g
rid 0 I 2 34516
kid g L L R L L R R7 ~ PO PI P2
Verifying Scalability
• Updating the distributed hash table
- Each processor receives at most N/p entries
- Each processor can send at most Nk/p entries (k is the number of
continuous attributes)
• This is a worst case situation that does not happen frequently
• Inquiring the distributed hash table
- Each processor will perform k inquires, one for each continuous
attribute
- Each processor receives at most N/p entries
- Each processor sends at most N/p entries
• T h e per-processor communication overhead is in general
O(N/p)
• Each processor requires only O(N/p) m e m o r y making S c a l P a r C
m e m o r y efficient
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 86 I
353
A worst case scenario for Updating
Hash Table
I One Processor might need to send O(kN/p) data! Happens
Infrequently.
BI B2
value rid cid value rid cid
5 1 0
10 8 l
~431 4 0
P O 1. 5 . . 4 . 0 P0 21 0 1
20 6 0 28 5 1
25 2 0 _ _357 0
SP~ . . . . .
30 0 1 42 2 0
35 3 1 P1 49 8 1
P1 . . . .
40 7 0 56 1 0
45 5 ' 1 63 6 0
BI B2 ~ BI B2
Sp~10-8 . 1. - 42 2 0 - 30 0 1 - I 7 3 1
35 3 1 P 0 21 0 1
.15. 4. 0 49 8 1 P1 4 0 7 0 28 5 1 .,,,t_Sp
P0 20 6 0 P1 56 1 0 35 ? " 0 "
25 2 0 63 6 0 45 5 1 -
p-11 I ~ I . . . . P-1 ~ ~ I I
/ x z
Categorical/Continuous and
Continuous/Categorical
• Special cases of the Continuous/Continuous
- Categorical/Continuous does not require
communication for loading the hash table
- Continuous/Categorical does not require
communication for inquiring the hash table
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 88 [
354
Algorithm: Level-wise
Communications
• Tree is built in a breadth-first manner.
• At each level of the decision tree
- Count Matrices for all attributes for all nodes are
reduced in one single communication approach.
- Loading the hash-table for all the nodes is
combined into a single communication operation
Inquiring the hash-table to split a particular
-
(Find-Split II)
Partition splitting attributes and update Node
-
Table (Perform-Split I)
Partition non-splitting attributes by inquiring Node
-
355
Example
Salary Age Counl Matrices
local global NodeT a b l e
value rid cid rid kid
24 50 R
2818 0 0I 0I
PI PI
33,,0
3813 1
"'~® i [ ]
SP~
P2 P2 58
o-~N I N
1
70 0
PI ~ ~ ['~ p ~
Experimental Results
• Experiments w e r e performed on a 128-
processor C r a y T 3 D
• Training sets w e r e synthetically g e n e r a t e d
- Each contained only continuous attributes (5-9)
- The were two possible class labels
]
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 92 l
356
Parallel Runtime
Constant size/processor
357
Different Number of Attributes
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 951
358
SPRINT: Attribute Lists
TrainingSet Attributelists
Tid Age CarType Class Age Class Tid CarType[ Class iTid
2o
2 43 sports High . . . . ,
s~ I rn~, |
32 Low 4 ~::famil~::7:l:!!
4 32 truck Low 43 truck I Low 4
Hash Table
Tid
Child
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 98 J
359
SPRINT: File per Attribute
®OO0 O 0 ~ O O0
TOTAL FILES PER ATTRIBUTE: 32
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 991
@® @®@® @
TOTAL FILES PER ATrRIBUTE: 4
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 100 I
360
BASIC: Data Parallel
• Given current leaf frontier
• Atomically acquire an attribute and
evaluate it for all leaves
• Master finds winning attribute and
constructs hash table
• Atomically acquire an attribute and split
its list for all leaves
• Barrier synchronization between phases
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 101 ]
BARRIER
| l | l l | | | |
Illlllll
IIIlll/
G ®® ® ® ®
J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 102 1
361
BASIC (Level View, A=3, P=4)
II PO(4) PI(4) II P2(4) n p3(o)
@
• ®
_,eaf
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 103 I
362
FWK (Tree View)
e={o,l,z,3}
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 105 I
I1~
Leaf
' I
Leaf Lea1
Window=2 I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 106 I
363
Moving Window: Data Parallel
• Partition leaves into blocks/window of K
• Dynamically acquire any attribute for
any leaf, say i, within current block
• Wait for last block's i-th leave
• Last processor to work on a leaf notes
the winning attribute and builds the
hash table
• Split data as in BASIC
• No barriers, only conditional wait
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) ~071
....
• dll lib
""~
l Bill lind
'
illlll
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 108 ]
364
MW2 (Level View)
il PO(3) P1(3) II
m P2(3) II P3(3)
Leaf Leaf
." q IP
Window=2
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 109 [
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 110 I
365
SUBTREE (4 procs)
II P0
P1
II P2
FreeQ
II P3
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 111 I
Experimental Setup
• SMP Machines
- 112 MHz PowerPC-604 processors
- 16KB L1-cache, 1MB L2-cache
- Machine: 4-way SMP
• 1 2 8 M B memory, 3 0 0 M B disk
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 112 I
366
Experimental Datasets
Dataset Function #Attr Num DBSize Num Max
Records Levels Nodes
F2A8D1M F2 8 1000K 192MB 4 2
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 113 I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 1141
367
Parallel Performance
F2A64D125K F7A64D125K
I
D MW4 • SUBTREE
25000
[D MW4• SUBTREEI
2500
20000
~2000 -~
,,,
E 1500 = t=oo
._E
i= ='-
p, 1000 ._~ 1oo00
-, p,
= 500 = ~o.
,,>f
0 0,
1 2 3 4 , 2 3 4
Number of Processors
Number of Processors
Parallel Performance
A64D125K A64D125K
I MW4-F2 SUBTREE.F2
1 I--m-F2 SUBTREE.F2
I
MW4.F7 ~ SUBREE.F7 ~ MW4-F7 ~ SUBREE-F7
4 3.5
3.52 '"
3
~ ~ 2
~ 1.5
0.5 ~ 0.5
0 . . . . 0 . . . .
I 2 3 4 I 2 3 4
Numberof Processors Number of Processors
I High PerformanceData Mining (Vipin Kumar and MohammedZaki) 116 I
368
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• l Associations
• Sequences
• Clustering
• Future directions and summary
l High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 117 I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 118 I
369
Association mining
• User specifies "interestingness"
-Minimum support (minsup)
- Minimum confidence (minconf)
• Find all frequent itemsets (> minsup)
-Exponential Search Space
-Computation and I/O Intensive
• Generate strong rules (> minconf)
- Relatively cheap
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 119 I
370
Handling Exponential Complexity
Apriori Principle
• Collect single item counts. Find large items.
• Find candidate pairs, count them => large
pairs of items.
• Find candidate triplets, count them => large
triplets of items, And so on...
• Guiding Principle: Every subset of a
frequent itemset has to be frequent.
- Used for pruning m a n y candidates.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 122 I
371
Illustrating Apriori Principle
Items (1-itemsets)
Pairs (2-itemsets)
I Minimum Support = 3 I
~il Triplets (3-itemsets)
If every subset is considered,
6C1 + 6C2 + 6C3 = 41
With support-based pruning,
6+6+2=14 "°.%
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 123 ]
Counting Candidates
• Frequent Itemsets are found by counting
candidates.
• Simple way:
- Search for each candidate in each transaction.
Expensive!l!
Transactions
Candidates
N M
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 124 I
372
Association Rule Discovery: Hash
tree for fast access.
Candidate Hash Tree
Hash Function l
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 125 I
:,9
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 126 }
373
Association Rule Discovery: Subset
Hash Function
Operation (con
transaction
• \
124 155
457 458
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 127 J
• Data Distribution:
-Partition the Transaction Database, or
- Partition the Candidates, or
- Both
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 128 I
374
Parallel Association Rules: Count
Distribution (CD)
• Each Processor has complete candidate
hash tree.
• Each Processor updates its hash tree with
local data.
• Each Processor participates in global
reduction to get global counts of candidates
in the hash tree.
• Multiple database scans per iteration are
required if hash tree too big for memory.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 129 [
CD: Illustration
i i
~ GlobalReductionof C o u n t s ~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 130 [
375
Parallel Association Rules: Data
Distribution (DD)
• Candidate set is partitioned among the
processors.
• Once local data has been partitioned, it is
broadcast to all other processors.
• High Communication Cost due to data
movement.
• Redundant work due to multiple traversals
of the hash trees.
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 131 I
DD: Illustration
Data
~-~ _ ~ Broadcast
I-I
to-All Broadcast of C a n d i d ~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 132 [
376
Parallel Association Rules: Intelligent
Data Distribution (IDD)
• Data Distribution using point-to-point
communication.
• Intelligent partitioning of candidate sets.
- P a r t i t i o n i n g b a s e d on t h e first i t e m of c a n d i d a t e s .
- Bitmap to keep track of local candidate items.
• Pruning at the root of candidate hash tree using the
bitmap.
• Suitable for single data source such as database
server.
• With smaller candidate set, load balancing is
difficult.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 133 I
IDD: Illustration
Data
Shift
-I
Count
~ t
-fto-AllBroadcast of C a n d i d ~
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 134 I
377
Filtering Transactions in IDD
bitmask~ transaction
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 135 ]
378
HD: Illustration
PIG Processors per Group
CD
along
Rows
o ~.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 137 ]
379
Parallel Association Rules: Scaleup
Results (100K,0.25%) i i i i i
CD -~ ....
IDD -+--
2000 .~<_ HD -E~--
DD ~ ....
.+
~ 10OO ..
° 0 '
20 ;o '
60 20 '
1O0 '
120 140
Number of p r o c e s s o r s
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 139 t
I/I
//
2O00
I/"
_= // . .Y"
g
1500
/ .'*/" .ij.i f / f
1000
500
.d:f i¢/- .f
0 I I ! I
0 200 400 600 800 1000
Number of t r a n s a c t i o n s p e r p r o c e s s o r (K)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 140 ]
380
Parallel Association Rules: Response
Time (np= 16,50K)
500 i i
CD -~--
HD -+--"
450 IDD -o--
simple hybrid --~.....
400
35O x
,. /:p
..."/~,"(2408 K)
3OO
250 . -, - .×/
& .........:..::;:;"iio8,~
K)
200 ,.,."'" .S.2;;""
n-
• - El"" ,..;:;C/"
150
100
50
1211 K)
0 i I I I
O5 0.25 O.1 0.06
Minimum support (%)
High Performance Data Mining (Vipin Kumar and Mohammed ~ki) 141 I
CD - e ~
H D -4---
1200 IDD -o--
simple hybrid -.~.....
1000 0
! ...m /
x
..... / ~ (5232 K)
..o-'" .. ~ /
381
Parallel Association Rules:
Minimum Support Reachable
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 143 I
ml~ll ~ mm 4 ~6 ~8
~ ~ ~,x~ ~,~ ~~x~,
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 144 I
382
Parallel Association Rules:
Summary of Experiments
[] HD shows the same linear speedup and
sizeup behavior as that of CD.
[] HD Exploits Total Aggregate Main
Memory, while CD does not.
[] IDD has much better scaleup behavior
than DD
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 145 i
Eclat Approach
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 146 J
383
Example Database DISTINCT DATABASE ITEMS
Jane Agatha Sir Arthur Mark P. G.
Austen Christie Conan Doyle Twain Wodehouse
A C D T W
DATABASE ALL FREQUENT ITEM-SETS
Transcation Items Support I Item-Sets
1 ACTW 1 0 0 % (6H C
2 CDW
3 ACTW 67%
(4)1[ A, D, T,
AC, AW
C D , CT, A C W
A T , DW, T W , A C T , A T W
4 ACDW 5 0 % (3) CDW, CTW, ACTW
5 ACDTW
6 CDT
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 147 I
I I
Frequent ItemsetLattice
.................. 211/.::::-~:~:"T""~::::.::::::..~IIIII
..............
. . . . . . .
i/./:
. . . . . -..... " ...............
0
DOWNWARD CLOSED ON ITEM-SET SUPPORT
MAXIMAL FREQUENT ITEM-SETS_" A C ~ CDW
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 148 I
384
Eclat: Support Counting
................ ~...::::.......::::..-'T"---..::::::....:::::..
A C D ~
.............
! l
/
A C D T W CO CW CDW
TRANSACTION LISTS
Intersect
C & D
(TID LISTSI
ii,
Intersect
CD & C W
{}
EQUIVALENCE I,A,: ~ .. AC. A.. A W, AC'r, ACW. A~. AC.W,]
CLASSES [C] C, C D , C T , C W , C D W , CT~N }
[ D ] = { D, D W } [31"] = { T , T W } [W] = { W } I
C R O ~ - C L A ~ LINK~ U~;FD FOR PRUNIN G
High Peffo~nance Data Mining (Vipin Kumar and Mohammed Zald) 150 I
385
Lattice Search Strategies
• Bottom-up
-Level-wise like Apriori
• Top-down
- Start with largest element. If frequent,
done! Else look at subsets
• Hybrid
- Find long frequent/maximal itemsets
- F i n d remaining using bottom-up search
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 151 ]
ACID
386
Top-down Lattice Search
A
MINIMAL INFREQUENT ITEM-SET; AD
MAXIMAL I~RIr=OIL]IP=NT |TIEM-~EET_- A~TW
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 153 I
Parallel Eclat/MaxEclat
• Partition tidlists among processors
• Schedule equivalence classes
• Selectively replicate database
• Solve Asynchronously/Independently
• Algorithm Space
- Eclat (Bottom-up)
- MaxEclat (Hybrid)
387
Count Distribution
PROCESSOR 0 PROCESSOR 1 PROCESSOR 2
PARTITIONED DATABASE
COUNT ITEMS
II GET F R E Q U E N T I T E M S
FORM CANDIDATE PAIRS
C O U N T PAIRS
COUNT TRIPLES
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 155 ]
Parallel Eclat
PROCESSOR O PROCESSOR 1 PROCESSOR 2
PARTITIONED DATABASE
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 156 I
388
Parallel Eclat Algorithm
ITEM SET LATTICE CLASS PARTITIONING
EQUIVALENCE CLASSES
[A] = { AC, AT, AW }
[c] = { CD, CT, CW }
[D] = { DW }
i-r] = {-rw }
CLASS SCHEDULE
Processor 0 (P0)
Processor 1 (P1)
[C] = { CD, CT, CW } I
I [D] { DW }
TID-LIST COMMUNICATION
Original Database Partitioned Database After Tid-List Exchange I
A C D T W P0 P1 P0 P1 J
[~-~ ~---I I C o T wll
I n ml i i i l! !l l! ! !l / m m m u l u m = i = u l
in == . = Hi i'= '= m ==ll
,i=," In
lib
[---
n i
n i
Ill
ml
"11---
llll
Ii
II
I
n ill
i ill
=11
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 157 I
Eclat Experiments
[] Database: T20.16.D4550K
- 1000 Items
- 4550,000 Transactions
[] Machine: Eight 4-way SMP nodes
- Digitars Memory Channel (5gs,30Mb/s)
-233 Mhz, 256 MB, 1MB L2 Cache
[] Hierarchical Parallelization
- M e s s a g e passing + S M P
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 158]
389
Parallel Performance
Parallel Performance(1 Proc/Host' Parallel Speedup (1 Proc/Host) I
6000
5000
4000
3~00
0
21100
lOOO
H1 H2 H4 H8
Numberof Hosts
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 159 I
Parallel Performance
Parallel Performance(2 Proc/Host Parallel Speedup (2 Proc/Host)
3500 7r
300O 6
2500 5
21100 4
1500 3
IOO0 2
5130 !
H1 H2 H4 H8
Numberof Hosts
I
I High Performance Data Mining (Vipin Kumar and Mohammed Za~d) 160 I
390
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
•[Sequences
• Clustering
• Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 181 I
Object 1
imeline
l0 20 30 40 50
Object 2
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 162 I
391
Sequential Pattern Discovery:
Definition
Given is a set of objects, with each object associated with its own
timeline of events, find rules that predict strong sequential
dependencies among different events.
(A B) (C) }, (D E) i
Rules are formed by first disovering patterns. Event occurrences in
the patterns are governed by timing constraints.
(A B) (C) (D E)
i: <-_x~ <=m,T'I!T
'>"~LI"
<=w,
r 1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 163 ]
392
Apriori-like Approach
Sequential Apriori:
Count operation
• Various possibilities for counting occurrences
of a sequence in an object's timeline
- C o u n t only one occurrence per object.
- C o u n t the n u m b e r of span-size w i n d o w s the
s e q u e n c e occurs in.
- C o u n t the n u m b e r of distinct occurrences of a
sequence:
• Each event-timestamp pair considered at most once.
• Each counted occurrence has at least one new event-
timestamp pair.
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 166 I
393
Sequential Apriori:
Count operation (contd..)
• Hash Tree used for fast search of candidate
occurrences. Similar to association rule
discovery, except for following differences.
- E v e r y event-timestamp pair in the timeline
is hashed at the root.
-Events eligible for hashing at the next level
are determined by the maximum gap (xg),
window size (ws), and span (ms)
constraints.
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 167 I
I
(1 )(2,5)
(4,5,8)
Candidate Sequence Hash Tree
I
(1)(5)(9)
(1) (2,3)
(4)(8,9)
I
|
I
1@0, 3@5
4@0, 5@5
1@12, 1@15
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 168 ]
I
394
Counting Leaf Level Candidates
Part of Candidate Sequence Hash__Tree
Object 2:
Count = 12
I I
[ "< -> ~
I High .Performance Data Mining (Vipin Kumar and Mohammed Zaki) 169 I
395
Parallel Sequential Associations:
Event Distribution (EVE)
P0 P1 P2
396
Event and Candidate Distribution
(EVECAN)
Rotate Candidates in
either
Round-Robin Manner
or in
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)
~7~J
i
i ,: <
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 174 I
397
The SPADE Approach
• Search space is a lattice
• Use "inverted" vertical lists
• Find support via list intersections
• Break large search space into small
independent suffix-based chunks
• Process each chunk recursively using
breadth/depth-first search
• Takes only a few data scans
High Performance Data Mining (Vipin Kumar and Mohammed Zaki)
Example Database
DATABASE FREQUENTSEQUENCES(75%)
Customer-ld Transaction-Time Items Supp0d Sequences
r 11 ' ; 1 ~'• t
April C,
C, D, T, W, C->D, C->T
1 t April7 ~II~D 100%(3) C->W,D->T, D->W
1 April25
, ;~, T W TW, C->TW,D->TW
2 April 2 A C
2 April 5 D ~~=o~.~,~d A'>W, AC->T, ~!AC->D ~
2 April 20 CTW • J AU >W~A'>IW :~.
3
I
" [, AC->TW C->D.~ i '
, April15 I A C D
MAXIMAL FREQUENTSEQUENCES
3 • April 27 .DTW AC->D AC->TW C->D->TW
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 176 I
398
Frequent sequence lattice
LATTICE INDUCED BY MAXIMAL FREQUENT SEQUENCES
AC->D, AC->TW, C->D->TW
{}
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 177 I
{}
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 178 I
399
Temporal Joins
Customer-Transaction List Intersection
P->X->Y
P->X P->Y
(3110 I TII)
3 I 10
4 I 60 s / 70
7 | 40 P->Y->X
-m~l-:~ r - 13 10
-]
16 80
'15 I so 20
17 I 20 P->XY
20 I 10
I ClI~ Tin
i:
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 179 ]
t High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 180 I
400
Parallel Design Space
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 182 I
401
Idlist Data Parallelism
I
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 184 I
402
TaskParallelism:
StaticLoadBalancing
• Given level 1 classes, C1, C2, C3, ...
• Assign weights W l , W2, W3, ... on the class size (#
sequences in the class)
• Greedy scheduling of entire classes
- Sort classes on weight (descending)
- Assign class to proc with least total weight
• Each proc solves assigned classes asynchronously
• Hard to get accurate work estimate
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 185 I
Static andDynamic
Load7q~l~O
~I Balincin~q
403
Task Parallelism:
Dynamic Load Balancing
• Given level 1 classes, C1, C2, C3, ... and
their weights
• Sort classes on weight (descending) and
insert in a logical task queue
• A proc atomically grabs the first available
class and solves it completely
• Repeat until no more classes in queue
• Uses only inter-class parallelism
• Queue contention negligible since classes
are coarse-grained
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 187 I
Task Parallelism:
Recursive DLB (pSPADE)
• Process available classes in parallel
• Worst case: P-1 procs free, 1 proc busy
• Classes sorted on weight, last class
usually small (but it could be large)
• Provide mechanism for free procs to
join busy group
• At each level, get free procs, insert the
classes into a shared task queue;
process available classes in parallel
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 188 I
404
pSPADE: Recursive
D,;namic LoadBalancing
] High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 189 ]
Experimental Setup
• SGI Origin 2000
- 12 195Mhz R10K MIPS processors
- 4 M B cache, 2GB memory
• Synthetic datasets
- C : number of transactions/customer
-T: average transaction size
- S/I: average sequence/itemset size
- D : number of customers
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 190 I
405
Data vs. Task Parallelism
C10-T5-S4-11.25-D1M C20-T5-S8-11.25-D1M
100
100-
u~ 50
O.
1 2 4 1 2 4
Number of Processors Number of Processors
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 191 I
Static vs. D y n a m i c D L B
~ 300-
12% better than
g
Dynamic
200-
• Over all Recursive
100-
is 44% better than
O_
C10T5S411.25 C20T5S811.25 C20T5S812 Static
I
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 192 I
406
Parallel Performance
C10-T5-S4-11.25-D1M C10-T5-S4-11.25-D1M
400-
,
350-
~ . 300-
3.
250- E
. n
i.,-
•~ 200- 2,
.o t50-
..=
~I.
~, 100-
50- .
.
1 2 4 8 t 2 4 8
Number of Processors Number of Processors
High Performance Data Mining (Vipin Kumar end Mohammed Zaki) 193 I
Parallel Performance
C20-T5-S8-12-D1M C20-T5-S8-12.D1M
1400.
1200.
1000
~' 800'
.s_
I--
-- 600.
.o
= 400, T=
"~ 200.
t 2 4 8 t 2 4 8
Number of Processors Number of Processors
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 194 I
407
Scaleup and Support
CS-T2.5-S4-1125 C5-T2.5.S4-11.25.D1M
f I
180- 90-'/
160- 80/-;
140- 701/
=) t20-
E
=E 6o. // '
.m
~'- t00- ~-- 50-
._o 80- .o
"-!
40-
~, 60~ o /
40 20
10 ",~
o~ O"
2 4 8 10 0.10% 0.08% 0,05% 0.03% 0.01%
Millionsof Customers Number of Processors
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 195 J
pSPADE Summary
• Task parallel better than data parallel
• Dynamic load balancing
• Asynchronous algorithms (independent
classes)
• Good locality (uses only intersections)
• Good speedup and scaleup
• What next? Gigabyte databases
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 196 J
408
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
• Sequences
• 1Clustering
I Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) lg7 r
What is clustering?
409
Clustering Illustration
I Euclidean Distance Based Clustering in 3-D space.
00 o
I0
/
/
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 199 I
Clustering
410
Clustering schemes
• Distance-based
- Numeric
• Euclidean distance (root of sum of squared
differences along each dimension)
• Angle between two vectors
- Categorical
• Number of common features (categorical)
• Partition-based
- Enumerate partitions and score each
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 201 I
Clustering schemes
• Model-based
- Estimate a density (e.g., a mixture of
gaussians)
- Go bump-hunting
- C o m p u t e P(Feature Vector i I Cluster j)
- Finds overlapping clusters too
- E x a m p l e : bayesian clustering
1
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 202 J
411
Before clustering
• Normalization:
- G i v e n three attributes
• A -- micro-seconds
• B -- milli-seconds
• C -- seconds
412
K-means algorithm
0 0
0 00: Initial seeds
0
0 0 o°
0
O0
O0
o/ o
' ! •
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 205 I
I
K-means algorithm
O
o****
O New centers ~w-
0 0 0
o ~o
000
0 0
I=
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 206 [
413
K-means algorithm
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 207 I
Operations in k-means
• Main Operation: Calculate distance to
all k means or centroids
• Other operations:
- Find the closest centroid for each point
- C a l c u l a t e mean squared error (MSE) for all
points
- Recalculate centroids
!
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 208 I
414
Parallel k-means
• Divide N points among P processors
• Replicate the k centroids
• Each processor computes distance of
each local point to the centroids
• Assign points to closest centroid and
compute local MSE
• Perform reduction for global centroids
and global MSE value
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 209 I
• Drawbacks of K-means
- Doesn't do well with overlapping clusters
-Clusters easliy pulled off center by outliers
- Each records is either in or out of a cluster;
no notion of some records begin more or
less likely than others to really belong to
cluster they have been assigned
• G M M : probabilistic variants of K-means
1
High Performance Data Mining (Vipin Kumar and Mohammed zaki) 210 J
415
Estimation-maximization
Deviation/outlier detection
416
Deviation detection
outlier.
J High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 213 ]
K-nearest neighbors
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 214 I
417
K-nearest neighbors
+~
0 0
Neighborhood
0 000 0
+ 5 of class 0
0 0 3 of class +
0 0 0 0 0
-k-o
**2++ o o
++
+
+0 0
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 215 n
t
_
Tutorial overview
• Overview of KDD and data mining
• Parallel design space
• Classification
• Associations
• Sequences
• Clustering
• l Future directions and summary
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 216J
418
Large-scale Parallel KDD Systems
• Terabyte-sized datasets
• Centralized or distributed datasets
• Incremental changes
• Heterogeneous data sources
• Pre-processing, mining, post-processing
• Modular (rapid development)
• Benchmarking (algorithm comparison)
I High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 217 ]
Research Directions
• Fast algorithms: different mining tasks
-Classification, clustering, associations, etc.
-Incorporating concept hierarchies
• Parallelism and scalability
- Millions of records
- T h o u s a n d s of attributes/dimensions
- Single pass algorithms
- Sampling
- Parallel I/O and file systems
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 218 I
419
Research Directions (contd.)
High Performance Data Mining (Vipin Kumar and Mohammed Zaki) 219 I
420
Research Directions (contd.)
Books
• A. Freitas and S. Lavington. Mining very large databases with parallel processing. Kluwer Academic
Pub., Boston, MA, 1998.
• M.J. Zaki and C.-T. Ho (eds). Large-Scale Parallel Data Mining. LNAI State-of-the-Art Survey,
Volume 1759, Springer-Verlag, 2000.
• H. Kargupta and P. Chan (eds). Advance in Distributed and Parallel Knowledge Discovery, AAAI
Press, Summer 2000.
Resources (contd.)
Journal Special Issues
• P. Stolorz and R. Musick (eds.). Scalable High-Performance Computing for KDD, Data Mining and
Knowledge Discovery: An International Journal, Vol. 1, No. 4, December 1997.
• Y. Guo and R. Grossman (eds.). Scalable Parallel and Distributed Data Mining, Data Mining and
Knowledge Discovery: An International Journal, Vol. 3, No. 3, September 1999.
• V. Kumar, S. Ranka and V. Singh. High Performance Data Mining, Journal of Parallel and Distributed
Computing, forthcoming, 2000.
Survey Articles
• F. Provost and V. Kollud. A survey of methods for scaling up inductive algorithms. Data Mining and
Knowledge Discovery: An International Journal, 3(2): 131--169, 1999.
• A. Srivastava, E.-H. Han, V. Kumar and V. Singh. Parallel formulations of decision-tree classification
algodthms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237--262, 1999.
• M.J. Zaki. Parallel and distdbuted association mining: A survey. In IEEE Concurrency special issue
on Parallel Data Mining, 7(4):14-25, Oct-Dec 1999.
• D. Skillicom. Strategies for parallel data mining. IEEE Concurrency, 7(4):26--35, Oct-Dec 1999.
• M.V. Joshi, E.-H. Han, G. Karypis and V. Kumar. Efficient parallel algorithm for mining associations.
In Zaki and Ho (eds.), Large-Scale Parallel Data Mining, LNAI 1759, Spdnger-Verlag 2000.
• M.J. Zaki. Parallel and distdbuted data mining: An introduction. In Zaki and Ho (eds.), Large-Scale
Parallel Data Mining, LNAI 1759, Spdnger-Vedag 2000.
I
High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 224 I
422
References: Classification
• J. Chattratichat, J. Darlington, M. Ghanem, Y. Guo, H. Huning, M. Kohler, J. Sutiwaraphun, H. W. To, and
Y. Dan. Large scale data mining: Challenges and responses. In 3rd Intl. Conf. on Knowledge Discovery and
Data Mining, August 1997.
• S. Goil and A. Choudhary. Efficient parallel classification using dimensional aggregates. In Zaki and Ho
(eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• M. Holsheimer, M. L. Kersten, and A. Siebes. Data surveyor. Searching the nuggets in parallel. In Fayyad
et al.(eds.), Advances in KDD, AAAI Press, 1998.
• M. Joshi, G. Karypis, and V. Kumar. ScalParC: A scalable and parallel classification algorithm for mining
large datasets. In Intl. Parallel Processing Symposium, 1998.
• R. Kufrin. Decision trees on parallel processors. In J. Geller, H. Kitano, and C. Suttner, editors, Parallel
Processing for Artificial Intelligence 3. Elsevier-Science, 1997.
• S. Lavington, N. Dewhurst, E. Wilkins, and A. Freitas. Interfacing knowledge discovery algorithms to large
databases management systems. Information and Software Technology, 41:605--617, 1999.
• F. Provost and J. Aronis. Scaling up inductive learning with massive parallelism. Machine Leaming, 23(1),
April 1996.
• F. Provost and V. Kolluri. A survey of methods for scaling up inductive algorithms. Data Mining and
Knowledge Discovery: An International Journal, 3(2): 131-- 169, 1999.
• John Shafer, Rakesh Agrawal, and Manish Mehta. SPRINT: A scalable parallel classifier for data mining. In
Proc. of the 22rid Int'l Conference on Very Large Databases, Bombay, india, September 1996.
• M. Sreenivas, K. Alsabti, and S. Ranka. Parallel out-of-core divide and conquer techniques with application
to classification trees. In 13th International Parallel Processing Symposium, April 1999.
• A. Srivastava, E-H. Han, V. Kumar, and V. Singh. Parallel formulations of decision-tree classification
algorithms. Data Mining and Knowledge Discovery: An International Journal, 3(3):237--261, 1999.
• M.J. Zaki, C.-T. Ho, and R. Agrawal.Parallel classification for data mining on shared-memory
multiprocessors. In 15th IEEE Intl. Conf. on Data Engineering, March 1999.
References: Associations
• R. Agrawal, H. Mannila, R. Srikant, H. Toivonen, and A. Inkeri Verkamo. Fast discovery of association
rules. In U. Fayyad and et al, editors, Advances in Knowledge Discovery and Data Mining, pages 307--328.
AAAI Press, Menlo Park, CA, 1996.
• R. Agrawal and J. Sharer. Parallel mining of association rules. IEEE Trans. on Knowledge and Data Engg.,
8(6):962--989, December 1996.
• D. Cheung, J. Han, V. Ng, A. Fu, and Y. Fu. A fast distdbuted algorithm for mining association rules. In 4th
Intl. Conf. Parallel and Distributed Info. Systems, December 1996.
• D. Cheung, V. Ng, A. Fu, and Y. Fu. Efficient mining of association rules in distributed databases. IEEE
Transactions on Knowledge and Data Engg., 8(6):911-922, December 1996.
• D. Cheung, K. Hu, and S. Xia. Asynchronous parallel algorithm for mining association rules on shared-
memory multi-processors. In 10th ACM Symp. Parallel Algorithms and Architectures, June 1998.
• D. Cheung and Y. Xiao. Effect of data distribution in parallel mining of associations. Data Mining and
Knowledge Discovery: An International Journal. 3(3):291--314, 1999.
• E-H. Han, G. Karypis, and V. Kumar. Scalable parallel data mining for association rules. In ACM SIGMOD
Conf. Management of Data, May 1997.
• M. Joshi, E.-H. Han, G. Karypis, and V. Kumar. Efficient parallel algorithms for mining associations. In M.
Zaki and C.-T. Ho (eds), Large-Scale Parallel Data Mining, LNAI State-of-the-Art Survey, Volume 1759,
Spdnger-Verlag, 2000.
• S. Morishita and A. Nakaya. Parallel branch-and-bound graph search for correlated association rules. In
Zaki and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Spdnger-Vedag 2000.
423
References: Associations (contd.)
• A. Mueller. Fast sequential and parallel algorithms for association rule mining: A comparison. Technical
Report CS-TR-3515, University of Maryland, College Park, August 1995.
• J.S. Park, M. Chen, and P. S. Yu. Efficient parallel data mining for association rules. In ACM Intl. Conf.
Information and Knowledge Management, November 1995.
• T. Shintani and M. Kitsuregawa. Hash based parallel algorithms for mining association rules. In 4th Intl.
Conf. Parallel and Distributed Info. Systems, December 1996.
• T. Shintani and M. Kitsuregawa. Parallel algorithms for mining generalized association rules with
classification hieramhy. In ACM SIGMOD International Conference on Management of Data, May 1998.
• M. Tamura and M. Kitsuregawa. Dynamic load balancing for parallel association rule mining on
heterogeneous PC cluster systems. In 25th Int'l Conf. on Very Large Data Bases, September 1999.
• M.J. Zaki. Parallel and distributed association mining: A survey. IEEE Concurrency, 7(4):14--25, October-
December 1999.
• M.J. Zaki, M. Ogihara, S. Parthasarathy, and W. Li. Parallel data mining for association rules on shared-
memory multi-processors. In Supercomputing'98, November 1996.
• M . J . Zaki, S. Parthasarathy, M. Ogihara, and W. U. New algorithms for fast discovery of association rules.
In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
• M.J. Zaki, S. Parthasarathy, M. Ogihara, and W. U. Parallel algorithms for fast discovery of association
rules. Data Mining and Knowledge Discovery: An Intemational Journal, 1(4):343-373, December 1997.
• M.J. Zaki, S. Parthasarathy, W. Li, A Localized Algorithm for Parallel Association Mining, 9th Annual ACM
Symposium on Parallel Algorithms and Architectures (SPAA), June, 1997.
References: Sequences
• R. Agrawal and R. Srikant. Mining sequential patterns. In 1 lth Intl. Conf. on Data Engg., 1995.
• H. Mannila and H. Toivonen and I. Verkamo. Discovery of frequent episodes in event sequences. Data
Mining and Knowledge Discovery: An Intemational Journal. 1(3):259-289, 1997.
• T. Oates, M. D. Schmill, and P. R. Cohen. Parallel and distributed search for structure in multivariate time
sedes. In 9th European Conference on Machine Learning, 1997.
• T. Oates, M. D. Schmill, D. Jansen, and P. R. Cohen. A family of algorithms for finding temporal structure in
data. In 6th Intl. Workshop on AI and Statistics, March 1997.
• T. Shintani and M. Kitsuregawa. Mining algorithms for sequential patterns in parallel: Hash based approach.
In 2nd Pacific-Asia Conf. on Knowledge Discovery and Data Mining, April 1998.
• R. Srikant and R. Agrawal. Mining sequential patterns: Generalizations and performance improvements. In
5th Intl. Conf. Extending Database Technology, Mamh 1998.
• M.J. Zaki. Efficient enumeration of frequent sequences. In 7th Intl. Conf. on Information and Knowledge
Management, November 1998.
• Mohammed J. Zaki, Parallel Sequence Mining on SMP Machines, ACM SIGKDD Workshop on Large-Scale
Parallel KDD Systems, 1999 (LNAI Vo11759).
1
High P e r f o r m a n c e Data Mining (Vipin K u m a r a n d M o h a m m e d Zaki) 228 J
424
References: Clustering
• K. Alsabti, S. Ranka, V. Singh. An Efficient K-Means Clustering Algorithm. 1st IPPS Workshop on High
Performance Data Mining, March 1998.
• I. Dhillon and D. Modha. A data clustering algorithm on distributed memory machines. In Zaki and Ho (eds),
Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• L. lyer and J. Aronson. A parallel branch-and-bound algorithm for cluster analysis. Annals of Operations
Research Vol. 90, pp 65-86, 1999.
• E. Johnson and H. Kargupta. Collective hierarchical clustering from distributed heterogeneous data. In Zaki
and Ho (eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Verlag 2000.
• D. Judd, P. McKinley, and A. Jain. Large-scale parallel data clustering. In Int'l Conf. Pattern Recognition,
August 1996.
• X. Li and Z. Fang. Parallel clustering algorithms. Parallel Computing, 11:270--290, 1989.
• C.F. Olson. Parallel algorithms for hierarchical clustering. Parallel Computing, 21:1313--1325, 1995.
• S. Ranka and S. Sahni. Clustering on a hypercube multicomputer. IEEE Trans. on Parallel and Distdbuted
Systems, 2(2): 129-- 137, 1991.
• F. Rivera, M. Ismail, and E. Zapata. Parallel squared error clustering on hypercube arrays. Journal of
Parallel and Distributed Computing, 8:292--299, 1990.
• G. Rudolph. Parallel clustering on a unidirectional dng. In R. Grebe et al., editor, Transputer Applications
and Systems'93: Volume 1, pages 487--493. lOS Press, Amsterdam, 1993.
• H. Nagesh, S. Goil and A. Choudhary. MAFIA: Efficient and scalable subspace clustering for very large
data sets. Technical Report 9906-010, Center for Parallel and Distributed Computing, Northwestern
University, June 1999.
• X. Xu, J. Jager and H.-P. KriegeL A fast parallel clustering algorithm for large spatial databases. Data
Mining and Knowledge Discovery: An International Journal. 3(3):263--290, 1999.
References: Distributed DM
• J. Aronis, V. Kolluri, F. Provost, and B. Buchanan. The WORLD: Knowledge discovery from multiple
distributed databases. In Florida Artificial Intelligence Research Symposium, May 1997.
• R. Bhatnagar and S. Srinivasan. Pattern discovery in distributed databases. In AAAI National Conference
on Artificial Intelligence, July 1997.
• J. Chattratichat, J. Darlington, Y. Guo, S. Hedvall, M. Kohler, and J. Syed. An architecture for distributed
enterprise data mining. In 7th Intl. Conf. High-Performance Computing and Networking, April 1999.
• R.L. Grossman, S. M. Bailey, H. Sivakumar, and A. L. Tudnsky. Papyrus: A system for data mining over
local and wide area clusters and super-clusters. In Supercomputing'99, November 1999.
• Y. Guo and J. Sutiwaraphun. Knowledge probing in distnbuted data mining. In 3rd Pacific-Asia Conference
on Knowledge Discovery and Data Mining, August 1999.
• L. Hall, N. Chawla, K. Bowyer and W. P. Kegelmeyer. Leaming rules from distributed data. In Zaki and Ho
(eds), Large-Scale Parallel Data Mining, LNAI Vol. 1759, Springer-Vedag 2000.
• H. Kargupta, I. Hamzaoglu, and B. Stafford. Scalable, distdbuted data mining using an agent based
architecture. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
• H. Kargupta, B-H. Park, D. Hershberger, and E. Johnson. CollecUve data mining: A new perspective toward
distnbuted data mining. In Kargupta and Chan (eds), Advances in Distributed DM, AAAI Press, 2000.
• S. Parthasarathy and R. Subramonian. Facilitating data mining on a network of workstations. In Kargupta
and Chan (eds), Advances in Distnbuted DM, AAAI Press, 2000.
• A. Prodromidis, S. Stoifo, and P. Chan. Meta-learning in distributed data mining systems: Issues and
approaches. In Kargupta and Chan (eds), Advances in Distributed DM, AAAI Press, 2000.
• S. Stolfo, A. Prodromidis, S. Tselepis, W. Lee, W. Fan, and P. Chan. Jam: Java agents for meta-leaming
over distributed databases. In 3rd Intl. Conf. on Knowledge Discovery and Data Mining, August 1997.
!
High P e r f o r m a n c e Data Mining (Vipin K u m a r and M o h a m m e d Zaki) 230 I
425