Marc Snir NGDM07

Opportunities for XXL Datamining
Marc Snir
www.informatics.uiuc.edu
Outline
 How a Supercomputer looks like in > 2010
 What it takes to run a DM code on such a

platform
 How DM can help supercomputing
2
Large NSF Funded Supercomputers
beyond 2010
 One Petascale platform -- Blue Waters at NCSA, U Illinois
– Sustained performance: petaflop range
– Memory: petabyte range
– Disk: 10’s petabytes
– Archival storage: exabyte range
 Multiple 1/4 scale platforms at various universities
 Available to NSF-funded “grand challenge” teams on a

competitive basis
 My talk: What it takes to mine data at such scale

 Your job: Think big
3
The Uniprocessor Crisis
 Manufacturers cannot increase clock rate anymore (power
problem)
 Computer architects have run out of productive ideas on
how to use more transistors to increase single thread
performance
– Diminishing return on caches
– Diminishing return on instruction-level parallelism
 Increased processor performance will come only from the
increase on number of cores per chip
Petascale = 250K -- 1M threads
Need algorithms with massive levels of parallelism
4
Average # Processors Top 500 System
3000
2500
2000
1500
1000
500
0
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007
5
Mileage is Less than Advertised
4
3.5
nominal
IPC 2.5
1.5
0.5
0
LCM Eclat FP-Growth
Instruction per cycle, frequent item mining

(M Wei) www.informatics.uiuc.edu
6
It’s the Memory, Stupid
0.25
PC Balance
(word operands from memory per flop)
0.2
0.15
0.1
0.05
0
0 5000 10000 15000 20000 25000 30000
Seem stuck at ~ 1:10 ratio
(source McAlpin) www.informatics.uiuc.edu

7
The Memory Wall and Palliatives
 The problem
– Memory bandwidth is limited (cost)
– Compilers cannot issue enough concurrent loads to fill
the memory pipe
– Compilers cannot issue loads early enough to avoid
stalls
 Solutions
– Multicore and vector operations -- to fill the pipe
Simultaneous multithreading -- to tolerate latency
– Need even higher levels of parallelism!
8
Solutions to the Memory Wall
 Caching and locality
– Need algorithms with good locality
 Split communication
– Memory prefetch (local memory)
– Put/get (remote memory)
– Need programmed communication (not necessarily
message-passing)
 N.B.: Computer power is essentially free; you pay for

storing and moving data
– Accelerators (GPUs, FPGAs, Cell processors) enhance a
non-critical resource, and will often have a negligible
impact on overall performance
9
Load Balancing
 Problem: Amount of computation in DM kernels heavily
data dependent -- work partitioning results in load
imbalances
 Hard solution: develop good work predictors and do
explicit, static load balancing
 Easy solution: use system with task virtualization and
dynamic task migration
– E.g., AMPI (Kale, http://charm.cs.uiuc.edu/) -- scalable,
negligible (often negative) overheads
– Overhead of task migration is few seconds, at worse
 Parallel file system is shared all
– Task virtualization essential for modularity and
ease of programming
10
Code Tuning
 Is essential when using a Petascale system
– 1 hour = $5K - $10K
 Is data dependent (more so with Datamining than with
many applications)
 Is platform dependent
11
Relative Performance of Frequent Item Mining
Codes is Input Dependent
three algorithms on six real-world datasets
0
normalizedchess
execution time
pumsb bms-wv-1 accidents webdocs
(s=0.18) (s=0.45) (s=0.0005) (s=0.16) (s=0.1)
LCM FP-Growth Eclat
FIMI workshop needs some thinking…
(C Jiang PhD thesis) www.informatics.uiuc.edu

12
A Schematic View of Performance
Tuning
(LCM, All three should be platform

1. Algorithm FP_Growth,
and data dependent
selection Eclat,…)
2. Implementation (choice of data

selection structures…)
3. Automatic (compiler,
tuning runtime)
13
Algorithm Selection is a Classification
Problem
•Can be solved using
supervised ML
Input set
•Smart part: choice of
Generated generator Input feature vector that can
datasets
be computed fast and
“works”
Empirical Alg.
evaluation pool Feature
Labeled
SVM Platform specific predict Run
SVM model Alg.
features learning Alg.
Training stage Execution stage
14
Results: Average execution time
30
25
20.63
20 17.4
15
10.51
9.34
10
5
0
Average execution time
Optimal Predicted LCM FP-Growth Eclat
• The predicted algorithm is close to optimal (12.5% worse)

• The predicted algorithm is significantly better than LCM(65.3%)
C Jiang, PhD thesis www.informatics.uiuc.edu

15
Selected features
 Size
– The number of ‘1’s in the bit Example:
matrix
 Density 111001
– Number of ‘1’s divided by number 1 1 1 0 0 1
of cells 110101
 Height: 110100
010011
– 1 – support threshold / density
– An estimate of how much room
for the support to decrease to the Size=18
threshold
 Similarity: Density = 18/30 = 0.6
– How similar transactions are to Height = 1-s/density
each other = 1 – 0.2/0.6 =2/3
16
Implementation Selection
 We represent implementation choices via tuning patterns -

descriptions of solutions to common software performance
optimization problems that are applicable to multiple algorithms
– Lexicographic ordering
– Aggregation
– Compaction
– Wave-front prefetch
– Tiling for sparse arrays
– SIMDization
 Probably need richer ontology (relations, constraints, expert
knowledge)
 Classification problem: select best set of tuning patterns

– Used SVM; GA probably more appropriate
17
Speedups of LCM
Intel AMD
2.2 2.2
all
all
all
2 2
1.8 1.8 all

lex+tile
1.6 1.6 all

pref+
reorg
1.4 1.4 lex+pref+reorg
1.2 1.2
1 1
0.8 0.8
DS1 (77) DS2 (169) DS3 (90) DS4 (36)
DS1 (73) DS2 (159) DS3 (93) DS4 (35)
lex pref reorg tile lex+reorg+pref+tile best

 Good speedup (up to 2.1)
 ALL does not always win!

 Optimal set of tuning patterns is machine and data dependent
18
Prediction results – LCM
119.6
70
87.7
65
60
55
optimal
50 predicted
original
45
Average execution time in seconds
tile
Number of times that each 40 all
code version is the fastest Average execution time
 Prediction close to “optimal” (oracle)
Prediction overhead is negligible
M Wei, PhD thesis, ICDE07 www.informatics.uiuc.edu

19
Summary
 Main obstacle to petascale datamining is
dreaming of grand challenges that need it
 Petascale datamining requires tuned code

– Node performance (locality) + scalability
 Should develop tunable code generators to adapt

to platform and data
– Need good training sets!
 Code tuning is a very interesting classification

problem
20
Questions?
21
Similarity definition
 “Similarity”: how similar transactions are to each other
 “Normalized hamming distance” (pair-wise similarity):

– Given two transactions, their “normalized hamming distance”
is the number of differences divided by the total number of
unique ones.
– Example:
T1 1 1 0 1 0 1
T2 0 1 0 0 1 1
Difference = 3,
Therefore, distance(T1,T2) = 3/5 = 0.6
the number of unique ones is 5
22
Similarity feature definition
 Normalized hamming distance defines pair-wise distance, but we
need a global measure of similarity among all transactions.
 Approach – “Average linkage clustering”

– Start with n transactions, each as a cluster
– Merge the two closest into one new cluster
– Repeat merging until one cluster left.
 “Similarity” = average value of the n-1 clustering distances
23
Prediction results on real-world datasets
accidents chess webdocs
pumsb Pumsb_star connect
24
Prediction results on real-world datasets
mushroom retail kosarak
BMS-POS BMS-WebView1 BMS-WebView2
25
Using synthetic data for training
0. 38%
 IBM Quest dataset Best algorithm 0. 08%
Generator
– Widely used in
data mining
research 99. 53%
Synthetic datasets
 Problem:
– The generated
11.11%
dataset is not
representative of 50%
real-world data 38.89%
LCM FP-Growth Eclat
Real-world datasets
26
Item frequency curve
27
Using modified IBM generator to produce
algorithm variability
20.03% 11.11%
50%
56.58%
23.39% 38.89%
Modified Synthetic datasets Real-world datasets
28
Lexicographic ordering of transactions
 Preprocess original database by reordering transactions
in lexicographic order
– Alphabet: items in descending frequency order
 Improves locality of accesses (LCM & FP_Growth);
reduces computation (Eclat)
 Overhead of lexicographic ordering
29
Lexicographic ordering in LCM
c f a b d e
1 c f a
 Spatial locality of traversal 1 1 1 2 4 4
is improved (fewer jumps) 2 c f b
2 2 3 5 5 5
– Locality improved for 3 c f a
most frequent items 3 3 5
4 d e
– Order mostly preserved 4 5
for projected databases
5 c f a b d e
5
– ordering overhead
amortized over multiple
traversals c f a b d e
1 c f a
1 1 1 3 3 3
2 c f a
2 2 2 4 5 5
3 c f a b d e
3 3 3
4 c f b
4 4
5 d e
30
Lexicographic ordering in Eclat
 Range reduction reduces computation
c f a b d e Mark first 1 1s becomes contiguous for c

0 1 1 1 0 0 0
1 1 1 0 1 0 0 1 1 1 0 0 0
2 1 1 1 0 0 0 1 1 1 0 0 0
3 0 0 0 0 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 0 1 0 0
0 0 0 0 1 1
Mark last 1
{a} ∩ {c}  Transactions for {a,c}

Rows are switched after
lexicographic ordering
www.informatics.uiuc.edu 31
31
Lexicographic ordering
– project() in FP-Growth
 Tree is constructed by inserting transactions from the
original database one by one
 Lexicographic ordering improve the temporal locality for
insertions.
c
Already
f
in the
a
cache
b
(2)
d
e
(1)
Pseudo node
32
Lexicographic ordering
– project() in FP-Growth
 Access pattern: From an intermediate node to root
 More (parent, child) pairs are contiguous in the memory – better
spatial locality
c
d c
f
f d Assuming nodes allocated
a e
b consecutively are
a e contiguous in memory.
b
(2)
(1) b
(1)
b (2) (1) Pseudo node
d
d (parent, child) are
e
contiguous in memory
e (1) (1)
33
Wave-front prefetch
Array of short linked lists
2 iterations between
 Prefetch pointers from prefetch and use
different linked-lists in one I3:
iteration
I4:
 Hides memory latency
I5:
 Increases register pressure
Can be used even if lists are of

different length
34
Tiling (LCM)
Improves temporal locality

Slightly increases instruction count and memory pressure
www.informatics.uiuc.edu 35
35
Programming patterns applied
36

Marc Snir NGDM07

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Marc Snir NGDM07

Uploaded by

Copyright:

Available Formats

Opportunities for XXL Datamining

 What it takes to run a DM code on such a

 How DM can help supercomputing

 Multiple 1/4 scale platforms at various universities

 Available to NSF-funded “grand challenge” teams on a

 My talk: What it takes to mine data at such scale

Petascale = 250K -- 1M threads

Need algorithms with massive levels of parallelism

Instruction per cycle, frequent item mining

Seem stuck at ~ 1:10 ratio

(source McAlpin) www.informatics.uiuc.edu

 N.B.: Computer power is essentially free; you pay for

three algorithms on six real-world datasets

LCM FP-Growth Eclat

FIMI workshop needs some thinking…

(C Jiang PhD thesis) www.informatics.uiuc.edu

(LCM, All three should be platform

2. Implementation (choice of data

Training stage Execution stage

Optimal Predicted LCM FP-Growth Eclat

• The predicted algorithm is close to optimal (12.5% worse)

C Jiang, PhD thesis www.informatics.uiuc.edu

 We represent implementation choices via tuning patterns -

 Classification problem: select best set of tuning patterns

1.8 1.8 all

1.6 1.6 all

lex pref reorg tile lex+reorg+pref+tile best

 ALL does not always win!

M Wei, PhD thesis, ICDE07 www.informatics.uiuc.edu

 Petascale datamining requires tuned code

 Should develop tunable code generators to adapt

 Code tuning is a very interesting classification

 “Normalized hamming distance” (pair-wise similarity):

Therefore, distance(T1,T2) = 3/5 = 0.6

the number of unique ones is 5

 Approach – “Average linkage clustering”

 “Similarity” = average value of the n-1 clustering distances

accidents chess webdocs

pumsb Pumsb_star connect

mushroom retail kosarak

BMS-POS BMS-WebView1 BMS-WebView2

 IBM Quest dataset Best algorithm 0. 08%

real-world data 38.89%

LCM FP-Growth Eclat

Modified Synthetic datasets Real-world datasets

 Range reduction reduces computation

c f a b d e Mark first 1 1s becomes contiguous for c

{a} ∩ {c}  Transactions for {a,c}

Can be used even if lists are of

Improves temporal locality

You might also like