Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 36

Opportunities for XXL Datamining

Marc Snir

www.informatics.uiuc.edu
Outline
 How a Supercomputer looks like in > 2010

 What it takes to run a DM code on such a


platform

 How DM can help supercomputing

www.informatics.uiuc.edu
2
Large NSF Funded Supercomputers
beyond 2010
 One Petascale platform -- Blue Waters at NCSA, U Illinois
– Sustained performance: petaflop range
– Memory: petabyte range
– Disk: 10’s petabytes
– Archival storage: exabyte range

 Multiple 1/4 scale platforms at various universities

 Available to NSF-funded “grand challenge” teams on a


competitive basis

 My talk: What it takes to mine data at such scale


 Your job: Think big

www.informatics.uiuc.edu
3
The Uniprocessor Crisis
 Manufacturers cannot increase clock rate anymore (power
problem)
 Computer architects have run out of productive ideas on
how to use more transistors to increase single thread
performance
– Diminishing return on caches
– Diminishing return on instruction-level parallelism
 Increased processor performance will come only from the
increase on number of cores per chip

Petascale = 250K -- 1M threads

Need algorithms with massive levels of parallelism

www.informatics.uiuc.edu
4
Average # Processors Top 500 System
3000

2500

2000

1500

1000

500

0
1993 1994 1995 1996 1997 1998 1999 2000 2001 2002 2003 2004 2005 2006 2007

www.informatics.uiuc.edu
5
Mileage is Less than Advertised
4

3.5

nominal
IPC 2.5

1.5

0.5

0
LCM Eclat FP-Growth

Instruction per cycle, frequent item mining


(M Wei) www.informatics.uiuc.edu
6
It’s the Memory, Stupid
0.25

PC Balance
(word operands from memory per flop)
0.2

0.15

0.1

0.05

0
0 5000 10000 15000 20000 25000 30000

Seem stuck at ~ 1:10 ratio

(source McAlpin) www.informatics.uiuc.edu


7
The Memory Wall and Palliatives
 The problem
– Memory bandwidth is limited (cost)
– Compilers cannot issue enough concurrent loads to fill
the memory pipe
– Compilers cannot issue loads early enough to avoid
stalls
 Solutions
– Multicore and vector operations -- to fill the pipe
Simultaneous multithreading -- to tolerate latency
– Need even higher levels of parallelism!

www.informatics.uiuc.edu
8
Solutions to the Memory Wall
 Caching and locality
– Need algorithms with good locality
 Split communication
– Memory prefetch (local memory)
– Put/get (remote memory)
– Need programmed communication (not necessarily
message-passing)

 N.B.: Computer power is essentially free; you pay for


storing and moving data
– Accelerators (GPUs, FPGAs, Cell processors) enhance a
non-critical resource, and will often have a negligible
impact on overall performance

www.informatics.uiuc.edu
9
Load Balancing
 Problem: Amount of computation in DM kernels heavily
data dependent -- work partitioning results in load
imbalances
 Hard solution: develop good work predictors and do
explicit, static load balancing
 Easy solution: use system with task virtualization and
dynamic task migration
– E.g., AMPI (Kale, http://charm.cs.uiuc.edu/) -- scalable,
negligible (often negative) overheads
– Overhead of task migration is few seconds, at worse
 Parallel file system is shared all
– Task virtualization essential for modularity and
ease of programming

www.informatics.uiuc.edu
10
Code Tuning
 Is essential when using a Petascale system
– 1 hour = $5K - $10K
 Is data dependent (more so with Datamining than with
many applications)
 Is platform dependent

www.informatics.uiuc.edu
11
Relative Performance of Frequent Item Mining
Codes is Input Dependent

three algorithms on six real-world datasets

0
normalizedchess
execution time
pumsb bms-wv-1 accidents webdocs
(s=0.18) (s=0.45) (s=0.0005) (s=0.16) (s=0.1)

LCM FP-Growth Eclat

FIMI workshop needs some thinking…

(C Jiang PhD thesis) www.informatics.uiuc.edu


12
A Schematic View of Performance
Tuning

(LCM, All three should be platform


1. Algorithm FP_Growth,
and data dependent
selection Eclat,…)

2. Implementation (choice of data


selection structures…)

3. Automatic (compiler,
tuning runtime)

www.informatics.uiuc.edu
13
Algorithm Selection is a Classification
Problem
•Can be solved using
supervised ML
Input set
•Smart part: choice of
Generated generator Input feature vector that can
datasets
be computed fast and
“works”

Empirical Alg.
evaluation pool Feature

Labeled
SVM Platform specific predict Run
SVM model Alg.
features learning Alg.

Training stage Execution stage

www.informatics.uiuc.edu
14
Results: Average execution time

30
25
20.63
20 17.4

15
10.51
9.34
10
5
0
Average execution time

Optimal Predicted LCM FP-Growth Eclat

• The predicted algorithm is close to optimal (12.5% worse)


• The predicted algorithm is significantly better than LCM(65.3%)

C Jiang, PhD thesis www.informatics.uiuc.edu


15
Selected features
 Size
– The number of ‘1’s in the bit Example:
matrix
 Density 111001
– Number of ‘1’s divided by number 1 1 1 0 0 1
of cells 110101
 Height: 110100
010011
– 1 – support threshold / density
– An estimate of how much room
for the support to decrease to the Size=18
threshold
 Similarity: Density = 18/30 = 0.6
– How similar transactions are to Height = 1-s/density
each other = 1 – 0.2/0.6 =2/3

www.informatics.uiuc.edu
16
Implementation Selection

 We represent implementation choices via tuning patterns -


descriptions of solutions to common software performance
optimization problems that are applicable to multiple algorithms
– Lexicographic ordering
– Aggregation
– Compaction
– Wave-front prefetch
– Tiling for sparse arrays
– SIMDization
 Probably need richer ontology (relations, constraints, expert
knowledge)

 Classification problem: select best set of tuning patterns


– Used SVM; GA probably more appropriate

www.informatics.uiuc.edu
17
Speedups of LCM

Intel AMD
2.2 2.2
all
all
all
2 2

1.8 1.8 all


lex+tile

1.6 1.6 all


pref+
reorg
1.4 1.4 lex+pref+reorg

1.2 1.2

1 1

0.8 0.8
DS1 (77) DS2 (169) DS3 (90) DS4 (36)
DS1 (73) DS2 (159) DS3 (93) DS4 (35)

lex pref reorg tile lex+reorg+pref+tile best


 Good speedup (up to 2.1)

 ALL does not always win!


 Optimal set of tuning patterns is machine and data dependent

www.informatics.uiuc.edu
18
Prediction results – LCM
119.6
70
87.7

65

60

55
optimal
50 predicted
original
45
Average execution time in seconds
tile
Number of times that each 40 all
code version is the fastest Average execution time
 Prediction close to “optimal” (oracle)
Prediction overhead is negligible

M Wei, PhD thesis, ICDE07 www.informatics.uiuc.edu


19
Summary
 Main obstacle to petascale datamining is
dreaming of grand challenges that need it

 Petascale datamining requires tuned code


– Node performance (locality) + scalability

 Should develop tunable code generators to adapt


to platform and data
– Need good training sets!

 Code tuning is a very interesting classification


problem

www.informatics.uiuc.edu
20
Questions?

www.informatics.uiuc.edu
21
Similarity definition
 “Similarity”: how similar transactions are to each other

 “Normalized hamming distance” (pair-wise similarity):


– Given two transactions, their “normalized hamming distance”
is the number of differences divided by the total number of
unique ones.
– Example:

T1 1 1 0 1 0 1

T2 0 1 0 0 1 1
Difference = 3,

Therefore, distance(T1,T2) = 3/5 = 0.6

the number of unique ones is 5

www.informatics.uiuc.edu
22
Similarity feature definition
 Normalized hamming distance defines pair-wise distance, but we
need a global measure of similarity among all transactions.

 Approach – “Average linkage clustering”


– Start with n transactions, each as a cluster
– Merge the two closest into one new cluster
– Repeat merging until one cluster left.

 “Similarity” = average value of the n-1 clustering distances

www.informatics.uiuc.edu
23
Prediction results on real-world datasets

accidents chess webdocs

pumsb Pumsb_star connect

www.informatics.uiuc.edu
24
Prediction results on real-world datasets

mushroom retail kosarak

BMS-POS BMS-WebView1 BMS-WebView2

www.informatics.uiuc.edu
25
Using synthetic data for training
0. 38%

 IBM Quest dataset Best algorithm 0. 08%

Generator
– Widely used in
data mining
research 99. 53%

Synthetic datasets
 Problem:
– The generated
11.11%

dataset is not
representative of 50%

real-world data 38.89%

LCM FP-Growth Eclat

Real-world datasets

www.informatics.uiuc.edu
26
Item frequency curve

www.informatics.uiuc.edu
27
Using modified IBM generator to produce
algorithm variability

20.03% 11.11%

50%
56.58%

23.39% 38.89%

Modified Synthetic datasets Real-world datasets

www.informatics.uiuc.edu
28
Lexicographic ordering of transactions
 Preprocess original database by reordering transactions
in lexicographic order
– Alphabet: items in descending frequency order
 Improves locality of accesses (LCM & FP_Growth);
reduces computation (Eclat)
 Overhead of lexicographic ordering

www.informatics.uiuc.edu
29
Lexicographic ordering in LCM
c f a b d e
1 c f a
 Spatial locality of traversal 1 1 1 2 4 4
is improved (fewer jumps) 2 c f b
2 2 3 5 5 5
– Locality improved for 3 c f a
most frequent items 3 3 5
4 d e
– Order mostly preserved 4 5
for projected databases
5 c f a b d e
5
– ordering overhead
amortized over multiple
traversals c f a b d e
1 c f a
1 1 1 3 3 3
2 c f a
2 2 2 4 5 5
3 c f a b d e
3 3 3
4 c f b
4 4
5 d e
www.informatics.uiuc.edu
30
Lexicographic ordering in Eclat

 Range reduction reduces computation

c f a b d e Mark first 1 1s becomes contiguous for c


0 1 1 1 0 0 0
1 1 1 0 1 0 0 1 1 1 0 0 0

2 1 1 1 0 0 0 1 1 1 0 0 0
3 0 0 0 0 1 1 1 1 1 1 1 1
4 1 1 1 1 1 1 1 1 0 1 0 0
0 0 0 0 1 1
Mark last 1

{a} ∩ {c}  Transactions for {a,c}


Rows are switched after
lexicographic ordering
www.informatics.uiuc.edu 31
31
Lexicographic ordering
– project() in FP-Growth
 Tree is constructed by inserting transactions from the
original database one by one
 Lexicographic ordering improve the temporal locality for
insertions.

c
Already
f
in the
a
cache
b
(2)
d

e
(1)
Pseudo node

www.informatics.uiuc.edu
32
Lexicographic ordering
– project() in FP-Growth
 Access pattern: From an intermediate node to root
 More (parent, child) pairs are contiguous in the memory – better
spatial locality

c
d c
f
f d Assuming nodes allocated
a e
b consecutively are
a e contiguous in memory.
b
(2)
(1) b
(1)
b (2) (1) Pseudo node
d
d (parent, child) are
e
contiguous in memory
e (1) (1)

www.informatics.uiuc.edu
33
Wave-front prefetch
Array of short linked lists
2 iterations between
 Prefetch pointers from prefetch and use
different linked-lists in one I3:
iteration
I4:
 Hides memory latency
I5:
 Increases register pressure

Can be used even if lists are of


different length

www.informatics.uiuc.edu
34
Tiling (LCM)

Improves temporal locality


Slightly increases instruction count and memory pressure

www.informatics.uiuc.edu 35
35
Programming patterns applied

www.informatics.uiuc.edu
36

You might also like