Professional Documents
Culture Documents
Atc17 Slides Ma
Atc17 Slides Ma
Input
Secondary Storages
CPU Kernel
Secondary
Edges
Storages
Memory
Dispatcher
Edges
Vertices
GPU Kernel
Activate
Streaming Multiprocessor
Edges
Edges
Host Memory
L1 Cache/
PCIe Shared
Device
Memory
Apply
Memory
Init
Vertices Sync Vertices Vertices
Apply
Shared Memory
Global Memory
Global Memory
- Conflicts
- Linear penalty
- Intra-warp >> Inter-warp
- Customized replication
- O(N) -> O(logN), N≤32
- Modeling: balance profits and costs Aggregation
-
Shared Memory
GlobalVertices
2. Apply
Thread 0 Thread 1 Thread … Thread p-1
Aggregation
1. Gather
0 1 2 … r0 r0 … r1 ……
LocalVertices
r_{p-1} … n
Edges
Rep Rep …… Rep
- CPU
- Pros: thread sequential processing
- Suit: pull/notify-pull dual-mode processing
- GPU
- Pros: SIMD parallel processing
- Suit: replication-based gather processing (only pull)
1.50
Runtime(s)
1.00
0.50
0.00
uk-2007-05@1M uk-2014-host enwiki-2013
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H
4.05x
Runtime(s)
50.00
40.00
30.00
20.00
10.00
0.00
gsh-2015-tpd twitter-2010 sk-2005 renren-2010
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H
1.00
Runtime(s)
0.50
0.00
uk-2007-05@1M uk-2014-host enwiki-2013
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H
250.00
Runtime(s)
150.00
- GPU is much slower than CPU without 100.00
activation scheme 50.00
32.15x