Atc17 Slides Ma

USENIX ATC’17, Santa Clara, CA, USA
Garaph: Efficient GPU-accelerated Graph Processing on

a Single Machine with Balanced Replication
Lingxiao Ma †, Zhi Yang †, Han Chen †, Jilong Xue ‡, Yafei Dai *
† Computer Science Department, Peking University

‡ Microsoft Research
* SPCCTA, Peking University
Large-Scale Graph Processing
1010 pages, 1012 tokens 109 nodes, 1012 edges
Peking University, Microsoft Research 2

Powerful Storage & Computation Technologies
* Figure from Internet

Our Goal
- Large Memory + Fast Secondary Storages
- CPU+GPUs
Host (CPU) Device (GPUs)
Main Memory PCIe Bus Global Memory
Input
Secondary Storages

Architecture
CPU Kernel
Secondary
Edges
Storages
Memory
Dispatcher
Edges
Vertices
GPU Kernel

Graph Representation for Hybrid CPU and GPU
- CSC & CSR representation CSC (incomming edges)
- Shard: vertex interval Idx 0 2 4 5 6 9 9
- Page: batched shards Nbr 3 4 0 2 5 4 1 2 5
Edge 1 2 1 3 5 4 2 5 1
IdxOff 0 0 1 1 2 3
0 4 1
1 4 1
4
Shard 0 Shard 1
1 1 3
0 2 CSR (outgoing edges)
1 5
Idx 0 1 2 4 5 7 9
2
2
5
Nbr 1 4 1 4 0 0 3 2 4
3 5
4 1
4
Programming APIs
- GAS Decomposition
- One program for both CPU and GPU
Activate
*GAS figure from PowerGraph slides

GPU Computation Kernel
GPU
Streaming Multiprocessor
Edges
Edges
Host Memory
L1 Cache/
PCIe Shared
Device
Memory
Apply
Memory
Init
Vertices Sync Vertices Vertices
Apply

Gather in GPU Computation Kernel
Shared Memory
Global Memory

Problems in Gather
Shared Memory
Global Memory
*Gomez-Luna, Juan, et al. "Performance modeling of atomic additions on

GPU scratchpad memory." TPDS 24.11 (2013): 2273-2282.
- Conflicts
- Linear penalty
- Intra-warp >> Inter-warp

Replication-Based Gather
Global Memory Mapping
- Customized replication
- O(N) -> O(logN), N≤32
- Modeling: balance profits and costs Aggregation
-
Shared Memory

CPU Computation Kernel
- Sequential memory access & lock-free & load balance
GlobalVertices
2. Apply
Thread 0 Thread 1 Thread … Thread p-1
Aggregation
1. Gather
0 1 2 … r0 r0 … r1 ……
LocalVertices
r_{p-1} … n
Edges
Rep Rep …… Rep

Dual Modes Processing Engine
- Pull & Notify-pull

Hybrid CPU-GPU Scheduling
Schedule Page Page Page Page
CPU Page Page CPU Time/Page Time/Page

GPU0 Page Page Page Page GPU0 Time Time Time Time
GPU1 Page Page Page Page GPU1 Time Time Time Time
- CPU
- Pros: thread sequential processing
- Suit: pull/notify-pull dual-mode processing
- GPU
- Pros: SIMD parallel processing
- Suit: replication-based gather processing (only pull)

Experiment Setup
- Machine information
- CPU: Intel Xeon E5-2650 v3
- 10 cores, 20 threads, 2.3-3.0GHz
- Memory: 64GB dual-channel DDR4 2133MHz
- GPU: NVidia GeForce GTX 1070 Graph |V| |E| Max in-deg Avg deg Size
- 1920 cores, 15 SMs, 8GB memory uk-2007@1M 1M 41M 0.4M 41 0.6GB
- Typical graph applications uk-2014-host 4.8M 51M 0.7M 11 0.8GB
enwiki-2013 4.2M 0.1B 0.4M 24 1.7GB
- PR, CC, SSSP, NN, HS, CS
gsh-2015-tpd 31M 0.6B 2.2M 20 10GB
- Datasets twitter-2010 42M 1.5B 0.8M 35 27GB
- 7 real world datasets sk-2005 51M 1.9B 8.6M 39 35GB
- Compare renren-2010 58M 2.8B 0.3M 48 44GB
- CuSha(HPDC’14), Ligra(PPoPP’13), Gemini(OSDI’16)

Evaluation: Overall Performance
2.00
1.50
Runtime(s)
1.00
0.50
0.00
uk-2007-05@1M uk-2014-host enwiki-2013
CuSha Ligra Gemini Garaph-C Garaph-G Garaph-H
- Run 10 iterations of PR 90.00

80.00
- Up to 4.05x faster than the fastest one 70.00
60.00
4.05x
Runtime(s)
50.00
40.00
30.00
20.00
10.00
0.00
gsh-2015-tpd twitter-2010 sk-2005 renren-2010

Evaluation: Overall Performance
1.50
1.00
Runtime(s)
0.50
0.00
uk-2007-05@1M uk-2014-host enwiki-2013
250.00
- Run CC to convergence 200.00
Runtime(s)
150.00
- GPU is much slower than CPU without 100.00
activation scheme 50.00
- Up to 5.36x faster than the fastest one 0.00

gsh-2015-tpd twitter-2010 sk-2005 renren-2010

Evaluation: Customized Replication
- SK-2005 dataset
- Slowest is 45.17x slower than the fastest one
- Correlation: 0.9853 45.17x
- => vertices of high degree
- Customized replication
- Up to 32.15x speedup
32.15x

Evaluation: Hybrid CPU-GPU Scheduling

Conclusions
Garaph: efficient GPU-accelerated graph processing on a single machine

- Replication-based GPU computation kernel.
- Dual modes replication-based CPU computation kernel.
- Scheduler for hybrid CPU and GPU.

Q&A

Atc17 Slides Ma

Uploaded by

Document Information

Original Description:

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Atc17 Slides Ma

Uploaded by

Copyright:

Available Formats

USENIX ATC’17, Santa Clara, CA, USA

Garaph: Efficient GPU-accelerated Graph Processing on

† Computer Science Department, Peking University

1010 pages, 1012 tokens 109 nodes, 1012 edges

Peking University, Microsoft Research 2

* Figure from Internet

Peking University, Microsoft Research 3

Main Memory PCIe Bus Global Memory

Peking University, Microsoft Research 4

Peking University, Microsoft Research 5

*GAS figure from PowerGraph slides

Peking University, Microsoft Research 7

Peking University, Microsoft Research 8

Peking University, Microsoft Research 9

*Gomez-Luna, Juan, et al. "Performance modeling of atomic additions on

Peking University, Microsoft Research 10

Peking University, Microsoft Research 11

Peking University, Microsoft Research 12

Peking University, Microsoft Research 13

CPU Page Page CPU Time/Page Time/Page

Peking University, Microsoft Research 14

Peking University, Microsoft Research 15

- Run 10 iterations of PR 90.00

Peking University, Microsoft Research 16

- Run CC to convergence 200.00

- Up to 5.36x faster than the fastest one 0.00

Peking University, Microsoft Research 17

Peking University, Microsoft Research 18

Peking University, Microsoft Research 19

Garaph: efficient GPU-accelerated graph processing on a single machine

Peking University, Microsoft Research 20

Peking University, Microsoft Research 21

You might also like