Download as pdf or txt
Download as pdf or txt
You are on page 1of 17

Mapping OpenMP to the Stream Programming Model

Hu Ming Zhang Fangzhou Yue Kun

Objective
1. Study the mapping relationship of parallel mechanism in OpenMP to stream programming model (CUDA). 2. Point out the which part is suitable for translation. 3. Analyzing typical scientific applications

Outline
OpenMP vs CUDA: Execution model

OpenMP vs CUDA: Semantics


OpenMP vs CUDA: Performace Analysis of Benchmarks

OpenMP vs CUDA Execution Model

OpenMP vs CUDA Execution Model

OpenMP vs CUDA Semantic


Parallel Construct
parallel

Worksharing Construct
loop, sections, single

Master and Synchronization Construct


critical, barrier, taskwait, atomic, flush, ordered

Data Environment
shared, private, firstprivate, lastprivate, reduction, copyin, copyprivate

OpenMP vs CUDA Semantic


#include <omp.h> main() { int x; x = 0; #pragma omp parallel shared(x) { #pragma omp critical x = x + 1; } /* end of parallel section */ }

OpenMP vs CUDA Semantic

#pragma omp for ordered [clauses...] (loop region) #pragma omp ordered structured_block

(endo of loop region)

OpenMP vs CUDA Semantic

Most of the directives and clauses can be mapped into the stream programs

OpenMP vs CUDA Performance


OpenMP: OS level thread thread-centric parallel processing model thread can be complicated CUDA: lightweight hardware thread data-centric processing model simple control logic inefficient to handle branch

Map those constructs that have large parallelism and uniform processing among threads

OpenMP vs CUDA Performance


Not suitable:

single, section. -- they have small parallelism and different processing among threads
master ---- parallelism is 1

barrier, taskwait ---- demand all threads grouped into one block
lastprivate ---- processing is not uniform among threadc

OpenMP vs CUDA
To understand whether it is reasonable to translate OpenMP program to CUDA program, we should analyze the applications pattern.

Conclusion
1. A majority of scientific applications are suitable to be mapped to stream programming model. 2. The heterogeneous architecture using CPU and GPU will be more common.

Comments:
1.This papers work is mainly on analysis. 2.We think more real applications should be considered, not just benchmark.

3.Automatically translate OpenMP program to CUDA program may be possible.

You might also like