3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop - May19

Intel Open-source SYCL compiler project
Konstantin Bobrovskii, Intel
2019 Khronos Embedded Vision Summit

Agenda
SYCL intro
Intel Open-Source SYCL project
Q&A
2
SYCL Intro
3
What is SYCL?
Single-source heterogeneous programming using STANDARD C++

 Use C++ templates and lambda functions for host & device code
Aligns the hardware acceleration of OpenCL with direction of the C++ standard
Developer Choice
The development of the two specifications are aligned so
code can be easily shared between the two approaches
C++ Kernel Language Single-source C++
Low Level Control
Programmer Familiarity
‘GPGPU’-style separation of
Approach also taken by
device-side kernel source
C++ AMP and OpenMP
code and host code
Intel confidential 4
Why SYCL? Reactive and Proactive Motivation:
Reactive to OpenCL Pros and Cons: Proactive about Future C++:

• OpenCL has a well-defined, portable • SYCL is based on purely modern C++
execution model. and should feel familiar to C++11 users.
• OpenCL is too verbose for many • SYCL expected to run ahead of C++Next
application developers. regarding heterogeneity and
parallelism. ISO C++ of tomorrow may
• OpenCL remains a C API and only
look a lot like SYCL.
recently supported C++ kernels.
• Not held back by C99 or C++03
• Just-in-time compilation model and
compatibility goals.
disjoint source code is awkward and
contrary to HPC usage models.
5
OpenCL Example (w/ C++ Wrappers):
nstream.cl nstream.cpp
cl::Context gpu(CL_DEVICE_TYPE_GPU, &err);
__kernel void nstream( cl::CommandQueue queue(gpu);
int length,
double s, cl::Program program(gpu,
__global double * A, prk::opencl::loadProgram(“nstream.cl”), true);
__global double * B,
__global double * C) auto kernel = cl::make_kernel<int, double, cl::Buffer,
{ cl::Buffer, cl::Buffer>(program, “nstream”, &err);
int i = get_global_id(0);
A[i] += B[i] + s * C[i]; auto d_a = cl::Buffer(gpu, begin(h_a), end(h_a));
} auto d_b = cl::Buffer(gpu, begin(h_b), end(h_b));
auto d_c = cl::Buffer(gpu, begin(h_c), end(h_c));
kernel(cl::EnqueueArgs(queue, cl::NDRange(length)),
length, scalar, d_a, d_b, d_c);
queue.finish();
6
SYCL Example
Retains OpenCL’s ability to easily target
sycl::gpu_selector device_selector;
sycl::queue q(device_selector); different devices in the same thread.
sycl::buffer<double> d_A { h_A.data(), h_A.size() };

sycl::buffer<double> d_B { h_B.data(), h_B.size() };
sycl::buffer<double> d_C { h_C.data(), h_C.size() }; Accessors create DAG to trigger data
movement and represent execution
q.submit([&](sycl::handler& h) dependencies.
{
auto A = d_A.get_access<sycl::access::mode::read_write>(h);
auto B = d_B.get_access<sycl::access::mode::read>(h);
auto C = d_C.get_access<sycl::access::mode::read>(h);
h.parallel_for<class nstream>(sycl::range<1>{length}, [=] (sycl::item<1> i) {

A[i] += B[i] + s * C[i];
});
});
q.wait(); Parallel loops are explicit like C++ vs. implicit in OpenCL.
Kernel code does not need to live in a separate part of the program.
SYCL Example: Graph of Asynchronous Executions
A B
myQueue.submit([&](handler& cgh) {
auto A = a.get_access<access::mode::read>(cgh);
auto B = b.get_access<access::mode::read>(cgh);
auto C = c.get_access<access::mode::discardwrite>(cgh);
cgh.parallel_for<class add1>( range<2>{N, M},
[=](id<2> index) { C[index] = A[index] + B[index]; }); add1 A D
});
myQueue.submit([&](handler& cgh) { C
auto C = c.get_access<access::mode::read>(cgh);
auto D = d.get_access<access::mode::write>(cgh); add2
cgh.parallel_for<class add2>( range<2>{P, Q},
[=](id<2> index) { D[index] = A[index] + C[index]; });
A E
});
D
auto D = d.get_access<access::mode::read>(cgh);
auto E = e.get_access<access::mode::write>(cgh);
add3
cgh.parallel_for<class add3>( range<2>{S, T},
[=](id<2> index) { E[index] = A[index] + D[index]; });
});
SYCL Example: Graph of Asynchronous Executions
A B A D
auto B = b.get_access<access::mode::read>(cgh);
auto C = c.get_access<access::mode::discardwrite>(cgh);
cgh.parallel_for<class add1>( range<2>{N, M}, add1 add2
[=](id<2> index) { C[index] = A[index] + B[index]; });
});
auto A = a.get_access<access::mode::read>(cgh); C E
auto D = d.get_access<access::mode::read>(cgh);
auto E = e.get_access<access::mode::discardwrite>(cgh);
cgh.parallel_for<class add2>( range<2>{P, Q},
[=](id<2> index) { E[index] = A[index] + D[index]; });
});
add3
auto C = c.get_access<access::mode::read>(cgh);
auto E = e.get_access<access::mode::read>(cgh);
auto F = f.get_access<access::mode::discardwrite>(cgh); F
cgh.parallel_for<class add3>( range<2>{S, T},
[=](id<2> index) { F[index] = C[index] + E[index]; });
});
• SYCL queues are out-of-order by default – data dependencies order kernel executions
• Will also be able to use in-order queue policies to simplify porting
Hierarchical parallelism (logical view)
parallel_for_work_group (…) {
…
parallel_for_work_item (…) {
…
}
}
 Fundamentally top down expression of parallelism

 Many embedded features and details, not covered here
10
Intel Open-Source SYCL Project
Motivation, location, architecture, plugin interface, future directions
11
Motivation
Why SYCL
 SYCL is an open standard
 Can target any offload device
 Expressive and highly productive
Why Open source

 Promote SYCL as the offload programming model
 Foster collaboration and innovation in the industry
12
Location and information
Github: https://github.com/intel/llvm/tree/sycl - fork from llvm repo

Primary project goal is contribution to llvm.org:
 RFC: https://lists.llvm.org/pipermail/cfe-dev/2019-January/060811.html
First changes to the clang driver are already committed:

https://reviews.llvm.org/D57768
Detailed plan for upstream: https://github.com/intel/llvm/issues/49
13
Overview of Changes atop LLVM/Clang
SYCL runtime. Core runtime components, SYCL API definition and implementation
Compilation driver. Two-step host + device compilation, separate compilation and
linking, AOT compilation
Front End. Device compiler diagnostics, device code marking and outlining, address
space handling, integration header generation,
Tools.
 clang-offload-bundler. Generates “fat objects”. Updated to support SYCL offload kind and partial
linking to enable “fat static libraries”.
 clang-offload-wrapper. New tool to create fat binaries – host + device executables in one file. Portable
analog of OpenMP offload linker script.
LLVM. New sycldevice environment to customize spir target
14
SYCL Compiler Architecture
Single host + multiple device compilers
3rd-party host compiler can be used
 Integration header with kernel details
Support for JIT and AOT compilation

Support for separate compilation and linking
Host compiler
Integration JIT
SYCL headers compilation
over OpenCL* header Fat CPU
binary Device
LLVM IR to SPIR-V SPIR-V
SpirV code
SPIR-V Back End
SYCL LoopNest Scalar Device
SYCL AOT code
Sources materializer Opt SYCL offload runtime
FE compilation SPIR-V
Back Ends
LLVMIR .bc & Integrated
SYCL headers
Back Ends SYCL host device runtime
over OpenMP Single
Host Device Back CPU
Vectorizer Loop Opt
End binary OpenMP-based SYCL host
device runtime
Device compiler MT+SIMD path
* OpenCL dependence is to be moved under the Plugin Interface 15
Simplified Compilation Flow for JIT scenario
SYCL RT headers
src1.cpp src2.cpp
Device compiler Integration

Host compiler
Middle End header generator
.bc .bc header2.h .o .o

device library
llvm-link header1.h
.bc_lib
linked .bc
llvm-spirv
.spirv image
host library
clang-offload-wrapper .lib
Device bin .o ld
Device bin
Fat binary
16
SYCL Runtime Architecture
Modular architecture
SYCL host / host device
SYCL
application Device X exe SPIR-V Plugin interface to support
multiple back-ends
SYCL runtime library
SYCL API
PI discovery & Memory Program & Host device RT Support concurrent offload to
SYCL Scheduler Device mgr …
runtime
plugin infra mgr kernel mgr interace
multiple devices
SYCL Runtime Plugin Interface (PI)
PI types & services Device binary mgmt Support for device code
L1RI plugin PI/X RT plugin PI/OpenCL plugin
versioning
 Multiple device binaries in a single fat
DeviceX native RT OpenCL Runtime
Native runtime binary
& driver Other
layers TBB RT OpenMP RT
Device Device X Device Y Device Z CPU
17
SYCL Runtime Plugin Interface
Defines a programming model atop OpenCL
concepts & model
Abstracts away the device layer from the SYCL
runtime
libsycl.so
OCL ICD like discovery logic
 Allows implementations for multiple devices co-
exist
dlopen dlopen
PI_deviceX_plugin.so PI_OpenCL_plugin.so
libOpenCL.so
dlopen
libdeviceX_rt.so libOCL_Y_rt.so libOCL_CPU_rt.so libOCL_Z_rt.so
Non-OpenCL runtime Khronos ICD-compatible OpenCL installation “Custom” OpenCL installation
18
Future plans
NDRange subgroups
Ordered queue
Ahead of time compilation
Scheduler improvements
Unified Shared Memory
Hierarchical parallelism
Generic address space
Windows support
Specialization constants
Documentation update
Fat static libraries
Device code distribution per cl::sycl::program 19
Q&A
20

3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop - May19

Uploaded by

Copyright:

Available Formats

You might also like

3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop - May19

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

3 Open-Source-SYCL-Intel-Khronos-EVS-Workshop - May19

Uploaded by

Copyright:

Available Formats

Intel Open-source SYCL compiler project

Konstantin Bobrovskii, Intel

2019 Khronos Embedded Vision Summit

Single-source heterogeneous programming using STANDARD C++

Reactive to OpenCL Pros and Cons: Proactive about Future C++:

sycl::buffer<double> d_A { h_A.data(), h_A.size() };

h.parallel_for<class nstream>(sycl::range<1>{length}, [=] (sycl::item<1> i) {

 Fundamentally top down expression of parallelism

Why Open source

Github: https://github.com/intel/llvm/tree/sycl - fork from llvm repo

First changes to the clang driver are already committed:

LLVM. New sycldevice environment to customize spir target

Support for JIT and AOT compilation

Device compiler Integration

.bc .bc header2.h .o .o

Device Device X Device Y Device Z CPU

libdeviceX_rt.so libOCL_Y_rt.so libOCL_CPU_rt.so libOCL_Z_rt.so

Non-OpenCL runtime Khronos ICD-compatible OpenCL installation “Custom” OpenCL installation

You might also like