Download as docx, pdf, or txt
Download as docx, pdf, or txt
You are on page 1of 5

Introduction to OpenCL’s Concurrency and

Execution Model
The goal of heterogeneous computing, which began in recent years, is to combine the finest features
of central processing units (CPUs) and graphics processing units in a single device (GPUs).
Heterogeneous machines are being designed in a wider variety than ever before, and hardware
suppliers are making them widely accessible. The new hardware provides excellent platforms for
innovative new applications. However, due to the differing designs, traditional programming models do
not function very well, thus it is crucial to become familiar with new models like those in OpenCL.

When the development of OpenCL began, the designers observed that while developers frequently
wrote code in C or C++ for a class of throughput-focused algorithms (such as matrix multiplication), they
frequently used GPUs for a class of latency-focused algorithms (such as spreadsheets). These two
related approaches, however, each only supported one type of processor; C++ did not support GPUs,
and CUDA did not support CPUs. Developers were forced to focus on one and disregard the other. The
ability of a heterogeneous device to execute programs that combine both types of algorithms effectively
is its true strength. How do you program such devices, was the query.

One alternative is to upgrade the current platforms with new capabilities; C++ and CUDA are
continuously developing in response to the need for new hardware. A new set of programming
abstractions tailored expressly for heterogeneous computing was another answer. An original
suggestion for such a novel paradigm was made by Apple. This idea was improved upon by technical
teams from several businesses to become OpenCL.

That industry standard is OpenCL. OpenCL was developed through a partnership between software
providers, computer system designers (including designers of mobile platforms), and microprocessor
(embedded, accelerator, CPU, and GPU) manufacturers inside the Khronos Group, which is well-known
for OpenGL and other standards.

OpenCL, which was created in 2008, is currently accessible from several sources and on a variety of
devices. It is always changing to keep up with the most recent advancements in microprocessor
technology. This study focuses on the concurrency and execution model of OpenCL.

An open-source framework called OpenCL is used to programme machines with a mix of CPUs, GPUs,
and other processors. This class of platforms known as "heterogeneous systems" has grown in
importance, and OpenCL is the first industry standard that specifically addresses their requirements.

A broad variety of various entities may be optimised for certain activities and settings using
heterogeneous systems. Additionally, they enable the programmer to choose the finest architecture for
carrying out the work at hand or the best task to utilise a certain design. Building heterogeneous
systems has seen a recent uptick in experimentation in the computer design field.

With OpenCL, you can create a single application that can operate on a variety of devices, including
laptops, cell phones, and the nodes of enormous supercomputers. There is no other parallel
programming standard with as broad a scope. This is one of the key factors influencing OpenCL's
significance and its ability to revolutionise the software sector.

The new standard represents a significant advancement that enables programmers to utilise the new
combined GPU/CPU processors.

The following are the significant changes:

• Shared virtual memory, which eliminates the need for expensive data transfers between the host and
devices by allowing host and device code to share sophisticated pointer-based structures like trees and
linked lists.

• Dynamic parallelism, which eliminates substantial inefficiencies by allowing device kernels to launch
work to the same device without interacting with the host.

• Generic address spaces, which make programming simpler by allowing single functions to act on data
from either the GPU or the CPU.

• Atomics in the C++ style—this enables a larger class of algorithms to be implemented in Open CL by
allowing work-items to transfer data between work-groups and devices.

Performance is provided via parallel hardware, which executes several processes concurrently. Software
that executes as several concurrent streams of operations is necessary for parallel hardware to function
properly; in other words, parallel software is required.

We need to start with the more broad concept of concurrency in order to understand parallel software.
A well-established and well-known notion in computer science is concurrency. A software system is
considered concurrent if it has many active streams of operations that can advance simultaneously. Any
contemporary operating system must support concurrency. By allowing some streams of operations (or
threads) to proceed while others are stopped while waiting for resources, it maximises resource
consumption. It creates the impression that a user is interacting with the system continuously and
almost instantly.

Parallel computation is when concurrent software operates on a computer with numerous processing
elements so that threads really run simultaneously. Parallelism is concurrency that hardware supports.

Finding the concurrency in an issue, expressing it in software, and running the resulting programme in
such a way that the concurrent produces the appropriate performance present programmers with a
challenge. Executing a separate stream of actions for each pixel in an image can be a straightforward
way to identify concurrency in a problem. Or it could be exceedingly difficult with numerous operations
sharing information and requiring careful orchestration of their execution.

Programmers must represent concurrency in their source code after it has been discovered in a
situation. In particular, it is necessary to specify the streams of operations that will run concurrently,
associate the data they operate on with them, and manage the relationships between them in order for
the proper answer to be provided when they run simultaneously. The core of the parallel programming
issue is this.
Most people are unable to use a parallel computer's low-level components. The difficulty of handling
each memory conflict or scheduling individual threads would be so great that even seasoned parallel
programmers would be unable to handle it. In order to make the problem of parallel programming more
manageable, a high-level abstraction or model is essential.

There are far too many programming models, categorised into overlapping groups with names that are
frequently unclear. Task parallelism and data parallelism are the two parallel programming models that
we shall focus on for our needs. The fundamental concepts underlying these two models are simple.

Programmers think of their issues in terms of collections of data pieces that can be changed
concurrently when using a data-parallel programming methodology. By concurrently applying the same
stream of instructions (a task) to every data element, parallelism is demonstrated. The data contains
parallelism.
A task-parallel programming architecture allows for direct definition and control of concurrent tasks by
programmers. Problems are broken down into concurrently executable tasks, which are then assigned
to processing elements (PEs) of a parallel computer. Although it works best when each task is
autonomous in every way, this programming style can also be applied to activities that share data. When
the last job is finished, the computation involving a group of tasks is also finished. It might be challenging
to distribute jobs so that they all finish around the same time because of the large variation in
computational demands among them. This is the load balance issue.

The requirements of the problem being solved determine whether data parallelism or task parallelism
should be used. updates to problems rather than places on a grid. On the other hand, problems
formulated as traversals across graphs are readily formulated in terms of task parallelism. Therefore, a
proficient parallel programmer must be at ease with both programming paradigms. And both must be
supported by a general-purpose programming framework (like OpenCL).

The next step in the parallel programming process, regardless of the programming model, is to map the
programme onto actual hardware. Here, heterogeneous computers pose particular difficulties. The
system's computational components could have distinct instruction sets, memory structures, and
operating speeds. An efficient application must be able to distinguish between these variations and
correctly map the parallel software to the best OpenCL hardware.

Programmers have traditionally approached this issue by conceptualizing their software as a collection
of modules that implement various aspects of their issue. The modules have a clear connection to the
heterogeneous platform's parts. On the GPU, for instance, graphics software is run. On the CPU, other
software is used.

This approach was broken by general-purpose GPU (GPGPU) programming. Outside of graphics, other
algorithms were altered to fit on the GPU. The GPU handles all of the "interesting" computing, whereas
the CPU only sets up the computation and controls I/O. In essence, the heterogeneous platform is
disregarded, and the GPU is the only system component that is given attention.

This strategy is discouraged by OpenCL. A user effectively "pays for all the OpenCL devices" in a system,
therefore a software should make advantage of them all. This is exactly what a programming
environment made for heterogeneous platforms should foster, and this is what OpenCL does.
Hardware diversity is challenging. High-level abstractions, which mask the complexity of the hardware,
have become a staple of programming. A programming language that is heterogeneous exposes
heterogeneity and goes against the current tendency toward growing abstraction.

And it's okay. Each programming community's needs do not necessarily need to be met by a single
language. A low-level hardware abstraction layer for portability is mapped to high-level languages by
high-level frameworks that make programming problems simpler.

OpenCL's conceptual basis

There are numerous applications that OpenCL supports. It's challenging to generalise these
applications in a broad sense. However, a heterogeneous platform application must always do the
following actions:

1. Identify the elements of the heterogeneous system.

2. Examine these components' qualities so that the programme can adjust to the unique properties
of various hardware components.

3. Create the kernels, or blocks of code, that the platform will use.

4. Configure and work with the memory items needed for the computation.

5. Run the kernels on the appropriate system components and in the correct sequence.

6. Compile the final findings.

These actions are carried out using a number of OpenCL APIs and a kernel programming
environment. We will use a "divide and conquer" technique to explain how all of this operates.
The issue will be divided into the following models:
platform Model
A high-level description of the heterogeneous system is provided by the platform model.
Any heterogeneous platform utilised with OpenCL is defined at a high level by the OpenCL
platform model. A single host is always present on an OpenCL platform. The host engages in I/O
or engagement with a program's user in the environment outside of the OpenCL programme.
Figure 1.1 The OpenCL platform model with one host and one or more OpenCL devices. Each OpenCL
device has one or more compute units, each of which has one or more processing elements.
There are one or more OpenCL devices attached to the host. An OpenCL device is frequently
referred to as a compute device because that is where the streams of instructions (or kernels) run.
Any processor offered by the hardware and supported by the OpenCL vendor can be a device,
including a CPU, GPU, DSP, or other type of processor.
The compute units, which are further separated into one or more processing elements, are how
the OpenCL devices are organised (PEs). The PEs are where a device's computations take place.
In a later section, when we discuss work-groups and the OpenCL memory architecture, it will
become evident why an OpenCL device is split into processing components and compute units.
An abstract illustration of how streams of instructions operate on a heterogeneous platform is the
execution model.
• Memory model: a group of OpenCL's memory areas and how they interact when performing an
OpenCL computation

• Programming models: high-level abstractions used by programmers to create algorithms for


applications.
The OpenCL Contents
We have been concentrating on the principles of OpenCL so far. We now change topics and
discuss how the OpenCL framework supports these concepts. The following elements make up
the OpenCL framework:
• The OpenCL platform API describes the operations carried out by the host programme to
identify OpenCL devices and their capabilities as well as to establish the environment in which
the OpenCL application will run.
• The OpenCL runtime API: This API modifies the context to produce command queues and do
other runtime actions. For instance, the OpenCL runtime API provides the functions to submit
commands to the command-queue.
• The programming language OpenCL: This is the language used to create the kernel-level code.
Its foundation is an

You might also like