Parallelism Joins Concurrency For Multicore Embedded Computing

2/23/2014 www.embedded.
com/print/4428486
Parallelism joins concurrency for multicore

embedded computing
S. Tucker Taft, Adacore - February 18, 2014
For many of us, the terms parallelism and concurrency are near synonyms, and even if we feel they represent
somewhat different concepts, we don't all agree on what makes one construct parallel and the other
concurrent. But as we enter the era of multicore, manycore, and GPU-based computing, it is useful to make a
distinction between these two, and see how those distinctions affect the embedded programmer's world.
Concurrent programming has a long history in embedded systems, and in that context it represents the approach
of using separate threads of control to manage the multiple activities taking place. Often these threads of control
have different relative priorities, with interrupt handlers preempting non-interrupt code, and code requiring a
higher invocation frequency or an earlier deadline preempting code with lower frequency or later deadlines. For
threads of control managing activities with equivalent priorities, often a round-robin scheduling approach is used
to ensure no activity gets ignored for too long. None of this requires multiple physical processors, and it can be
seen overall as a way to share limited resources in an appropriate way when many activities need to be
managed. It can also be seen as a way to simplify the logic of the programming, by allowing any given thread of
control to be restricted to managing a single activity. Because introducing concurrent threads of control can
simplify the logic of the overall program, it is all right if the language constructs to support this approach are
themselves of somewhat heavier weight.
In contrast with concurrent programming, parallel programming is often more about solving a computationally-
intensive problem by splitting it up into pieces, using an overall divide-and-conquer strategy, to take better
advantage of multiple processing resources to solve a single problem. Rather than simplifying the logic,
introducing parallel programming can often make the logic more complex, and therefore it is important that the
parallel programming constructs be very light weight both syntactically and at run-time, or else the complexity
and run-time overhead may outweigh the potential speedup.
Scheduling the threads

For both concurrent and parallel programming, the actual number of physical processors may vary from one run
of the program to the next, so it is typical to provide a level of abstraction on top of the physical processors,
typically provided by a scheduler. For concurrent programming, preemption is often used by the scheduler,
where one thread is interrupted in the middle of its execution to allow a higher priority thread to take over the
shared processing resource. For parallel programming, more often the goal is simply to get all of the work done
in the shortest possible time, so preemption is not used as often, while overall throughput becomes the critical
criterion of success.
For concurrent programming used to manage external activities of varying criticalities, a real-time scheduler often
relies on priorities assigned using some sort of rate-monotonic or deadline-monotonic analysis, to ensure that all
threads will meet their deadlines [3]. In contrast, for parallel programming, an approach called work stealing
(Figure 1) has emerged as a robust approach to balancing the load across multiple physical processors, while
http://www.embedded.com/print/4428486 1/6
2/23/2014 www.embedded.com/print/4428486
providing good locality of reference for a given processor, and good separation of access between processors.
Figure 1: Work stealing approach using double ended queues
Work stealing [2] is based on some relatively simple concepts, but often requires very careful coding of the
underlying scheduler to achieve the required low overhead. The basic idea is that computations are broken up by
the compiler into very light-weight threads (which we will call picothreads), and in the scheduler each physical
processor has a server with its own double-ended queue (often called a deque) of picothreads (Figure 1).
A typical picothread might represent the execution of a single iteration of a loop, or the evaluation of a single
subexpression. Picothreads are automatically placed on the tail of the deque of the server that spawned them,
and when a server finishes working on a given picothread, it removes the picothread it most recently added to its
own deque (from the tail), and starts running that one. This last-in-first-out (LIFO) discipline is using the deque
effectively as a stack of work to do.
At some point a serverâ€™s deque is empty, it having finished the overall computation it was performing. At that
point the server steals a picothread from one of the other servers. But in this case it removes the oldest
picothread at the head of some other serverâ€™s deque. That is, for stealing, a first-in-first-out (FIFO)
discipline is used. What this accomplishes is that when serving its own deque, a server is picking up a picothread
that will likely be manipulating data that the associated processor was recently using, while when stealing from
another deque, it will be picking up a picothread that has been languishing on its deque, and likely will be
manipulating data that is not sitting in any processorâ€™s cache, and is likely not in close physical proximity to
the data being manipulated by any other processor.
Individual picothreads are small enough that there is generally no need to preempt them before they complete,
which means they donâ€™t need a complete stack of their own â€“ they can share a stack provided by the
server. To terminate an overall parallel computation prematurely it is often adequate to prevent new picothreads
from starting, without trying to interrupt a given picothread in the middle of its execution. In fact, it is sharing a
stack and using a run-to-completion approach that makes these picothreads so light-weight relative to the typical
preemptible heavier-weight threads used in concurrent programming.
There are many subtleties to work stealing, and this remains an active area of research, in terms of deciding
which server to steal from, deciding how many picothreads to steal at once, choosing an efficient double-ended
queue to support exclusive access from one end and shared access from the other, how to deal with any
synchronization between the picothreads, etc. Nevertheless, the basic approach of work stealing has been
adopted very widely, including for various parallel languages (Cilk+, Go, Rust, ParaSail, etc.) and various
libraries (Java fork/join library, OpenMP, Intelâ€™s Threaded Building Blocks, among others).
So how does this all pertain to the mobile and embedded programming world? The fact is that multicore
hardware has arrived in the mobile and embedded worlds as well, in part because a multicore architecture can
often provide the best performance per watt, and power is almost always a major concern in these resource-
constrained environments. But most of these environments will still have hard or soft real-time requirements, so
work stealing by itself will not provide that. What is needed is a careful integration of more traditional concurrent
programming approaches with some aspects of work-stealing-based scheduling.
Title-1
Combining real-time with work-stealing is a new research area, and there are only a few academic papers
focused on this issue so far. Nevertheless, it is clearly becoming more important. Standards such as ARINC
653, which defines a strongly-partitioned architecture for systems of mixed criticality, are being updated to
accommodate multicore hardware. Because of concerns that processors sharing a single chip are not as
independent as required for strong partitioning, one approach being adopted involves assigning multiple cores to
a single partition for its time slice, and then reassigning them all to a different partition when the time slice is
done. This would allow the use of a hybrid work-stealing approach, where each partition has its own set of
server processes, each with its own double-ended queue (Figure 2). When a partition’s time slice ends, all of the
server processes associated with that partition are suspended, and the server processes for the next partition to
execute are resumed. So here we have preemption happening to server processes, while the individual
picothreads can still use a run-to-completion model by treating the server process like a kind of virtual
processor.
Partition 1
Partition 2
Partition 3
Figure 2: Combining real-time and workstealing with a strongly partitioned architecture
In a similar fashion prioritized scheduling can be accommodated, by creating separate server processes for each
real-time priority. Each has its own dedicated stack, with a lower-priority server process running on a core only
when all higher-priority server processes on the core have nothing to do. An alternative, when the real-time
requirements are softer, is to use only one server process per core, but with separate deques for different
priorities. With this approach, preemption of a running picothread does not occur. However, when the server
process chooses a new picothread to execute, priorities would be obeyed: the server would select first from its
own highest priority non-empty deque, but steal from another server if the latter had a non-empty deque of
higher priority than any of the server’s own non-empty deques.
Programming language constructs for concurrency and parallelism

Many programming languages incorporate some notion of concurrent threads, mutual-exclusion locks, and
synchronizing signals and waiting. Per Brinch Hansen’s Concurrent Pascal [1] was one of the first languages to
incorporate many of these concepts into the language itself. Ada and Java also included concurrent programming
concepts from their inception, and many other languages now include these concepts as well. Generally the
execution of a concurrent thread corresponds to the asynchronous execution of something approximating a
named function or procedure. In Ada, this is the task body for the associated task. In Java, it is the run method
of the associated Runnable object. Locks are often associated with some sort of synchronizing object (often
called a monitor), where some or all operations on the object automatically acquire a lock on starting the
operation and automatically release the lock on completion, thereby ensuring that locks and unlocks are always
balanced. In Ada, these are called protected objects and operations, while in Java they are the synchronized
methods of a class.
Signaling and waiting are used to handle cases where concurrent threads need to communicate or otherwise
cooperate, and one thread must wait for one or more other threads to take some action, or some external event
to occur, before it can proceed further. Signaling and waiting is also often mediated by a synchronizing object,
with a thread awaiting some change in the state of the object, and a signal being used to indicate that the state has
changed and some number of waiting threads should recheck to see whether the synchronizing object is now in
the desired state. Conditional critical regions suggested by Hoare and Brinch Hansen represented one of the first
language constructs providing this kind of waiting and signaling implicitly based on a Boolean expression. More
commonly this is provided by explicit Wait and Signal operations on an object or a condition queue (in Java
signaling uses notify or notifyAll). Ada combines the notions of conditional critical regions and monitors by
incorporating entries with entry barriers into the protected object construct, eliminating the need for explicit
Signal and Wait calls. All of these notions represent what we mean by concurrent programming constructs.
By contrast, a smaller number of languages thus far incorporate what could be considered parallel programming
constructs, though that is changing rapidly. As with concurrent programming, parallel programming can be
supported by explicit language extensions, standard libraries, or some mixture of these two. A third option with
parallel programming is the use of program annotations, such as pragmas, providing direction to the compiler to
allow it to automatically parallelize an originally sequential algorithm.
One characteristic that distinguishes parallel programming is that the unit of parallel computation can often be less
than the execution of an entire function or procedure, but instead might represent one or more iterations of a
loop, or the evaluation of one part of a larger expression. Furthermore, the compiler and the underlying run-time
system are more involved in determining what portions of the code can actually run in parallel. This is quite
different from traditional concurrent programming constructs, which rely on explicit programmer decisions to
determine where the thread boundaries lie.
One of the first widely used languages with general purpose parallel programming constructs was Cilk, designed
by Charles Leiserson [3] at MIT, and now supported by Intel as part of their Intel Parallel Studio. Cilk allows
the programmer to insert directives such as cilk_spawnand cilk_syncat strategic points in an
algorithm, with _spawncausing the evaluation of an expression to be forked off into a separate lightweight
thread, and _synccausing the program to wait for locally spawned parallel threads, so the result of their
execution can be used. Furthermore, Cilk provides the ability to use cilk_forrather than simply for to
indicate that the iterations of the given for-loop are candidates for parallel execution. Other languages now
providing similar capabilities include OpenMP, which uses pragmas rather than language extensions to direct the
insertion of parallel execution, the language Go from Google, which includes lightweight goroutines for parallel
execution with channels for communication, the language Rust from Mozilla Research, which supports large
numbers of lightweight tasks communicating using ownership transfer to avoid race conditions, and the language
ParaSail from AdaCore using safe automatic parallelization based on a pointer-free, alias-free approach that
simplifies divide-and-conquer algorithms.
All of these parallel languages or extensions have adopted some variant of work stealing for the scheduling of
their lightweight threads. And all of these languages make it easier to move from a sequentially-oriented mindset
to a parallel-oriented one. Embedded and mobile programmers should begin experimenting with these languages
now, to be prepared as real-time prioritized capabilities are merged with work-stealing schedulers, to provide
the combination of reactivity and throughput needed for the advanced embedded and mobile applications on the
drawing boards for the near future.
S. Tucker Taft is VP and Director of Language Research at AdaCore. He joined AdaCore in 2011 as
part of a merger with SofCheck, which he had founded in 2002 to develop advanced static analysis
technology. Prior to that he was a Chief Scientist at Intermetrics, Inc. and its follow-ons for 22 years,
where in 1990-1995 he led the design of Ada 95. He is recipient of an A.B. Summa Cum Laude degree
from Harvard University, where he has more recently taught compiler construction and programming
language design.
Further Reading:
1. P. Brinch Hansen (editor), The Origin of Concurrent Programming: From Semaphores to Remote Procedure
Calls, Springer, June 2002.
2. R. D. Blumofe and C. E. Leiserson, Scheduling Multithreaded Computations by Work Stealing, Journal

of the ACM, 720–748, September, 1999.
3. C. Maia, L. Nogueira, L. M. Pinho, Scheduling parallel real-time tasks using a fixed-priority work-
stealing algorithm on multiprocessors, 8th IEEE Symposium on Industrial Embedded Systems (SIES), June
2013
4. S. T. Taft, Systems Programming with Go, Rust, and ParaSail

Parallelism Joins Concurrency For Multicore Embedded Computing

Uploaded by

Copyright:

Available Formats

You might also like

Parallelism Joins Concurrency For Multicore Embedded Computing

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Parallelism Joins Concurrency For Multicore Embedded Computing

Uploaded by

Copyright:

Available Formats

2/23/2014 www.embedded.

Parallelism joins concurrency for multicore

Scheduling the threads

Figure 1: Work stealing approach using double ended queues

Figure 2: Combining real-time and workstealing with a strongly partitioned architecture

Programming language constructs for concurrency and parallelism

2. R. D. Blumofe and C. E. Leiserson, Scheduling Multithreaded Computations by Work Stealing, Journal

4. S. T. Taft, Systems Programming with Go, Rust, and ParaSail

You might also like