Professional Documents
Culture Documents
Composable Scheduler Activations For Haskell
Composable Scheduler Activations For Haskell
Purdue e-Pubs
2014
Tim Harris
Oracle Labs, timothy.l.harris@oracle.com
Simon Marlow
Facebook UK Ltd., smarlow@fb.com
Report Number:
14-004
Sivaramakrishnan, KC; Harris, Tim; Marlow, Simon; and Peyton Jones, Simon, "Composable Scheduler
Activations for Haskell" (2014). Department of Computer Science Technical Reports. Paper 1774.
https://docs.lib.purdue.edu/cstech/1774
This document has been made available through Purdue e-Pubs, a service of the Purdue University Libraries.
Please contact epubs@purdue.edu for additional information.
Composable Scheduler Activations for Haskell
1 2014/3/26
Written by: Written in: with different scheduling or work-stealing algorithms. In-
deed, we would like the ability to combine multiple ULS’s
Application Concurrent Application Haskell in the same program. For example, in order to utilise the best
Developer
scheduling strategy, a program could dynamically switch
Runtime System from a priority-based scheduler to gang scheduling when
switching from general purpose computation to data-parallel
MVar Safe FFI Scheduler computation. Applications might also combine the sched-
Language
C ulers in a hierarchical fashion; a scheduler receives compu-
Developer
STM GC Async tational resources from its parent, and divides them among
Exception its children.
This goal is not not easy to achieve. The scheduler inter-
acts intimately with other RTS components including
Figure 1. The anatomy of the Glasgow Haskell Compiler
runtime system • MVars and transactional memory [10] allow Haskell
threads to communicate and synchronise; they may cause
performance of ULS’s is comparable to the highly opti- threads to block or unblock.
mised default scheduler of GHC (Section 7).
• The garbage collector must somehow know about the
2. Background run-queue on each HEC, so that it can use it as a root
for garbage collection.
To understand the design of the new concurrency substrate
for Haskell, we must first give some background on the • Lazy evaluation means that if a Haskell thread tries to
existing RTS support for concurrency in our target platform evaluate a thunk that is already under evaluation by an-
– the Glasgow Haskell Compiler (GHC). We then articulate other thread (it is a “black hole”), the former must block
the goals of our concurrency substrate. until the thunk’s evaluation is complete [9]. Matters are
made more complicated by asynchronous exceptions,
2.1 The GHC runtime system which may cause a thread to abandon evaluation of a
GHC has a sophisticated, highly tuned RTS that has a rich thunk, replacing the thunk with a “resumable black hole”.
support for concurrency with advanced features such as • A foreign-function call may block (e.g. when doing I/O).
software transactional memory [10], asynchronous excep- GHC’s RTS has can schedule a fresh task (OS thread)
tions [16], safe foreign function interface [17], and transpar- to re-animate the HEC, blocking the in-flight Haskell
ent scaling on multicores [9]. The Haskell programmer can thread, and scheduling a new one [17].
use very lightweight Haskell threads, which are executed
by a fixed number of Haskell execution contexts, or HECs. All of these components do things like “block a thread”
Each HEC is in turn animated by an operating system thread; or ”unblock a thread” that require interaction with the sched-
in this paper we use the term tasks for these OS threads, to uler. One possible response, taken by Li et al [14] is to pro-
distinguish them from Haskell threads. The choice of which gram these components, too, into Haskell. The difficulty is
Haskell thread is executed by which HEC is made by the that all they are intricate and highly-optimised. Moreover,
scheduler. unlike scheduling, there is no call from Haskell’s users for
GHC’s current scheduler is written in C, and is hard- them to be user-programmable.
wired into the RTS (Figure 1). It uses a single run-queue Instead, our goal is to tease out the scheduler implemen-
per processor, and has a single, fixed notion of work-sharing tation from rest of the RTS, establishing a clear API between
to move work from one processor to another. There is no the two, and leaving unchanged the existing implementation
notion of thread priority; nor is there support for advanced of MVars, STM, black holes, FFI, and so on.
scheduling policies such as gang or spatial scheduling. From Lastly, schedulers are themselves concurrent programs,
an application developer’s perspective, the lack of flexibility and they are particularly devious ones. Using the facilities
hinders deployment of new programming models on top of available in C, they are extremely hard to get right. Given
GHC such as data-parallel computations [4, 15], and appli- that the ULS will be implemented in Haskell, we would like
cations such as virtual machines [7] and web-servers [11] to utilise the concurrency control abstractions provided by
that can benefit from the ability to define custom scheduling Haskell (notably transactional memory) to simplify the task
policies. of scheduler implementation.
2.2 The challenge
Because there is such a rich design space for schedulers, our 3. Design
goal is to allow a user-level scheduler (ULS) to be written In this section, we describe the design of our concurrency
in Haskell, giving programmers the freedom to experiment substrate and present the concurrency substrate API. Along
2 2014/3/26
the way, we will describe how our design achieves the goals Written by: Written in:
put forth in the previous section.
Application Concurrent Application Haskell
3.1 Scheduler activation Developer
3 2014/3/26
its scheduler. Since t’s scheduler is already running, t will Since the activations are under STM monad, we have the
eventually be scheduled again. assurance that the ULS’ cannot be built with low-level un-
safe components such as locks and condition variables. Such
3.2 Software transactional memory low-level operations would be under IO monad, which can-
Since Haskell computations can run in parallel on different not be part of an STM transaction. Thus, our concurrency sub-
HECs, the substrate must provide a method for safely coordi- strate statically prevents the implementation of potentially
nating activities across multiple HECs. Similar to Li’s sub- unsafe schedulers.
strate design [14], we adopt transactional memory (STM),
as the sole multiprocessor synchronisation mechanism ex- 3.3.2 SCont management
posed by the substrate. Using transactional memory, rather The substrate offers primitives for creating, constructing and
than locks and condition variables make complex concurrent transferring control between SConts. The call (newSCont M )
programs much more modular and less error-prone [10] – creates a new SCont that, when scheduled, executes M . By
and schedulers are prime candidates, because they are prone default, the newly created SCont is associated with the ULS
to subtle concurrency bugs. of the invoking thread. This is done by copying the invoking
SCont’s activations.
3.3 Concurrency substrate An SCont is scheduled (i.e. is given control of a HEC) by
Now that we have motivated our design decisions, we will the switch primitive. The call (switch M ) applies M to the
present the API for the concurrency substrate. The con- current continuation s. Notice that (M s) is an STM compu-
currency substrate includes the primitives for instantiating tation. In a single atomic transaction switch performs the
and switching between language level threads, manipulating computation (M s), yielding an SCont s0 , and switches con-
thread local state, and an abstraction for scheduler activa- trol to s0 . Thus, the computation encapsulated by s0 becomes
tions. The API is presented below: the currently running computation on this HEC.
data SCont
Since our continuations are one-shot, capturing a contin-
type DequeueAct = SCont -> STM SCont uation simply fetches the reference to the underlying TSO
type EnqueueAct = SCont -> STM ()
object. Hence, continuation capture involves no copying, and
-- a c t i v a t i o n i n t e r f a c e is cheap. Using the SCont interface, a cooperative scheduler
dequeueAct :: DequeueAct
enqueueAct :: EnqueueAct
can be built as follows:
-- SCont m a n i p u l a t i o n yield :: IO ()
newSCont :: IO () -> IO SCont yield = switch (\ s -> enqueueAct s >> dequeueAct s )
switch :: ( SCont -> STM SCont ) -> IO ()
runOnIdleHEC :: SCont -> IO ()
3.4 Parallel SCont execution
-- M a n i p u l a t i n g local state
setDequeueAct :: DequeueAct -> IO () When the program begins execution, a fixed number of
setEnqueueAct :: EnqueueAct -> IO ()
getAux :: SCont -> STM Dynamic HECs (N) is provided to it by the environment. This sig-
setAux :: SCont -> Dynamic -> STM () nifies the maximum number of parallel computations in
the program. Of these, one of the HEC runs the main IO
3.3.1 Activation interface computation. All other HECs are in idle state. The call
runOnIdleHEC s initiates parallel execution of SCont s on
Rather than directly exposing the notion of a “thread”, the
an idle HEC. Once the SCont running on a HEC finishes
substrate offers one-shot continuations [3], which is of type
evaluation, the HEC moves back to the idle state.
SCont. An SCont is a heap-allocated object representing the
Notice that the upcall from the RTS to the dequeue acti-
current state of a Haskell computation. In the RTS, SConts
vation as well as the body of the switch primitive return an
are represented quite conventionally by a heap-allocated
SCont. This is the SCont to which the control would switch
Thread Storage Object (TSO), which includes the compu-
to subsequently. But what if such an SCont cannot be found?
tations stack and local state, saved registers, and program
This situation can occur during multicore execution, when
counter. Unreachable SConts are garbage collected.
the number of available threads is less than the number of
The call (dequeueAct s) invokes s’s dequeue activa-
HECs. If a HEC does not have any work to do, it better be
tion, passing s to it like a “self” parameter. The return type
put to sleep.
of dequeueAct indicates that the computation encapsulated
Notice that the result of the dequeue activation and the
in the dequeueAct is transactional (under STM monad4 ),
body of the switch primitive are STM transactions. GHC
which when discharged, returns an SCont. Similarly, the
today supports blocking operations under STM. When the
call (enqueueAct s) invokes the enqueue activation trans-
programmer invokes retry inside a transaction, the RTS
actionally, which enqueues s to its ULS.
blocks the thread until another thread writes to any of the
4 http://hackage.haskell.org/package/stm-2.1.1.0/docs/ transactional variables read by the transaction; then the
Control-Concurrent-STM.html thread is re-awoken, and retries the transaction [10]. This
4 2014/3/26
is entirely transparent to the programmer. Along the same in this case, one could imagine manipulating the blocked
lines, we interpret the use of retry within a switch or de- SCont’s aux state for accounting information such as time
queue activation transaction as putting the whole HEC to slices consumed for fair-share scheduling. Enqueue activa-
sleep. We use the existing RTS mechanism to resume the tion (enqueueActivation) finds the SCont’s HEC number
thread when work becomes available on the scheduler. by querying its stack-local state (the details of which is pre-
sented along with the next primitive). This HEC number
3.5 SCont local state (hec) is used to fetch the correct runqueue, to which the
The activations of an SCont can be read by dequeueAct SCont is appended to.
and enqueueAct primitives. In effect, they constitute the The next step is to initialise the scheduler. This involves
SCont-local state. Local state is often convenient for other two steps: (1) allocating the scheduler (newScheduler) and
purposes, so we also provide a single dynamically-typed5 initialising the main thread and (2) spinning up additional
field, the “aux-field”, for arbitrary user purposes. The aux- HECs (newHEC). We assume that the Haskell program wish-
field can be read from and written to using the primitives ing to utilise the ULS performs these two steps at the start
getAux and setAux. The API additionally allows an SCont of the main IO computation. The implementation of these
to change its own scheduler through setDequeueAct and primitives are given below:
setEnqueueAct primitives. newScheduler :: IO ()
newScheduler = do
-- I n i t i a l i s e A u x i l i a r y state
4. Developing concurrency libraries switch $ \ s -> do
counter <- newTVar (0:: Int )
In this section, we will utilise the concurrency substrate to setAux s $ toDyn $ (0:: Int , counter )
implement a multicore capable, round-robin, work-sharing return s
-- A l l o c a t e s c h e d u l e r
scheduler and a user-level MVar implementation. nc <- getNumHECs
sched <- ( Sched . listArray (0 , nc -1) ) <$ >
4.1 User-level scheduler replicateM n ( newTVar [])
-- I n i t i a l i s e a c t i v a t i o n s
The first step in designing a scheduler is to describe the setDequeueAct s $ d e q u e u e A c t i v a t i o n sched
setEnqueueAct s $ e n q u e u e A c t i v a t i o n sched
scheduler data structure. We utilise an array of runqueues,
with one queue per HEC. Each runqueue is represented by newHEC :: IO ()
newHEC = do
a transactional variable (a TVar), which can hold a list of -- Initial task
SConts. s <- newSCont $ switch dequeueAct
-- Run in p a r a l l e l
newtype Sched = Sched ( Array Int ( TVar [ SCont ]) ) runOnIdleHEC s
The next step is to provide an implementation for the First we will focus on initialising a new ULS (newScheduler).
scheduler activations. For load balancing purposes, we will spawn threads in a
round-robin fashion over the available HECs. For this pur-
d e que ueA ctiv ati on :: Sched -> SCont -> STM SCont
d e que ueA ctiv ati on ( Sched pa ) _ = do pose, we initialise a TVar counter, and store into the auxil-
cc <- getCurrentHEC -- get current HEC number iary state a pair (c, t) where c is the SCont’s home HEC and
l <- readTVar $ pa ! cc
case l of t is the counter for scheduling. Next, we allocate an empty
[] -> retry scheduler data structure (sched), and register the current
x : tl -> do
writeTVar ( pa ! cc ) tl thread with the scheduler activations. This step binds the
return x current thread to participate in user-level scheduling.
e n que ueA ctiv ati on :: Sched -> SCont -> STM () All other HECs act as workers (newHEC), scheduling the
e n que ueA ctiv ati on ( Sched pa ) sc = do threads that become available on their runqueues. The initial
dyn <- getAux sc
let ( hec :: Int , _ :: TVar Int ) = fromJust $ task created on the HEC simply waits for work to become
fromDynamic dyn available on the runqueue, and switches to it. Recall that al-
l <- readTVar $ pa ! hec
writeTVar ( pa ! hec ) $ l ++[ sc ] locating a new SCont copies the current SCont’s activations
to the newly created SCont. In this case, the main SCont’s
dequeueActivation either returns the SCont at the activations, initialised in newScheduler, are copied to the
front of the runqueue and updates the runqueue appro- newly allocated SCont. As a result, the newly allocated
priately, or puts the HEC to sleep if the queue is empty. SCont shares the same ULS with the main SCont. Finally,
Recall that performing retry within a dequeue activation we run the new SCont on a free HEC. Notice that sched-
puts the HEC to sleep. The HEC will automatically be wo- uler data structure is not directly accessed in newHEC, but
ken up when work becomes available i.e. queue becomes accessed through the activation interface.
non-empty. Although we ignore the SCont being blocked The Haskell program only needs to prepend the follow-
ing snippet to the main IO computation to utilise the ULS
5 http://hackage.haskell.org/package/base-4.6.0.1/docs/
implementation.
Data-Dynamic.html
5 2014/3/26
main = do takeMVar :: MVar a -> IO a
newScheduler takeMVar ( MVar ref ) = do
n <- getNumHECs h <- atomically $ newTVar undefined
replicateM_ (n -1) newHEC switch $ \ s -> do
... -- rest of the main code st <- readTVar ref
case st of
Empty ts -> do
How do we create new user-level threads in this sched- writeTVar ref $ Empty $ enqueue ts (h , s )
uler? For this purpose, we implement a forkIO primitive dequeueAct s
Full x ts -> do
that spawns a new user-level thread as follows: writeTVar h x
case deque ts of
forkIO :: IO () -> IO SCont
(_ , Nothing ) -> do
forkIO task = do
writeTVar ref $ Empty emptyQueue
numHECs <- getNumHECs
( ts ’ , Just (x ’ , s ’) ) -> do
-- epilogue : Switch to next thread
writeTVar ref $ Full x ’ ts ’
newSC <- newSCont ( task >> switch dequeueAct )
enqueueAct s ’
-- Create and i n i t i a l i s e new Aux state
return s
switch $ \ s -> do
atomically $ readTVar h
dyn <- getAux s
let ( _ :: Int , t :: TVar Int ) = fromJust $
fromDynamic dyn If the MVar is empty, the SCont enqueues itself into the
nextHEC <- readTVar t queue of pending takers. If the MVar is full, SCont con-
writeTVar t $ ( nextHEC + 1) ‘mod ‘ numHECs
setAux newSC $ toDyn ( nextHEC , t ) sumes the value and unblocks the next waiting putter SCont,
return s if any. The implementation of putMVar is the dual of this
-- Add new thread to s c h e d u l e r
atomically $ enqueueAct newSC implementation. Notice that the implementation only uses
return newSC the activations to block and resume the SConts interacting
through the MVar. This allows threads from different ULS’s
forkIO primitive spawns a new thread that runs concur-
to communicate over the same MVar, and hence the imple-
rently with its parent thread. What should happen after such
mentation is scheduler agnostic.
a thread has run to completion? We must request the sched-
uler to provide us the next thread to run. This is captured in
5. Semantics
the epilogue e, and is appended to the given IO computation
task. Next, we allocate a new SCont, which implicitly in- In this section, we present the formal semantics of the con-
herits the current SCont’s scheduler activations. In order to currency substrate primitives introduced in Section 3.3. We
spawn threads in a round-robin fashion, we create a new aux- will subsequently utilise the semantics to formally describe
iliary state for the new SCont and prepare it such that when the interaction of the ULS with the RTS in Section 6. Our se-
unblocked, the new SCont is added to the runqueue on HEC mantics closely follows the implementation. The aim of this
nextHEC. Finally, the newly created SCont is added to the is to precisely describe the issues with respect to the interac-
scheduler using its enqueue activation. tions between the ULS and the RTS, and have the language
The key aspect of this forkIO primitive is that it does not to enunciate our solutions.
directly access the scheduler data structure, but does so only 5.1 Syntax
through the activation interface. As a result, aside from the
auxiliary state manipulation, the rest of the code pretty much Figure 5 shows the syntax of program states. The program
can stay the same for any user-level forkIO primitive. Addi- state P is a soup S of HECs, and a shared heap Θ. The
tionally, we can implement a yield primitive similar to the operator k in the HEC soup is associative and commutative.
one described in Section 3.3.2. Due to scheduler activations, Each HEC is either idle (Idle) or a triple hs, M, Dit where
the interaction with the RTS concurrency mechanisms come s is a unique identifier of the currently executing SCont, M
for free, and we are done! is the currently executing term, D represents SCont-local
state. Each HEC has an optional subscript t representing its
4.2 Scheduler agnostic user-level MVars current state, and the absence of the subscript represents a
Our scheduler activations abstracts the interface to the HEC that is running. As mentioned in Section 3.4, when the
ULS’s. This fact can be exploited to build scheduler agnostic program begins execution, the HEC soup has the following
implementation of user-level concurrency libraries such as configuration:
MVars. The following snippet describes the structure of an Initial HEC Soup S = hs, M, Di k Idle1 k . . . k IdleN −1
MVar implementation: where M is the main computation, and all other HECs are
newtype MVar a = MVar ( TVar ( MVPState a ) ) idle. We represent the stack local state D as a tuple with
data MVPState a = Full a [( a , SCont ) ]
| Empty [( IORef a , SCont ) ]
two terms and a name (M, N, r). Here, M , N , and r are
the dequeue activation, enqueue activation, and a TVar rep-
MVar is either empty with a list of pending takers, or full resenting the auxiliary storage of the current SCont on this
with a value and a list of pending putters. An implementation HEC. For perspicuity, we define accessor functions as shown
of takeMVar function is presented below: below.
6 2014/3/26
a
x, y ∈ V ariable r, s, ∈ N ame Top-level transitions S; Θ =⇒ S 0 ; Θ0
7 2014/3/26
to retry the transaction through the context. The act of block-
ing, wake up and undoing the effects of the transaction are
HEC transitions H; Θ =⇒ H 0 ; Θ0
handled in Section 6.2.
∗
s; M ; D; Θ return N ; Θ0 5.4 SCont semantics
(TATOMIC )
hs, E[atomically M ], Di; Θ =⇒
hs, E[return N ], Di; Θ0 The semantics of SCont primitives are presented in Figure 8.
Each SCont has a distinct identifier s (concretely, its heap
∗
s; M ; D; Θ throw N ; Θ0 address). An SCont’s state is represented by the pair (M, D)
(TT HROW )
hs, E[atomically M ], Di; Θ =⇒ where M is the term under evaluation and D is the local
hs, E[throw N ], Di; Θ ∪ (Θ0 \ Θ) state.
Rule N EW SC ONT binds the given IO computation and a
STM transitions s, M, D; Θ; M 0 ; Θ0 new SCont-local state pair to a new SCont s0 , and returns s0 .
Notice that the newly created SCont inherits the activations
M →N of the calling SCont. This implicitly associates the new
(TP URE S TEP )
s; P[M ]; D; Θ P[N ]; Θ SCont with the invoking SCont’s scheduler.
∗ The rules for switch (S WITCH S ELF, S WITCH, and
s; M ; D; Θ return M 0 ; Θ0 S WITCH E XN) begin by atomically evaluating the body of
(TC ATCH )
s; P[catchSTM M N ]; D; Θ P[return M 0 ]; Θ0
switch M applied to the current SCont s. If the resultant
∗
s; M ; D; Θ throw M 0 ; Θ0
SCont is the same as the current one (S WITCH S ELF), then
(TCE XN ) we simply commit the transaction and there is nothing more
s; P[catchSTM M N ]; D; Θ P[N M 0 ]; Θ∪(Θ0 \Θ)
to be done. If the resultant SCont s0 is different from the
∗
s; M ; D; Θ retry; Θ0 current SCont s (S WITCH), we transfer control to the new
(TCR ETRY )
s; P[catchSTM M N ]; D; Θ P[retry]; Θ0 SCont s0 by making it the running SCont and saving the
state of the original SCont s in the heap. If the switch primi-
r fresh
(TN EW ) tive happens to throw an exception, the updates by the trans-
s; P[newTVar M ]; D; Θ P[return r]; Θ[r 7→ M ]
action are discarded (S WITCH E XN).
s; P[readTVar r]; D; Θ P[return Θ(r)]; Θ (TR EAD ) The alert reader will notice that the rules for switch
duplicate much of the paraphernalia of an atomic transaction
s; P[writeTVar r M ]; D; Θ P[return()]; Θ[r 7→ M ] (TW RITE )
(Figure 7), but that is unavoidable because the switch to a
new continuation must form part of the same transaction as
the argument computation.
Figure 7. Operational semantics for software transactional
memory
6. Interaction with the RTS
ferred until Section 6.2. A STM transition is of the form The key aspect of our design is composability of ULS’s with
s; M ; D; Θ M 0 ; Θ0 , where M is the current monadic the existing RTS concurrency mechanisms (Section 3.1). In
term under evaluation, and the heap Θ binds transactional this section, we will describe in detail the interaction of RTS
variables TVar locations r to their current values. The cur- concurrency mechanisms and the ULS’s. The formalisation
rent SCont s and its local state D are read-only, and are brings out the tricky cases associated with the interaction
not used at all in this section, but will be needed when ma- between the ULS and the RTS.
nipulating SCont-local state. The reduction produces a new
term M 0 and a new heap Θ0 . Rule TP URE S TEP is similar to 6.1 Timer interrupts
P URE S TEP rule in Figure 6. STM allows creating (TN EW), In GHC, concurrent threads are preemptively scheduled. The
reading (TR EAD), and writing (TW RITE) to transactional RTS maintains a timer that ticks, by default, every 20ms. On
variables. a tick, the current SCont needs to be de-scheduled and a
The most important rule is TATOMIC which combines new SCont from the scheduler needs to be scheduled. The
multiple STM transitions into a single program transition. semantics of handling timer interrupts is shown in Figure 9.
Thus, other HECs are not allowed to witness the interme- The Tick label on the transition arrow indicates an inter-
diate effects of the transaction. The semantics of excep- action with the RTS; we call such a label an RTS-interaction.
tion handling under STM is interesting (rules TCE XN and In this case the RTS-interaction Tick indicates that the RTS
TT HROW). Since an exception can carry a TVar allocated wants to signal a timer tick6 . The transition here injects
in the aborted transaction, the effects of the current transac- yield into the instruction stream of the SCont running on
tion are undone except for the newly allocated TVars. Oth-
erwise, we would have dangling pointer corresponding to 6 Technically we should ensure that every HEC receives a tick, and of
such TVars. Rule TCR ETRY simply propagates the request course our implementation does just that, but we elide that here.
8 2014/3/26
HEC transitions H; Θ =⇒ H 0 ; Θ0
(N EW SC ONT )
(S WITCH S ELF )
s0 fresh r fresh D0 = (deq(D), enq(D), r) ∗
s; M s; D; Θ return s; Θ0
hs, E[newSCont M ], Di; Θ =⇒
hs, E[switch M ], Di; Θ =⇒ hs, E[return ()], Di; Θ0
hs, E[return s0 ], Di; Θ[s0 7→ (M, D0 )][r 7→ toDyn ()]
(S WITCH ) (S WITCH E XN )
∗ ∗
s; M s; D; Θ return s0 ; Θ0 [s0 7→ (M 0 , D0 )] s; M s; D; Θ throw N ; Θ0
hs, E[switch M ], Di; Θ =⇒ hs0 , M 0 , D0 i; Θ0 [s 7→ (E[return ()], D)] hs, E[switch M ], Di; Θ =⇒ hs, E[throw N ], Di; Θ ∪ (Θ0 \ Θ)
Idle k hs, E[runOnIdleHEC s0 ], Di; Θ[s0 7→ (M 0 , D0 )] =⇒ hs, return (), Di; Θ =⇒ Idle; Θ (D ONE U NIT )
hs0 , M 0 , D0 i k hs, E[return ()], Di; Θ hs, throw N, Di; Θ =⇒ Idle; Θ (D ONE E XN )
a a
HEC transitions H; Θ =⇒ H 0 ; Θ0 HEC transitions H; Θ =⇒ H 0 ; Θ0
(TR ETRYATOMIC )
yield = switch (λs. enq(D) s >> deq(D) s)
(T ICK ) ∗
Tick
hs, M, Di; Θ ==⇒ hs, yield >> M, Di; Θ s; M ; D; Θ retry; Θ0
deq
hs, E[atomically M ], Di; Θ ,→ H 0 ; Θ00
STMBlock s
hs, E[atomically M ], Di; Θ ======⇒ H 0 ; Θ00
Figure 9. Handling timer interrupts
(TR ESUME R ETRY )
this HEC, at a GC safe point, where yield behaves just like enq s
H; Θ ,→ H 0 ; Θ0
the definition in Section 3.3.2.
RetrySTM s
H; Θ ======⇒ H 0 ; Θ0
6.2 STM blocking operations (TR ETRY S WITCH )
As mentioned before (Section 3.4), STM supports blocking ∗
s; M s; D; Θ retry; Θ0
operations through the retry primitive. Figure 10 gives the
STMBlock s
semantics for STM retry operation. hs, E[switch M ], Di; Θ ======⇒ hs, E[switch M ], DiSleeping ; Θ
(TWAKEUP )
6.2.1 Blocking the SCont
RetrySTM s
hs, E[M ], DiSleeping ; Θ ======⇒ hs, E[M ], Di; Θ
Rule TR ETRYATOMIC is similar to TT HROW in Figure 7.
It runs the transaction body M ; if the latter terminates with
retry, it abandons the effects embodied in Θ0 , reverting to Figure 10. STM Retry
deq
Θ. But, unlike TT HROW it then uses an auxiliary rule ,→ ,
defined in Figure 11, to fetch the next SCont to switch to.
The transition in TR ETRYATOMIC is labelled with the RTS dequeue activation. s0 is made the running SCont on this
interaction STMBlock s, indicating that the RTS assumes re- HEC.
sponsibility for s after the reduction. It is necessary that the dequeue upcall be performed on
The rules presented in Figure 11 are the key rules in a new SCont s0 , and not on the SCont s being blocked.
abstracting the interface between the ULS and the RTS, and At the point of invocation of the dequeue upcall, the RTS
describe the invocation of upcalls. In the sequel, we will believes that the blocked SCont s is completely owned by
often refer to these rules in describing the semantics of the the RTS, not running, and available to be resumed. Invoking
RTS interactions. Rule U P D EQUEUE in Figure 11 stashes the dequeue upcall on the blocked SCont s can lead to a race
s (the SCont to be blocked) in the heap Θ, instantiates an on s between multiple HECs if s happens to be unblocked
ephemeral SCont that fetches the dequeue activation b from and enqueued to the scheduler before the switch transaction
s’s local state D, and switches to the SCont returned by the is completed.
9 2014/3/26
a
Dequeue upcall instantiation
deq
H; Θ ,→ H 0 ; Θ0 HEC transitions H; Θ =⇒ H 0 ; Θ0
(U P D EQUEUE ) OC s
hs, E[outcall r], Di; Θ ===⇒ (OCB LOCK )
s0 fresh r fresh D0
= (deq(D), enq(D), r) hs, E[outcall r], DiOutcall ; Θ
M 0 = switch (λx. deq(D) s)
OCRet s M
Θ0 = Θ[s 7→ (M, D)][r 7→ toDyn ()] hs, E[outcall r], DiOutcall ; Θ ======⇒ (OCR ET FAST )
deq hs, E[M ], Di; Θ
hs, M, Di; Θ ,→ hs0 , M 0 , D0 i; Θ0
deq
enq s
hs, M, Di; Θ ,→ H 0 ; Θ0
(OCS TEAL )
Enqueue upcall instantiation H; Θ ,→ H 0 ; Θ0 OCSteal s
hs, M, DiOutcall ; Θ =====⇒ H 0 ; Θ0
(U P E NQUEUE I DLE ) enq s
H; Θ[s 7→ (E[M ], D)] ,→ H 0 ; Θ0
(OCR ET S LOW )
s0 fresh r fresh D0
= (deq(D), enq(D), r) OCRet s M
H; Θ[s 7→ (E[outcall r], D)] ======⇒ H 0 ; Θ0
M 0 = atomically (enq(D) s)
Θ0 = Θ[s 7→ (M, D)][r 7→ toDyn ()]
enq s
Idle; Θ[s 7→ (M, D)] ,→ hs0 , M 0 , D0 i; Θ0 Figure 12. Safe foreign call transitions
6.2.4 Implementation of upcalls
(U P E NQUEUE RUNNING )
Notice that the rules U P D EQUEUE and U P E NQUEUE I DLE in
M 00 = atomically (enq(D) s) >> M 0
enq s
Figure 11 instantiate a fresh SCont. The freshly instantiated
hs0 ,M 0 ,D0 i; Θ[s 7→ (M,D)] ,→ hs0 ,M 00 ,D0 i; Θ[s 7→ (M,D)] SCont performs just a single transaction; switch in U P D E -
QUEUE and atomically in U P E NQUEUE I DLE , after which
it is garbage-collected. Since instantiating a fresh SCont for
Figure 11. Instantiating upcalls every upcall is unwise, the RTS maintains a dynamic pool
of dedicated upcall SConts for performing the upcalls. It
is worth mentioning that we need an “upcall SCont pool”
6.2.2 Resuming the SCont
rather than a single “upcall SCont” since the upcall trans-
Some time later, the RTS will see that some thread has actions can themselves get blocked synchronously on STM
written to one of the TVars read by s’s transaction, so it will retry as well as asynchronously due to optimizations for
signal an RetrySTM s interaction (rule TR ESUME R ETRY). lazy evaluation (Section 6.5).
enq s
Again, we use an auxiliary transition ,→ to enqueue the
SCont to its scheduler (Figure 11). 6.3 Safe foreign function calls
deq Foreign calls in GHC are highly efficient but intricately
Unlike ,→ transition, unblocking an SCont has nothing
to do with the computation currently running on any HEC. interact with the scheduler [17]. Much of it owes to the the
If we find an idle HEC (rule U P E NQUEUE I DLE), we instan- RTS’s task model. Each HEC is animated by one of a pool of
tiate a new ephemeral SCont s0 to enqueue the SCont s. tasks (OS threads); the current task may become blocked in
The actual unblock operation is achieved by fetching SCont a foreign call (e.g. a blocking I/O operation), in which case
s’s enqueue activation, applying it to s and atomically per- another task takes over the HEC. However, at most only one
forming the resultant STM computation. If we do not find task ever has exclusive access to a HEC.
any idle HECs (rule U P E NQUEUE RUNNING), we pick one GHC’s task model ensures that a HEC performing a safe-
of the running HECs, prepare it such that it first unblocks the foreign call only blocks the Haskell thread (and the task)
SCont s before resuming the original computation. making the call but not the other threads running on the
HEC’s scheduler. However, it would be unwise to switch the
thread (and the task) on every foreign call as most invoca-
6.2.3 HEC sleep and wakeup tions are expected to return in a timely fashion. In this sec-
Recall that invoking retry within a switch transaction or tion, we will discuss the interaction of safe-foreign function
dequeue activation puts the HEC to sleep (Section 3.4). calls and the ULS. In particular, we restrict the discussion to
Also, notice that the dequeue activation is always invoked outcalls — calls made from Haskell to C.
by the RTS from a switch transaction (Rule U P D EQUEUE). Our decision to preserve the task model in the RTS allows
This motivates rule TR ETRY S WITCH: if a switch transac- us to delegate much of the work involved in safe foreign call
tion blocks, we put the whole HEC to sleep. Then, dual to to the RTS. We only need to deal with the ULS interaction,
TR ESUME R ETRY, rule TWAKEUP wakes up the HEC when and not the creation and coordination of tasks. The semantics
the RTS sees that the transaction may now be able to make of foreign call handling is presented in Figure 12. Rule
progress. OCB LOCK illustrates that the HEC performing the foreign
10 2014/3/26
call moves into the Outcall state, where it is no longer
runnable. In the fast path (rule OCR ET FAST), the foreign HEC transitions
a
H; Θ =⇒ H 0 ; Θ0
call returns immediately with the result M , and the HEC
resumes execution with the result plugged into the context. deq
In the slow path, the RTS may decide to pay the cost of hs, M, Di; Θ ,→ H 0 ; Θ0
BlockBH s
(B LOCK BH)
task switching and resume the scheduler (rule OCS TEAL). hs, M, Di; Θ = =⇒ H 0 ; Θ0
====
The scheduler is resumed using the dequeue upcall. Once the enq s
foreign call eventually returns, the SCont s blocked on the H; Θ ,→ H 0 ; Θ0
ResumeBH s
(R ESUME BH)
foreign call can be resumed. Since we have already resumed H; Θ =======⇒ H 0 ; Θ0
the scheduler, the correct behaviour is to prepare the SCont
s with the result and add it to its ULS. Rule OCR ET S LOW
achieves this through enqueue upcall. Figure 13. Black holes
6.4 Timer interrupts and transactions threads runnable. This mechanism, and its implementation
on a multicore, is described in detail in earlier work [9].
What if a timer interrupt occurs during a transaction? The
Clearly this is another place where the RTS may initiate
(T ICK ) rule of Section 6.1 is restricted to HEC transitions,
blocking. We can describe the common case with rules simi-
and says nothing about STM transitions. One possibility
lar to those of Figure 10, with rules shown in Figure 13. The
(Plan A) is that transactions should not be interrupted, and
RTS initiates the process with a BlockBH s action, taking
ticks should only be delivered at the end. This is faithful to
ownership of the SCont s. Later, when the evaluation of the
the semantics expressed by the rule, but it does mean that a
thunk is complete, the RTS initiate an action ResumeBH s,
rogue transaction could completely monopolise a HEC.
which returns ownership to s’s scheduler.
An alternative possibility (Plan B) is for the RTS to roll
But these rules only apply to HEC transitions, outside
the transaction back to the beginning, and then deliver the
transactions. What if a black hole is encountered during an
tick using rule (T ICK ). That too is implementable, but this
STM transaction? We addressed this same question in the
time the risk is that a slightly-too-long transaction would
context of timer interrupts, in Section 6.4, and we adopt the
always be rolled back, so it would never make progress.
same solution. The RTS behaves as if the black-hole suspen-
Our implementation behaves like Plan B, but gives better
sion and resumption occurred just before the transaction, but
progress guarantees, while respecting the same semantics.
the implementation actually arranges to resume the transac-
Rather than rolling the transaction back, the RTS suspends
tion from where it left off.
the transaction mid-flight. None of its effects are visible to
Just as in Section 6.4, we need to take particular care
other SConts; they are confined to its SCont-local transac-
with switch transactions. Suppose a switch transaction
tion log. When the SCont is later resumed, the transaction
encounters a black-holed thunk under evaluation by some
continues from where it left off, rather than starting from
other SCont B; and suppose we try to suspend the transaction
scratch. Of course, time has gone by, so when it finally tries
(either mid-flight or with roll-back) using rule (B LOCK BH).
to commit there is a higher chance of failure, but at least deq
uncontended access will go through. Then the very next thing we will do (courtesy of ,→ ) is a
That is fine for vanilla atomically transactions. But switch transaction; and that is very likely to encounter the
what about the special transactions run by switch? If we very same thunk. Moreover, it is just possible that the thunk
are in the middle of a switch transaction, and suspend it to is under evaluation by an SCont in this very scheduler’s run-
deliver a timer interrupt, rule (T ICK ) will initiate . . . a switch queue, so the black hole is preventing us from scheduling the
transaction! And that transaction is likely to run the very very SCont that is evaluating it. Deadlock beckons!
same code that has just been interrupted. It seems much sim- In the case of timer interrupts we solved the problem by
pler to revert to Plan A: the RTS does not deliver timer inter- switching them off in switch transactions, and it turns out
rupts during a switch transaction. If the scheduler has rogue that we can effectively do the same for thunks. Since we
code, then it will monopolise the HEC with no recourse. cannot sensibly suspend the switch transaction, we must
find a way for it to make progress. Fortunately, GHC’s RTS
6.5 Black holes allows us to steal the thunk from the SCont that is evaluating
In a concurrent Haskell program, a thread A may attempt to it, and that suffices. The details are beyond the scope of this
evaluate a thunk x that is already being evaluated by another paper, but the key moving parts are already part of GHC’s
thread B. To avoid duplicate evaluation the RTS (in intimate implementation of asynchronous exceptions [16, 20].
cooperation with the compiler) arranges for B to blackhole
the thunk when it starts to evaluate x. Then, when A attempts 6.6 Interaction with RTS MVars
to evaluate x, it finds a black hole, so the RTS enqueues An added advantage of our scheduler activation interface is
A to await its completion. When B finishes evaluating x it that we are able to reuse the existing MVar implementation
updates the black hole with its value, and makes any queued in the RTS. Whenever an SCont s needs to block on or
11 2014/3/26
deq enq s Baseline Vanilla LWC
unblock from an MVar, the RTS invokes the ,→ or ,→ Benchmark (1 proc) 1 HEC Fastest 1 HEC Fastest
upcall, respectively. This significantly reduces the burden of (# HECs) (# HECs)
migrating to a ULS implementation. k-nucleotide 10.60 10.62 4.82 (8) 10.61 4.83 (8)
mandelbrot 85.30 90.83 3.21 (48) 87.06 2.19 (48)
6.7 Asynchronous exceptions spectral-norm 125.76 125.91 2.92 (48) 125.76 2.84 (48)
chameneos 4.62 5.71 5.71 (1) 18.25 12.35 (2)
GHC’s supports asynchronous exceptions in which one primes-sieve 32.52 36.33 36.33 (1) 223 13.7 (48)
thread can send an asynchronous interrupt to another [16].
This is a very tricky area; for example, if a thread is blocked Figure 14. Benchmark results. All times are in seconds.
on a user-level MVar (Section 4.2), and receives an excep-
tion, it should wake up and do something — even though it is In order to evaluate the performance and quantify the
linked onto an unknown queue of blocked threads. Our im- overheads of LWC substrate, we picked the following Haskell
plementation does in fact handle asynchronous exceptions, concurrency benchmarks from The Computer Language
but we are not yet happy with the details of the design, and Benchmarks Game [24]: k-nucleotide, mandelbrot,
in any case space precludes presenting them here. spectral-norm and chameneos. We also implemented
a concurrent prime number generator using sieve of Er-
6.8 On the correctness of user-level schedulers atosthenes (primes-sieve), where the threads commu-
While the concurrency substrate exposes the ability to build nicate over the MVars. For our experiments, we gener-
ULS’s, the onus is on the scheduler implementation to en- ated the first 10000 primes. The benchmarks offer vary-
sure that it is sensible. The invariants such as not switching ing degrees of parallelisation opportunity. k-nucleotide,
to a running thread, or a thread blocked in the RTS, are not mandelbrot and spectral-norm are computation inten-
statically enforced by the concurrency substrate, and care sive, while chameneos and primes-sieve are commu-
must be taken to preserve these invariants. Our implemen- nication intensive and are specifically intended to test the
tation dynamically enforces such invariants through runtime overheads of thread synchronisation.
assertions. We also expect that the activations do not raise an The LWC version of the benchmarks utilised the sched-
exception that escape the activation. Activations raising ex- uler and the MVar implementation described in Section 4.
ceptions indicates an error in the ULS implementation, and For comparison, the benchmark programs were also im-
the substrate simply reports an error to the standard error plemented using Control.Concurrent on a vanilla GHC
stream. implementation. Experiments were performed on a 48-core
The fact that the scheduler itself is now implemented in AMD Opteron server, and the GHC version was 7.7.20130523.
user-space complicates error recovery and reporting when The results are presented in Figure 14. For each bench-
threads become unreachable. A thread suspended on an mark, the baseline is the non-threaded (not compiled with
ULS may become unreachable if the scheduler data struc- -threaded) vanilla GHC version. The non-threaded version
ture holding it becomes unreachable. A thread indefinitely does not support multi-processor execution of concurrent
blocked on an RTS MVar operation is raised with an ex- programs, but also does not include the mechanisms nec-
ception and added to its ULS. This helps the correspond- essary for (and the overheads included in) multi-processor
ing thread from recovering from indefinitely blocking on an synchronisation. Hence, the non-threaded version of a pro-
MVar operation. gram running on 1 processor is faster than the corresponding
However, the situation is worse if the ULS itself becomes threaded version.
unreachable; there is no scheduler to run this thread! Hence, For the vanilla and LWC versions (both compiled with
salvaging such a thread is not possible. In this case, imme- -threaded), we report the running times on 1 HEC as well
diately after garbage collection, our implementation logs an as the fastest running time observed with additional HECs.
error message to the standard error stream along with the Additionally, we report the HEC count corresponding to the
unreachable SCont (thread) identifier. fastest configuration. All the times are reported in seconds.
In k-nucleotide and spectral-norm benchmarks, the
7. Results performance of LWC version was indistinguishable from
the vanilla version. The threaded versions of the bench-
Our implementation is a fork of GHC,7 and supports all of
mark programs were fastest on 8 HECs and 48 HECs
the features discussed in the paper. We have been very par-
on k-nucleotide and spectral-norm, respectively. In
ticular not to compromise on any of the existing features in
mandelbrot benchmark, LWC version was faster than the
GHC. As shown in Section 4, porting existing concurrent
vanilla version. While the vanilla version was 29× faster
Haskell program to utilise a ULS only involves few addi-
than the baseline, LWC version was 38× faster. In the vanilla
tional lines of code.
GHC, the RTS thread scheduler by default spawns a thread
7 Thedevelopment branch of LWC substrate is available at https:// on the current HEC and only shares the thread with other
github.com/ghc/ghc/tree/ghc-lwc2 HECs if they are idle. The LWC scheduler (described in Sec-
12 2014/3/26
tion 4) spawns threads by default in a round-robin fashion integrates well with GHC’s safe foreign-function interface
on all HECs. This simple scheme happens to work better in through the activation interface (Section 6.3).
mandelbrot since the program is embarrassingly parallel. While Manticore [22], and MultiMLton [27] utilise low-
In chameneos benchmark, the LWC version was 3.9× level compare-and-swap operation as the core synchronisa-
slower than the baseline on 1 HEC and 2.6× slower on 2 tion primitive, Li et al.’s concurrency substrate [14] for GHC
HECs, and slows down with additional HECs as chameneos was the first to utilise transactional memory for multiproces-
does not present much parallelisation opportunity. The sor synchronisation for in the context of ULS’s. Our work
vanilla chameneos program was fastest on 1 HEC, and was borrows the idea of using STM for synchronisation. Unlike
1.24× slower than the baseline. In primes-sieve bench- Li’s substrate, we retain the key components of the concur-
mark, while the LWC version was 6.8× slower on one HEC, rency support in the runtime system. Not only does alleviate
the vanilla version was 1.3X slower, when compared to the the burden of implementing the ULS, but enables us to safely
baseline. handle the issue of blackholes that requires RTS support,
In chameneos and primes-sieve, we observed that the and perform blocking operations under STM. In addition,
LWC implementation spends around half of its execution Li’s substrate work uses explicit wake up calls for unblock-
running the transactions for invoking the activations or MVar ing sleeping HECs. This design has potential for bugs due
operations. Additionally, in these benchmarks, LWC version to forgotten wake up messages. Our HEC blocking mecha-
performs 3X-8× more allocations than the vanilla version. nism directly utilises STM blocking capability provided by
Most of these allocations are due to the data structure used in the runtime system, and by construction eliminates the pos-
the ULS and the MVar queues. In the vanilla primes-sieve sibility of forgotten wake up messages.
implementation, these overheads are negligible. This is an Scheduler activations [1] have successfully been demon-
unavoidable consequence of implementing concurrency li- strated to interface kernel with the user-level process sched-
braries in Haskell. uler [2, 26]. Similar to scheduler activations, Psyche [18] al-
Luckily, these overheads are parallelisable. In primes- lows user-level threads to install event handlers for scheduler
sieve benchmark, while the vanilla version was fastest on 1 interrupts and implement the scheduling logic in user-space.
HEC, LWC version scaled to 48 HECs, and was 2.37× faster Unlike these works, our system utilises scheduler activations
than the baseline program. This gives us the confidence that in the language runtime rather than OS kernel. Moreover,
with careful optimisations and application specific heuristics our activations being STM computations allow them to be
for the ULS and the MVar implementation, much of the composed with other language level transactions in Haskell,
overheads in the LWC version can be eliminated. enabling scheduler-agnostic concurrency library implemen-
tations.
13 2014/3/26
fide one-shot continuation, and not simply a reference to [16] S. Marlow, S. P. Jones, A. Moran, and J. Reppy. Asynchronous
the underlying TSO object. As a result, concurrency sub- exceptions in haskell. In PLDI, pages 274–285, 2001.
strate should be able to better handle erroneous scheduler be- [17] S. Marlow, S. P. Jones, and W. Thaller. Extending the haskell
haviours rather than raising an error and terminating. As for foreign function interface with concurrency. In Haskell, pages
the implementation, we would like to explore the effective- 22–32, 2004.
ness user-level gang scheduling for Data Parallel Haskell [4] [18] B. D. Marsh, M. L. Scott, T. J. Leblanc, and E. P. Markatos.
workloads, and priority scheduling for Haskell based web- First-class user-level threads. In OSDI, pages 110–121, 1991.
servers [11] and virtual machines [7]. [19] Microsoft Corp. Common Language Runtime (CLR),
2014. http://msdn.microsoft.com/en-us/library
/8bs2ecf4(v=vs.110).aspx.
References
[20] A. Reid. Putting the spine back in the spineless tagless g-
[1] T. E. Anderson, B. N. Bershad, E. D. Lazowska, and H. M. machine: An implementation of resumable black-holes. In
Levy. Scheduler activations: effective kernel support for the IFL, pages 186–199, 1999.
user-level management of parallelism. ACM Trans. Comput.
Syst., 10(1):53–79, Feb. 1992. [21] J. Reppy. Concurrent Programming in ML. Cambridge Uni-
versity Press, 2007.
[2] A. Baumann, P. Barham, P.-E. Dagand, T. Harris, R. Isaacs,
[22] J. Reppy, C. V. Russo, and Y. Xiao. Parallel concurrent ml. In
S. Peter, T. Roscoe, A. Schüpbach, and A. Singhania. The
ICFP, pages 257–268, 2009.
multikernel: a new os architecture for scalable multicore sys-
tems. In SOSP, pages 29–44, 2009. [23] O. Shivers. Continuations and Threads: Expressing Machine
Concurrency Directly in Advanced Languages. In Continua-
[3] C. Bruggeman, O. Waddell, and R. K. Dybvig. Representing
tions Workshop, 1997.
control in the presence of one-shot continuations. In PLDI,
pages 99–107, 1996. [24] Shootout. The Computer Language Benchmarks Game, 2014.
http://benchmarksgame.alioth.debian.org/.
[4] M. M. T. Chakravarty, R. Leshchinskiy, S. Peyton Jones,
G. Keller, and S. Marlow. Data parallel haskell: a status re- [25] M. Wand. Continuation-based multiprocessing. In LFP, pages
port. In DAMP, pages 10–18, 2007. 19–28, 1980.
[5] M. Fluet, M. Rainey, and J. Reppy. A scheduling framework [26] N. J. Williams. An Implementation of Scheduler Activations
for general-purpose parallel languages. In ICFP, pages 241– on the NetBSD Operating System. In Usenix ATC, pages 99–
252, 2008. 108, 2002.
[27] L. Ziarek, K. Sivaramakrishnan, and S. Jagannathan. Compos-
[6] D. Frampton, S. M. Blackburn, P. Cheng, R. J. Garner,
able Asynchronous Events. In PLDI, pages 628–639, 2011.
D. Grove, J. E. B. Moss, and S. I. Salishev. Demystifying
magic: high-level low-level programming. In VEE, pages 81–
90, 2009.
[7] Galois. Haskell Lightweight Virtual Machine (HaLVM),
2014. http://corp.galois.com/halvm.
[8] GHC. Glasgow Haskell Compiler, 2014.
http://www.haskell.org/ghc.
[9] T. Harris, S. Marlow, and S. Peyton Jones. Haskell on
a shared-memory multiprocessor. In ACM Workshop on
Haskell, Tallin, Estonia, 2005.
[10] T. Harris, S. Marlow, S. Peyton-Jones, and M. Herlihy. Com-
posable memory transactions. In PPoPP, pages 48–60, 2005.
[11] Haskell. Haskell Web Development, 2014.
http://www.haskell.org/haskellwiki/Web/Servers.
[12] HotSpotVM. Java SE HotSpot at a Glance, 2014.
http://www.oracle.com/technetwork/java/javase
/tech/index-jsp-137187.html.
[13] IBM. Java Platform Standard Edition (Java SE), 2014.
http://www.ibm.com/developerworks/java/jdk/.
[14] P. Li, S. Marlow, S. Peyton Jones, and A. Tolmach.
Lightweight concurrency primitives for ghc. In Haskell, pages
107–118, 2007.
[15] B. Lippmeier, M. Chakravarty, G. Keller, and S. Peyton Jones.
Guiding parallel array fusion with indexed types. In Haskell,
pages 25–36, 2012.
14 2014/3/26
Appendix and a TVar representing auxiliary storage, respectively. For
perspicuity, we define accessor functions as shown below.
Semantics of local state manipulation deq(M, , ) = M enq( , M, ) = M aux( , , r) = r
s; P[enqueueAct s0 ]; D; Θ[s0 7→ (M 0 , D0 )]
P[enq(D0 ) s0 ]; Θ[s0 7→ (M 0 , D0 )]
15 2014/3/26