Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

4/22/09

Administrative
•  Design review feedback
-  Sent out yesterday – feel free to ask questions
L19: Dynamic Task •  Deadline Extended to May 4: Symposium on Application
Accelerators in High Performance Computing
Queues and More http://www.saahpc.org/
Synchronization •  Final Reports on projects
-  Poster session April 29 with dry run April 27
-  Also, submit written document and software by May 6
-  Invite your friends! I’ll invite faculty, NVIDIA, graduate
students, application owners, ..
•  Industrial Advisory Board meeting on April 29
-  There is a poster session immediately following class
-  I was asked to select a few projects (4-5) to be presented
-  The School will spring for printing costs for your poster!
-  Posters need to be submitted for printing immediately
following Monday’s class
CS6963 L19: Dynamic Task Queues
CS6963 2

Final Project Presentation Let’s Talk about Demos


•  Dry run on April 27 •  For some of you, with very visual projects, I asked
-  Easels, tape and poster board provided you to think about demos for the poster session
-  Tape a set of Powerpoint slides to a standard 2’x3’ poster, •  This is not a requirement, just something that would
or bring your own poster. enhance the poster session
•  Final Report on Projects due May 6 •  Realistic?
-  Submit code -  I know everyone’s laptops are slow …
-  And written document, roughly 10 pages, based on earlier -  … and don’t have enough memory to solve very large
submission. problems
-  In addition to original proposal, include
•  Creative Suggestions?
-  Project Plan and How Decomposed (from DR)
-  Description of CUDA implementation
-  Performance Measurement
-  Related Work (from DR)

L19: Dynamic Task Queues L19: Dynamic Task Queues


CS6963 3 CS6963 4

1
4/22/09

Sources for Today’s Lecture Last Time: Simple Lock Using Atomic Updates
•  “On Dynamic Load Balancing on Graphics Processors,” D. Can you use atomic updates to create a lock variable?
Cederman and P. Tsigas, Graphics Hardware (2008).
http://www.cs.chalmers.se/~cederman/papers/
GPU_Load_Balancing-GH08.pdf Consider primitives:
•  “A simple, fast and scalable non-blocking concurrent
FIFO queue for shared memory multiprocessor systems,”
P. Tsigas and Y. Zhang, SPAA 2001. int lockVar;
(more on lock-free queue)
•  Thread Building Blocks atomicAdd(&lockVar, 1);
http://www.threadingbuildingblocks.org/ atomicAdd(&lockVar,-1);
(more on task stealing)

L19: Dynamic Task Queues L19: Dynamic Task Queues


CS6963 5 CS6963 6

Suggested Implementation Constructing a dynamic task queue on GPUs


// also unsigned int and long long versions •  Must use some sort of atomic operation for global
synchronization to enqueue and dequeue tasks
int atomicCAS(int* address, int compare, int val);
•  Numerous decisions about how to manage task queues
reads the 32-bit or 64-bit word old located at the address
address in global or shared memory, computes (old == -  One on every SM?
compare ? val : old), and stores the result back to memory -  A global task queue?
at the same address. These three operations are
performed in one atomic transaction. The function returns -  The former can be made far more efficient but also more
old (Compare And Swap). 64-bit words are only supported prone to load imbalance
for global memory. •  Many choices of how to do synchronization
-  Optimize for properties of task queue (e.g., very large task
queues can use simpler mechanisms)
__device__ void getLock(int *lockVarPtr) {
•  All proposed approaches have a statically allocated
while (atomicCAS(lockVarPtr, 0, 1) == 1); task list that must be as large as the max number of
waiting tasks
}

L19: Dynamic Task Queues L19: Dynamic Task Queues


CS6963 7 CS6963 8

2
4/22/09

Synchronization Load Balancing Methods


•  Blocking •  Blocking Task Queue
-  Uses mutual exclusion to only allow one process at a time to
access the object.
•  Non-blocking Task Queue

•  Lockfree •  Task Stealing


-  Multiple processes can access the object concurrently. At •  Static Task List
least one operation in a set of concurrent operations
finishes in a finite number of its own steps.
•  Waitfree
-  Multiple processes can access the object concurrently.
Every operation finishes in a finite number of its own steps.

Slide source: Daniel Cederman Slide source: Daniel Cederman


L19: Dynamic Task Queues L19: Dynamic Task Queues
CS6963 9 CS6963 10

Static Task List (Recommended) Blocking Dynamic Task Queue


function DEQUEUE(q, id) Two lists: function DEQUEUE(q) Use lock for both
return q.in[id] ; adding
q_in is read only and while atomicCAS(&q.lock, 0, 1) == 1 do; and deleting tasks
function ENQUEUE(q, task) not synchronized
localtail ← atomicAdd (&q.tail, 1) q_out is write only and
if q.beg != q.end then from the queue.
q.out[localtail ] = task is updated atomically q.beg ++
function NEWTASKCNT(q) result ← q.data[q.beg] All other threads
When NEWTASKCNT is
q.in, q.out , oldtail , q.tail ← q.out , q.in, q.tail, 0 else block waiting for lock.
called at the end of major
return oldtail
task scheduling phase, result ← NIL
procedure MAIN(taskinit) q.lock ← 0 Potentially very
q_in and q_out are
q.in, q.out ← newarray(maxsize), newarray(maxsize) inefficient, particularly
q.tail ← 0
swapped return result for fine-grained tasks
enqueue(q, taskinit )
blockcnt ← newtaskcnt (q)
Synchronization required function ENQUEUE(q, task)
while blockcnt != 0 do
to insert tasks, but at while atomicCAS(&q.lock, 0, 1) == 1 do;
least one gets through
run blockcnt blocks in parallel
(wait free)
t ← dequeue(q, TBid ) q.end++
subtasks ← doWork(t )
for each nt in subtasks do q.data[q.end ] ← task
enqueue(q, nt ) q.lock ← 0
blocks ← newtaskcnt (q) L19: Dynamic Task Queues L19: Dynamic Task Queues
CS6963 11 CS6963 12

3
4/22/09

Lock-free Dynamic Task Queue Task Stealing


function DEQUEUE(q) •  No code provided in paper
oldbeg ← q.beg Idea:
lbeg ← oldbeg At least one thread •  Idea:
while task = q.data[lbeg] == NI L do will succeed to add or
remove task from -  A set of independent task queues.
lbeg ++
if atomicCAS(&q.data[l beg], task, NIL) != task then queue -  When a task queue becomes empty, it goes out to other task
restart queues to find available work
if lbeg mod x == 0 then Optimization: -  Lots and lots of engineering needed to get this right
atomicCAS(&q.beg, oldbeg, lbeg) Only update
beginning and end -  Best work on this is in Intel Thread Building Blocks
return task
with atomicCAS every
function ENQUEUE(q, task)
x elements.
oldend ← q.end
lend ← oldend
while q.data[lend] != NIL do
lend ++
if atomicCAS(&q.data[lend], NIL, task) != NIL then
restart
if lend mod x == 0 then
atomicCAS(&q.end , oldend, lend )
L19: Dynamic Task Queues L19: Dynamic Task Queues
CS6963 13 CS6963 14

General Issues
•  One or multiple task queues?
•  Where does task queue reside?
-  If possible, in shared memory
-  Actual tasks can be stored elsewhere, perhaps in global
memory

L19: Dynamic Task Queues


CS6963 15

You might also like