Outline Intel Threading Building Blocks

Intel Threading Building Blocks


Dr. M. Schwind
Prof. Praktische Informatik

Winter Term 2013/2014


Intel Threading Building Blocks

Intel Threading Building Blocks

Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Mutex Atomar Operations Container concurrent vector concurrent hash map concurrent queue Task-Programming

Introduction Basics Parallel Constructs Synchronization Container Task-Programming

C++-library for shared memory parallel programming mainly for multicore CPU Implements important parallel programming patterns
Parallel loops Pipelining Task programming

Provides data structures, which allow the parallel access from several threads:
Queue (FIFO) Associative Container Vector

Developed by Intel

No restriction to Intel-Processors Implementation uses generic programming (C++-templates)

Commercial- and open-source-version Homepage: Literature:
Reference Manual Installation Guide Getting Started Guide
Introduction Basics Generic Programming with C++ Concepts of Threading Building Blocks Initialization Parallel Constructs Synchronization Container Task-Programming
Intel Threading Building Blocks: Outtting C++ for Multi-core Processor Parallelism Author: James Reinders Verlag: OReilly ISBN: 0596514808 Erscheinungsdatum: 2007

Motivation for Generic Programming

Example: simplied implementation of a stack for storing integer values Problem: Type safe implementation of a stack for the storage of variable types requires a implementation per each type Solution:
Usage of the preprocessor (awkward, confusing, dicult to maintain) Usage of templates

Generic Programming with C++

Functions, classes, and methods can be declared with types, which are variable until compile time To dene a class with variables type the declaration of a class is preceded with template<typename T1, typename T2, ...>
T1, T2, ... are identiers for the variable types typename is a keyword preceding the identier

class IntStack { public : void push ( const int & item ) { mem [ pos ++]= item ; int pop () { return mem [ - - pos ]; } int mem [100]; int pos ; }; IntStack s ; // Usage s . push (5); x = s . pop ();
1 2 3 4 5 6 7 8 9

template < typename T > // Use type T instead int class Stack { public : void push ( const T & item ) { mem [ pos ++]= item ; T pop () { return mem [ - - pos ]; } T mem [100]; int pos ; }

Declaration of Objects using Variable Types

Type Requirements
Templates can be used with self dened types Example: Usage of the stack class for storage of self dened tuple classes Analysis of the implementation of the stack class shows, that it is required that a assignment operator must be dened. Template implementations require certain semantic and syntactic requirements.

Declaration of objects from a template class uses class name followed by the types specied in <>-braces

// D e c l a r a t i o n of a integer stack Stack < int > int_stack ; int_stack . push (5); // D e c l a r a t i o n of a stack using double p r e c i s i o n numbers Stack < double > double_stack ; double_stack . push (5.0);

1 2 3 4 5 6 7 8 9 10 11
class IntTupel { public : // A s s i g n m e n t O pe r at o r IntTupel & operator =( const IntTupel & other ) { s1 = other . s1 ; s2 = other . s2 ; return * this ; } int s1 , s2 ; // el e me nt s of the tuple }; ... // Usage Stack < IntTupel > s ; s . push ( IntTupel (5 ,6));
Concepts and Models

A concept is a collection of requirements for a type
Syntactic requirements (e.g. a class denes a method with a specic name) Semantic requirements (a method does a computation in a specic way)

Splittable Concept
pseudo signature
X::X(X& x, Split )

semantics Splitting x into x and a new constructed object

A model is a type which fullls all requirements of a concept Concepts are in threading building blocks described by pseudo signatures:

Splitting-Constructor splits objects into two parts Argument Split is used to distinguish the splitting-constructor from the copy constructor Used for:
Partitioning of a index range into two subranges, which can be computed in parallel Duplication of function objects which are computed in parallel

Example (CopyConstructible)
pseudo signature
T( const T&) ~T() T* operator& () const T* operator&() const

semantics Copy-Constructor Destructor Address from T Address from const T

blocked_range and blocked_range2d parallel_reduce and parallel_scan

Range Concept
Represent index sets Typically used in parallel loops pseudo signature
R::R(const R& ) R::~R() bool R::empty() const R::is_divisible() const R::R(R& r, Split) const

Models for Ranges

blocked range
template<typename Value> class blocked_range;

semantics Copy-Constructor Destructor true if index range empty true if index range can be divided Subdivision of r into two index sets

Represents half open interval [i , j ); i and j have type Value Models for Value are build in types such as int, uint or pointer to vector elements
1 template < typename Value > class blocked_range { 2 public : 3 typedef size_t size_type ; 4 typedef Value const_iterato r ; 5 6 blocked_range ( Value begin , Value end , size_type grainsize =1); 7 blocked_range ( blocked_range & r , split ); 8 9 size_type size () const ; 10 bool empty () const ; 11 12 size_type grainsize () const ; 13 bool is_divisible () const ; 14 15 const _iterator begin () const ; 16 const _iterator end () const ; }
and blocked_range2d

Initialization of TBB
Concept of value
pseudo signature
Value::Value(const Value&) Value::~Value() operator-(const Value& i, const Value& j) operator+(const Value& i, size_t k)

semantics Copy constructor Destructor Number of elements in range [i , j ) k th value after i

1 2 3 4 5 6 7 8

include " tbb / t a s k _ s c h e d u l e r _ i n i t . h " using namespace tbb ; int main () { t a s k _ s c h e d u l e r _i n i t init ; ... return EXIT_SUCCESS ; }

Each program requires a tbb::task_scheduler_init-object After initialization threads get started and wait for work assignment. A additional parameter can specify the number of threads Example: task_scheduler_init init(8) creates 8 threads Threads are alive as long the task_scheduler_init-object is not destroyed

gets destructed threads are destroyed

Template for Parallel Loops

parallel_for template<typename Range, typename Body> void parallel_for(const Range& range, const Body& body);

Introduction Basics Parallel Constructs parallel for parallel reduce parallel do pipeline parallel sort Additional Algorithm Templates Synchronization Container Task-Programming
Parallel iteration over a range-object Range object is subdivided into parts For each part the

gets called from the body object

Additional version of parallel_for which has as a third argument a partitioner Requirements for body:
pseudo signature
Body::Body(const Body); Body::~Body(); void Body::operator()(Range& r) const;

semantics Copy-constructor Destructor application of the operator () to r

subdivide the index range recursively until the call to return the value false. For each part of the index range the body object is replicated and applied to that part.
parallel_for is_divisible()

Reductions Operation
parallel_reduce template<typename Range, typename Body> void parallel_reduce(const Range& range, const Body& body);

1 2 3 4 5 6 7 8 9 10 11 12 13 14
class DoubleAll { int * intput ; DoubleAll ( int * _input ) : input ( _input ) {}; void operator ()( const blocked_range < int >& range ) const { for ( int i = range . begin (); i != range . end (); ++ i ) input [ i ]*=2: } } void Par all elDo ubl eAl l ( int * input , size_t n ) { DoubleAll da ( input ); parallel_for ( blocked_range < int >(0 , n ,1000) , da ); }
Build a single object by applying a reduction operator to a set of objects Computation e.g. the sum, minimum, maximum of vector elements Additional version using a partitioner Reduction operator should be associative Body:
pseudo signature
Body::Body(Body, split); Body::~Body(); void Body::operator()(Range& r); Body::join(Body& rhs);

semantics Splitting Constructor Destructor Reduction of elements using the subrange r Combining the values of subranges; combines rhs with the value of *this
Is replicated for each subrange Application of the operator-() of body object to each subrange Stores the value of a reduction over a subrange

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22
class Sum { public : float * array ; float value ; Sum ( float * _array ) : value (0) , array ( _array ) {} Sum ( Sum & s , split ) { value =0; array = s . array } void operator () ( const blocked_range < int > & range ) { float temp = value ; for ( int i = range . begin (); i != range . end (); ++ i ) temp += array [ i ]; value = temp ; } void join ( Sum & rhs ) ( value += rhs . value ;} }; float ParallelSum ( float * array , size_t n ) { Sum total ( array ); p ar al le l _r educ e ( blocked_range < int >(0 , n , 1000) , total ); return total . value ; }

Combination of intermediate results with


Controls the subdivision of range objects and the assignment of range objects to threads. Used for
parallel_for, parallel_reduce

parallel do
parallel_do template<typename InputIterator, typename Body> void parallel_do(InputIterator first, InputIterator last, Body& body);




Recursive subdivision of range objects until Range::is_divisible return false.


Sequential iteration over a elements of some container and applying an operator of the body object. Particularly useful when the elements of the container are not random accessible, e.g. in a list To each element of the container the operator object is applied Iterator object required:

Subdivides range object not necessarily until Range::is_divisible returns false. Balances work for processors, by ensuring that ranges for threads have nearly equal size.

of the body

Subdivision similar to auto_partitioner On iterating several times over the range object the partitioner assigns subranges to the same threads over all iterations. Increases cache eciency if data ts in cache.
Iterator is a abstract interface to access elements from a container Iterator objects are dened for STL (Standard Template Library)-Container or TBB-Container Possibility to apply the body to objects which are generated while the computation proceeds.
Pseudo-Signature Body:
pseudo signature
void B::operator()( B::argument_type &item, parallel_do_feeder<B::argument_type>& feed ) const; B::argument_type B::argument_type(const B::argument_type& ~B::argument

semantics item element to which the operator is applied feed is used to store newly created elements Type of elements Copy constructor of argument_type Destructor of argument_type

Class denition
1 class pipeline { 2 public : 3 pipeline (); 4 virtual ~ pipeline (); 5 void add_filter ( filter & f ); 6 void run ( size_t m a x _ n u m b e r _ o f _ l i v e _ t o k e n s ); 7 void clear (); 8 }

1 2 3 4 5 6 7 8 9 10 11
class ListEl {}; // is Copy - C o n s t r u c t i b l e struct Body { typedef ListEl argument_type ; void operator ()( ListEl c , tbb :: parallel_do_feeder < ListEl >& feed ) const ListEl & new_item = prozess_item ( c ); feed . add ( new_item ); } }; std :: list < ListEl > list ; ... tbb :: parallel_do ( list . begin () , list . end () , Body ());

A pipeline object (class pipeline;) uses several uses several pipeline stages f1 , . . . , fn called lters in TBB. Filters are created outside of the pipeline and put into the pipeline by calling pipeline::addfilter() The method pipeline::run starts the pipeline; max_number_of_live_tokens limits the number of parallel pipeline stages. pipeline::clear() removes all lters from pipeline; after that call the lters can be destroyed.
Class Denition of Filter

1 class Filter { 2 enum mode { parallel , serial } 3 filter ( mode ); 4 bool is_serial () const ; 5 virtual void * operator ()( void * item ) = 0; 6 virtual ~ filter (); }

Parallel Sorting

1 template < typename RandomAccessIterator , typename Compare > 2 parallel_sort < R a n d o m A c c e s s I t e r a t o r begin , 3 R a n d o m A c c e s s I t e r a t o r end , 4 const Compare & comp );

Each lter-class has to overwrite the virtual method void* filter::operator()(void *). The return value from the operator-() is used as the argument the next pipeline stage . The rst lter f1 generates the data; a return value of that no more elements need to be processed sind.
NULL item


tells TBB

Used for sorting a container-object Unstable sorting order of elements with the same key is not preserved. Deterministic sorting the same sequence of element generates in each sorting run the same sorted sequence RandomAccessIterator is dened in STL-Library; allows random access to elements

The last stage fn should manage the output; The return value of that stage is ignored. A lter can be marked as a parallel lter several items are computed in parallel in that stage
1 2 3 4
const int N = 100000; float b [ N ]; ... parallel_sort (b , b +N , std :: greater < float >());
Additional algorithm templates


Introduction Basics Parallel Constructs Synchronization Mutex Atomar Operations Container Task-Programming


Computing the prex sum in parallel Used for e.g. parallel sorting

parallel application of a function to elements of a random-access-iterator


parallel invocation of up to 10 functions


Scoped Locking Pattern

Motivation: Exception Usage:
1 2 3 4 5 6 7 8 9 10 11 12
fun1 () { ... throw new Exception (); } void fun2 () { lock . lock (); fun1 (); // mutex is not // un l oc k ed lock . unlock (); } fun1

Solution 1: Modication of fun2

1 void fun2 () { 2 lock . lock (); 3 try { 4 fun2 (); 5 } 6 catch ( Exception * e ) { 7 lock . unlock (); 8 // E x c e p t i o n H an dl i n g 9 } 10 catch (...) { 11 lock . unlock (); 12 throw ; 13 } 14 lock . unlock (); 15 }

1 void fun3 () { 2 try { 3 fun2 (); 4 } catch ( Exception * e ) { 5 // e x e c p t i o n ha ndl in g 6 } 7 }

Disadvantage: throws a exception, the lock variable


Problem: In case unlocked

is not

Unlocking locks may be forgotten; Not only when using exceptions. Complexity of program text increases Increased programming expenses for the programmer

Solution 2: Division of lock-variables and the locking functionality into two objects
Mutex: globally visible Scoped Lock: Used for locking the mutex
For each thread and each mutex one Scoped Lock instance exists Locks a mutex at its object-construction Unlocks a mutex at their deconstruction Tip: Using a code block (braces { } in C++) and declaring a scoped lock object at the beginning of the code block locks the associated mutex within the whole code block

Mutex Concept
All the following mutex models have to implement to following functions
Pseudo Signature
M() ~M() typename M::scoped_lock M::scoped_lock() M::scoped_lock(M& mutex) M::~scoped_lock() M::scoped_lock::aquire(M& mutex) bool M::scoped_lock::try_aquire(M& mutex)

1 ... 2 { 3 // C o n s t r u c t i o n of myLock locks mutex myMutex 4 mutex :: scoped_lock myLock ( myMutex ); 5 // C o m p u t a t i o n s are p r o t e c t e d by myMutex 6 ... 7 // u n l o c k i n g of myMutex 8 // ( D e s t r u c t o r of myLock is called i m p l i c i t l y ) , 9 }
M::scoped_lock::release() static const bool M::is_rw_mutex static const bool M::is_recursive_mutex static const bool M::is_fair_mutex
Intel Threading Building Blocks

Models Implementing the Mutex-Concept

spin_mutex-Class mutex-Class

Wrapper for operating system implementation of locks


Lock-Implementation using a busy-waiting loop. Uses a ag variable in memory. Good for short delays, since while waiting
processor time and memory bandwidth is used.

Wrapper for recursive operating system implementation (e.g. for pthread_mutex_t) A recursive lock, can be locked several times from one and the same thread. If a mutex was locked n-times, the thread has to be unlocked n times too.

Unfair Implementation:
Order of locking requests is ignored.

Implementation using a busy waiting loop Fair implementation locking requests are served in FIFO order. Implementation scales

Read-/Write Locks:
Several threads that only read the protected data structure are allowed to read in parallel. One thread which tries to modify the data structure needs exclusive write access. Can only be locked by several readers or by one writer. Additional requirements compared to the mutex-concept.
Pseudo Signature
M::scoped_lock(M& mutex, bool write=true) M::scoped_lock::aquire(M& mutex, bool write=true) bool M::scoped_lock::try_aquire(M& mutex, bool write=true) bool RW::scoped_lock::upgrade_to_writer() bool RW::scoped_lock::downgrade_to_reader()
exclusive write access; write=false read access Models: Class spin_rw_mutex and Class queuing_rw_mutex Example

Data Structure queue (FIFO) Implementation using a linked list Attaching elements at the end Taking elements at start Later: Implementation by TBB
enq-Operation deq-Operation concurrent_queue

Taking elements from an empty queue Exception

1 2 3 4 5 6 7
template < typename T > struct Node { Node () : next ( NULL ) {} Node ( const T & v ) : val ( v ) , next ( NULL ) {} T val ; // Value Node * next ; // Pointer to next element };

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19

template < typename T > class LockQueue { // Mutexes for taking and a t t a c h i n g mutex enqLock , deqLock ; // // pointer to b e g i n n i n g and end of linked list Node <T > * head , * tail ; public : // Queue has one s en ti n e l element LockQueue () { head = new Node <T >(); tail = head ; } ~ LockQueue () { delete head ; }

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18
void enq ( const T & x ) { mutex :: scoped_lock l ( enqLock ); Node <T > * e = new Node <T >( x ); tail - > next = e ; tail = e ; } T deq () { mutex :: scoped_lock l ( deqLock ); if ( head - > next == NULL ) throw new EmptyException (); T val = head - > next - > val ; Node <T > * h = head ; head = head - > next ; delete h ; return val ; } };

Notes: by using two mutexes it is possible to take elements from and attach elements to the queue in parallel Deadlock free, since no thread accesses two locks at the same time points to sentinel-element, it successor is the rst element of the queue

Atomars Operations

Introduction Basics Parallel Constructs Synchronization Container concurrent vector concurrent hash map concurrent queue Task-Programming
1 struct atomic <T > { 2 typedef T value_type ; 1 3 4 value_type fetch_and_add ( value_type addend ); // x = x + addend 5 value_type f e t c h _a n d _ i n c r e me n t (); // x = x +1 2 6 value_type f e t c h _a n d _ d e c r e me n t (); // x =x -1 7 value_type compare_an d_s wap ( value_type new_value , (*) 8 value_type comparand ); 9 value_type fetch_and _store ( value_type new_value ); // swap (x , n e w _ v a l u e 3) 10 value_type operator () const ; 11 value_type operator +=( value_type ); 12 value_type operator -=( value_type ); 4 13 value_type operator ++(); 14 value_type operator - -(); 15 } 5

Integer- or pointer-type

Operations are executed atomar compare_and_swap:

Compares comparand with value from *this, if equal sets *this=new_value Returns the old value of *this
concurrent vector
concurrent_queue template<typename T> class concurrent_vector;

Selected Methods (Continued)

size_type size() Number of elements stored bool empty() Returns size() == 0 size_type capacity() Maximum number of elements,

before new

Properties: Random access to elements (addressed by index) Data structure can grow After growing indices and iteratores are still valid No shrinking is possible Selected Methods
Access to elements:
T& operator[](size_type i) Access i-th element without index checking T& at(size_type i) Access i-th element; Exception std::out_of_range

memory is allocated
size_type max_size()

Maximum number of elements

Iteratores and Ranges and iterator end() random access iteratores for vector elements in increasing order of indices reverse_iterator rbegin() and reverse_iterator rend() random access Iteratores for visiting vector elements in reverse order range_type range(int grainsize) Range object for vector
iterator begin()

concurrent hash map

concurrent_hash_map<Key,T,HashCompare> template<typename Key,typename T,typename HashCompare> class concurrent_hash_map;

Element Access
Accessor-object (proxy) allows the concurrent access to key-value pairs Accessor object uses implicit lock for each key-value pair Construction of a accessor object locking of the corresponding key-value pair Destruction of the accessor-objects unlocking the implicit lock The are two dierent accessor-types:
const_accessor read accessor read/write

Hash-table for storage of key-value pairs with parallel access Key - type of key, T type of values HashCompare Class for mapping of keys to integer values. Concept of HashCompare:
Pseudo signature
HashCompare::HashCompare(const HashCompare&) HashCompare::~HashCompare() bool HashCompare::equal(const Key& j, const Key& k) size_t HashCompare::hash(const Key& k) const

Semantic Copy-Constructor Destructor True, if j and k are equal Mappping k Integer

access read lock access read-/write lock

const accessor
1 template < typename Key , typename T , 2 typename HashCompare , typename A > 3 class concurrent_hash_map < Key ,T , hashCompare ,A >:: con st_acce ssor { 4 ... 5 typedef const std :: pair < const Key , T > value_type ; 6 7 bool empty () const ; // Element present ? 8 const value_type & operator *() const ; // Pointer to entry 9 const value_type * operator - >() const ; // R e f e r e n c e to entry 10 void release (); // u n l o c k i n g the i mpl i c i t lock Prof. Praktische Informatik Intel Threading Building Blocks WS 2013/2014 48 / 65 11 };


i,j have type Key; h is a object, which implements the concept HashCompare. If h.equal(i,j) is true, then h.hash(i) = h.hash(j) must hold.
Selected Methods
Example: Compute the frequency of words
size_type count(const Key& key) const

returns one if


is present, null otherwise


1 struct MyHashCompare { 2 static size_t hash ( const string & x ) { 3 size_t h =0; 4 for ( const char * s = x . c_str (); * s ; s ++) 5 h =( h *17)^* s ; 6 return h ; 7 } 8 static bool equal ( const string & x , const string & y ) { 9 return x == y ; 10 } 11 }; 12 13 typedef concurrent_hash_map < string , int , MyHashCompare > StringTable ;

bool find(accessor& res, const Key key)

Search for key; If present returns in a write lock

the entry; locks the entry with

bool insert(accessor& res, const Key key)

Similar to find; Dierence: If entry not present create and insert new key-value pair with pair<Key,T>(key,T()).
bool erase(const Key& key)

Search key; if present delete it Iteration over elements:

1. by using iterator begin() and iterator end() 2. by using a range object returned by range_type range(size_t grainsize)

1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23
struct Tally { StringTable & table ; Tally ( StringTable & _table ) : table ( _table ) {} void operator ()( const blocked_range < string * > r ) const { for ( string * p = r . begin (); p != r . end (); ++ p ) { StringTable :: accessor a ; table . insert (a , * p ); a - > second +=1; } } }; void C ountAc currences ( String * data , int nitems ) { t a s k _s c h e d u l e r _ in i t init ; StringTable table ; parallel_for ( blocked_range < string * >( data , data + nitems , grainsize ) , Tally ( table ) ); for ( StringTable :: iterator i = table . begin (); i != table . end (); ++ i ) cout < <i - > first < < " " <<i - > second < < endl ; }
concurrent queue
concurrent_queue template<typename T> class concurrent_queue;

FIFO-queue Inserting and deleting elements concurrently possible Limited capacity Implementation uses locks Busy waiting on some (blocking) operations Important methods:
void push(const T& source); Inserting elements at the end void pop(T& destination); Removing and returning from the


blocks if empty
bool pop_if_present(T& destination); Removing and returning; size_type size() const; Number of elements stored; If empty, return

the number of waiting threads as a negative number size_t capacity() const; Return maximum capacity
A task is composed of data and code which uses the data for computation. Tasks can be executed in parallel Tasks can be divided into subtasks father-child relationship creates a tree of tasks Child tasks should be independent computation on dierent cores possible Programmer denes the subdivision Scheduler component within TBB manages computation order Example for Algorithms:
Linear algebra (Matrix-Multiplication,-Decomposition) Sorting (Merge-,Quick-Sort) Search
Introduction Basics Parallel Constructs Synchronization Container Task-Programming

Decomposition of a task into subtasks split-operation Waiting for the completion of childs join-operation Task-Depth

1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k +1); 6 task & tk = new ( al locate_child ()) T (...); tk . spawn (); 7 ... 8 task & t2 = new ( al locate_child ()) T (...); t2 . spawn (); 9 task & t1 = new ( al locate_child ()) T (...); 10 t1 . s p aw n _ a n d _ wa i t_ al l ( t1 ); 11 } 12 return NULL ; }

Each task has the implicit information about his task depth. Task depth of childs is one grater than task depth of father Root task has task depth 0

Reference counter
Each task has a reference counter The reference counter counts the number of existing childs If the reference counter reaches zero task is deleted; reference counter of father is decremented

Explanation: T inherits from the class Task and reimplements the method execute controls the subdivision into tasks; Steps:


Split-/Join-Parallelism; Two possible methods

Continuation-Passing Blocking

Example (Blocking)
1 struct Tree { int val ; Tree * left ,* right ; } 2 class SumTask : public Task { 3 int * sum ; 4 Tree * tree ; 5 6 SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {}; 7 8 task * execute () { 9 SumTask *a ,* b ; 10 int ref =1 , x =0 , y =0; 11 if ( tree - > right != NULL ) { 12 a = new ( alloc ate_child ()) SumTask ( tree - > right ,& x ); 13 ref ++; } 14 if ( tree - > left != NULL ) { 15 b = new ( alloc ate_child ()) SumTask ( tree - > left ,& y ); 16 ref ++; } 17 if ( ref > 1) { 18 set_ref_count ( ref ); 19 if ( tree - > right != NULL ) spawn (* a ); 20 if ( tree - > left != NULL ) spawn (* b ); 21 wait_for_all (); } 22 * sum = tree - > val + x + y ; 23 } 24 return NULL ; 25 } }
Problems with Blocking

Problems: Local variables of task::execute remain on the stack of the executing OS thread, while calling task::spawn_and_wait. Task-Stealing in conjunction with blocking may result in a stack growth; Remember stack size is limited The scheduler tries to limit the stack growth, be choosing ready tasks with a task depth higher then the last blocking task. limited parallelism Instead of calling


the method



The computation using the results from child tasks is outsourced into a continuation-task. The continuation task is executed, after all childs have nished.

1 task * T :: execute () { 2 if ( there is no further division possible ) { 3 /* s e q u e n t i a l c o m p u t a t i o n */ 4 } else { 5 set_ref_count ( k ); 6 r e c y c l e _ a s _ c o n t i n u a t i o n (); 7 task & tk = new ( allocate_child ()) T (...); tk . spawn (); 8 ... 9 task & t1 = new ( allocate_child ()) T (...); t1 . spawn (); 10 return & t1 ; }

Example (Continuation-Passing)
1 2 3 4 5 6 7 8 9 10 11 12 13 14 15 16 17 18 19 20 21 22 23 24 25
class SumContTask : public Task { int * sum , x , y ; SumContTask ( int * _sum ) : sum ( _sum ) {} task * execute () { * sum = x + y ; return NULL ;} } class SumTask : public Task { int * sum ; Tree * tree ;

SumTask ( Tree * _tree , int * _sum ) : tree ( _tree ) , sum ( _sum ) {* sum += tree - > val ;}; task * execute () { SumTask *a ,* b ; int ref =0; SumCont * c = new ( a l l o c _ c o n t in u t a t i o n ()) SumContTask ( sum ); if ( tree - > right != NULL ) { a = new ( alloc ate_child ()) SumTask ( tree - > right ,& c - > x ); ref ++; } if ( tree - > left != NULL ) { b = new ( alloca te_child ()) SumTask ( tree - > left ,& c - > y ); ref ++; } if ( ref > 0) { set_ref_count ( ref ); if ( tree - > right != NULL ) c - > spawn (* b ); if ( tree - > left != NULL ) c - > spawn (* a ); } return NULL ; } }
In the example there is no further computation after t1.spawn() There is no need from algorithm point of view for a continuation task. Internals from TBB require continuation task

marks father as a continuation task

Additional Possibility: Specifying a continuation task implicitly. (shown in next example)

Important Methods of the Class Task

void wait_for_all(); void spawn(task &child); void spawn(task_list& list); spawn_and_wait_for_all(task &child); spawn_and_wait_for_all(task_list &list); depth_type depth(); void set_depth(depth_type new_depth); void add_to_depth(int delta); int ref_count() const: void set_ref_count(int count); void recycle_as_continuation(); void recycle_as_child_of(task& parent); void recycle_to_reexecute();
1 int ParallelSum ( Tree * tree ) { 2 int sum ; 3 SumTask & a =* new ( task :: allocate_root ()) SumTask ( tree , & sum ); 4 task :: s p a w n _ ro o t _ a n d _ w a i t ( a ); 5 return sum ; 6 }

wait for childs to nish mark child for execution marks a list of childs for execution mark child for execution and wait for the childs Mark childs in list for execution and wait for childs Returns task depth Sets task depth Increments task depth Returns reference counter Sets reference counter Recycling of a task as continuation task Recycling as child with father parent Recycling as child
sum &root) new(task::allocate_root())

as argument to new executes root

The static method task::spawn_root_and_wait(task task and waits for completion. The static task::spawn_root_and_wait(task_list executing a list of root tasks
can be used for

Execution Orders


Each OS-thread manages a ready-pool Organization of the ready-pool:

Per task depth there is a list with ready to executed tasks. The lists are managed by an array; the task depth is the index

small memory footprint good cache locality no parallelism

high memory footprint poor cache locality high parallelism

New tasks are stored at the beginning of the list corresponding to their tasks depth and are removed at the beginning of their list (LIFO).

Operation of the task-scheduler

Tasks are executed in the following order: 1. The task returned by


2. The task which is farther of the last executed task. 3. A task from the list with the highest task depth. 4. A task with an anity for that thread. 5. A task from the ready pool of another thread with the lowest depth (task stealing).

