L09-openmp for第13周

中国科学院计算技术研究所
Parallel Processing
Lecture 09: Advanced OpenMP Programming
2024-05-21
王银山副研究员
高性能计算机研究中心
1
本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture
2
Introduction to OpenMP: Basics

 “Easy”, incremental and portable parallel programming of shared-
memory computers: OpenMP
 Standardized set of compiler directives & library functions:
http://www.openmp.org/
 FORTRAN, C and C++ interfaces are defined
 Supported by most/all commercial compilers, GNU starting with 4.2
 OpenMP program can be written to compile and execute on a single-

processor machine just by ignoring the directives
 API calls must be masked out though
 Supports data parallelism
 R.Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, R. Menon: Parallel

programming in OpenMP. Academic Press, San Diego, USA, 2000, ISBN 1-
55860-671-8
 B. Chapman, G. Jost, R. v. d. Pas: Using OpenMP.MIT Press, 2007, ISBN 978-
0262533027
3
Introduction to OpenMP: Shared-Memory model

Central concept of OpenMP programming: Threads
 Threads access globally
T T shared memory
 Data: shared or private

private private  shared data available
to all threads (in
Shared principle)
 private data only to
Memory thread that owns it
private
private  Data transfer transparent
to programmer
T T  Synchronization takes
place, is mostly implicit
 Tailored to data parallel
execution
Other threading libs. Available, e.g. pthreads
4
Introduction to OpenMP: Fork and join execution model

Program start:
only master thread runs
Parallel region: team of threads is

generated (“fork”)
Synchronize when leaving parallel
region (“join”)
Serial region:
only master executes
worker threads usually sleep
Task (and data) distribution

possible via directives
Thread # 1 2 3 4
0
Often best choice 1 thread/core
5
Introduction to OpenMP: Software Architecture
 Programmer’s view:
 directives/pragmas
Applicati User
in
on application code
 (a few) library
routines
Compiler Environment  User’s view:
Directives Variables  environment variables
determine:
resource allocation
scheduling strategies and other
Runtime Library (implementation-dependent )
behavior
Threads in OS  Operating system view:

Cores in  parallel work done by
hardware threads
6
Introduction to OpenMP: Syntax in C/C++

 Include file: #include <omp.h>
 Compiler directive:
#pragma omp [directive [clause ...]]

structured block
 Conditional compilation: Compiler’s OpenMP switch sets

preprocessor macro (acts like -D_OPENMP)
#ifdef _OPENMP
... do something
#endif
7
Introduction to OpenMP: Parallel execution

 #pragma omp parallel
structured block
 Makes structured block a parallel region: All code executed
between start and end of this region is executed by all threads.
 This includes subroutine calls within the region (unless
explicitly sequentialized)
 Local variables inside the block are automatically private
to each thread
 END PARALLEL required in Fortran
use omp_lib
…
!$OMP PARALLEL
call work(omp_get_thread_num(), omp_get_num_threads())
!$OMP END PARALLEL
8
Introduction to OpenMP:
Data scoping – shared vs. private
T T
The OpenMP memory model
private private
Data in a parallel region can be: Shared

private
private
 private to each executing thread
T T
 each thread has its own local copy of data
 shared between threads

 there is only one instance of data available to all threads
 this does not mean that the instance is always visible to all threads!
 OMP clause specifies scope of variables:

 Default: shared
 Specify private variables in a parallel region:
#pragma omp parallel private(var1, tmp)
9
Hello world
#include <stdio.h>  Compile switch
#include <omp.h>  g++ hello.cpp -fopenmp
int main(int argc, char* argv[]){
#pragma omp parallel
 Number of threads: Determined
{ by shell variable
printf( "Hello wrold from %d!\n", OMP_NUM_THREADS
omp_get_thread_num());  Export OMP_NUM_THREADS=4
}  Thread pinning: eg LIKWID
return 0;
}  likwid-pin –c 0-3
./a.out
Ordering not
reproducible
 More environment variables available:

 Loop scheduling: OMP_SCHEDULE, Stacksize: OMP_STACKSIZE
 Dynamic adjustment of threads: OMP_DYNAMIC
10
本讲内容
 Basic4: Tasking
11
Data scoping – shared vs. private
 Default: All data in a parallel region is shared
This includes global data (global/static variables, C++ class variables)
 Exceptions:
1. Loop variables of parallel (“sliced”) loops are private (cf. workshare constructs)
2. Local (stack) variables within parallel region
3. Local data within enclosed function calls are private* unless declared static
 Stack size limits  may be necessary to make large arrays static

 This presupposes it is safe to do so!
 If not: make data dynamically allocated
 As of OpenMP 3.0: OMP_STACKSIZE may be set at run time (increase thread-
specific stack size):
$ setenv OMP_STACKSIZE 100M
12
Data scoping – side effects
Incorrect
 Incorrect shared attribute may lead to result
 correctness issues (“race conditions”)
 performance issues (“false sharing”)

(Very) slow
 Scoping of local function data and global data

 is not changed
 compiler cannot be assumed to have knowledge
 Recommendation: Use
#pragma omp parallel default(none)
 to not overlook anything

 compiler complains about every variable that has no explicit scoping attribute
13
Private variables - Masking

private
#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[]){ fork:
s T0 T1 T2 T3
int s = 0;
#pragma omp parallel private(s)
s s0 s1 s2 s3
{
for(int i = 0; i < 10; i++) { persists
s += i; (inaccessible)
} s s0 s1 s2 s3
}
return 0; s join
} OpenMP 3.0:
Shared/global value
 Masking relevant for
time
recovered
 Masking also applies to
• privatized variables  association status of
defined in scope pointers
outside the parallel  allocation status of
region allocatable variables
14
The firstprivate clause

#include <stdio.h> shared private
#include <omp.h> s
int s = 0; s T0 T1 T2 T3
#pragma omp parallel firstprivate(s)
{ s s0 s1 s2 s3
}
} s s0 s1 s2 s3
return 0; s join
}
 Extension of private:
time
 value of master copy is transferred to

private variables
 restrictions: not a pointer, not
assumed shape, not a subobject,
master copy not itself private etc.
15
The lastprivate clause

shared private
#include <stdio.h>
#include <omp.h> s
int s = 0; s T0 T1 T2 T3
#pragma omp parallel lastprivate(s)
s s0 s1 s2 s3
{
} s s0 s2 s3
}
s1
return 0; s
} join
 Extension of private:
 value from thread which executes
last iteration of loop is transferred
back to master copy
 restrictions similar to
firstprivate 16
Example1
#include <stdio.h>
#include <omp.h>

int sum = 0;
#pragma omp parallel for shared(sum)
for(int i = 0; i < omp_get_num_threads(); i++) {
sum += i; sum is shared
printf("tid %d, sum = %d\n", omp_get_thread_num(), sum);
}
return 0;
}
code is in PE/src/lec09/share_private.cpp
17
Example1
#include <stdio.h>
#include <omp.h>
int sum = 0;
#pragma omp parallel for private(sum)
for(int i = 0; i < omp_get_num_threads(); i++) {
sum += i; sum is private
printf("tid %d, sum = %d\n", omp_get_thread_num(), sum);
}
printf("outside sum = %d\n", private_sum);
return 0;
}
code is in PE/src/lec09/share_private.cpp
18
Introduction to OpenMP: Reduction operations

real :: s private
s is still
shared here
start reduction:
!$omp parallel s T0 T1 T3
!$omp do reduction(+:s)
T2 all set
do i = … s s0 s1 s3
2 to 0.0
: s
persists
:
(inaccessible)
s = s + …
end do s s0 s1 s2 s3
!$omp end
do
s synchronize
… = … * s
!$omp
end
 At synchronization point:
time
parallel
 reduction operation is performed
 result is transferred to master copy
 restrictions similar to firstprivate
19
Introduction to OpenMP: Reduction operations

 Initial values of reduction  Consistency required
variable depend on operation  operation specified in clause vs.
update statement
 rely on algebraic rules!
Operation Initial value
 subtract: X = expr – X is not
allowed
+ 0
 Multiple reductions:
C / C++ has an analogous set
- 0
 multiple scalars, or an array:
* 1
.and. .true. Double x, y, z
.or. .false. #pragma omp parallel for(+:x,
y, z)
.eqv. .true.
.neqv. .false.
Double a(n)
#pragma omp parallel for(*:a)
MAX -HUGE(X)
MIN HUGE(X)  multiple operations:
IAND all bits set 1
IEOR 0 #pragma omp parallel for
reduction(+:sum_i)
IOR 0
reduction(max:max_i)
20
本讲内容
 Basic4: Tasking
21
The schedule clause

 Default scheduling:  Static scheduling
 implementation dependent  schedule(static,10) 10 iterations
 typical: largest possible chunks of as-equal-

as-possible size (“static scheduling”)
 minimal overhead (precalculated
work
iteration space assignment)
(threads color coded)
 default chunk value: see left
 Dynamic scheduling
 schedule(dynamic, 10)
 User-defined scheduling:
#pragma omp for schedule(...)
!$OMP do schedule(...)
both threads take long to complete
their chunk (workload imbalance)
static
schedule( dynamic [,chunk] )
guided
 after a thread has completed a chunk, it
is
chunk: always a non-negative integer. If assigned a new one, until no chunks are
omitted, has a schedule dependent default left
value  synchronization overhead
 default chunk value is 1
22
OpenMP Scheduling of simple for loops
OMP_SCHEDULE=static
OMP_SCHEDULE=static,10 OMP_SCHEDULE=dynamic,10 23
Example2
code is in PE/src/lec09/schedule.cpp
24
OpenMP Schedule example
25
Guided scheduling
 Size of chunks in dynamic schedule
 too small  large overhead
 too large  load imbalance
 Guided scheduling: dynamically vary chunk size.
 Size of each chunk is proportional to the number of unassigned iterations divided by
the number of threads in the team, decreasing to chunk-size (default = 1).
 Chunk size:
 means minimum chunk size (except perhaps final chunk)
 default value is 1
chunk = 7
iteration space
 Both dynamic and guided scheduling useful for handling poorly balanced and
unpredictable workloads.
26
Deferred scheduling
 auto: automatic scheduling  Examples:
 Programmer gives implementation the  environment setting:
freedom to use any possible
export OMP_SCHEDULE="guided,4"
mapping.
./a.out
 call to API routine:
 Decided at run time:
call omp_set_schedule(omp_sched_dynamic,4)
!$OMP do schedule(runtime) !$OMP parallel
!$OMP do schedule(runtime)
do
#pragma omp for schedule(runtime) …
end do
!$OMP end
do
 runtime:
 schedule is one of the above or the omp_set_schedule(omp_sched_dynamic, 4)
previous two slides #pragma omp parallel
#pragma omp schedule(runtime)
 determine by either setting OMP_SCHEDULE, for (…) { }
and/or calling omp_set_schedule()
(overrides env. setting) runtime scheduling and OMP_SCHEDULE is not
 find which is active by calling set:
omp_get_schedule() implementation chooses a schedule
27
Collapsing loop nests

 Collapse nested loops into a single  Logical iteration space
iteration space  example: kmax=3, jmax=3
!$OMP do collapse(2) 0 1 2 3 4 5 6 7 8
do k=1, kmax J 1 2 3 1 2 3 1 2 3
do j=1, jmax
: Argument specifies K 1 1 1 2 2 2 3 3 3
end do number of loop nests

end do to flatten
 this is what is divided up into chunks
!$OMP end do
and distributed among threads
#pragma omp for collapse(2)  Sequential execution of the iterations in
for (k=0; k<kmax; ++k) all loops determines the order of iterations
for (j=0; j<jmax; ++j) in the collapsed iteration space
...
 Restrictions:  Optimization effect

 iteration space computable at entry to loop  may improve memory locality properties
(rectangular)  may reduce data traffic between cores
 CYCLE (Fortran) or continue (C/C++)
only in innermost loop
28
Performance Tuning: the nowait clause

 Remember:  Shooting yourself in the foot
 an OpenMP for/do performs  modified variables must not be accessed
implicit synchronization at loop unless explicit synchronization is performed
completion  use a barrier for this
 Example: multiple loops in parallel region
29
Explicit barrier synchronization

 barrier construct is a stand-  each barrier must be encountered by
alone directive all threads in the team or by non at all.
 barrier synchronizes all threads
30
Parallel sections
 Non-iterative work-sharing construct  Restrictions:
 distribute a set of structured blocks  section directive must be within
lexical scope of sections directive
!$omp parallel
!$omp sections  sections directive binds to
!$omp section innermost parallel region
! code block 1 thread 0   only the threads executing the binding
!$omp section
parallel region participate in the execution
! code block 2 thread 1
… of the section blocks and the implicit barrier
!$omp end sections (if not eliminated with nowait)
synchronization
!$omp end parallel
 each block executed exactly once by one  Scheduling to threads

of the threads in team
 implementation-dependent
 if there are more threads than code blocks:
excess threads wait at synchronization point
 Allowed clauses on sections:
 private, firstprivate,
lastprivate, reduction, nowait
31
The single directive

private
{
double s = …; fork:
#pragma omp single copyprivate(s) T0 T1 T2 T3
{
s = … s0 s1 s2 s3
}
… = … + s copyprivate(s)
} broadcasts s
s2
s0 s1 s2 s3
 one thread only executes enclosed code join
time
block
 all other threads wait until block
completes execution  copyprivate and nowait clauses: appear
 allowed clauses: private, firstprivate, on
copyprivate, nowait end single in Fortran, on single in C/C++
 use for updates of shared entities, but … !$omp single
s = …
 single – really a worksharing directive?
!$omp end
single
copyprivate(
s)
32
Combining regions and work sharing

 Example:  Applies to most work-sharing
!$OMP parallel do
constructs
...  do/for
!$OMP end parallel do  workshare
 sections
#pragma omp parallel for
...  Notes:
 clauses for work-sharing constructs
 is equivalent to can appear on combined construct
 the reverse is not true
!$omp parallel
!$omp do shared can only appear in a
... parallel region
!$omp end do  clauses on a work-sharing construct
!$omp end parallel
only apply for the specific construct
block
#pragma omp for
...
33
本讲内容
 Basic4: Tasking
34
Why do we need synchronization?

 OpenMP Memory Model
T T
 private (thread-local):
 no access by other threads
shared
memory
a
 shared: two views
T T
 temporary view: thread has modified data in
its registers (or other intermediate
device)
 content becomes inconsistent with that in processor

registers
cache/memory (different cores)
two threads execute

 other threads: cannot know that their copy a = a + 1
of data is invalid in same parallel region
 race condition
35
Possible results
a = 0
Thread 0: Thread 1:
a = a + 1 a = a + 1
 Following results could be obtained on each thread

 a after completion of statement:
Thread 0 Thread 1
1 1
1 2
2 1
 may be different from run to run, depending on which thread is the last one
 after completion of parallel region, may obtain 1 or 2.
36
Consequences and (theoretical) remedies

 For threaded code without  further memory operations only
synchronization this means allowed after all involved threads
complete flush:
 multiple threads write to same
 restrictions on memory instruction
memory location
 resulting value is unspecified
reordering (by compiler)
 some threads read and another writes
 result on reading threads
unspecified
 Ensure consistent view of
 Flush Operation memory:
 is performed on a set of (shared) variables  assumption: want to write a data item
or on the whole thread-visible data state of with first thread, read it with second
a program  order of execution required:
 flush-set 1. thread 1 writes to shared variable
 discards temporary view: 2. thread 1 flushes variable
 modified values forced to 3. thread 2 flushes same variable
cache/memory 4. thread 2 reads variable
 next read access must be
from cache/memory
37
OpenMP flush syntax

 OpenMP directive for explicit flushing
!$omp flush[(var1[,var2,…])]
 Stand-alone directive
 applicable to all variables with shared scope
 including: SAVE, COMMON/module globals, shared dummy arguments, shared pointer dereferences
 If no variables specified, the flush-set
 encompasses all shared variables
which are accessible in the scope of the FLUSH directive
 potentially slower
 Implicit flush operations (with no list) occur at:
 All explicit and implicit barriers
 Entry to and exit from critical regions
 Entry to and exit from lock routines
38
Barrier synchronization
 Explicit via directive:
 the execution flow of each thread blocks upon reaching the barrier until all threads have reached
the barrier
 flush synchronization of all accessible shared variables happens before all threads continue
 after the barrier, all shared variables have consistent value visible to all threads
 barrier may not appear within work-sharing code block
 e.g. !$omp do block, since this would imply deadlock
 Implicit for some directives:

 at the beginning and end of parallel regions
 at the end of do, single, sections, workshare blocks unless a nowait clause is specified
(where allowed)
 all threads in the executing team are synchronized
 this is what makes these directives “easy-and-safe-to-use”
39
Relaxing synchronization requirements

 Use a nowait clause
 on end do / end sections / end single / end workshare
(Fortran)
 on for / sections / single (C/C++)
 removes the synchronization at end of block
 potential performance improvement
 especially if load imbalance occurs within construct)
 programmer’s responsibility to prevent races
40
Critical regions
 The critical and atomic directives:
 each thread arriving at the code block executes it (in contrast to single)
 mutual exclusion: only one at a time within code block
 atomic: code block must be a single line update of a scalar entity of intrinsic type
with an intrinsic operation
Fortran unary operator

also allowed
!$omp critical !$omp atomic
block x = x <op> y
!$omp end critical
C/C++
# pragma omp critical # pragma omp atomic
{ block } x = x <op> y ;
41
Synchronizing effect of critical regions

 Mutual exclusion is only assured for the statements inside the block
 i.e., subsequent threads executing the block are synchronized against each other
 If other statements access the shared variable, may be in trouble:
C/C++ Fortran
#pragma omp parallel !$omp parallel  Race on read to x.
{ :
: !$omp atomic
 A barrier is required
#pragma omp atomic x = x + y before this statement to
x = x + y : assure that all threads
: a = f(x, …) have executed their
a = f(x, …) !$omp end atomic updates
} parallel
42
Named critical
 Consider multiple updates  Solution:
 same shared variable  use named criticals
thread 0 thread 1 subroutine foo()
!$omp critical (foo_x)
subroutine foo() subroutine bar() x = x + y
!$omp critical !$omp critical !$omp end critical (foo_x)
x = x + y x = x + z
!$omp end !$omp end critical
critical subroutine bar()
 critical region is global: OK !$omp critical (foo_w)
w = w + z
!$omp end critical
 different shared variables (foo_w)
 mutual exclusion only if same name is
subroutine foo() subroutine bar()
!$omp critical !$omp critical used for critical
x = x + y w = w + z  atomic is bound to updated variable
!$omp end !$omp end
critical critical  problem does not occur
 mutual exclusion not required
 unnecessary loss of performance
43
The ordered clause and directive

 Statements must be within body of a loop
 directive acts similar to single
 threads do work ordered as in sequential execution
 execution in the order of the loop iterations
 requires ordered clause on enclosing do/for
construct
 only effective if code is executed in parallel ...
i=1 i=2 i=3 i=N
 only one ordered region per loop O1
O1
time
O1
C/C++ Fortran O2
O1
#pragma omp for ordered !$OMP do ordered O2
for (i=0; i<N; ++i) { do I=1,N O2
O1 O1 O3 O3
...
#pragma omp ordered !$OMP ordered O3
{ O2 } O2
O3 !$OMP end ordered O2
} O3
O3
end do barrier
!$OMP end do
44
Two applications of ordered

 Loop contains recursion  Loop contains I/O
 dependency requires serialization  it is desired that output (file) be
 only small part of loop (otherwise consistent with serial execution
performance issue)
!$OMP do ordered !$OMP do ordered

Fortran
Fortran
do I=2,N do I=1,N
... ! large ... !
block calculate a(I)
!$OMP ordered !$OMP ordered
a(I) = a(I- write(unit,.
1) + ... ..) a(I)
!$OMP end ordered !$OMP end ordered
end do end do
#pragma
!$OMP omp
end for
do ordered #pragma omp for
!$OMP end do ordered
C/C++
C/C++
for (i=1; i<N; ++i) { for (i=0; i<N; ++i) {

... /* large block */ ... /* calculate a[i] */
#pragma omp ordered #pragma omp ordered
a[i] = a[i-1] + ... printf("%e ", a[i]);
} }
}
45
Mutual exclusion with locks

 A shared lock variable can be used to implement specifically designed
synchronization mechanisms
 In the following, var is of type
 Fortran: integer(omp_lock_kind)
 C/C++: omp_lock_t
 OpenMP lock variables must be only accessed by the lock routines
 Mutual exclusion bound to objects

 more flexible than critical regions
46
OpenMP locks
 An OpenMP lock can be in one of the following 3 stages:
 uninitialized
 unlocked
 locked
 The task that sets the lock is then said to own the lock.
 Only a task that sets the lock, can unset the lock, returning it to the unlocked stage.
 2 types of locks are supported:

 simple locks
 Can only be locked if unlocked.
 A thread may not attempt to re-lock a lock it already has acquired.
 nestable locks
 Owning thread can lock multiple times
 Owning thread must unlock the same number of times it locked it
47
Lock routines (1)
 Fortran: omp_init_lock(var)
C/C++ omp_init_lock(omp_lock_t *var)
 initialize a lock
 initial state is unlocked
 what resources are protected by lock: defined by developer
 var not associated with a lock before this routine is called
 Fortran: omp_destroy_lock(var)
C/C++: omp_destroy_lock(omp_lock_t
*var)
 disassociate var from lock
 precondition:
 var must have been initialized
 var must be in unlocked state
48
Lock routines (2)

 Assuming: lock variable var has been initialized
 Fortran: omp_set_lock(var)
C/C++: void omp_set_lock(omp_lock_t *var)
 blocks if lock not available
 set ownership and continue execution if lock available
 Fortran: omp_unset_lock(var)
C/C++: void omp_unset_lock(omp_lock_t *var)
 release ownership of lock
 ownership must have been established before
 Fortran: logical function omp_test_lock(var)

 C/C++: int omp_test_lock(omp_lock_t *var)
 does not block, tries to set ownership
 returns true if lock was set, false if not
 allows to do something else while lock is hold by another thread
49
Example for using locks
#include <omp.h> #inclue <omp.h>

omp_lock_t lock; omp_lock_t lock;
loop until lock
omp_init_lock(&lock); omp_init_lock(&lock) calling thread
hold lock
#pragma omp parallel #pragma omp parallel
{ {
... ...
omp_set_lock(&lock) while (!omp_test_lock(&lock)) {
/* use resource /* work unrelated to lock
protected acts like a protected resource */
by lock */ critical }
omp_unset_lock(&loc region /* use lock protected
k) resource */
... omp_unset_lock(&lock)
} ...
}
omp_destroy_lock(&loc
k) omp_destroy_lock(&lock)
50
Nestable Locks
 replace omp_*_lock by omp_*_nest_lock
 task owning a nestable lock may re-lock it multiple times

 a nestable lock is available if it is either unlocked
or
 it is already owned by the task executing
omp_set_nest_lock()or omp_test_nest_lock()
 re-locking increments nest count

 releasing the lock decrements nest count
 lock is unlocked once nest count is zero
51
本讲内容
 Basic4: Tasking
52
Task Execution Model

 Supports unstructured parallelism  Example of unstructured parallelism
 unbounded loops
while (<expr>) { do while (<expr> #pragma omp single
... ... while (elem != NULL) {
} end do #pragma omp task
compute(elem);
 recursive functions elem = elem->next;
}
void myfunc(<args>)
{
...
myfunc(<newargs>)
...
}
 Several scenarios are possible

 single creator, multiple creators, nested tasks,
 All threads in the team are candidates to execute tasks
53
The Execution Model
Task queue
54
The task Construct

 Deferring (or not) a unit of work (executable for any member of the team)
#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]
{structured-block} …structured-block…
!$omp end task
 Clauses:
 data environment:  cutoff strategies:
 private, fistprivate,  if(scalar-expression)
default(shared|none),
 mergable
in_reduction(r-id:list)
 Dependencies:  final(scalar-expression)
 depend(dep-type: list)  Other clauses:
 Scheduler restriction:  allocate(allocator:] list)
 untied  detach(event-handler)
 Scheduler hints:
 priority(priority-
value)
 affinity(list)
55
What is a Task?
 Make OpenMP worksharing more flexible:
 allow the programmer to package code blocks and data items for execution
 this by definition is a task
 and assign these to an encountering thread
 possibly defer execution to a later time (“work queue”)
 Introduced with OpenMP 3.0 and extended over time
 When a thread encounters a task construct, a task is generated from the code of
the associated structured block.
 Data environment of the task is created (according to the data-sharing attributes,
defaults, …)
 “Packaging of data”
 The encountering thread may immediately execute the task, or defer its execution.
In the latter case, any thread in the team may be assigned the task.
56
Example: Processing a Linked List
typedef struct
{ list *next;
contents *data;
} list;
Typical task generation loop:
void #pragma omp parallel
process_list(list {
*head) #pragma omp single
{ {
#pragma omp while(p) {
parallel #pragma omp task
{ { /* taks code */ }
#pragma omp }
single } /* all tasks done */
{ }
listp*p= p->next;
= head;
}
while(p) {
} /* all tasks
#pragma omp done */
} task
} { do_work(p-
>data); }
57
Example: Processing a Linked List

 Features of this example:
typedef struct  one of the threads has the job of generating
{ list *next; all tasks
contents *data;  synchronization: at the end of the single
} list;
block for all tasks created inside it
void process_list(list *head)  no particular order between tasks is
{ enforced here
{
 data scoping default for task block:
#pragma omp single  firstprivate
{  iterating through p is fine
list *p = head;
while (p) {  this is the “packaging of data ”
#pragma omp mentioned earlier
task  task region: includes call of do_work()
{ =
p do_work(p-
p->next;
} >data); }
} /* all tasks done */
}
}
58
The if Clause
 When if argument is false –
 task becomes an undeferred task
 task body is executed immediately by encountering thread
 all other semantics stay the same (data environment, synchronization) as for a
“deferred”task
#pragma omp task if (sizeof(p->data) > threshold)

{ do_work(p->data); }
 User-directed optimization:
 avoid overhead for deferring small tasks
 cache locality / memory affinity may be lost by doing so
59
Task Synchronization
60
Task Synchronization with barrier and taskwait
 OpenMP barrier (implicit or explicit) #pragma omp parallel

#pragma omp single
 All tasks created by any thread of the current {
team are guaranteed to be completed at barrier while (elem != NULL) {
#pragma omp task
exit compute(elem);
C/C++
#pragma omp barrier Fortran
!$omp barrier elem = elem->next;
}
} /* impl. barrier */

 Task barrier: taskwait #pragma omp single
{
 Encountering task is suspended until child tasks are #pragma omp task
complete {
... /* A */
 Applies to direct children only, not descendants!
#pragma omp
C/C++ Fortran task
{ ... } /* B
#pragma omp taskwait !$omp taskwait */
#pragma omp
task
{ ... } /* C
*/
#pragma omp
taskwait
}
} /* impl. barrier 61
Task Synchronization with taskgroup

 taskgroup construct #pragma omp taskgroup [allocate] \
 deep task synchronization [task_reduction(list)]
 attached to a structured block {structured-block}
 completion of all descendants of

the current task
 task synchronization point (TSP)
at the end, see later slides

#pragma omp single
{ A
#pragma omp taskgroup
A
{
#pragma omp task
{ … } B B C
#pragma omp task wait for…
C
{ … #C.1; #C.2; …}
C.1 C.2
} // end of taskgroup
}
62
Task Synchronization Constructs: Comparison
Task Synchronization Description

Construct
barrier Either an implicit, or explicit barrier.
taskwait Wait on the completion of child tasks of

the current task.
taskgroup Wait on the completion of child tasks of
the current task and their descendants.
63
Task Switching
 It is allowed for a thread to
 suspend a task during execution
 start (or resume) execution of another task (assigned to the same team)
 resume original task later
 Pre-condition:
 a task scheduling point (TSP) is reached
 Example from previous slide:
 the taskwait directive implies a task scheduling point
 Another example:
 very many tasks are generated
 implementation can suspend generation of tasks and start processing the existing
queue
 Nearly all task scheduling points are implicit
 taskyield construct only explicit one
64
Task Scheduling Points

 The point immediately following the generation of an explicit task.
 After the point of completion of a task region.
 At a taskyield directive.
 At a taskwait directive.
 At the end of a taskgroup region.
 At an implicit or explicit barrier directive.
65
Example: Dealing with Large Number of Tasks

 Features of this example:
 generates a large number of tasks
#define LARGE_N 10000000 with one thread and executes them
double item[LARGE_N] with the threads in the team
extern void
process(double);
 implementation may reach its limit
int main() on unassigned tasks
task scheduling
{ point  if it does, the implementation is
#pragma omp parallel allowed to cause the thread executing
{
the task generating loop to suspend
#pragma omp single
{ its task at the scheduling point and
for(int i = 0; i< LARGE_N; i++) { start executing unassigned tasks.
#pragma omp task  once the number of unassigned
process(item[i]); tasks is sufficiently low, the thread
}
}
may resume executing of the task
return 0; item is shared i is generating loop.
} firstprivate
66
Example: The taskyield Directive

 taskyield directive introduces an
explicit task scheduling point (TSP).
#include <omp.h>  May cause the calling task to be
suspended.
void something_useful();
void something_critical();
void foo(omp_lock_t *
lock, int n)
{ The waiting task may be
for(int i = 0; i < n; i++) { suspended here and allow
#pragma omp task
{ the executing thread to
something_useful(); perform other work. This
while (! may also avoid deadlock
omp_test_lock(lock)) {
#pragma omp taskyield situations.
}
something_critical();
omp_unset_lock(lock);
}
}
}
67
Task Switching: Tied and Untied Tasks

 Default behavior:
 a task assigned to a thread must be (eventually) completed by that thread
 task is tied to the thread
 Change this via the untied clause #pragma omp task untied
 execution of task block may change to another structured-block
thread of the team at any task scheduling point
 implementation may add task scheduling points beyond those previously
defined (outside programmer‘s control!)
 Deployment of untied tasks
 starvation scenarios: running out of tasks while generating thread is still
working on something
 Dangers:
 more care required (compared with tied tasks) wrt. scoping and
synchronization
68
Final and mergeable Tasks

 Final Tasks  Merged Tasks
 use a final clause with a condition  using a mergeable clause may create
 always undeferred, a merged task if it is undeferred or final
 executed immediately by the  a merged task has the same data
encountering thread environment as its creating task region
 reducing the overhead of placing tasks in  Clause was introduced to reduce
the “task pool” data / memory requirements
 all tasks created inside final task region are  Final and/or mergeable
also final  can be used for optimization purposes
 different from an if clause  optimize wind-down phase of a recursive
 use omp_in_final() to test if task is final algorithm
69
Task data scoping
70
Data Environment
 The task directive takes the following data attribute clauses that define the
data environment of the task:
 default (private | firstprivate | shared | none)

 private (list)
 firstprivate (list)
 shared (list)
71
Data Scoping
 Some rules from Parallel Regions apply:
 Static and global variables are shared
 Automatic storage (stack) variables are private
 The OpenMP Standard says:
 The data-sharing attributes of variables that are not listed in data attribute clauses of
a task construct, and are not predetermined according to the OpenMP rules, are
implicitly determined as follows:
 In a task construct, if no default clause is present:
a) a variable that is determined to be shared in all enclosing constructs, up to
and including the innermost enclosing parallel construct, is shared.
b) a variable whose data-sharing attribute is not determined by rule (a) is
firstprivate.
int i = 0, j = 1;
#pragma omp parallel private(j)
#pragma omp single
{
int k = 0;
#pragma omp task shared
{ /* use i, j, k
} */ } firstprivate
72
Task reduction
73
Reminder: taskgroup construct

 taskgroup construct #pragma omp taskgroup [allocate] \
 deep task synchronization [task_reduction(list)]
{structured-block}
 attached to a structured block
 completion of all descendants of the
current task
 task synchronization point (TSP) at the
end, see later slides

#pragma omp single
{ A
#pragma omp taskgroup
A
{
#pragma omp task
{ … } B B C
#pragma omp task wait for…
C
{ … #C.1; #C.2; …}
C.1 C.2
} // end of taskgroup
}
74
Task Reductions using the taskgroup Construct

 Reduction operation int res = 0; ≥ v5.0
 perform some forms of recurrence calculations node_t* node = head;
...
 associative and commutative operators
 The taskgroup scoping reduction {
clause #pragma omp single
#pragma omp taskgroup \ {
task_reduction(op: list) #pragma omp taskgroup \
{structured-block} task_reduction(+: res) 1
{
 Register a new reduction 1 while (node)
at
 Computes the final result after 3 { #pragma omp task
\ 2
in_reduc
 The task in_reduction clause tion(+
: res)
 Task participates in a reduction operation 2 \
firstpri
#pragma omp task in_reduction(op: list) vate(n
{structured-block} } ode)
} {
} res += node-
>value;
}
node = node->next;
}
3
75
Task Loops
76
Example: saxpy Kernel with OpenMP task
for (int i=0; i<N; ++i) {  Difficult to determine grain

a[i] += b[i] * s;  1 single iteration: to fine
}
 whole loop: no parallelism
for (int i=0; i<N; i+=TS) {  Manually transform the code
int ub = min(N, i+TS);
for (int ii=i; ii<ub; ii++)
 blocking techniques
{
a[ii] += b[ii] * s;
}
}
 Improving programmability
#pragma omp single  OpenMP taskloop
for (int i=0; i<N; i+=TS) {
int ub = min(N, i+TS); Outer loop iterates
#pragma omp task shared(a, across the tiles
b, s)
for (int ii=i; ii<ub; ii++) {
a[ii] += b[ii] * s;
Inner loop iterates within a tile
}
}
You have to rename
all your loop indices!
77
The taskloop Construct

 Parallelize a loop using OpenMP tasks
 Cut loop into chunks
 Create a task for each loop chunk
 C/C++
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops
 Fortran
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]
 Loop iterations are distributed over the tasks.

 With simd the resulting loop uses SIMD
instructions.
78
Clauses for the taskloop Construct

 taskloop constructs inherit clauses both from worksharing constructs and the
task construct
 shared, private
 firstprivate, lastprivate
 default
 collapse
 final, untied, mergeable
 allocate
 in_reduction / reduction
 grainsize(grain-size)
 Chunks have at least grain-size and maximally 2 × grain-size loop iterations
 num_tasks(num-tasks)
 Create num-tasks tasks for iterations of the loop
79
Example: saxpy Kernel with OpenMP taskloop

for (int i=0; i<N; ++i) {
a[i] += b[i] * s;
manual } taskloop
blocking
for (int i=0; i<N; i+=TS) { #pragma omp taskloop grainsize(TS)

int ub = min(N, i+TS); for (int i=0; i<N; ++i) {
for (int ii=i; ii<ub; ii++) { a[i] += b[i] * s;
a[ii] += b[ii] * s; }
}
}

 Easier to apply than manual
#pragma omp single blocking:
for (int i=0; i<N; i+=TS) {  Compiler implements mechanical
int ub = min(N, i+TS); transformation
#pragma omp task shared(a, b, s)
for (int ii=i; ii<ub; ii++) {  Less error-prone, more
a[ii] += b[ii] * s; productive
}
}
80
Task Dependencies
81
The depend Clause

C/C++ Fortra
#pragma omp task depend(dependency-type: list) n
!$omp task depend(dependency-type: list)
structured block structured block
!$omp end task
 The task dependence is fulfilled when the predecessor task has completed
 in dependency-type:
The generated task will be a dependent task of all previously generated sibling tasks
that reference at least one of the list items in an out or inout clause.
 out and inout dependency-type:
The generated task will be a dependent task of all previously generated sibling tasks
that reference at least one of the list items in an in, out, or inout clause.
 mutexinoutset dependency-type:
Support mutually exclusive inout sets, requires ≥v5.0.
 The list items in a depend clause may include array sections.
82
Task Synchronization with Dependencies
int x = 0;
#pragma omp single
{
#pragma omp task depend(in: x)
std::cout << x << std::endl;
task 1
#pragma omp task task 2
long_running_task(); task 3
#pragma omp task depend(inout: x)
x++;
}
83
本讲内容
 Basic4: Tasking
84
OpenMP correctness: Two types of problems

 Race Condition T0 v T1
 Definition:
Two threads access the same shared variable and
at least one thread modifies the variable and
the sequence of the accesses is undefined, i.e.
unsynchronized
 The result of a program depends on the detailed timing of the threads
in the team.
 This is often caused by unintended sharing of data
 Consequence: Wrong results
T0 T1
 Deadlock
 Threads lock up waiting on a locked
R0 R1
resource that will never become free.
 Consequence: Program hangs forever
 Both are very common, and both may be hard to detect!

85
Race condition
 Example
real :: x
!$OMP PARALLEL DO PRIVATE(x) subroutine get_lcgrand(r)
do i=1, N real :: r
call get_lcgrand(x) integer, save :: seed=1
if(x.lt.0.5) then
a(i) = a(i) * 2 seed = mod(106*seed+1283,6075)
endif r = sngl(seed)/6075.
enddo end subroutine
!$OMP END PARALLEL
DO
 seed is a shared variable, because it has the save

attribute
(equivalent to static in C/C++)
 Sequence of generated numbers is unpredictable!
 Possible solutions
 Use synchronization constructs (critical/atomic/locks)
 Use privatization
 Use reductions 86
Race condition: Remedies

 Synchronization constructs can avoid concurrent access:
real :: x
!$OMP PARALLEL DO PRIVATE(x)
do i=1, N
!$OMP CRITICAL
call get_lcgrand(x)
!$OMP END CRITICAL
Only called by one
if(x.lt.0.5) then
thread at a time!
a(i) = a(i) * 2
endif
enddo
!$OMP END PARALLEL
DO
 Advantage: Simple
 Disadvantages
 Performance implications (overhead costs , serialization)
 Danger of deadlocks (see later)
87
Race condition: Remedies

 Privatization is sometimes an alternative to synchronization
real :: x
integer :: seed
!$OMP PARALLEL
PRIVATE(x,seed)
seedDO
!$OMP =
omp_get_thread_
do i=1, N One copy of seed for
num()*10
call get_lcgrand(x,seed) each thread  one
if(x.lt.0.5) then stream of RNs for each
thread
a(i) = a(i) * 2
endif
enddo
!$OMP END DO subroutine get_lcgrand(r,s)
!$OMP END PARALLEL real :: r,s
s = mod(106*s+1283,6075)
 Advantage: No synchronization r = sngl(seed)/6075.
end subroutine
required on every access
 Caveat: Privatization may alter
program semantics (as here)
88
Deadlock
 Example 1
x = 0; y = 0;
# pragma omp parallel

{
# pragma omp for
for(i=1; i<=N; i++){
if (i < 0.3*N)
Two locks (possibly) acquired
{ omp_set_lock(&lock_x); by two threads in different
x = x + i; orders lead to a deadlock
omp_set_lock(&lock_y);
y = y + i;
omp_unset_lock(&lock_y);
omp_unset_lock(&lock_x);
} else
{ omp_set_lock(&lock_y);
y = y + 2*i;
omp_set_lock(&lock_x);
x = x + 2*i;
omp_unset_lock(&lock_x);
omp_unset_lock(&lock_y);
} /*endif*/
}
}/* end parallel*/
89
Deadlock – Example 1: Remedies

 Choose correct locking order:  Pull the locks out of the
branch:
x = 0; y = 0;
# pragma omp parallel x = 0; y = 0;

{
# pragma omp for # pragma omp parallel private(f)
for(i=1; i<=N; i++){ {
if (i < 0.3*N) # pragma omp for
{ omp_set_lock(&lock_x); for(i=1; i<=N; i++){
x = x + i; if (i < 0.3*N)
omp_set_lock(&lock_y); f = 1;
y = y + i; else
omp_unset_lock(&lock_y); f = 2;
omp_unset_lock(&lock_x); omp_set_lock(&lock_x); x
} else = x + f*i;
{ omp_set_lock(&lock_x); omp_set_lock(&lock_y);
x = x + 2*i; y = y + f*i;
omp_set_lock(&lock_y); omp_unset_lock(&lock_y);
y = y + 2*i; omp_unset_lock(&lock_x);
omp_unset_lock(&lock_y); }
omp_unset_lock(&lock_x); }/* end parallel*/
} /*endif*/
}
}/* end parallel*/
90

 Eliminate the locks altogether by using a reduction clause:
x = 0; y = 0;
# pragma omp parallel private(f) reduction(+:x,y)

{
# pragma omp for
for(i=1; i<=N; i++){
if (i < 0.3*N)
f = 1;
else
f = 2;
x = x + f*i;
y = y + f*i;
}
}/* end parallel*/
 Guideline: “privatization instead of synchronization”
91
Deadlock
 Example 2

do i=1,N
x = SIN(2*PI*x/N)
!$OMP CRITICAL
sum = sum + func(x)
!$OMP END CRITICAL
enddo
!$OMP END PARALLEL DO
...
Nested critical regions
double precision function func(v) deadlock even a single thread,
double precision :: v since they are understood to
!$OMP CRITICAL protect the same resource
func = v + random_func()
!$OMP END CRITICAL
end function func
92

 Named critical regions decouple independent resources:

do i=1,N
x = SIN(2*PI*x/N)
!$OMP CRITICAL
(psum)
sum = sum +
func(x)
!$OMP END CRITICAL Names make the
(psum) critical regions protect
enddo disjoint resources
!$OMP
double END PARALLEL
precision function func(v)
 deadlock gone
double precision :: DOv
...
!$OMP CRITICAL (prand)
func = v + random_func()
!$OMP END CRITICAL (prand)
end function func
93
94
本讲内容
 Basic4: Tasking
95
OpenMP overheads
“Wake up” team
of threads
!$OMP PARALLEL PRIVATE(k)

do k=1,NITER Workload
distribution
(~10s cycle)
!$OMP DO SCHEDULE(…)
Loop do i=1,N A(i)=B(i)
parallelization +C(i)*D(i)
enddo
Implicit barrier /
!$OMP END DO sychronization
enddo
!$OMP END PARALLEL
“Retire” team of
threads
96
OpenMP overheads: Cost of barrier

 Cost [cycles] of #pragma omp barrier
2 Threads Intel 16.0 GCC 5.3.0 2.2 GHz

Shared L3 599 425 10 cores 10 cores
SMT threads 612 423

Other socket 1486 1067
Strong topology Strong dependence on

dependence! compiler, CPU and
system environmen!
Full domain Intel 16.0 GCC 5.3.0

Socket (10 cores) 1934 1301
Overhead
Node (20 cores) 4999 7783
grows with
Node +SMT 5981 9897 thread count
97
OpenMP overheads: Barrier implementation (reminder)

 How does a “barrier” scale (best case)?
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
𝑐 𝑁 =
𝑐𝑜𝑛𝑠𝑡 × 2 × 𝑙𝑜𝑔2𝑁
0 1 2 3 4 5 6 7
Where N is number of
0 1 2 3 4 5 6 7 threads/processes
in the barrier
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
98
OpenMP overheads: Barrier cost on Intel

Xeon Phi (KNC)
Intel Xeon Phi

(“Knights Corner”):
60 cores
4 SMT per core
1 GHZ clock speed
99
Serialization: coarse-grained locking

 “Big fat lock”:
double precision, dimension(N,N) :: M

integer :: c
...
!$OMP PARALLEL DO PRIVATE(c)
do i=1,K
c = col_calc(i) All
the
protected by the critical region
!$OMP CRITICAL comu
M(:,c) = M(:,c) + f(c) tatio
 Computation may be serialized!
!$OMP END CRITICAL nal
enddo work
is
!$OMP END PARALLEL DO
100
Serialization: Finer-grained locking

with OpenMP locks
 Spend one lock per matrix column:
double precision, dimension(N,N) :: M
integer (kind=omp_lock_kind), dimension(N) :: locks
integer :: c
!$OMP PARALLEL
!$OMP DO
do i=1,N Each matrix column is
call omp_init_lock(locks(i)) protected by its own lock
enddo
!$OMP END DO Serialization occurs only if two
... threads access the same
!$OMP DO PRIVATE(c) column
do i=1,K
c = col_calc(i) Caveat: Setting/releasing locks
call costs time!
omp_set_lock(locks(c))
M(:,c) = M(:,c) + f(c)
call omp_unset_lock(locks(c))
enddo
!$OMP END DO
!$OMP END PARALLEL 101
False sharing
 Example: Histogram integer, dimension(:,:), &
allocatable :: S
integer, dimension(8) :: S integer :: IND,ID,NT
integer :: IND !$OMP PARALLEL
S = 0 PRIVATE(ID,IND)
do i=1,N !$OMP SINGLE
IND = A(i) NT =
S(IND) = S(IND) + 1 omp_get_num_threads()
enddo allocate(S(0:NT,8))
S = 0
!$OMP END SINGLE
ID =
 “Big fat lock” would be omp_get_thread_num() + 1
!$OMP DO
hazardous  Use separate
do i=1,N
histogram for each thread IND = A(i)
 Manual collection of results S(ID,IND) =
S(ID,IND) + 1
at the end enddo
 Program does not scale – !$OMP END DO NOWAIT
why? !$OMP CRITICAL
do j=1,8
S(0,j) = S(0,j) 102
False sharing
 Why does the histogram code not scale?
 Example: 5 threads, cache line size = 8 DP words
v (1..8) T0
S(id,v)
T1
id (0..4)
T2
T3
T4
cache lines
  5 cache lines shared (interleaved) across 5 threads
  Massive write-allocate/evict/invalidate traffic between caches!
103
Eliminating false sharing

 One option: Padding
 allocate S(0:NT*CL,8) instead of S(0:NT,8)

 every thread gets its own cache line for each histogram
entry:
One cache line
… padding
One
histogram
entry
104
本讲内容
 Basic4: Tasking
105
Memory Locality Problems

 ccNUMA:
 whole memory is transparently accessible by all processors
 but physically distributed
 with varying bandwidth and latency
 and potential contention (shared memory paths)
 How do we make sure that memory access is always as
"local" and "distributed" as possible?
C C C C C C C C
M M M M
106
ccNUMA node layout – Golden Rule

 All modern multi-socket servers are of ccNUMA type
 First touch policy (“Golden Rule”): Locailty domain 0 Locailty domain 1
A memory page is mapped into

P P P P P P P P
the locality domain of the C C C C C C C C
C C
processor that first writes C
C
C C C
C
C C
to it
MI MI
 Consequences
 Mapping happens at initialization,
not allocation
 Initialization and computation must Memory Memory
use the same memory pages double precision, dimension(:),
per thread/process allocatable :: huge
 Affinity matters! Where are my allocate(huge(N))
! memory not mapped yet
threads/processes running? do i=1,N
 It is sufficient to touch a single item huge(i) = 0.d0 ! mapping here!
enddo
to map the entire page
107
Coding for Data Locality

 Simplest case: explicit initialization
integer,parameter :: N=1000000 integer,parameter :: N=1000000

real*8 A(N), B(N) real*8 A(N),B(N)
!$OMP parallel do
do I = 1, N
A=0.d0 A(i)=0.d0
end do
!$OMP end
!$OMP parallel do parallel do
do I = 1, N !$OMP parallel do
B(i) = function ( A(i) ) do I = 1, N
end do B(i) = function ( A(i) )
!$OMP end parallel do end do
!$OMP end parallel do
108

 Sometimes initialization is not so obvious: I/O cannot
be easily parallelized, so "localize" arrays before I/O
integer,parameter :: N=1000000 integer,parameter :: N=1000000
real*8 A(N), B(N) real*8 A(N),B(N)
!$OMP parallel do
do I = 1, N
A(i)=0.d0
end do
!$OMP end
parallel do
READ(1000) A
READ(1000) A
!$OMP parallel do
!$OMP parallel do
do I = 1, N
do I = 1, N
B(i) = function ( A(i) )
B(i) = function ( A(i) )
end do
end do
109

 Required condition: OpenMP loop schedule of initialization
must be the same as in all computational loops
 best choice: static! Specify explicitly on all NUMA-sensitive
loops, just
to be sure…
 imposes some constraints on possible optimizations (e.g. load
balancing)
 How about (automatically initialized) global objects?

 better not use them
 if communication vs. computation is favorable, might consider properly
placed copies of global data
 in C++, STL allocators provide an elegant solution
110
Diagnosing Bad Locality

 If your code is cache-bound, you might not notice any locality
problems (see previous slide)
 Otherwise, bad locality limits scalability at very low CPU

numbers (whenever a node boundary is crossed)
 Consider using performance counters

 likwid-perfctr can be used to measure nonlocal memory accesses
 Example for Intel Westmere (2 sockets, 4 threads/socket):
env OMP_NUM_THREADS=8 likwid-perfctr -g MEM

–C S0:0-3@S1:0-3 ./a.out
111
Memory Locality Problems

 Locality of reference is key to scalable performance on ccNUMA
 Less of a problem with distributed memory (MPI) programming, but see below
 What factors can destroy locality?

 Shared Memory Programming (OpenMP,…):
 Threads loose association with the CPU the initial mapping took
place on.
 Improper initialization of distributed data PPPPPP PPPPPP
 Dynamic/non-deterministic access patterns CC CC CC CC CC CC C C
C
C
C
C
C
C
C
C C C
 All cases: MI
C C
MI
 Other agents (e.g., OS kernel) may fill
memory with data that prevents optimal
placement of user data Memory Memory
 MPI programming:
 Processes lose their association with the CPU the mapping took
place on originally
 OS kernel tries to maintain strong affinity, but sometimes fails
112
ccNUMA problems beyond OpenMP

 OS uses part of main memory
P1 P2 P3 P4
for C C C C
disk buffer (FS) cache

C C C C

MI MI
 If FS cache fills part of memory,
apps will probably allocate from
foreign domains
data(1)
data(3)
  non-local access!
data(3) BC
BC
 Remedies
 Drop FS cache pages after user job has run (admin’s job)
 User can run “sweeper” code that allocates and touches all physical
memory before starting the real application
 This can also be done inside your own code
113
ccNUMA: OS may try to solve the problem

 Operating system may support „page migration“
 Page Migration: If a page is frequently requested by a remote NUMA
domain it is migrated to the respective NUMA domain
 Assume 2-socket Haswell (14 cores per socket) system with Cluster
on Die enabled  4 NUMA domains with 7 cores each
NUMA domains: 0 1 2 3
Frequent remote
. .. . .. access to page from
NUMA domain 0
After some time OS

may migrate page to
NUMA domain 0
NUMA domains: 0 1 2 3
. .. . ..
114
ccNUMA: OS driven page migration

void dmvm(int n, int m, double *lhs, double *rhs, double *mat)
{ #pragma omp for private(offset,c) schedule(static) {
for(r=0; r<n; ++r)
offset=m*r;
for(c=0; c<m; ++c)
lhs[r] += mat[c
+
offset]*rhs[c];
} }
 Serial initialization of
matrix data
...
 Fill socket first / No SMT ...
 Run scaling experiment
with
 OMP_NUM_THREADS
Run 10 tests
(with
= iter dmvm calls
1 (,2,4,6,8,…,28)
each) 115
ccNUMA: OS driven page migration iter = 200
20% performance
fluctuations
Saturation at
6 cores
... ...
OMP_NUM_THREADS
116
iter=2000
iter=1000
Saturation
at iter=500
6 cores
iter=200
... ...
OMP_NUM_THREADS
117

... ...
Page Migration
„takes time“
Migration
threshold: OS
parameter
Works fine for

regular access
patterns
All data on NUMA domain

0
118
ccNUMA performance – things to consider

 First touch policy – Programming
 Parallelization of initialization  Parallelization of
computation
 Thread pinning
 Granularity of placement: Page size!!!
 C++: Constructors, std::vector,...
 Operating system issues:

 OS/IO buffers
 Page migration
119

L09-openmp for第13周

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

L09-openmp for第13周

Uploaded by

Copyright:

Available Formats

中国科学院计算技术研究所

Introduction to OpenMP: Basics

 OpenMP program can be written to compile and execute on a single-

 R.Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, R. Menon: Parallel

Introduction to OpenMP: Shared-Memory model

 Data: shared or private

Introduction to OpenMP: Fork and join execution model

Parallel region: team of threads is

Task (and data) distribution

Introduction to OpenMP: Software Architecture

Threads in OS  Operating system view:

Introduction to OpenMP: Syntax in C/C++

#pragma omp [directive [clause ...]]

 Conditional compilation: Compiler’s OpenMP switch sets

Introduction to OpenMP: Parallel execution

Data in a parallel region can be: Shared

 shared between threads

 OMP clause specifies scope of variables:

 More environment variables available:

 Stack size limits  may be necessary to make large arrays static

$ setenv OMP_STACKSIZE 100M

 performance issues (“false sharing”)

 Scoping of local function data and global data

 to not overlook anything

Private variables - Masking

The firstprivate clause

 value of master copy is transferred to

The lastprivate clause

int main(int argc, char* argv[]){

Introduction to OpenMP: Reduction operations

Introduction to OpenMP: Reduction operations

The schedule clause

 typical: largest possible chunks of as-equal-

OpenMP Scheduling of simple for loops

OpenMP Schedule example

Collapsing loop nests

end do number of loop nests

 Restrictions:  Optimization effect

Performance Tuning: the nowait clause

 Example: multiple loops in parallel region

Explicit barrier synchronization

 each block executed exactly once by one  Scheduling to threads

The single directive

 one thread only executes enclosed code join

Combining regions and work sharing

Why do we need synchronization?

 content becomes inconsistent with that in processor

two threads execute

 Following results could be obtained on each thread

Consequences and (theoretical) remedies

OpenMP flush syntax

 Implicit for some directives:

Relaxing synchronization requirements

Fortran unary operator

Synchronizing effect of critical regions

 If other statements access the shared variable, may be in trouble:

The ordered clause and directive

Two applications of ordered

!$OMP do ordered !$OMP do ordered

for (i=1; i<N; ++i) { for (i=0; i<N; ++i) {

Mutual exclusion with locks

 OpenMP lock variables must be only accessed by the lock routines

 Mutual exclusion bound to objects

 2 types of locks are supported: