Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 119

中国科学院计算技术研究所

Parallel Processing
Lecture 09: Advanced OpenMP Programming
2024-05-21

王银山 副研究员
高性能计算机研究中心
中国科学院计算技术研究所

1
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

2
中国科学院计算技术研究所

Introduction to OpenMP: Basics


 “Easy”, incremental and portable parallel programming of shared-
memory computers: OpenMP
 Standardized set of compiler directives & library functions:
http://www.openmp.org/
 FORTRAN, C and C++ interfaces are defined
 Supported by most/all commercial compilers, GNU starting with 4.2

 OpenMP program can be written to compile and execute on a single-


processor machine just by ignoring the directives
 API calls must be masked out though
 Supports data parallelism

 R.Chandra, L. Dagum, D. Kohr, D. Maydan, J. McDonald, R. Menon: Parallel


programming in OpenMP. Academic Press, San Diego, USA, 2000, ISBN 1-
55860-671-8
 B. Chapman, G. Jost, R. v. d. Pas: Using OpenMP.MIT Press, 2007, ISBN 978-
0262533027
3
中国科学院计算技术研究所

Introduction to OpenMP: Shared-Memory model


Central concept of OpenMP programming: Threads
 Threads access globally
T T shared memory

 Data: shared or private


private private  shared data available
to all threads (in
Shared principle)
 private data only to
Memory thread that owns it
private
private  Data transfer transparent
to programmer
T T  Synchronization takes
place, is mostly implicit
 Tailored to data parallel
execution
Other threading libs. Available, e.g. pthreads
4
中国科学院计算技术研究所

Introduction to OpenMP: Fork and join execution model


Program start:
only master thread runs

Parallel region: team of threads is


generated (“fork”)
Synchronize when leaving parallel
region (“join”)

Serial region:
only master executes
worker threads usually sleep

Task (and data) distribution


possible via directives
Thread # 1 2 3 4
0
Often best choice 1 thread/core
5
中国科学院计算技术研究所

Introduction to OpenMP: Software Architecture

 Programmer’s view:
 directives/pragmas
Applicati User
in
on application code
 (a few) library
routines
Compiler Environment  User’s view:
Directives Variables  environment variables
determine:
resource allocation
scheduling strategies and other
Runtime Library (implementation-dependent )
behavior

Threads in OS  Operating system view:


Cores in  parallel work done by
hardware threads

6
中国科学院计算技术研究所

Introduction to OpenMP: Syntax in C/C++


 Include file: #include <omp.h>

 Compiler directive:

#pragma omp [directive [clause ...]]


structured block

 Conditional compilation: Compiler’s OpenMP switch sets


preprocessor macro (acts like -D_OPENMP)

#ifdef _OPENMP

... do something

#endif
7
中国科学院计算技术研究所

Introduction to OpenMP: Parallel execution


 #pragma omp parallel
structured block
 Makes structured block a parallel region: All code executed
between start and end of this region is executed by all threads.
 This includes subroutine calls within the region (unless
explicitly sequentialized)
 Local variables inside the block are automatically private
to each thread
 END PARALLEL required in Fortran

use omp_lib

!$OMP PARALLEL
call work(omp_get_thread_num(), omp_get_num_threads())
!$OMP END PARALLEL

8
中国科学院计算技术研究所

Introduction to OpenMP:
Data scoping – shared vs. private
T T
The OpenMP memory model
private private

Data in a parallel region can be: Shared


private
private
 private to each executing thread
T T
 each thread has its own local copy of data

 shared between threads


 there is only one instance of data available to all threads
 this does not mean that the instance is always visible to all threads!

 OMP clause specifies scope of variables:


 Default: shared
 Specify private variables in a parallel region:
#pragma omp parallel private(var1, tmp)
9
中国科学院计算技术研究所

Hello world
#include <stdio.h>  Compile switch
#include <omp.h>  g++ hello.cpp -fopenmp
int main(int argc, char* argv[]){
#pragma omp parallel
 Number of threads: Determined
{ by shell variable
printf( "Hello wrold from %d!\n", OMP_NUM_THREADS
omp_get_thread_num());  Export OMP_NUM_THREADS=4
}  Thread pinning: eg LIKWID
return 0;
}  likwid-pin –c 0-3
./a.out

Ordering not
reproducible

 More environment variables available:


 Loop scheduling: OMP_SCHEDULE, Stacksize: OMP_STACKSIZE
 Dynamic adjustment of threads: OMP_DYNAMIC
10
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

11
中国科学院计算技术研究所

Introduction to OpenMP:
Data scoping – shared vs. private
 Default: All data in a parallel region is shared
This includes global data (global/static variables, C++ class variables)

 Exceptions:
1. Loop variables of parallel (“sliced”) loops are private (cf. workshare constructs)
2. Local (stack) variables within parallel region
3. Local data within enclosed function calls are private* unless declared static

 Stack size limits  may be necessary to make large arrays static


 This presupposes it is safe to do so!
 If not: make data dynamically allocated
 As of OpenMP 3.0: OMP_STACKSIZE may be set at run time (increase thread-
specific stack size):

$ setenv OMP_STACKSIZE 100M

12
中国科学院计算技术研究所

Introduction to OpenMP:
Data scoping – side effects
Incorrect
 Incorrect shared attribute may lead to result
 correctness issues (“race conditions”)

 performance issues (“false sharing”)


(Very) slow

 Scoping of local function data and global data


 is not changed
 compiler cannot be assumed to have knowledge

 Recommendation: Use
#pragma omp parallel default(none)

 to not overlook anything


 compiler complains about every variable that has no explicit scoping attribute

13
中国科学院计算技术研究所

Private variables - Masking


private
#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[]){ fork:
s T0 T1 T2 T3
int s = 0;
#pragma omp parallel private(s)
s s0 s1 s2 s3
{
for(int i = 0; i < 10; i++) { persists
s += i; (inaccessible)
} s s0 s1 s2 s3
}
return 0; s join
} OpenMP 3.0:
Shared/global value
 Masking relevant for
time

recovered
 Masking also applies to
• privatized variables  association status of
defined in scope pointers
outside the parallel  allocation status of
region allocatable variables
14
中国科学院计算技术研究所

The firstprivate clause


#include <stdio.h> shared private
#include <omp.h> s
int main(int argc, char* argv[]){ fork:
int s = 0; s T0 T1 T2 T3
#pragma omp parallel firstprivate(s)
{ s s0 s1 s2 s3
for(int i = 0; i < 10; i++) { persists
s += i; (inaccessible)
}
} s s0 s1 s2 s3
return 0; s join
}

 Extension of private:
time

 value of master copy is transferred to


private variables
 restrictions: not a pointer, not
assumed shape, not a subobject,
master copy not itself private etc.
15
中国科学院计算技术研究所

The lastprivate clause


shared private
#include <stdio.h>
#include <omp.h> s
int main(int argc, char* argv[]){ fork:
int s = 0; s T0 T1 T2 T3
#pragma omp parallel lastprivate(s)
s s0 s1 s2 s3
{
for(int i = 0; i < 10; i++) { persists
s += i; (inaccessible)
} s s0 s2 s3
}
s1
return 0; s
} join

 Extension of private:
 value from thread which executes
last iteration of loop is transferred
back to master copy

 restrictions similar to
firstprivate 16
中国科学院计算技术研究所

Example1
#include <stdio.h>
#include <omp.h>

int main(int argc, char* argv[]){


int sum = 0;
#pragma omp parallel for shared(sum)
for(int i = 0; i < omp_get_num_threads(); i++) {
sum += i; sum is shared
printf("tid %d, sum = %d\n", omp_get_thread_num(), sum);
}
return 0;
}

code is in PE/src/lec09/share_private.cpp
17
中国科学院计算技术研究所

Example1
#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[]){
int sum = 0;
#pragma omp parallel for private(sum)
for(int i = 0; i < omp_get_num_threads(); i++) {
sum += i; sum is private
printf("tid %d, sum = %d\n", omp_get_thread_num(), sum);
}
printf("outside sum = %d\n", private_sum);
return 0;
}

code is in PE/src/lec09/share_private.cpp
18
中国科学院计算技术研究所

Introduction to OpenMP: Reduction operations


real :: s private
s is still
shared here

start reduction:
!$omp parallel s T0 T1 T3
!$omp do reduction(+:s)
T2 all set
do i = … s s0 s1 s3
2 to 0.0
: s
persists
:
(inaccessible)
s = s + …
end do s s0 s1 s2 s3
!$omp end
do
s synchronize
… = … * s
!$omp
end
 At synchronization point:
time

parallel
 reduction operation is performed
 result is transferred to master copy
 restrictions similar to firstprivate

19
中国科学院计算技术研究所

Introduction to OpenMP: Reduction operations


 Initial values of reduction  Consistency required
variable depend on operation  operation specified in clause vs.
update statement
 rely on algebraic rules!
Operation Initial value
 subtract: X = expr – X is not
allowed
+ 0
 Multiple reductions:
C / C++ has an analogous set

- 0
 multiple scalars, or an array:
* 1
.and. .true. Double x, y, z
.or. .false. #pragma omp parallel for(+:x,
y, z)
.eqv. .true.
.neqv. .false.
Double a(n)
#pragma omp parallel for(*:a)
MAX -HUGE(X)
MIN HUGE(X)  multiple operations:
IAND all bits set 1
IEOR 0 #pragma omp parallel for
reduction(+:sum_i)
IOR 0
reduction(max:max_i)
20
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

21
中国科学院计算技术研究所

The schedule clause


 Default scheduling:  Static scheduling
 implementation dependent  schedule(static,10) 10 iterations

 typical: largest possible chunks of as-equal-


as-possible size (“static scheduling”)
 minimal overhead (precalculated
work
iteration space assignment)
(threads color coded)
 default chunk value: see left
 Dynamic scheduling
 schedule(dynamic, 10)

 User-defined scheduling:
#pragma omp for schedule(...)

!$OMP do schedule(...)
both threads take long to complete
their chunk (workload imbalance)
static
schedule( dynamic [,chunk] )
guided
 after a thread has completed a chunk, it
is
chunk: always a non-negative integer. If assigned a new one, until no chunks are
omitted, has a schedule dependent default left
value  synchronization overhead
 default chunk value is 1
22
中国科学院计算技术研究所

OpenMP Scheduling of simple for loops

OMP_SCHEDULE=static

OMP_SCHEDULE=static,10 OMP_SCHEDULE=dynamic,10 23
中国科学院计算技术研究所

Example2

code is in PE/src/lec09/schedule.cpp
24
中国科学院计算技术研究所

OpenMP Schedule example

25
中国科学院计算技术研究所

Guided scheduling
 Size of chunks in dynamic schedule
 too small  large overhead
 too large  load imbalance
 Guided scheduling: dynamically vary chunk size.
 Size of each chunk is proportional to the number of unassigned iterations divided by
the number of threads in the team, decreasing to chunk-size (default = 1).

 Chunk size:
 means minimum chunk size (except perhaps final chunk)
 default value is 1

chunk = 7

iteration space

 Both dynamic and guided scheduling useful for handling poorly balanced and
unpredictable workloads.

26
中国科学院计算技术研究所

Deferred scheduling
 auto: automatic scheduling  Examples:
 Programmer gives implementation the  environment setting:
freedom to use any possible
export OMP_SCHEDULE="guided,4"
mapping.
./a.out
 call to API routine:
 Decided at run time:
call omp_set_schedule(omp_sched_dynamic,4)
!$OMP do schedule(runtime) !$OMP parallel
!$OMP do schedule(runtime)
do
#pragma omp for schedule(runtime) …
end do
!$OMP end
do
 runtime:
 schedule is one of the above or the omp_set_schedule(omp_sched_dynamic, 4)
previous two slides #pragma omp parallel
#pragma omp schedule(runtime)
 determine by either setting OMP_SCHEDULE, for (…) { }
and/or calling omp_set_schedule()
(overrides env. setting) runtime scheduling and OMP_SCHEDULE is not
 find which is active by calling set:
omp_get_schedule() implementation chooses a schedule

27
中国科学院计算技术研究所

Collapsing loop nests


 Collapse nested loops into a single  Logical iteration space
iteration space  example: kmax=3, jmax=3
!$OMP do collapse(2) 0 1 2 3 4 5 6 7 8
do k=1, kmax J 1 2 3 1 2 3 1 2 3
do j=1, jmax
: Argument specifies K 1 1 1 2 2 2 3 3 3

end do number of loop nests


end do to flatten
 this is what is divided up into chunks
!$OMP end do
and distributed among threads
#pragma omp for collapse(2)  Sequential execution of the iterations in
for (k=0; k<kmax; ++k) all loops determines the order of iterations
for (j=0; j<jmax; ++j) in the collapsed iteration space
...

 Restrictions:  Optimization effect


 iteration space computable at entry to loop  may improve memory locality properties
(rectangular)  may reduce data traffic between cores
 CYCLE (Fortran) or continue (C/C++)
only in innermost loop

28
中国科学院计算技术研究所

Performance Tuning: the nowait clause


 Remember:  Shooting yourself in the foot
 an OpenMP for/do performs  modified variables must not be accessed
implicit synchronization at loop unless explicit synchronization is performed
completion  use a barrier for this

 Example: multiple loops in parallel region

29
中国科学院计算技术研究所

Explicit barrier synchronization


 barrier construct is a stand-  each barrier must be encountered by
alone directive all threads in the team or by non at all.
 barrier synchronizes all threads

30
中国科学院计算技术研究所

Parallel sections
 Non-iterative work-sharing construct  Restrictions:
 distribute a set of structured blocks  section directive must be within
lexical scope of sections directive
!$omp parallel
!$omp sections  sections directive binds to
!$omp section innermost parallel region
! code block 1 thread 0   only the threads executing the binding
!$omp section
parallel region participate in the execution
! code block 2 thread 1
… of the section blocks and the implicit barrier
!$omp end sections (if not eliminated with nowait)
synchronization
!$omp end parallel

 each block executed exactly once by one  Scheduling to threads


of the threads in team
 implementation-dependent
 if there are more threads than code blocks:
excess threads wait at synchronization point
 Allowed clauses on sections:
 private, firstprivate,
lastprivate, reduction, nowait

31
中国科学院计算技术研究所

The single directive


private
#pragma omp parallel
{
double s = …; fork:
#pragma omp single copyprivate(s) T0 T1 T2 T3
{
s = … s0 s1 s2 s3
}
… = … + s copyprivate(s)
} broadcasts s
s2

s0 s1 s2 s3

 one thread only executes enclosed code join

time
block
 all other threads wait until block
completes execution  copyprivate and nowait clauses: appear
 allowed clauses: private, firstprivate, on
copyprivate, nowait end single in Fortran, on single in C/C++
 use for updates of shared entities, but … !$omp single
s = …
 single – really a worksharing directive?
!$omp end
single
copyprivate(
s)

32
中国科学院计算技术研究所

Combining regions and work sharing


 Example:  Applies to most work-sharing
!$OMP parallel do
constructs
...  do/for
!$OMP end parallel do  workshare
 sections
#pragma omp parallel for
...  Notes:
 clauses for work-sharing constructs
 is equivalent to can appear on combined construct
 the reverse is not true
!$omp parallel
!$omp do shared can only appear in a
... parallel region
!$omp end do  clauses on a work-sharing construct
!$omp end parallel
only apply for the specific construct
block
#pragma omp parallel
#pragma omp for
...

33
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

34
中国科学院计算技术研究所

Why do we need synchronization?


 OpenMP Memory Model
T T
 private (thread-local):
 no access by other threads
shared
memory
a
 shared: two views
T T
 temporary view: thread has modified data in
its registers (or other intermediate
device)

 content becomes inconsistent with that in processor


registers
cache/memory (different cores)

two threads execute


 other threads: cannot know that their copy a = a + 1
of data is invalid in same parallel region
 race condition

35
中国科学院计算技术研究所

Possible results
a = 0

Thread 0: Thread 1:
a = a + 1 a = a + 1

 Following results could be obtained on each thread


 a after completion of statement:

Thread 0 Thread 1
1 1
1 2
2 1

 may be different from run to run, depending on which thread is the last one
 after completion of parallel region, may obtain 1 or 2.

36
中国科学院计算技术研究所

Consequences and (theoretical) remedies


 For threaded code without  further memory operations only
synchronization this means allowed after all involved threads
complete flush:
 multiple threads write to same
 restrictions on memory instruction
memory location
 resulting value is unspecified
reordering (by compiler)
 some threads read and another writes
 result on reading threads
unspecified
 Ensure consistent view of
 Flush Operation memory:
 is performed on a set of (shared) variables  assumption: want to write a data item
or on the whole thread-visible data state of with first thread, read it with second
a program  order of execution required:
 flush-set 1. thread 1 writes to shared variable
 discards temporary view: 2. thread 1 flushes variable
 modified values forced to 3. thread 2 flushes same variable
cache/memory 4. thread 2 reads variable
 next read access must be
from cache/memory

37
中国科学院计算技术研究所

OpenMP flush syntax


 OpenMP directive for explicit flushing

!$omp flush[(var1[,var2,…])]

 Stand-alone directive
 applicable to all variables with shared scope
 including: SAVE, COMMON/module globals, shared dummy arguments, shared pointer dereferences
 If no variables specified, the flush-set
 encompasses all shared variables
which are accessible in the scope of the FLUSH directive
 potentially slower
 Implicit flush operations (with no list) occur at:
 All explicit and implicit barriers
 Entry to and exit from critical regions
 Entry to and exit from lock routines

38
中国科学院计算技术研究所

Barrier synchronization
 Explicit via directive:
 the execution flow of each thread blocks upon reaching the barrier until all threads have reached
the barrier
 flush synchronization of all accessible shared variables happens before all threads continue
 after the barrier, all shared variables have consistent value visible to all threads
 barrier may not appear within work-sharing code block
 e.g. !$omp do block, since this would imply deadlock

 Implicit for some directives:


 at the beginning and end of parallel regions
 at the end of do, single, sections, workshare blocks unless a nowait clause is specified
(where allowed)
 all threads in the executing team are synchronized
 this is what makes these directives “easy-and-safe-to-use”

39
中国科学院计算技术研究所

Relaxing synchronization requirements


 Use a nowait clause
 on end do / end sections / end single / end workshare
(Fortran)
 on for / sections / single (C/C++)
 removes the synchronization at end of block
 potential performance improvement
 especially if load imbalance occurs within construct)
 programmer’s responsibility to prevent races

40
中国科学院计算技术研究所

Critical regions
 The critical and atomic directives:
 each thread arriving at the code block executes it (in contrast to single)
 mutual exclusion: only one at a time within code block
 atomic: code block must be a single line update of a scalar entity of intrinsic type
with an intrinsic operation

Fortran unary operator


also allowed
!$omp critical !$omp atomic
block x = x <op> y
!$omp end critical

C/C++
# pragma omp critical # pragma omp atomic
{ block } x = x <op> y ;

41
中国科学院计算技术研究所

Synchronizing effect of critical regions


 Mutual exclusion is only assured for the statements inside the block
 i.e., subsequent threads executing the block are synchronized against each other

 If other statements access the shared variable, may be in trouble:

C/C++ Fortran
#pragma omp parallel !$omp parallel  Race on read to x.
{ :
: !$omp atomic
 A barrier is required
#pragma omp atomic x = x + y before this statement to
x = x + y : assure that all threads
: a = f(x, …) have executed their
a = f(x, …) !$omp end atomic updates
} parallel

42
中国科学院计算技术研究所

Named critical
 Consider multiple updates  Solution:
 same shared variable  use named criticals
thread 0 thread 1 subroutine foo()
!$omp critical (foo_x)
subroutine foo() subroutine bar() x = x + y
!$omp critical !$omp critical !$omp end critical (foo_x)
x = x + y x = x + z
!$omp end !$omp end critical
critical subroutine bar()
 critical region is global: OK !$omp critical (foo_w)
w = w + z
!$omp end critical
 different shared variables (foo_w)
 mutual exclusion only if same name is
subroutine foo() subroutine bar()
!$omp critical !$omp critical used for critical
x = x + y w = w + z  atomic is bound to updated variable
!$omp end !$omp end
critical critical  problem does not occur
 mutual exclusion not required
 unnecessary loss of performance

43
中国科学院计算技术研究所

The ordered clause and directive


 Statements must be within body of a loop
 directive acts similar to single
 threads do work ordered as in sequential execution
 execution in the order of the loop iterations
 requires ordered clause on enclosing do/for
construct
 only effective if code is executed in parallel ...
i=1 i=2 i=3 i=N
 only one ordered region per loop O1
O1

time
O1
C/C++ Fortran O2
O1
#pragma omp for ordered !$OMP do ordered O2
for (i=0; i<N; ++i) { do I=1,N O2
O1 O1 O3 O3
...
#pragma omp ordered !$OMP ordered O3
{ O2 } O2
O3 !$OMP end ordered O2
} O3
O3
end do barrier
!$OMP end do

44
中国科学院计算技术研究所

Two applications of ordered


 Loop contains recursion  Loop contains I/O
 dependency requires serialization  it is desired that output (file) be
 only small part of loop (otherwise consistent with serial execution
performance issue)

!$OMP do ordered !$OMP do ordered


Fortran

Fortran
do I=2,N do I=1,N
... ! large ... !
block calculate a(I)
!$OMP ordered !$OMP ordered
a(I) = a(I- write(unit,.
1) + ... ..) a(I)
!$OMP end ordered !$OMP end ordered
end do end do
#pragma
!$OMP omp
end for
do ordered #pragma omp for
!$OMP end do ordered

C/C++
C/C++

for (i=1; i<N; ++i) { for (i=0; i<N; ++i) {


... /* large block */ ... /* calculate a[i] */
#pragma omp ordered #pragma omp ordered
a[i] = a[i-1] + ... printf("%e ", a[i]);
} }
}

45
中国科学院计算技术研究所

Mutual exclusion with locks


 A shared lock variable can be used to implement specifically designed
synchronization mechanisms
 In the following, var is of type
 Fortran: integer(omp_lock_kind)
 C/C++: omp_lock_t

 OpenMP lock variables must be only accessed by the lock routines

 Mutual exclusion bound to objects


 more flexible than critical regions

46
中国科学院计算技术研究所

OpenMP locks
 An OpenMP lock can be in one of the following 3 stages:
 uninitialized
 unlocked
 locked
 The task that sets the lock is then said to own the lock.

 Only a task that sets the lock, can unset the lock, returning it to the unlocked stage.

 2 types of locks are supported:


 simple locks
 Can only be locked if unlocked.
 A thread may not attempt to re-lock a lock it already has acquired.
 nestable locks
 Owning thread can lock multiple times
 Owning thread must unlock the same number of times it locked it

47
中国科学院计算技术研究所

Lock routines (1)

 Fortran: omp_init_lock(var)
C/C++ omp_init_lock(omp_lock_t *var)
 initialize a lock
 initial state is unlocked
 what resources are protected by lock: defined by developer
 var not associated with a lock before this routine is called

 Fortran: omp_destroy_lock(var)
C/C++: omp_destroy_lock(omp_lock_t
*var)
 disassociate var from lock
 precondition:
 var must have been initialized
 var must be in unlocked state

48
中国科学院计算技术研究所

Lock routines (2)


 Assuming: lock variable var has been initialized

 Fortran: omp_set_lock(var)
C/C++: void omp_set_lock(omp_lock_t *var)
 blocks if lock not available
 set ownership and continue execution if lock available

 Fortran: omp_unset_lock(var)
C/C++: void omp_unset_lock(omp_lock_t *var)
 release ownership of lock
 ownership must have been established before

 Fortran: logical function omp_test_lock(var)


 C/C++: int omp_test_lock(omp_lock_t *var)
 does not block, tries to set ownership
 returns true if lock was set, false if not
 allows to do something else while lock is hold by another thread

49
中国科学院计算技术研究所

Example for using locks

#include <omp.h> #inclue <omp.h>


omp_lock_t lock; omp_lock_t lock;
loop until lock
omp_init_lock(&lock); omp_init_lock(&lock) calling thread
hold lock
#pragma omp parallel #pragma omp parallel
{ {
... ...
omp_set_lock(&lock) while (!omp_test_lock(&lock)) {
/* use resource /* work unrelated to lock
protected acts like a protected resource */
by lock */ critical }
omp_unset_lock(&loc region /* use lock protected
k) resource */
... omp_unset_lock(&lock)
} ...
}
omp_destroy_lock(&loc
k) omp_destroy_lock(&lock)

50
中国科学院计算技术研究所

Nestable Locks
 replace omp_*_lock by omp_*_nest_lock

 task owning a nestable lock may re-lock it multiple times


 a nestable lock is available if it is either unlocked
or
 it is already owned by the task executing
omp_set_nest_lock()or omp_test_nest_lock()

 re-locking increments nest count


 releasing the lock decrements nest count
 lock is unlocked once nest count is zero

51
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

52
中国科学院计算技术研究所

Task Execution Model


 Supports unstructured parallelism  Example of unstructured parallelism
 unbounded loops
#pragma omp parallel
while (<expr>) { do while (<expr> #pragma omp single
... ... while (elem != NULL) {
} end do #pragma omp task
compute(elem);
 recursive functions elem = elem->next;
}
void myfunc(<args>)
{
...
myfunc(<newargs>)
...
}

 Several scenarios are possible


 single creator, multiple creators, nested tasks,
 All threads in the team are candidates to execute tasks

53
中国科学院计算技术研究所

The Execution Model

Task queue

54
中国科学院计算技术研究所

The task Construct


 Deferring (or not) a unit of work (executable for any member of the team)
#pragma omp task [clause[[,] clause]...] !$omp task [clause[[,] clause]...]
{structured-block} …structured-block…
!$omp end task

 Clauses:
 data environment:  cutoff strategies:
 private, fistprivate,  if(scalar-expression)
default(shared|none),
 mergable
in_reduction(r-id:list)
 Dependencies:  final(scalar-expression)
 depend(dep-type: list)  Other clauses:
 Scheduler restriction:  allocate(allocator:] list)
 untied  detach(event-handler)
 Scheduler hints:
 priority(priority-
value)
 affinity(list)

55
中国科学院计算技术研究所

What is a Task?
 Make OpenMP worksharing more flexible:
 allow the programmer to package code blocks and data items for execution
 this by definition is a task
 and assign these to an encountering thread
 possibly defer execution to a later time (“work queue”)
 Introduced with OpenMP 3.0 and extended over time

 When a thread encounters a task construct, a task is generated from the code of
the associated structured block.
 Data environment of the task is created (according to the data-sharing attributes,
defaults, …)
 “Packaging of data”
 The encountering thread may immediately execute the task, or defer its execution.
In the latter case, any thread in the team may be assigned the task.

56
中国科学院计算技术研究所

Example: Processing a Linked List

typedef struct
{ list *next;
contents *data;
} list;
Typical task generation loop:
void #pragma omp parallel
process_list(list {
*head) #pragma omp single
{ {
#pragma omp while(p) {
parallel #pragma omp task
{ { /* taks code */ }
#pragma omp }
single } /* all tasks done */
{ }
listp*p= p->next;
= head;
}
while(p) {
} /* all tasks
#pragma omp done */
} task
} { do_work(p-
>data); }

57
中国科学院计算技术研究所

Example: Processing a Linked List


 Features of this example:
typedef struct  one of the threads has the job of generating
{ list *next; all tasks
contents *data;  synchronization: at the end of the single
} list;
block for all tasks created inside it
void process_list(list *head)  no particular order between tasks is
{ enforced here
#pragma omp parallel
{
 data scoping default for task block:
#pragma omp single  firstprivate
{  iterating through p is fine
list *p = head;
while (p) {  this is the “packaging of data ”
#pragma omp mentioned earlier
task  task region: includes call of do_work()
{ =
p do_work(p-
p->next;
} >data); }
} /* all tasks done */
}
}

58
中国科学院计算技术研究所

The if Clause
 When if argument is false –
 task becomes an undeferred task
 task body is executed immediately by encountering thread
 all other semantics stay the same (data environment, synchronization) as for a
“deferred”task

#pragma omp task if (sizeof(p->data) > threshold)


{ do_work(p->data); }

 User-directed optimization:
 avoid overhead for deferring small tasks
 cache locality / memory affinity may be lost by doing so

59
中国科学院计算技术研究所

Task Synchronization

60
中国科学院计算技术研究所

Task Synchronization with barrier and taskwait

 OpenMP barrier (implicit or explicit) #pragma omp parallel


#pragma omp single
 All tasks created by any thread of the current {
team are guaranteed to be completed at barrier while (elem != NULL) {
#pragma omp task
exit compute(elem);

C/C++
#pragma omp barrier Fortran
!$omp barrier elem = elem->next;
}
} /* impl. barrier */

#pragma omp parallel


 Task barrier: taskwait #pragma omp single
{
 Encountering task is suspended until child tasks are #pragma omp task
complete {
... /* A */
 Applies to direct children only, not descendants!
#pragma omp
C/C++ Fortran task
{ ... } /* B
#pragma omp taskwait !$omp taskwait */

#pragma omp
task
{ ... } /* C
*/

#pragma omp
taskwait
}
} /* impl. barrier 61
中国科学院计算技术研究所

Task Synchronization with taskgroup


 taskgroup construct #pragma omp taskgroup [allocate] \
 deep task synchronization [task_reduction(list)]
 attached to a structured block {structured-block}

 completion of all descendants of


the current task
 task synchronization point (TSP)
at the end, see later slides

#pragma omp parallel


#pragma omp single
{ A
#pragma omp taskgroup
A
{
#pragma omp task
{ … } B B C
#pragma omp task wait for…
C
{ … #C.1; #C.2; …}
C.1 C.2
} // end of taskgroup
}

62
中国科学院计算技术研究所

Task Synchronization Constructs: Comparison

Task Synchronization Description


Construct
barrier Either an implicit, or explicit barrier.

taskwait Wait on the completion of child tasks of


the current task.
taskgroup Wait on the completion of child tasks of
the current task and their descendants.

63
中国科学院计算技术研究所

Task Switching
 It is allowed for a thread to
 suspend a task during execution
 start (or resume) execution of another task (assigned to the same team)
 resume original task later
 Pre-condition:
 a task scheduling point (TSP) is reached
 Example from previous slide:
 the taskwait directive implies a task scheduling point
 Another example:
 very many tasks are generated
 implementation can suspend generation of tasks and start processing the existing
queue
 Nearly all task scheduling points are implicit
 taskyield construct only explicit one

64
中国科学院计算技术研究所

Task Scheduling Points


 The point immediately following the generation of an explicit task.
 After the point of completion of a task region.
 At a taskyield directive.
 At a taskwait directive.
 At the end of a taskgroup region.
 At an implicit or explicit barrier directive.

65
中国科学院计算技术研究所

Example: Dealing with Large Number of Tasks


 Features of this example:
 generates a large number of tasks
#define LARGE_N 10000000 with one thread and executes them
double item[LARGE_N] with the threads in the team
extern void
process(double);
 implementation may reach its limit
int main() on unassigned tasks
task scheduling
{ point  if it does, the implementation is
#pragma omp parallel allowed to cause the thread executing
{
the task generating loop to suspend
#pragma omp single
{ its task at the scheduling point and
for(int i = 0; i< LARGE_N; i++) { start executing unassigned tasks.
#pragma omp task  once the number of unassigned
process(item[i]); tasks is sufficiently low, the thread
}
}
may resume executing of the task
return 0; item is shared i is generating loop.
} firstprivate

66
中国科学院计算技术研究所

Example: The taskyield Directive


 taskyield directive introduces an
explicit task scheduling point (TSP).
#include <omp.h>  May cause the calling task to be
suspended.
void something_useful();
void something_critical();

void foo(omp_lock_t *
lock, int n)
{ The waiting task may be
for(int i = 0; i < n; i++) { suspended here and allow
#pragma omp task
{ the executing thread to
something_useful(); perform other work. This
while (! may also avoid deadlock
omp_test_lock(lock)) {
#pragma omp taskyield situations.
}
something_critical();
omp_unset_lock(lock);
}
}
}

67
中国科学院计算技术研究所

Task Switching: Tied and Untied Tasks


 Default behavior:
 a task assigned to a thread must be (eventually) completed by that thread
 task is tied to the thread
 Change this via the untied clause #pragma omp task untied
 execution of task block may change to another structured-block
thread of the team at any task scheduling point
 implementation may add task scheduling points beyond those previously
defined (outside programmer‘s control!)
 Deployment of untied tasks
 starvation scenarios: running out of tasks while generating thread is still
working on something
 Dangers:
 more care required (compared with tied tasks) wrt. scoping and
synchronization

68
中国科学院计算技术研究所

Final and mergeable Tasks


 Final Tasks  Merged Tasks
 use a final clause with a condition  using a mergeable clause may create
 always undeferred, a merged task if it is undeferred or final
 executed immediately by the  a merged task has the same data
encountering thread environment as its creating task region
 reducing the overhead of placing tasks in  Clause was introduced to reduce
the “task pool” data / memory requirements
 all tasks created inside final task region are  Final and/or mergeable
also final  can be used for optimization purposes
 different from an if clause  optimize wind-down phase of a recursive
 use omp_in_final() to test if task is final algorithm

69
中国科学院计算技术研究所

Task data scoping

70
中国科学院计算技术研究所

Data Environment
 The task directive takes the following data attribute clauses that define the
data environment of the task:

 default (private | firstprivate | shared | none)


 private (list)
 firstprivate (list)
 shared (list)

71
中国科学院计算技术研究所

Data Scoping
 Some rules from Parallel Regions apply:
 Static and global variables are shared
 Automatic storage (stack) variables are private
 The OpenMP Standard says:
 The data-sharing attributes of variables that are not listed in data attribute clauses of
a task construct, and are not predetermined according to the OpenMP rules, are
implicitly determined as follows:
 In a task construct, if no default clause is present:
a) a variable that is determined to be shared in all enclosing constructs, up to
and including the innermost enclosing parallel construct, is shared.
b) a variable whose data-sharing attribute is not determined by rule (a) is
firstprivate.
int i = 0, j = 1;
#pragma omp parallel private(j)
#pragma omp single
{
int k = 0;
#pragma omp task shared
{ /* use i, j, k
} */ } firstprivate

72
中国科学院计算技术研究所

Task reduction

73
中国科学院计算技术研究所

Reminder: taskgroup construct


 taskgroup construct #pragma omp taskgroup [allocate] \
 deep task synchronization [task_reduction(list)]
{structured-block}
 attached to a structured block
 completion of all descendants of the
current task
 task synchronization point (TSP) at the
end, see later slides

#pragma omp parallel


#pragma omp single
{ A
#pragma omp taskgroup
A
{
#pragma omp task
{ … } B B C
#pragma omp task wait for…
C
{ … #C.1; #C.2; …}
C.1 C.2
} // end of taskgroup
}

74
中国科学院计算技术研究所

Task Reductions using the taskgroup Construct


 Reduction operation int res = 0; ≥ v5.0
 perform some forms of recurrence calculations node_t* node = head;
...
 associative and commutative operators
#pragma omp parallel
 The taskgroup scoping reduction {
clause #pragma omp single
#pragma omp taskgroup \ {
task_reduction(op: list) #pragma omp taskgroup \
{structured-block} task_reduction(+: res) 1
{
 Register a new reduction 1 while (node)
at
 Computes the final result after 3 { #pragma omp task
\ 2
in_reduc
 The task in_reduction clause tion(+
: res)
 Task participates in a reduction operation 2 \
firstpri
#pragma omp task in_reduction(op: list) vate(n
{structured-block} } ode)
} {
} res += node-
>value;
}
node = node->next;
}
3

75
中国科学院计算技术研究所

Task Loops

76
中国科学院计算技术研究所

Example: saxpy Kernel with OpenMP task

for (int i=0; i<N; ++i) {  Difficult to determine grain


a[i] += b[i] * s;  1 single iteration: to fine
}
 whole loop: no parallelism
for (int i=0; i<N; i+=TS) {  Manually transform the code
int ub = min(N, i+TS);
for (int ii=i; ii<ub; ii++)
 blocking techniques
{
a[ii] += b[ii] * s;
}
}
#pragma omp parallel
 Improving programmability
#pragma omp single  OpenMP taskloop
for (int i=0; i<N; i+=TS) {
int ub = min(N, i+TS); Outer loop iterates
#pragma omp task shared(a, across the tiles
b, s)
for (int ii=i; ii<ub; ii++) {
a[ii] += b[ii] * s;
Inner loop iterates within a tile
}
}
You have to rename
all your loop indices!

77
中国科学院计算技术研究所

The taskloop Construct


 Parallelize a loop using OpenMP tasks
 Cut loop into chunks
 Create a task for each loop chunk

 C/C++
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops
 Fortran
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]

 Loop iterations are distributed over the tasks.


 With simd the resulting loop uses SIMD
instructions.

78
中国科学院计算技术研究所

Clauses for the taskloop Construct


 taskloop constructs inherit clauses both from worksharing constructs and the
task construct
 shared, private
 firstprivate, lastprivate
 default
 collapse
 final, untied, mergeable
 allocate
 in_reduction / reduction

 grainsize(grain-size)
 Chunks have at least grain-size and maximally 2 × grain-size loop iterations

 num_tasks(num-tasks)
 Create num-tasks tasks for iterations of the loop

79
中国科学院计算技术研究所

Example: saxpy Kernel with OpenMP taskloop


for (int i=0; i<N; ++i) {
a[i] += b[i] * s;
manual } taskloop
blocking

for (int i=0; i<N; i+=TS) { #pragma omp taskloop grainsize(TS)


int ub = min(N, i+TS); for (int i=0; i<N; ++i) {
for (int ii=i; ii<ub; ii++) { a[i] += b[i] * s;
a[ii] += b[ii] * s; }
}
}

#pragma omp parallel


 Easier to apply than manual
#pragma omp single blocking:
for (int i=0; i<N; i+=TS) {  Compiler implements mechanical
int ub = min(N, i+TS); transformation
#pragma omp task shared(a, b, s)
for (int ii=i; ii<ub; ii++) {  Less error-prone, more
a[ii] += b[ii] * s; productive
}
}

80
中国科学院计算技术研究所

Task Dependencies

81
中国科学院计算技术研究所

The depend Clause


C/C++ Fortra
#pragma omp task depend(dependency-type: list) n
!$omp task depend(dependency-type: list)
structured block structured block
!$omp end task

 The task dependence is fulfilled when the predecessor task has completed

 in dependency-type:
The generated task will be a dependent task of all previously generated sibling tasks
that reference at least one of the list items in an out or inout clause.
 out and inout dependency-type:
The generated task will be a dependent task of all previously generated sibling tasks
that reference at least one of the list items in an in, out, or inout clause.
 mutexinoutset dependency-type:
Support mutually exclusive inout sets, requires ≥v5.0.
 The list items in a depend clause may include array sections.

82
中国科学院计算技术研究所

Task Synchronization with Dependencies

int x = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp task depend(in: x)
std::cout << x << std::endl;
task 1
#pragma omp task task 2
long_running_task(); task 3
#pragma omp task depend(inout: x)
x++;
}

83
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

84
中国科学院计算技术研究所

OpenMP correctness: Two types of problems


 Race Condition T0 v T1

 Definition:
Two threads access the same shared variable and
at least one thread modifies the variable and
the sequence of the accesses is undefined, i.e.
unsynchronized
 The result of a program depends on the detailed timing of the threads
in the team.
 This is often caused by unintended sharing of data
 Consequence: Wrong results
T0 T1

 Deadlock
 Threads lock up waiting on a locked
R0 R1
resource that will never become free.
 Consequence: Program hangs forever

 Both are very common, and both may be hard to detect!


85
中国科学院计算技术研究所

Race condition
 Example
real :: x
!$OMP PARALLEL DO PRIVATE(x) subroutine get_lcgrand(r)
do i=1, N real :: r
call get_lcgrand(x) integer, save :: seed=1
if(x.lt.0.5) then
a(i) = a(i) * 2 seed = mod(106*seed+1283,6075)
endif r = sngl(seed)/6075.
enddo end subroutine
!$OMP END PARALLEL
DO

 seed is a shared variable, because it has the save


attribute
(equivalent to static in C/C++)
 Sequence of generated numbers is unpredictable!
 Possible solutions
 Use synchronization constructs (critical/atomic/locks)
 Use privatization
 Use reductions 86
中国科学院计算技术研究所

Race condition: Remedies


 Synchronization constructs can avoid concurrent access:

real :: x
!$OMP PARALLEL DO PRIVATE(x)
do i=1, N
!$OMP CRITICAL
call get_lcgrand(x)
!$OMP END CRITICAL
Only called by one
if(x.lt.0.5) then
thread at a time!
a(i) = a(i) * 2
endif
enddo
!$OMP END PARALLEL
DO
 Advantage: Simple
 Disadvantages
 Performance implications (overhead costs , serialization)
 Danger of deadlocks (see later)

87
中国科学院计算技术研究所

Race condition: Remedies


 Privatization is sometimes an alternative to synchronization
real :: x
integer :: seed
!$OMP PARALLEL
PRIVATE(x,seed)
seedDO
!$OMP =
omp_get_thread_
do i=1, N One copy of seed for
num()*10
call get_lcgrand(x,seed) each thread  one
if(x.lt.0.5) then stream of RNs for each
thread
a(i) = a(i) * 2
endif
enddo
!$OMP END DO subroutine get_lcgrand(r,s)
!$OMP END PARALLEL real :: r,s

s = mod(106*s+1283,6075)
 Advantage: No synchronization r = sngl(seed)/6075.
end subroutine
required on every access
 Caveat: Privatization may alter
program semantics (as here)
88
中国科学院计算技术研究所

Deadlock
 Example 1
x = 0; y = 0;

# pragma omp parallel


{
# pragma omp for
for(i=1; i<=N; i++){
if (i < 0.3*N)
Two locks (possibly) acquired
{ omp_set_lock(&lock_x); by two threads in different
x = x + i; orders lead to a deadlock
omp_set_lock(&lock_y);
y = y + i;
omp_unset_lock(&lock_y);
omp_unset_lock(&lock_x);
} else
{ omp_set_lock(&lock_y);
y = y + 2*i;
omp_set_lock(&lock_x);
x = x + 2*i;
omp_unset_lock(&lock_x);
omp_unset_lock(&lock_y);
} /*endif*/
}
}/* end parallel*/
89
中国科学院计算技术研究所

Deadlock – Example 1: Remedies


 Choose correct locking order:  Pull the locks out of the
branch:
x = 0; y = 0;

# pragma omp parallel x = 0; y = 0;


{
# pragma omp for # pragma omp parallel private(f)
for(i=1; i<=N; i++){ {
if (i < 0.3*N) # pragma omp for
{ omp_set_lock(&lock_x); for(i=1; i<=N; i++){
x = x + i; if (i < 0.3*N)
omp_set_lock(&lock_y); f = 1;
y = y + i; else
omp_unset_lock(&lock_y); f = 2;
omp_unset_lock(&lock_x); omp_set_lock(&lock_x); x
} else = x + f*i;
{ omp_set_lock(&lock_x); omp_set_lock(&lock_y);
x = x + 2*i; y = y + f*i;
omp_set_lock(&lock_y); omp_unset_lock(&lock_y);
y = y + 2*i; omp_unset_lock(&lock_x);
omp_unset_lock(&lock_y); }
omp_unset_lock(&lock_x); }/* end parallel*/
} /*endif*/
}
}/* end parallel*/
90
中国科学院计算技术研究所

Deadlock – Example 1: Remedies


 Eliminate the locks altogether by using a reduction clause:

x = 0; y = 0;

# pragma omp parallel private(f) reduction(+:x,y)


{
# pragma omp for
for(i=1; i<=N; i++){
if (i < 0.3*N)
f = 1;
else
f = 2;
x = x + f*i;
y = y + f*i;
}
}/* end parallel*/

 Guideline: “privatization instead of synchronization”

91
中国科学院计算技术研究所

Deadlock
 Example 2

!$OMP PARALLEL DO PRIVATE(x)


do i=1,N
x = SIN(2*PI*x/N)
!$OMP CRITICAL
sum = sum + func(x)
!$OMP END CRITICAL
enddo
!$OMP END PARALLEL DO
...
Nested critical regions
double precision function func(v) deadlock even a single thread,
double precision :: v since they are understood to
!$OMP CRITICAL protect the same resource
func = v + random_func()
!$OMP END CRITICAL
end function func

92
中国科学院计算技术研究所

Deadlock – Example 2: Remedies


 Named critical regions decouple independent resources:

!$OMP PARALLEL DO PRIVATE(x)


do i=1,N
x = SIN(2*PI*x/N)
!$OMP CRITICAL
(psum)
sum = sum +
func(x)
!$OMP END CRITICAL Names make the
(psum) critical regions protect
enddo disjoint resources
!$OMP
double END PARALLEL
precision function func(v)
 deadlock gone
double precision :: DOv
...
!$OMP CRITICAL (prand)
func = v + random_func()
!$OMP END CRITICAL (prand)
end function func

93
中国科学院计算技术研究所

94
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

95
中国科学院计算技术研究所

OpenMP overheads
“Wake up” team
of threads

!$OMP PARALLEL PRIVATE(k)


do k=1,NITER Workload
distribution
(~10s cycle)
!$OMP DO SCHEDULE(…)
Loop do i=1,N A(i)=B(i)
parallelization +C(i)*D(i)
enddo
Implicit barrier /
!$OMP END DO sychronization

enddo
!$OMP END PARALLEL

“Retire” team of
threads

96
中国科学院计算技术研究所

OpenMP overheads: Cost of barrier


 Cost [cycles] of #pragma omp barrier

2 Threads Intel 16.0 GCC 5.3.0 2.2 GHz


Shared L3 599 425 10 cores 10 cores

SMT threads 612 423


Other socket 1486 1067

Strong topology Strong dependence on


dependence! compiler, CPU and
system environmen!

Full domain Intel 16.0 GCC 5.3.0


Socket (10 cores) 1934 1301
Overhead
Node (20 cores) 4999 7783
grows with
Node +SMT 5981 9897 thread count

97
中国科学院计算技术研究所

OpenMP overheads: Barrier implementation (reminder)


 How does a “barrier” scale (best case)?

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7
𝑐 𝑁 =
𝑐𝑜𝑛𝑠𝑡 × 2 × 𝑙𝑜𝑔2𝑁
0 1 2 3 4 5 6 7

Where N is number of
0 1 2 3 4 5 6 7 threads/processes
in the barrier
0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

0 1 2 3 4 5 6 7

98
中国科学院计算技术研究所

OpenMP overheads: Barrier cost on Intel


Xeon Phi (KNC)

Intel Xeon Phi


(“Knights Corner”):

60 cores

4 SMT per core

1 GHZ clock speed

99
中国科学院计算技术研究所

Serialization: coarse-grained locking


 “Big fat lock”:

double precision, dimension(N,N) :: M


integer :: c
...
!$OMP PARALLEL DO PRIVATE(c)
do i=1,K
c = col_calc(i) All
the
protected by the critical region
!$OMP CRITICAL comu
M(:,c) = M(:,c) + f(c) tatio
 Computation may be serialized!
!$OMP END CRITICAL nal
enddo work
is
!$OMP END PARALLEL DO

100
中国科学院计算技术研究所

Serialization: Finer-grained locking


with OpenMP locks
 Spend one lock per matrix column:
double precision, dimension(N,N) :: M
integer (kind=omp_lock_kind), dimension(N) :: locks
integer :: c
!$OMP PARALLEL
!$OMP DO
do i=1,N Each matrix column is
call omp_init_lock(locks(i)) protected by its own lock
enddo
!$OMP END DO Serialization occurs only if two
... threads access the same
!$OMP DO PRIVATE(c) column
do i=1,K
c = col_calc(i) Caveat: Setting/releasing locks
call costs time!
omp_set_lock(locks(c))
M(:,c) = M(:,c) + f(c)
call omp_unset_lock(locks(c))
enddo
!$OMP END DO
!$OMP END PARALLEL 101
中国科学院计算技术研究所

False sharing
 Example: Histogram integer, dimension(:,:), &
allocatable :: S
integer, dimension(8) :: S integer :: IND,ID,NT
integer :: IND !$OMP PARALLEL
S = 0 PRIVATE(ID,IND)
do i=1,N !$OMP SINGLE
IND = A(i) NT =
S(IND) = S(IND) + 1 omp_get_num_threads()
enddo allocate(S(0:NT,8))
S = 0
!$OMP END SINGLE
ID =
 “Big fat lock” would be omp_get_thread_num() + 1
!$OMP DO
hazardous  Use separate
do i=1,N
histogram for each thread IND = A(i)
 Manual collection of results S(ID,IND) =
S(ID,IND) + 1
at the end enddo
 Program does not scale – !$OMP END DO NOWAIT
why? !$OMP CRITICAL
do j=1,8
S(0,j) = S(0,j) 102
中国科学院计算技术研究所

False sharing
 Why does the histogram code not scale?
 Example: 5 threads, cache line size = 8 DP words

v (1..8) T0
S(id,v)
T1
id (0..4)

T2

T3

T4

cache lines
  5 cache lines shared (interleaved) across 5 threads
  Massive write-allocate/evict/invalidate traffic between caches!
103
中国科学院计算技术研究所

Eliminating false sharing


 One option: Padding

 allocate S(0:NT*CL,8) instead of S(0:NT,8)


 every thread gets its own cache line for each histogram
entry:

One cache line

… padding

One
histogram
entry
104
中国科学院计算技术研究所

本讲内容
 Introduction to OpenMP
 Basic1: Data scoping & Reduction operations
 Basic2: Work Sharing Schemes
 Basic3: Synchronization and its issues
 Basic4: Tasking
 Advanced1: OpenMP correctness
 Advanced2: OpenMP performance pitfalls
 Advanced3: Data placement on ccNUMA architecture

105
中国科学院计算技术研究所

Memory Locality Problems


 ccNUMA:
 whole memory is transparently accessible by all processors
 but physically distributed
 with varying bandwidth and latency
 and potential contention (shared memory paths)
 How do we make sure that memory access is always as
"local" and "distributed" as possible?

C C C C C C C C

M M M M

106
中国科学院计算技术研究所

ccNUMA node layout – Golden Rule


 All modern multi-socket servers are of ccNUMA type
 First touch policy (“Golden Rule”): Locailty domain 0 Locailty domain 1

A memory page is mapped into


P P P P P P P P
the locality domain of the C C C C C C C C
C C
processor that first writes C
C
C C C
C
C C

to it
MI MI
 Consequences
 Mapping happens at initialization,
not allocation
 Initialization and computation must Memory Memory
use the same memory pages double precision, dimension(:),
per thread/process allocatable :: huge
 Affinity matters! Where are my allocate(huge(N))
! memory not mapped yet
threads/processes running? do i=1,N
 It is sufficient to touch a single item huge(i) = 0.d0 ! mapping here!
enddo
to map the entire page
107
中国科学院计算技术研究所

Coding for Data Locality


 Simplest case: explicit initialization

integer,parameter :: N=1000000 integer,parameter :: N=1000000


real*8 A(N), B(N) real*8 A(N),B(N)

!$OMP parallel do
do I = 1, N
A=0.d0 A(i)=0.d0
end do
!$OMP end
!$OMP parallel do parallel do
do I = 1, N !$OMP parallel do
B(i) = function ( A(i) ) do I = 1, N
end do B(i) = function ( A(i) )
!$OMP end parallel do end do
!$OMP end parallel do

108
中国科学院计算技术研究所

Coding for Data Locality


 Sometimes initialization is not so obvious: I/O cannot
be easily parallelized, so "localize" arrays before I/O
integer,parameter :: N=1000000 integer,parameter :: N=1000000
real*8 A(N), B(N) real*8 A(N),B(N)
!$OMP parallel do
do I = 1, N
A(i)=0.d0
end do
!$OMP end
parallel do
READ(1000) A
READ(1000) A
!$OMP parallel do
!$OMP parallel do
do I = 1, N
do I = 1, N
B(i) = function ( A(i) )
B(i) = function ( A(i) )
end do
end do
!$OMP end parallel do
!$OMP end parallel do

109
中国科学院计算技术研究所

Coding for Data Locality


 Required condition: OpenMP loop schedule of initialization
must be the same as in all computational loops
 best choice: static! Specify explicitly on all NUMA-sensitive
loops, just
to be sure…
 imposes some constraints on possible optimizations (e.g. load
balancing)

 How about (automatically initialized) global objects?


 better not use them
 if communication vs. computation is favorable, might consider properly
placed copies of global data
 in C++, STL allocators provide an elegant solution

110
中国科学院计算技术研究所

Diagnosing Bad Locality


 If your code is cache-bound, you might not notice any locality
problems (see previous slide)

 Otherwise, bad locality limits scalability at very low CPU


numbers (whenever a node boundary is crossed)

 Consider using performance counters


 likwid-perfctr can be used to measure nonlocal memory accesses
 Example for Intel Westmere (2 sockets, 4 threads/socket):

env OMP_NUM_THREADS=8 likwid-perfctr -g MEM


–C S0:0-3@S1:0-3 ./a.out

111
中国科学院计算技术研究所

Memory Locality Problems


 Locality of reference is key to scalable performance on ccNUMA
 Less of a problem with distributed memory (MPI) programming, but see below

 What factors can destroy locality?


 Shared Memory Programming (OpenMP,…):
 Threads loose association with the CPU the initial mapping took
place on.
 Improper initialization of distributed data PPPPPP PPPPPP
 Dynamic/non-deterministic access patterns CC CC CC CC CC CC C C
C
C
C
C
C
C
C
C C C
 All cases: MI
C C
MI
 Other agents (e.g., OS kernel) may fill
memory with data that prevents optimal
placement of user data Memory Memory

 MPI programming:
 Processes lose their association with the CPU the mapping took
place on originally
 OS kernel tries to maintain strong affinity, but sometimes fails
112
中国科学院计算技术研究所

ccNUMA problems beyond OpenMP


 OS uses part of main memory
P1 P2 P3 P4
for C C C C

disk buffer (FS) cache


C C C C

MI MI
 If FS cache fills part of memory,
apps will probably allocate from
foreign domains

data(1)
data(3)

  non-local access!
data(3) BC
BC
 Remedies
 Drop FS cache pages after user job has run (admin’s job)
 User can run “sweeper” code that allocates and touches all physical
memory before starting the real application
 This can also be done inside your own code

113
中国科学院计算技术研究所

ccNUMA: OS may try to solve the problem


 Operating system may support „page migration“
 Page Migration: If a page is frequently requested by a remote NUMA
domain it is migrated to the respective NUMA domain
 Assume 2-socket Haswell (14 cores per socket) system with Cluster
on Die enabled  4 NUMA domains with 7 cores each
NUMA domains: 0 1 2 3
Frequent remote
. .. . .. access to page from
NUMA domain 0

After some time OS


may migrate page to
NUMA domain 0
NUMA domains: 0 1 2 3

. .. . ..

114
中国科学院计算技术研究所

ccNUMA: OS driven page migration


void dmvm(int n, int m, double *lhs, double *rhs, double *mat)
{ #pragma omp for private(offset,c) schedule(static) {
for(r=0; r<n; ++r)
offset=m*r;
for(c=0; c<m; ++c)
lhs[r] += mat[c
+
offset]*rhs[c];
} }
 Serial initialization of
matrix data
...
 Fill socket first / No SMT ...
 Run scaling experiment
with
 OMP_NUM_THREADS
Run 10 tests
(with
= iter dmvm calls
1 (,2,4,6,8,…,28)
each) 115
中国科学院计算技术研究所

ccNUMA: OS driven page migration iter = 200

20% performance
fluctuations
Saturation at
6 cores

... ...
OMP_NUM_THREADS

116
中国科学院计算技术研究所

ccNUMA: OS driven page migration

iter=2000

iter=1000
Saturation
at iter=500
6 cores
iter=200

... ...
OMP_NUM_THREADS

117
中国科学院计算技术研究所

ccNUMA: OS driven page migration


... ...

Page Migration
„takes time“

Migration
threshold: OS
parameter

Works fine for


regular access
patterns

All data on NUMA domain


0

118
中国科学院计算技术研究所

ccNUMA performance – things to consider


 First touch policy – Programming
 Parallelization of initialization  Parallelization of
computation
 Thread pinning
 Granularity of placement: Page size!!!
 C++: Constructors, std::vector,...

 Operating system issues:


 OS/IO buffers
 Page migration

119

You might also like