Professional Documents
Culture Documents
L09-openmp for第13周
L09-openmp for第13周
Parallel Processing
Lecture 09: Advanced OpenMP Programming
2024-05-21
王银山 副研究员
高性能计算机研究中心
中国科学院计算技术研究所
1
中国科学院计算技术研究所
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
2
中国科学院计算技术研究所
Serial region:
only master executes
worker threads usually sleep
Programmer’s view:
directives/pragmas
Applicati User
in
on application code
(a few) library
routines
Compiler Environment User’s view:
Directives Variables environment variables
determine:
resource allocation
scheduling strategies and other
Runtime Library (implementation-dependent )
behavior
6
中国科学院计算技术研究所
Compiler directive:
#ifdef _OPENMP
... do something
#endif
7
中国科学院计算技术研究所
use omp_lib
…
!$OMP PARALLEL
call work(omp_get_thread_num(), omp_get_num_threads())
!$OMP END PARALLEL
8
中国科学院计算技术研究所
Introduction to OpenMP:
Data scoping – shared vs. private
T T
The OpenMP memory model
private private
Hello world
#include <stdio.h> Compile switch
#include <omp.h> g++ hello.cpp -fopenmp
int main(int argc, char* argv[]){
#pragma omp parallel
Number of threads: Determined
{ by shell variable
printf( "Hello wrold from %d!\n", OMP_NUM_THREADS
omp_get_thread_num()); Export OMP_NUM_THREADS=4
} Thread pinning: eg LIKWID
return 0;
} likwid-pin –c 0-3
./a.out
Ordering not
reproducible
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
11
中国科学院计算技术研究所
Introduction to OpenMP:
Data scoping – shared vs. private
Default: All data in a parallel region is shared
This includes global data (global/static variables, C++ class variables)
Exceptions:
1. Loop variables of parallel (“sliced”) loops are private (cf. workshare constructs)
2. Local (stack) variables within parallel region
3. Local data within enclosed function calls are private* unless declared static
12
中国科学院计算技术研究所
Introduction to OpenMP:
Data scoping – side effects
Incorrect
Incorrect shared attribute may lead to result
correctness issues (“race conditions”)
Recommendation: Use
#pragma omp parallel default(none)
13
中国科学院计算技术研究所
recovered
Masking also applies to
• privatized variables association status of
defined in scope pointers
outside the parallel allocation status of
region allocatable variables
14
中国科学院计算技术研究所
Extension of private:
time
Extension of private:
value from thread which executes
last iteration of loop is transferred
back to master copy
restrictions similar to
firstprivate 16
中国科学院计算技术研究所
Example1
#include <stdio.h>
#include <omp.h>
code is in PE/src/lec09/share_private.cpp
17
中国科学院计算技术研究所
Example1
#include <stdio.h>
#include <omp.h>
int main(int argc, char* argv[]){
int sum = 0;
#pragma omp parallel for private(sum)
for(int i = 0; i < omp_get_num_threads(); i++) {
sum += i; sum is private
printf("tid %d, sum = %d\n", omp_get_thread_num(), sum);
}
printf("outside sum = %d\n", private_sum);
return 0;
}
code is in PE/src/lec09/share_private.cpp
18
中国科学院计算技术研究所
start reduction:
!$omp parallel s T0 T1 T3
!$omp do reduction(+:s)
T2 all set
do i = … s s0 s1 s3
2 to 0.0
: s
persists
:
(inaccessible)
s = s + …
end do s s0 s1 s2 s3
!$omp end
do
s synchronize
… = … * s
!$omp
end
At synchronization point:
time
parallel
reduction operation is performed
result is transferred to master copy
restrictions similar to firstprivate
19
中国科学院计算技术研究所
- 0
multiple scalars, or an array:
* 1
.and. .true. Double x, y, z
.or. .false. #pragma omp parallel for(+:x,
y, z)
.eqv. .true.
.neqv. .false.
Double a(n)
#pragma omp parallel for(*:a)
MAX -HUGE(X)
MIN HUGE(X) multiple operations:
IAND all bits set 1
IEOR 0 #pragma omp parallel for
reduction(+:sum_i)
IOR 0
reduction(max:max_i)
20
中国科学院计算技术研究所
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
21
中国科学院计算技术研究所
User-defined scheduling:
#pragma omp for schedule(...)
!$OMP do schedule(...)
both threads take long to complete
their chunk (workload imbalance)
static
schedule( dynamic [,chunk] )
guided
after a thread has completed a chunk, it
is
chunk: always a non-negative integer. If assigned a new one, until no chunks are
omitted, has a schedule dependent default left
value synchronization overhead
default chunk value is 1
22
中国科学院计算技术研究所
OMP_SCHEDULE=static
OMP_SCHEDULE=static,10 OMP_SCHEDULE=dynamic,10 23
中国科学院计算技术研究所
Example2
code is in PE/src/lec09/schedule.cpp
24
中国科学院计算技术研究所
25
中国科学院计算技术研究所
Guided scheduling
Size of chunks in dynamic schedule
too small large overhead
too large load imbalance
Guided scheduling: dynamically vary chunk size.
Size of each chunk is proportional to the number of unassigned iterations divided by
the number of threads in the team, decreasing to chunk-size (default = 1).
Chunk size:
means minimum chunk size (except perhaps final chunk)
default value is 1
chunk = 7
iteration space
Both dynamic and guided scheduling useful for handling poorly balanced and
unpredictable workloads.
26
中国科学院计算技术研究所
Deferred scheduling
auto: automatic scheduling Examples:
Programmer gives implementation the environment setting:
freedom to use any possible
export OMP_SCHEDULE="guided,4"
mapping.
./a.out
call to API routine:
Decided at run time:
call omp_set_schedule(omp_sched_dynamic,4)
!$OMP do schedule(runtime) !$OMP parallel
!$OMP do schedule(runtime)
do
#pragma omp for schedule(runtime) …
end do
!$OMP end
do
runtime:
schedule is one of the above or the omp_set_schedule(omp_sched_dynamic, 4)
previous two slides #pragma omp parallel
#pragma omp schedule(runtime)
determine by either setting OMP_SCHEDULE, for (…) { }
and/or calling omp_set_schedule()
(overrides env. setting) runtime scheduling and OMP_SCHEDULE is not
find which is active by calling set:
omp_get_schedule() implementation chooses a schedule
27
中国科学院计算技术研究所
28
中国科学院计算技术研究所
29
中国科学院计算技术研究所
30
中国科学院计算技术研究所
Parallel sections
Non-iterative work-sharing construct Restrictions:
distribute a set of structured blocks section directive must be within
lexical scope of sections directive
!$omp parallel
!$omp sections sections directive binds to
!$omp section innermost parallel region
! code block 1 thread 0 only the threads executing the binding
!$omp section
parallel region participate in the execution
! code block 2 thread 1
… of the section blocks and the implicit barrier
!$omp end sections (if not eliminated with nowait)
synchronization
!$omp end parallel
31
中国科学院计算技术研究所
s0 s1 s2 s3
time
block
all other threads wait until block
completes execution copyprivate and nowait clauses: appear
allowed clauses: private, firstprivate, on
copyprivate, nowait end single in Fortran, on single in C/C++
use for updates of shared entities, but … !$omp single
s = …
single – really a worksharing directive?
!$omp end
single
copyprivate(
s)
32
中国科学院计算技术研究所
33
中国科学院计算技术研究所
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
34
中国科学院计算技术研究所
35
中国科学院计算技术研究所
Possible results
a = 0
Thread 0: Thread 1:
a = a + 1 a = a + 1
Thread 0 Thread 1
1 1
1 2
2 1
may be different from run to run, depending on which thread is the last one
after completion of parallel region, may obtain 1 or 2.
36
中国科学院计算技术研究所
37
中国科学院计算技术研究所
!$omp flush[(var1[,var2,…])]
Stand-alone directive
applicable to all variables with shared scope
including: SAVE, COMMON/module globals, shared dummy arguments, shared pointer dereferences
If no variables specified, the flush-set
encompasses all shared variables
which are accessible in the scope of the FLUSH directive
potentially slower
Implicit flush operations (with no list) occur at:
All explicit and implicit barriers
Entry to and exit from critical regions
Entry to and exit from lock routines
38
中国科学院计算技术研究所
Barrier synchronization
Explicit via directive:
the execution flow of each thread blocks upon reaching the barrier until all threads have reached
the barrier
flush synchronization of all accessible shared variables happens before all threads continue
after the barrier, all shared variables have consistent value visible to all threads
barrier may not appear within work-sharing code block
e.g. !$omp do block, since this would imply deadlock
39
中国科学院计算技术研究所
40
中国科学院计算技术研究所
Critical regions
The critical and atomic directives:
each thread arriving at the code block executes it (in contrast to single)
mutual exclusion: only one at a time within code block
atomic: code block must be a single line update of a scalar entity of intrinsic type
with an intrinsic operation
C/C++
# pragma omp critical # pragma omp atomic
{ block } x = x <op> y ;
41
中国科学院计算技术研究所
C/C++ Fortran
#pragma omp parallel !$omp parallel Race on read to x.
{ :
: !$omp atomic
A barrier is required
#pragma omp atomic x = x + y before this statement to
x = x + y : assure that all threads
: a = f(x, …) have executed their
a = f(x, …) !$omp end atomic updates
} parallel
42
中国科学院计算技术研究所
Named critical
Consider multiple updates Solution:
same shared variable use named criticals
thread 0 thread 1 subroutine foo()
!$omp critical (foo_x)
subroutine foo() subroutine bar() x = x + y
!$omp critical !$omp critical !$omp end critical (foo_x)
x = x + y x = x + z
!$omp end !$omp end critical
critical subroutine bar()
critical region is global: OK !$omp critical (foo_w)
w = w + z
!$omp end critical
different shared variables (foo_w)
mutual exclusion only if same name is
subroutine foo() subroutine bar()
!$omp critical !$omp critical used for critical
x = x + y w = w + z atomic is bound to updated variable
!$omp end !$omp end
critical critical problem does not occur
mutual exclusion not required
unnecessary loss of performance
43
中国科学院计算技术研究所
time
O1
C/C++ Fortran O2
O1
#pragma omp for ordered !$OMP do ordered O2
for (i=0; i<N; ++i) { do I=1,N O2
O1 O1 O3 O3
...
#pragma omp ordered !$OMP ordered O3
{ O2 } O2
O3 !$OMP end ordered O2
} O3
O3
end do barrier
!$OMP end do
44
中国科学院计算技术研究所
Fortran
do I=2,N do I=1,N
... ! large ... !
block calculate a(I)
!$OMP ordered !$OMP ordered
a(I) = a(I- write(unit,.
1) + ... ..) a(I)
!$OMP end ordered !$OMP end ordered
end do end do
#pragma
!$OMP omp
end for
do ordered #pragma omp for
!$OMP end do ordered
C/C++
C/C++
45
中国科学院计算技术研究所
46
中国科学院计算技术研究所
OpenMP locks
An OpenMP lock can be in one of the following 3 stages:
uninitialized
unlocked
locked
The task that sets the lock is then said to own the lock.
Only a task that sets the lock, can unset the lock, returning it to the unlocked stage.
47
中国科学院计算技术研究所
Fortran: omp_init_lock(var)
C/C++ omp_init_lock(omp_lock_t *var)
initialize a lock
initial state is unlocked
what resources are protected by lock: defined by developer
var not associated with a lock before this routine is called
Fortran: omp_destroy_lock(var)
C/C++: omp_destroy_lock(omp_lock_t
*var)
disassociate var from lock
precondition:
var must have been initialized
var must be in unlocked state
48
中国科学院计算技术研究所
Fortran: omp_set_lock(var)
C/C++: void omp_set_lock(omp_lock_t *var)
blocks if lock not available
set ownership and continue execution if lock available
Fortran: omp_unset_lock(var)
C/C++: void omp_unset_lock(omp_lock_t *var)
release ownership of lock
ownership must have been established before
49
中国科学院计算技术研究所
50
中国科学院计算技术研究所
Nestable Locks
replace omp_*_lock by omp_*_nest_lock
51
中国科学院计算技术研究所
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
52
中国科学院计算技术研究所
53
中国科学院计算技术研究所
Task queue
54
中国科学院计算技术研究所
Clauses:
data environment: cutoff strategies:
private, fistprivate, if(scalar-expression)
default(shared|none),
mergable
in_reduction(r-id:list)
Dependencies: final(scalar-expression)
depend(dep-type: list) Other clauses:
Scheduler restriction: allocate(allocator:] list)
untied detach(event-handler)
Scheduler hints:
priority(priority-
value)
affinity(list)
55
中国科学院计算技术研究所
What is a Task?
Make OpenMP worksharing more flexible:
allow the programmer to package code blocks and data items for execution
this by definition is a task
and assign these to an encountering thread
possibly defer execution to a later time (“work queue”)
Introduced with OpenMP 3.0 and extended over time
When a thread encounters a task construct, a task is generated from the code of
the associated structured block.
Data environment of the task is created (according to the data-sharing attributes,
defaults, …)
“Packaging of data”
The encountering thread may immediately execute the task, or defer its execution.
In the latter case, any thread in the team may be assigned the task.
56
中国科学院计算技术研究所
typedef struct
{ list *next;
contents *data;
} list;
Typical task generation loop:
void #pragma omp parallel
process_list(list {
*head) #pragma omp single
{ {
#pragma omp while(p) {
parallel #pragma omp task
{ { /* taks code */ }
#pragma omp }
single } /* all tasks done */
{ }
listp*p= p->next;
= head;
}
while(p) {
} /* all tasks
#pragma omp done */
} task
} { do_work(p-
>data); }
57
中国科学院计算技术研究所
58
中国科学院计算技术研究所
The if Clause
When if argument is false –
task becomes an undeferred task
task body is executed immediately by encountering thread
all other semantics stay the same (data environment, synchronization) as for a
“deferred”task
User-directed optimization:
avoid overhead for deferring small tasks
cache locality / memory affinity may be lost by doing so
59
中国科学院计算技术研究所
Task Synchronization
60
中国科学院计算技术研究所
C/C++
#pragma omp barrier Fortran
!$omp barrier elem = elem->next;
}
} /* impl. barrier */
#pragma omp
task
{ ... } /* C
*/
#pragma omp
taskwait
}
} /* impl. barrier 61
中国科学院计算技术研究所
62
中国科学院计算技术研究所
63
中国科学院计算技术研究所
Task Switching
It is allowed for a thread to
suspend a task during execution
start (or resume) execution of another task (assigned to the same team)
resume original task later
Pre-condition:
a task scheduling point (TSP) is reached
Example from previous slide:
the taskwait directive implies a task scheduling point
Another example:
very many tasks are generated
implementation can suspend generation of tasks and start processing the existing
queue
Nearly all task scheduling points are implicit
taskyield construct only explicit one
64
中国科学院计算技术研究所
65
中国科学院计算技术研究所
66
中国科学院计算技术研究所
void foo(omp_lock_t *
lock, int n)
{ The waiting task may be
for(int i = 0; i < n; i++) { suspended here and allow
#pragma omp task
{ the executing thread to
something_useful(); perform other work. This
while (! may also avoid deadlock
omp_test_lock(lock)) {
#pragma omp taskyield situations.
}
something_critical();
omp_unset_lock(lock);
}
}
}
67
中国科学院计算技术研究所
68
中国科学院计算技术研究所
69
中国科学院计算技术研究所
70
中国科学院计算技术研究所
Data Environment
The task directive takes the following data attribute clauses that define the
data environment of the task:
71
中国科学院计算技术研究所
Data Scoping
Some rules from Parallel Regions apply:
Static and global variables are shared
Automatic storage (stack) variables are private
The OpenMP Standard says:
The data-sharing attributes of variables that are not listed in data attribute clauses of
a task construct, and are not predetermined according to the OpenMP rules, are
implicitly determined as follows:
In a task construct, if no default clause is present:
a) a variable that is determined to be shared in all enclosing constructs, up to
and including the innermost enclosing parallel construct, is shared.
b) a variable whose data-sharing attribute is not determined by rule (a) is
firstprivate.
int i = 0, j = 1;
#pragma omp parallel private(j)
#pragma omp single
{
int k = 0;
#pragma omp task shared
{ /* use i, j, k
} */ } firstprivate
72
中国科学院计算技术研究所
Task reduction
73
中国科学院计算技术研究所
74
中国科学院计算技术研究所
75
中国科学院计算技术研究所
Task Loops
76
中国科学院计算技术研究所
77
中国科学院计算技术研究所
C/C++
#pragma omp taskloop [simd] [clause[[,] clause],…]
for-loops
Fortran
!$omp taskloop[simd] [clause[[,] clause],…]
do-loops
[!$omp end taskloop [simd]]
78
中国科学院计算技术研究所
grainsize(grain-size)
Chunks have at least grain-size and maximally 2 × grain-size loop iterations
num_tasks(num-tasks)
Create num-tasks tasks for iterations of the loop
79
中国科学院计算技术研究所
80
中国科学院计算技术研究所
Task Dependencies
81
中国科学院计算技术研究所
The task dependence is fulfilled when the predecessor task has completed
in dependency-type:
The generated task will be a dependent task of all previously generated sibling tasks
that reference at least one of the list items in an out or inout clause.
out and inout dependency-type:
The generated task will be a dependent task of all previously generated sibling tasks
that reference at least one of the list items in an in, out, or inout clause.
mutexinoutset dependency-type:
Support mutually exclusive inout sets, requires ≥v5.0.
The list items in a depend clause may include array sections.
82
中国科学院计算技术研究所
int x = 0;
#pragma omp parallel
#pragma omp single
{
#pragma omp task depend(in: x)
std::cout << x << std::endl;
task 1
#pragma omp task task 2
long_running_task(); task 3
#pragma omp task depend(inout: x)
x++;
}
83
中国科学院计算技术研究所
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
84
中国科学院计算技术研究所
Definition:
Two threads access the same shared variable and
at least one thread modifies the variable and
the sequence of the accesses is undefined, i.e.
unsynchronized
The result of a program depends on the detailed timing of the threads
in the team.
This is often caused by unintended sharing of data
Consequence: Wrong results
T0 T1
Deadlock
Threads lock up waiting on a locked
R0 R1
resource that will never become free.
Consequence: Program hangs forever
Race condition
Example
real :: x
!$OMP PARALLEL DO PRIVATE(x) subroutine get_lcgrand(r)
do i=1, N real :: r
call get_lcgrand(x) integer, save :: seed=1
if(x.lt.0.5) then
a(i) = a(i) * 2 seed = mod(106*seed+1283,6075)
endif r = sngl(seed)/6075.
enddo end subroutine
!$OMP END PARALLEL
DO
real :: x
!$OMP PARALLEL DO PRIVATE(x)
do i=1, N
!$OMP CRITICAL
call get_lcgrand(x)
!$OMP END CRITICAL
Only called by one
if(x.lt.0.5) then
thread at a time!
a(i) = a(i) * 2
endif
enddo
!$OMP END PARALLEL
DO
Advantage: Simple
Disadvantages
Performance implications (overhead costs , serialization)
Danger of deadlocks (see later)
87
中国科学院计算技术研究所
s = mod(106*s+1283,6075)
Advantage: No synchronization r = sngl(seed)/6075.
end subroutine
required on every access
Caveat: Privatization may alter
program semantics (as here)
88
中国科学院计算技术研究所
Deadlock
Example 1
x = 0; y = 0;
x = 0; y = 0;
91
中国科学院计算技术研究所
Deadlock
Example 2
92
中国科学院计算技术研究所
93
中国科学院计算技术研究所
94
中国科学院计算技术研究所
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
95
中国科学院计算技术研究所
OpenMP overheads
“Wake up” team
of threads
enddo
!$OMP END PARALLEL
“Retire” team of
threads
96
中国科学院计算技术研究所
97
中国科学院计算技术研究所
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
𝑐 𝑁 =
𝑐𝑜𝑛𝑠𝑡 × 2 × 𝑙𝑜𝑔2𝑁
0 1 2 3 4 5 6 7
Where N is number of
0 1 2 3 4 5 6 7 threads/processes
in the barrier
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
0 1 2 3 4 5 6 7
98
中国科学院计算技术研究所
60 cores
99
中国科学院计算技术研究所
100
中国科学院计算技术研究所
False sharing
Example: Histogram integer, dimension(:,:), &
allocatable :: S
integer, dimension(8) :: S integer :: IND,ID,NT
integer :: IND !$OMP PARALLEL
S = 0 PRIVATE(ID,IND)
do i=1,N !$OMP SINGLE
IND = A(i) NT =
S(IND) = S(IND) + 1 omp_get_num_threads()
enddo allocate(S(0:NT,8))
S = 0
!$OMP END SINGLE
ID =
“Big fat lock” would be omp_get_thread_num() + 1
!$OMP DO
hazardous Use separate
do i=1,N
histogram for each thread IND = A(i)
Manual collection of results S(ID,IND) =
S(ID,IND) + 1
at the end enddo
Program does not scale – !$OMP END DO NOWAIT
why? !$OMP CRITICAL
do j=1,8
S(0,j) = S(0,j) 102
中国科学院计算技术研究所
False sharing
Why does the histogram code not scale?
Example: 5 threads, cache line size = 8 DP words
v (1..8) T0
S(id,v)
T1
id (0..4)
T2
T3
T4
cache lines
5 cache lines shared (interleaved) across 5 threads
Massive write-allocate/evict/invalidate traffic between caches!
103
中国科学院计算技术研究所
… padding
One
histogram
entry
104
中国科学院计算技术研究所
本讲内容
Introduction to OpenMP
Basic1: Data scoping & Reduction operations
Basic2: Work Sharing Schemes
Basic3: Synchronization and its issues
Basic4: Tasking
Advanced1: OpenMP correctness
Advanced2: OpenMP performance pitfalls
Advanced3: Data placement on ccNUMA architecture
105
中国科学院计算技术研究所
C C C C C C C C
M M M M
106
中国科学院计算技术研究所
to it
MI MI
Consequences
Mapping happens at initialization,
not allocation
Initialization and computation must Memory Memory
use the same memory pages double precision, dimension(:),
per thread/process allocatable :: huge
Affinity matters! Where are my allocate(huge(N))
! memory not mapped yet
threads/processes running? do i=1,N
It is sufficient to touch a single item huge(i) = 0.d0 ! mapping here!
enddo
to map the entire page
107
中国科学院计算技术研究所
!$OMP parallel do
do I = 1, N
A=0.d0 A(i)=0.d0
end do
!$OMP end
!$OMP parallel do parallel do
do I = 1, N !$OMP parallel do
B(i) = function ( A(i) ) do I = 1, N
end do B(i) = function ( A(i) )
!$OMP end parallel do end do
!$OMP end parallel do
108
中国科学院计算技术研究所
109
中国科学院计算技术研究所
110
中国科学院计算技术研究所
111
中国科学院计算技术研究所
MPI programming:
Processes lose their association with the CPU the mapping took
place on originally
OS kernel tries to maintain strong affinity, but sometimes fails
112
中国科学院计算技术研究所
data(1)
data(3)
non-local access!
data(3) BC
BC
Remedies
Drop FS cache pages after user job has run (admin’s job)
User can run “sweeper” code that allocates and touches all physical
memory before starting the real application
This can also be done inside your own code
113
中国科学院计算技术研究所
. .. . ..
114
中国科学院计算技术研究所
20% performance
fluctuations
Saturation at
6 cores
... ...
OMP_NUM_THREADS
116
中国科学院计算技术研究所
iter=2000
iter=1000
Saturation
at iter=500
6 cores
iter=200
... ...
OMP_NUM_THREADS
117
中国科学院计算技术研究所
Page Migration
„takes time“
Migration
threshold: OS
parameter
118
中国科学院计算技术研究所
119