Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Programming with Shared

Memory
Parallel Programming Techniques and Applications Using
Networked Workstations and Parallel Computers
Chapter 8
Dr. Hamdan Abdellatef
OpenMP
Hands on tutorial by Tim Matson:
https://www.youtube.com/playlist?list=PLLX-Q6B8xqZ8n8bwjGdzBJ25X2utwnoEG

https://colab.research.google.com/gist/hamdanabdellatef/033338741f820332dfb61e1116c86fc8/openmp-patterns.ipynb
Recall
Decompose
into tasks

Original Problem
Tasks, shared and local data
Program SPMD_Emb_Par ()
{ Program SPMD_Emb_Par ()
TYPE *tmp, *func();
{ Program SPMD_Emb_Par ()
global_array
TYPE Data(TYPE);
*tmp,
{ Program *func();
SPMD_Emb_Par ()
global_array { Res(TYPE);
global_array
TYPE Data(TYPE);
*tmp, *func();
int N = global_array
get_num_procs();
global_arrayTYPERes(TYPE);
Data(TYPE);
*tmp, *func();
int idint=N get_proc_id();
= global_array
get_num_procs();
global_array Res(TYPE);
Data(TYPE);
Code with a if (id==0)
int id
for (int
int=setup_problem(N,DATA);
I= 0;
if (id==0)
int
Nget_proc_id();
= get_num_procs();
global_array
intI<N;I=I+Num){
id
Res(TYPE);
=setup_problem(N,DATA);
get_proc_id();
Num = get_num_procs();
parallel Prog. Env. tmp = (id==0)
for (int
if func(I);
I= 0;
int
Res.accumulate(
tmp
for (int
if
id I<N;I=I+Num){
=setup_problem(N,DATA);
get_proc_id();
I= 0; tmp);
= (id==0)
func(I);I<N;I=I+Num){
setup_problem(N, Data);
} Res.accumulate(
tmp I= ID;tmp);
= func(I);
for (int I<N;I=I+Num){
} } Res.accumulate(
tmp = func(I, tmp);
Data);
} } Res.accumulate( tmp);
} }
}

Units of execution + new shared data


for extracted dependencies Corresponding source code

3
OpenMP
▪ An accepted standard developed in the late 1990s by a group of industry specialists.

▪ Consists of a small set of compiler directives, augmented with a small set of library routines and
environment variables using the base language Fortran and C/C++.

▪ The compiler directives can specify such things as the parallel block and parallel for.

▪ Several OpenMP compilers available.

4
OpenMP
▪ For C/C++, the OpenMP directives are contained in #pragma statements. The OpenMP #pragma
statements have the format:

#pragma omp directive_name ...

▪ where omp is an OpenMP keyword.

▪ May be additional parameters (clauses) after the directive name for different options.

▪ Some directives require code to specified in a structured block (a statement or statements) that follows
the directive and then the directive and structured block form a “construct”.

5
OpenMP
▪ OpenMP uses “fork-join” model but thread-based.
▪ Initially, a single thread is executed by a master thread.
▪ Parallel regions (sections of code) can be executed by multiple threads (a team of threads).

Master
Thread
Parallel Regions
▪ Parallel directive creates a team of threads with a specified block of code executed by the
multiple threads in parallel.
▪ Other directives used within a parallel construct to specify parallel for loops and different blocks of
code for threads.
6
Parallel Directive
#pragma omp parallel
structured_block

▪ creates multiple threads, each one executing the specified structured_block, either a single
statement or a compound statement created with { ...} with a single entry point and a single exit
point.

▪ There is an implicit barrier at the end of the construct.

7
Simple Example

▪ omp_get_num_threads returns the number of threads that are currently being used
▪ omp_get_thread_num() returns the thread number
▪ The array a[] is a global array
▪ x and num_threads are declared as private to the threads.

8
Number of threads in a team
▪ Established by either:

1. num_threads clause after the parallel directive, or


2. omp_set_num_threads() library routine being previously called, or
3. The environment variable OMP_NUM_THREADS is defined in the order given or is system
dependent if none of the above.

▪ Number of threads available can also be altered automatically to achieve best use of system
resources by a “dynamic adjustment” mechanism.

9
Work-Sharing
▪ Three constructs in this classification:

sections
for
single

▪ In all cases, there is an implicit barrier at the end of the construct unless a nowait clause is
included.

▪ Note that these constructs do not start a new team of threads. That done by an enclosing parallel
construct.

10
Sections
▪ The construct is a non-iterative worksharing
#pragma omp sections

▪ Structured blocks are to be distributed among and executed by


the threads in a team.

▪ Each structured block is executed once by one of the threads in


the team in the context of its implicit task.

11
For Loop
#pragma omp for
for_loop

▪ causes the for loop to be divided into parts and parts shared among threads in the team.
▪ The for loop must be of a simple form.

▪ Way that for loop divided can be specified by an additional “schedule” clause.
▪ Example: the clause schedule (static, chunk_size) cause the for loop be divided into sizes
specified by chunk_size and allocated to threads in a round robin fashion.

▪ Also can use reduction (op:var)

12
Single
▪ The directive

#pragma omp single


structured block

▪ cause the structured block to be executed by one thread only.

13
Combined Parallel Work-sharing Constructs
▪ If a parallel directive is followed by a single for directive, it can be combined into with similar effect

#pragma omp parallel for


for_loop

▪ If a parallel directive is followed by a single sections directive, it can be combined into with similar
effects

14
Master Directive
▪ The master directive:
#pragma omp master
structured_block

▪ Only the master thread to execute the structured block.

▪ Different to those in the work sharing group in that there is no implied barrier at the end of the
construct (nor the beginning).

▪ Other threads encountering this directive will ignore it and the associated structured block, and
will move on.

15
Hello OpenMP
#include <iostream>
#include <omp.h>
using namespace std;

int main()
{
omp_set_num_threads(4);
#pragma omp parallel
{
int ID = omp_get_thread_num();
cout << "Hello World!" << endl;
cout << ID << endl;
}

16
Example: Calculating Pi
▪ We know that:

▪ We can approximate the integral using rectangles

17
#include <iostream>
#include <omp.h>
using namespace std;
#define NUM_THREADS 2

static long num_steps = 100000000;


double step;

int main()
{
int i;
double x, pi, sum = 0.0; double start_time, run_time;

step = 1.0 / (double)num_steps;

start_time = omp_get_wtime();
omp_set_num_threads(NUM_THREADS);
#pragma omp parallel
{
double x;
#pragma omp for reduction (+:sum)
for (i = 1; i <= num_steps; i++) {
x = (i - 0.5) * step;
sum += 4.0 / (1.0 + x * x);
}
}
pi = step * sum;
run_time = omp_get_wtime() - start_time;
printf("\n pi with %ld steps is %lf in %lf seconds\n ", num_steps, pi, run_time);
}
18
Synchronization Constructs
Critical
▪ The critical directive will only allow one thread execute the associated structured block. When
one or more threads reach the critical directive:

#pragma omp critical name


structured_block

▪ They will wait until no other thread is executing the same critical section (one with the same
name), and then one thread will proceed to execute the structured block.
▪ Name is optional.
▪ All critical sections with no name map to one undefined name.

20
Barrier
▪ When a thread reaches the barrier

#pragma omp barrier

▪ it waits until all threads have reached the barrier and then they all proceed together.

▪ There are restrictions on the placement of barrier directive in a program. In particular, all threads
must be able to reach the barrier.

21
Atomic
▪ The atomic directive

#pragma omp atomic


expression_statement

▪ implements a critical section efficiently when the critical section simply updates a variable (adds
one, subtracts one, or does some other simple arithmetic operation as defined by
expression_statement).

22
Ordered
▪ Used in conjunction with for and parallel for directives to cause an iteration to be executed in the
order that it would have occurred if written as a sequential loop.

23
#include <stdio.h>
#include <omp.h>

#define MAX_THREADS 4

static long num_steps = 100000000; double step;


int main()
{
int i, j; double pi, full_sum = 0.0; double start_time, run_time;
double sum[MAX_THREADS];

step = 1.0 / (double)num_steps;

for (j = 1; j <= MAX_THREADS; j++) {


omp_set_num_threads(j);
full_sum = 0.0; start_time = omp_get_wtime();
#pragma omp parallel private(i)
{
int id = omp_get_thread_num();
int numthreads = omp_get_num_threads();
double x; double partial_sum = 0;

#pragma omp single


printf(" num_threads = %d", numthreads);

for (i = id; i < num_steps; i += numthreads) {


x = (i + 0.5) * step;
partial_sum += +4.0 / (1.0 + x * x);
}
#pragma omp critical
full_sum += partial_sum;
}

pi = step * full_sum; run_time = omp_get_wtime() - start_time;


printf("\n pi is %f in %f seconds %d threds \n ", pi, run_time, j); 24
}}
Thank You

You might also like