Beginning OpenMP

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 20

Open MP Basics

Contents
1.

Beginning OpenMP ................................................................................................................. 2

2.

Hello OpenMP!........................................................................................................................ 2
Example ...................................................................................................................................... 3

3.

Directives (Pragmas) ............................................................................................................... 4


Example ...................................................................................................................................... 5

4.

Sections .................................................................................................................................... 6
Example ...................................................................................................................................... 6

5.

Loops ....................................................................................................................................... 8
Example ...................................................................................................................................... 8

6.

Critical Code .......................................................................................................................... 10


Example .................................................................................................................................... 11
Exercise ..................................................................................................................................... 13
Answer to Exercise ................................................................................................................... 13

7.

Reduction ............................................................................................................................... 14
Example .................................................................................................................................... 15
Exercise ..................................................................................................................................... 15
Answer to Exercise (C) ............................................................................................................. 16

8.

Map/Reduce ........................................................................................................................... 17
Example .................................................................................................................................... 18

9.

Performance ........................................................................................................................... 20

1.Beginning OpenMP
OpenMP provides a straight-forward interface to write software that can use multiple cores of a
computer. Using OpenMP you can write code that uses all of the cores in a multicore computer,
and that will run faster as more cores become available.
OpenMP is a well-established, standard method of writing parallel programs. It was first released
in 1997, and is currently on version 3.0. It is provided by default in nearly all compilers, e.g. the
gnu compiler suite (gcc, g++, gfortran), the Intel compilers (icc, icpc, ifort) and the Portland Group
compilers (pgc, pgCC, pgf77) and works on nearly all operating systems (e.g. Linux, Windows
and OS X).
You can read about the history of OpenMP at its Wikipedia page, or by going to one of
the many OpenMP websites. One book to learn OpenMP in my opinion is Using OpenMP:
Portable Shared Memory Parallel Programming.
OpenMP can be combined with other parallel programming technologies, e.g. MPI. This course is
presented as a companion to my MPI course, with both this and the MPI course following a similar
structure and presenting similar examples. If you want to compare OpenMP and MPI, then please
click on the Compare with MPI links on each page.

2.Hello OpenMP!
The first stage is to write a small OpenMP program. Choose C to write the
program hello_openmp.
You should see the following output.
Hello
Hello
Hello
Hello

OpenMP!
OpenMP!
OpenMP!
OpenMP!

The line Hello OpenMP! is output four times, as the program split into four threads, and each
thread printed Hello OpenMP!

Example
Most C compilers support the use of OpenMP. Available compilers include gcc (version 4.2 or
above), icc and pgc.

The first step is to create a simple OpenMP C program, which we will call hello_openmp.
Open a Visual Studio C++ project and in project properties choose Open MP support YES, and
then create a file called hello_openmp.c and copy in the following code;
#include "stdafx.h"
#include <stdio.h>
#include <omp.h>
int main(int argc, char **argv)
{
#pragma omp parallel
{
printf("Hello OpenMP!\n");
}
getchar();
return 0;
}

The only new line in this example is #pragma omp parallel, which is used to specify that
all of the code between the curly brackets is part of an OpenMP parallel section.
3

3.Directives (Pragmas)
So what was going on in the last example?
A standard program works by executing one line of code at a time, starting from the main function
and working down line by line. This single thread of execution is known as the "main" thread. All
programs have a single main thread, and in most programs, this is the only thread of execution,
hence why the program can only do one thing at a time.
The hello_openmp also has a single main thread of execution. However, this main thread is
split into a team of threads within the OpenMP parallel section. Each parallel thread in the
team executes all of the code in the parallel section, hence each thread executes the line of code
that prints Hello OpenMP!
We can see this more clearly by getting each thread to identify itself. Please copy the code from
the example below to create the executable hello_threads.
The OpenMP parallel section is specified by using compiler directives. These directives (also
called compiler pragmas) are instructions to the compiler to tell it how to create the team of threads,
and to help tell the compiler how to assign threads to tasks. These OpenMP directives are only
followed if the program is compiled with OpenMP support. If the program is compiled without
OpenMP support, then they are ignored.
There are several OpenMP directives. This course will cover the basic usage of just a selection;

parallel : Used to create a parallel block of code which will be executed by a team of
threads
sections : Used to specify different sections of the code that can be run in parallel by
different threads.
for (C/C++) : Used to specify loops where different iterations of the loop are performed
by different threads.
critical : Used to specify a block of code that can only be run by one thread at a time.
reduction : Used to combine (reduce) the results of several local calculations into a single
result

Pragmas are added to the code using:


#pragma omp name_of_directive

Example
Copy this into the file hello_threads.c
#include "stdafx.h"
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_num_threads() 0
#define omp_get_thread_num() 0
#endif
int main(int argc, char **argv)
{
int nthreads, thread_id;
printf("I am the main thread.\n");
#pragma omp parallel private(nthreads, thread_id)
{
nthreads = omp_get_num_threads();
thread_id = omp_get_thread_num();
printf("Hello. I am thread %d out of a team of %d\n",
thread_id, nthreads);
}
printf("Here I am, back to the main thread.\n");
getchar();
return 0;
}

This example uses two OpenMP functions;

omp_get_num_threads() : Returns the number of threads in the OpenMP thread team.


omp_get_thread_num() : Returns the identifying number of the thread in the team.

Note that using these functions requires you to include the omp.h header file. To ensure portability
(if OpenMP is not supported) we hide this header file behind an #ifdef _OPENMP guard, and
add stubs for the two OpenMP functions to set them to 0.
This example uses a slightly modified omp
parallel line. In this
case, private(nthreads, thread_id) is added to specify that each thread should have
its own copy of the nthreads and thread_id variables.

4.Sections
The OpenMP sections directive provides a means by which different threads can run different
parts of the program in parallel.
In this example, there are three functions, times_table, countdown and long_loop. These
three functions are called from within an OpenMP sections directive, where each function is
placed into a separate OpenMP section. This tells the compiler that these three functions can
be called in parallel, and a thread from the team can be assigned to each section. Note that if there
are more sections than threads, then each section is queued until a thread is available to run it,
while if there are more threads than sections, then the extra threads will have nothing to do. Note
that there is no guarantee as to the order in which sections are executed.

Example
Create the file omp_sections.c and copy in the following;
#include "stdafx.h"
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_thread_num() 0
#endif
void times_table(int n)
{
int i, i_times_n, thread_id;
thread_id = omp_get_thread_num();
for (i=1; i<=n; ++i)
{
i_times_n = i * n;
printf("Thread %d says %d times %d equals %d.\n",
thread_id, i, n, i_times_n );
}
}
void countdown()
{
int i, thread_id;
thread_id = omp_get_thread_num();
for (i=10; i>=1; --i)
{
printf("Thread %d says %d...\n", thread_id, i);
}
printf("Thread %d says \"Lift off!\"\n", thread_id);

}
void long_loop()
{
int i, thread_id;
double sum = 0;
thread_id = omp_get_thread_num();
for (i=1; i<=10; ++i)
{
sum += (i*i);
}
printf("Thread %d says the sum of the long loop is %f\n",
thread_id, sum);
}
int main(int argc, char **argv)
{
printf("This is the main thread.\n");
#pragma omp parallel
{
#pragma omp sections
{
#pragma omp section
{
times_table(12);
}
#pragma omp section
{
countdown();
}
#pragma omp section
{
long_loop();
}
}
}
printf("Back to the main thread. Goodbye!\n");

getchar();
return 0;
}

In this example, the omp sections specifies a block of sections that may be run in parallel,
with each individual section specified within each omp section block. While it is possible to
write the code within each omp section block directly, the code is more readable if you write
each section as a function (e.g. countdown, long_loop and times_table) and just call the
function from within each section.
7

5.Loops
OpenMP sections provide a method by which you can assign different functions to be run by
different threads. While this is easy to do, it does not scale well. You can only run as many threads
in parallel as there are different functions to be run. If you have more threads than functions, then
the extra threads will be idle. Also, if different functions take different amounts of time, then some
threads may finish earlier than other threads, and they will be left idle waiting for the remaining
threads to finish.
One way of achieving better performance is to use OpenMP to parallelize loops within your code.
Lets imagine you have a loop that requires 1000 iterations. If you have two threads in the OpenMP
team, then it would make sense for one thread to perform 500 of the 1000 iterations while the other
thread performs the other 500 of 1000 iterations. This will scale as more threads are added, the
iterations of the loop can be shared evenly between them, e.g.

2 threads : 500 iterations each


4 threads : 250 iterations each
100 threads : 10 iterations each
1000 threads : 1 iteration each

Of course, this only scales up to the number of iterations in the loop, e.g. if there are 1500 threads,
then 1000 threads will have 1 iteration each, while 500 threads will sit idle.
Also, and this is quite important, this will only work if each iteration of the loop is independent.
This means that it should be possible to run each iteration in the loop in any order, and that each
iteration does not affect any other iteration. This is necessary as running loop iterations in parallel
means that we cannot guarantee that loop iteration 99 will be performed before loop iteration 100.

Example
Copy this into loops.c
#include "stdafx.h"
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_thread_num() 0
#endif
int main(int argc, char **argv)
{
int i, thread_id, nloops;
#pragma omp parallel private(thread_id, nloops)
{
nloops = 0;

#pragma omp for


for (i=0; i<1000; ++i)
{
++nloops;
}
thread_id = omp_get_thread_num();
printf("Thread %d performed %d iterations of the loop.\n",
thread_id, nloops );
}
getchar();
return 0;
}

The new directive in this code is omp for. This tells the compiler that the for loop directly
below this pragma can be run in parallel by the team of threads. In this case, the only work
performed in each iteration is increasing the thread private counter of the number of times that the
loop has been performed (each thread in the team has its own copy of nloops because it is
specified as private as part of the OpenMP parallel pragma). By doing this, each thread in
the team counts up the number of times that it has performed the loop.

6.Critical Code
Up to this point, the examples have shown how to just run different parts of the program in parallel,
either by using sections, and running different sections using different threads, or using loops, and
running different iterations of the loop using different threads. While this is good, it isn't yet useful.
This is because we have not yet seen how to make the threads work together on a problem. At the
moment, each thread works on its own, using its own private variables. For example, we've seen
how each thread can count the number of iterations it performed itself in a loop, but no mechanism
has been presented yet that would allow the team of threads to count up the total number of
iterations performed by all of them.
What we need is a way to allow the threads to combine their thread private copies of variables into
a global copy. One method of doing this is to use an OpenMP critical section. A critical section is
a part of code that is performed by all of the threads in the team, but that is only performed by one
thread at a time. This allows each thread to update a global variable with the local result calculated
by that thread, without having to worry about another thread trying to update that global variable
at the same time. To make this clear, use the following code to create the two executables,
broken_loopcount and fixed_loopcount;
Depending
on
the
compiler,
programming
language
and
number
of
threads, broken_loopcount may print out a wide range of different outputs. Sometimes it will
work, and will correctly add the number of iterations onto the global sum, and will correctly print
out the intermediate steps without problem. However, sometimes, completely randomly, it will
break, and either it will print out nonsense (e.g. it will add 1000 iterations to a total of 4000, but
the insist that the total is 10000) or it will get the total number of iterations completely wrong. The
reason for this is that while one thread is updating or printing the global total, another thread may
be changing it.
The fixed_loopcount in contrast will always work, regardless of compiler, programming
language or number of threads. This is because we've protected the update and printing of the
global total within an OpenMP critical section. The critical section ensures that only one thread at
a time is printing and updating the global sum of loops, and so ensures that two threads don't try
to access global_nloops simultaneously.
OpenMP critical sections are extremely important if you want to guarantee that your program will
work reproducibly. Without critical sections, random bugs can sneak through, and the result of
your program may be different if different numbers of threads are used.
OpenMP loops plus OpenMP critical sections provide the basis for one of the most efficient models
of parallel programming, namely map and reduce. The idea is that you map the problem to be
solved into a loop over a large number of iterations. Each iteration solves its own part of the
problem and computes the result into a local thread-private variable. Once the iteration is complete,
all of the thread-private variables are combined together (reduced) via critical sections to form the
final global result.

10

Example
Copy this code into broken_loopcount.c
#include "stdafx.h"
#include <stdio.h>
#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_thread_num() 0
#endif
int main(int argc, char **argv)
{
int i, thread_id;
int global_nloops, private_nloops;
global_nloops = 0;
#pragma omp parallel private(private_nloops, thread_id)
{
private_nloops = 0;
thread_id = omp_get_thread_num();
#pragma omp for
for (i=0; i<100000; ++i)
{
++private_nloops;
}
printf("Thread %d adding its iterations (%d) to the sum (%d)...\n",
thread_id, private_nloops, global_nloops);
global_nloops += private_nloops;
printf("...total nloops now equals %d.\n", global_nloops);
}
printf("The total number of loop iterations is %d\n",
global_nloops);
getchar();
return 0;
}

and copy this code into fixed_loopcount.c


#include "stdafx.h"
#include <stdio.h>

11

#ifdef _OPENMP
#include <omp.h>
#else
#define omp_get_thread_num() 0
#endif
int main(int argc, char **argv)
{
int i, thread_id;
int global_nloops, private_nloops;
global_nloops = 0;
#pragma omp parallel private(private_nloops, thread_id)
{
private_nloops = 0;
thread_id = omp_get_thread_num();
#pragma omp for
for (i=0; i<100000; ++i)
{
++private_nloops;
}
#pragma omp critical
{
printf("Thread %d adding its iterations (%d) to the sum
(%d)...\n",
thread_id, private_nloops, global_nloops);
global_nloops += private_nloops;
printf("...total nloops now equals %d.\n", global_nloops);
}
}
printf("The total number of loop iterations is %d\n",
global_nloops);
getchar();
return 0;
}

The only new code here is the omp critical section in fixed_loopcount.c. The critical
section is performed by each thread, but can only be performed by one thread at a time. This
ensures that while one thread is updating the global nloops variable (global_nloops) with the
thread local value of nloops (private_nloops), that the value of global_nloops is not
changed by any other thread.

12

Exercise
Write an OpenMP parallel program to calculate pi using a Monte Carlo algorithm.
Pi can be calculated using Monte Carlo by imagining a circle with radius 1 sitting at the origin
within a square that just contains this circle (so with corners [-1,-1], [-1,1], [1,-1] and [1,1]). The
area of the circle is pi, (from pi r squared), while the area of the square is 4. If we imagine throwing
darts randomly at the square, than the proportion that lie within the circle compared to the
proportion that lie outside the circle will be directly related to the ratio of the area of the circle
against the area of the square. In a parallel loop, you must thus generate a large number of random
points in the square, and count up the number that lie within the circle and those that lie outside.
Reduce these numbers into a global count of the number inside and outside the circle, and then
finally take the ratio of these numbers to get the value of pi.

Answer to Exercise
#include
#include
#include
#include

"stdafx.h"
<math.h>
<stdlib.h>
<stdio.h>

double rand_one()
{
return rand() / (RAND_MAX + 1.0);
}
int main(int argc, char **argv)
{
int n_inside, n_outside;
int pvt_n_inside, pvt_n_outside;
int i;
double x, y, r, pi;
n_inside = 0;
n_outside = 0;
#pragma omp parallel private(x, y, r, pvt_n_inside, pvt_n_outside)
{
pvt_n_inside = 0;
pvt_n_outside = 0;
#pragma omp for
for (i=0; i<1000000; ++i)
{
x = (2*rand_one()) - 1;
y = (2*rand_one()) - 1;
r = sqrt( x*x + y*y );
if (r < 1.0)
{

13

++pvt_n_inside;
}
else
{
++pvt_n_outside;
}
}
#pragma omp critical
{
n_inside += pvt_n_inside;
n_outside += pvt_n_outside;
}
}
pi = (4.0 * n_inside) / (n_inside + n_outside);
printf("The estimated value of pi is %f\n", pi);
getchar();
return 0;
}

7.Reduction
Reduction, which is the process of combining (or reducing) the results of several sub-calculations
into a single combined (reduced) result, is very common, and is the second half of the very
powerful map-reduce form of parallel programming. In the exercise in the last section you used
reduction to form the global sum of the total number of points inside and outside the circle,
calculated as the sum of the number of points inside and outside the circle calculated by each
thread's iterations of the loop. While it may appear easy to write your own reduction code, it is
actually very hard to write efficient reduction code. This is because reduction requires the use of a
critical section, where only one thread is allowed to update the global sum at a time. Reduction
can actually be implemented much more efficiently, e.g. perhaps by dividing threads into pairs,
and getting each pair to sum their results, and then dividing into pairs of pairs, and summing the
pairs of pairs results, etc. (this method is known as binary tree reduction - see here for a more indepth discussion of reduction algorithms).
So reduction is actually quite complicated to implement if you want it to be efficient and work
well. Fortunately, you don't have to implement it, as OpenMP provides a reduction directive
which has implemented it for you! The reduction directive is added to the end of the OpenMP
parallel directive and has the following form;
reduction( operator : variable list )
where operator can be any binary operator (e.g. +, -, *), and variable list is a single
variable or list of variables that will be used for the reduction, e.g.

14

reduction( + : sum )
will tell the compiler that sum will hold the global result of a reduction which will be formed by
adding together the results of thread-private calculations, while,
reduction( - : balance, diff )
will tell the compiler that both balance and diff will hold the global results of reductions
which are formed by subtracting the results of thread-private calculations.
To make this clear, the following links provide the code for the fixed loop counting examples from
the last section which use reduction rather than thread-private variables with an OpenMP critical
section;

Example
#include "stdafx.h"
#include <stdio.h>
int main(int argc, char **argv)
{
int i;
int private_nloops, nloops;
nloops = 0;
#pragma omp parallel private(private_nloops) \
reduction(+ : nloops)
{
private_nloops = 0;
#pragma omp for
for (i=0; i<100000; ++i)
{
++private_nloops;
}
/* Reduction step - reduce 'private_nloops' into 'nloops' */
nloops = nloops + private_nloops;
}
printf("The total number of loop iterations is %d\n",
nloops);
getchar();
return 0;
}

Exercise
Edit your program to estimate pi so that it uses reduction to form the sum of the number of points
inside and outside the circle.
15

Answer to Exercise (C)


#include
#include
#include
#include

"stdafx.h"
<math.h>
<stdlib.h>
<stdio.h>

double rand_one()
{
return rand() / (RAND_MAX + 1.0);
}
int main(int argc, char **argv)
{
int n_inside, n_outside;
int pvt_n_inside, pvt_n_outside;
int i;
double x, y, r, pi;
n_inside = 0;
n_outside = 0;
#pragma omp parallel private(x, y, r, pvt_n_inside, pvt_n_outside) \
reduction( + : n_inside, n_outside )
{
pvt_n_inside = 0;
pvt_n_outside = 0;
#pragma omp for
for (i=0; i<1000000; ++i)
{
x = (2*rand_one()) - 1;
y = (2*rand_one()) - 1;
r = sqrt( x*x + y*y );
if (r < 1.0)
{
++pvt_n_inside;
}
else
{
++pvt_n_outside;
}
}
n_inside = n_inside + pvt_n_inside;
n_outside = n_outside + pvt_n_outside;
}
pi = (4.0 * n_inside) / (n_inside + n_outside);
printf("The estimated value of pi is %f\n", pi);
getchar();

16

return 0;
}

8.Map/Reduce
We have now covered enough that we can use OpenMP to parallelize a map/reduce style
calculation. In this case, the problem we will solve will be calculating the total interaction energy
between each ion in an array of ions with a single reference ion. Passed into this function will be
the reference ion, and an array of ions. The algorithm performed for each ion in the array will be;
1. Calculate the distance between the ion in the array and the reference ion.
2. Use this distance (r) to calculate the interaction energy ( 1 / r )
3. Add this interaction energy onto the total sum.
Map/reduce can be used when you have an array of data, a function you wish to apply (to map) to
each item in the array, and a single value you want back that is the reduction of the results of
applying the function to each item in the array. In terms of map/reduce, our algorithm would look
like this;
1. Create a function that calculates and returns the interaction between a passed ion and the
reference ion, e.g. calc_energy(ion)
2. Map each ion in the array against the energy function calc_energy
3. Reduce the result of each mapped function call using a sum (+)
Here are incomplete pieces of code that implement this algorithm (note that this is to provide an
example of how map/reduce can be used - you don't need to complete this code);
Note that the amount of OpenMP in these examples is very low (just 2-3 lines). This is quite
common for OpenMP programs - most of the work of parallelization is organizing your code so
that it can be parallelized. Once it has been organized, you then only need to add a small number
of OpenMP directives.

17

Example
#include
#include
#include
#include

"stdafx.h"
<stdio.h>
<math.h>
<stdlib.h>

/* Define an Ion type to hold the


coordinates of an Ion */
typedef struct Ion
{
double x;
double y;
double z;
} Ion;
/* The reference Ion */
struct Ion reference_ion;
/* Return the square of a number */
double square(double x)
{
return x*x;
}
/* Energy function to be mapped */
double calc_energy( struct Ion ion )
{
double r;
r = sqrt( square( reference_ion.x - ion.x ) +
square( reference_ion.y - ion.y ) +
square( reference_ion.z - ion.z ) );
/* The energy is simply 1 / r */
return 1.0 / r;
}
/* You will need to fill in this function to read in
an array of ions and return the array */
struct Ion* readArrayOfIons(int *num_ions)
{
int i;
Ion *ions = (Ion*)malloc(10 * sizeof(Ion));
*num_ions = 10;
for (i=0; i<10;
{
ions[i].x =
ions[i].y =
ions[i].z =
}

++i)
0.0;
0.0;
i;

return ions;
}
int main(int argc, char **argv)
{

18

int i, num_ions;
struct Ion *ions_array;
double total_energy, mapped_energy;
/* Lets put the
reference_ion.x
reference_ion.y
reference_ion.z

reference ion at (1,2,3) */


= 1.0;
= 2.0;
= 3.0;

/* Now read in an array of ions */


ions_array = readArrayOfIons( &num_ions );
total_energy = 0.0;
#pragma omp parallel private(mapped_energy) \
reduction( + : total_energy )
{
mapped_energy = 0.0;
#pragma omp for
for (i=0; i < num_ions; ++i)
{
/* Map this ion against the function */
mapped_energy += calc_energy( ions_array[i] );
}
/* Reduce to get the result */
total_energy += mapped_energy;
}
printf("The total energy is %f\n", total_energy);
getchar();
return 0;
}

19

9.Performance
OpenMP is reasonably straightforward to use, and can be added to existing programs. However,
don't let this simplicity deceive you. While writing OpenMP code is straightforward, writing
efficient, scalable OpenMP code can be hard, and requires you to think deeply about the problem
to be solved. As you saw in the map/reduce example, the amount of OpenMP was quite small, but
the application itself had to be arranged in a way that allowed the problem to be solved using a
map/reduce approach.
While the techniques used to get good performance using OpenMP are problem specific, there are
a set of general guidelines that will put you on the right track;

Avoid working with global variables whenever possible


Try to do as much work as you can using thread-private variables. Structure your program
so that each thread can calculate a thread-local result, and then these thread-local results
can then be combined together at the end into the final answer.
Avoid OpenMP critical regions wherever possible. These will cause threads to block
as only one thread can be in a critical region at a time. However, remember
that critical regions are necessary whenever you update a global variable.
Try to leave all global updates to the end of the parallel region.
Prefer to use OpenMP reduction operations rather than trying to write your own.

However, the most important advice is benchmark, benchmark, and benchmark! Don't try and
guess the performance of your code, or guess that something makes it faster. Actually time how
fast your code runs (e.g. using the time command or using timers within your code), and compare
timings for different numbers of threads every time you make changes to your code. Make
benchmarking and performance testing as important in your coding workflow as debugging and
unit testing.

20

You might also like