Professional Documents
Culture Documents
Ebook An Introduction To Parallel Programming PDF Full Chapter PDF
Ebook An Introduction To Parallel Programming PDF Full Chapter PDF
SECOND EDITION
Peter S. Pacheco
University of San Francisco
Matthew Malensek
University of San Francisco
Table of Contents
Cover image
Title page
Copyright
Dedication
Preface
1.10. Summary
1.11. Exercises
Bibliography
2.6. Performance
2.9. Assumptions
2.10. Summary
2.11. Exercises
Bibliography
3.8. Summary
3.9. Exercises
Bibliography
4.5. Busy-waiting
4.6. Mutexes
4.11. Thread-safety
4.12. Summary
4.13. Exercises
Bibliography
5.10. Tasking
5.11. Thread-safety
5.12. Summary
5.13. Exercises
Bibliography
6.13. CUDA trapezoidal rule III: blocks with more than one warp
6.15. Summary
6.16. Exercises
Bibliography
7.5. Summary
7.6. Exercises
Bibliography
Bibliography
Bibliography
Bibliography
Index
Copyright
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139,
United States
Notices
Knowledge and best practice in this field are constantly
changing. As new research and experience broaden our
understanding, changes in research methods,
professional practices, or medical treatment may become
necessary.
Practitioners and researchers must always rely on their
own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments
described herein. In using such information or methods
they should be mindful of their own safety and the safety
of others, including parties for whom they have a
professional responsibility.
ISBN: 978-0-12-804605-0
Typeset by VTeX
Printed in United States of America
Here the prefix indicates that each core is using its own,
private variables, and each core can execute this block of
code independently of the other cores.
After each core completes execution of this code, its
variable will store the sum of the values computed by
its calls to . For example, if there are eight
cores, , and the 24 calls to return the
values
1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5,
1, 2, 3, 9,
then the values stored in might be
The idea here is that each core will wait in the function
until all the cores have entered the function—in
particular, until the master core has entered this function.
Currently, the most powerful parallel programs are
written using explicit parallel constructs, that is, they are
written using extensions to languages such as C, C++, and
Java. These programs include explicit instructions for
parallelism: core 0 executes task 0, core 1 executes task 1,
…, all cores synchronize, …, and so on, so such programs
are often extremely complex. Furthermore, the complexity
of modern cores often makes it necessary to use
considerable care in writing the code that will be executed
by a single core.
There are other options for writing parallel programs—
for example, higher level languages—but they tend to
sacrifice performance to make program development
somewhat easier.
1.10 Summary
For many years we've reaped the benefits of having ever-
faster processors. However, because of physical limitations,
the rate of performance improvement in conventional
processors has decreased dramatically. To increase the
power of processors, chipmakers have turned to multicore
integrated circuits, that is, integrated circuits with multiple
conventional processors on a single chip.
Ordinary serial programs, which are programs written
for a conventional single-core processor, usually cannot
exploit the presence of multiple cores, and it's unlikely that
translation programs will be able to shoulder all the work
of converting serial programs into parallel programs—
programs that can make use of multiple cores. As software
developers, we need to learn to write parallel programs.
When we write parallel programs, we usually need to
coordinate the work of the cores. This can involve
communication among the cores, load balancing, and
synchronization of the cores.
In this book we'll be learning to program parallel
systems, so that we can maximize their performance. We'll
be using the C language with four different application
program interfaces or APIs: MPI, Pthreads, OpenMP, and
CUDA. These APIs are used to program parallel systems
that are classified according to how the cores access
memory and whether the individual cores can operate
independently of each other.
In the first classification, we distinguish between shared-
memory and distributed-memory systems. In a shared-
memory system, the cores share access to one large pool of
memory, and they can coordinate their actions by accessing
shared memory locations. In a distributed-memory system,
each core has its own private memory, and the cores can
coordinate their actions by sending messages across a
network.
In the second classification, we distinguish between
systems with cores that can operate independently of each
other and systems in which the cores all execute the same
instruction. In both types of system, the cores can operate
on their own data stream. So the first type of system is
called a multiple-instruction multiple-data or MIMD
system, and the second type of system is called a single-
instruction multiple-data or SIMD system.
MPI is used for programming distributed-memory MIMD
systems. Pthreads is used for programming shared-memory
MIMD systems. OpenMP can be used to program both
shared-memory MIMD and shared-memory SIMD systems,
although we'll be looking at using it to program MIMD
systems. CUDA is used for programming Nvidia graphics
processing units or GPUs. GPUs have aspects of all four
types of system, but we'll be mainly interested in the
shared-memory SIMD and shared-memory MIMD aspects.
Concurrent programs can have multiple tasks in
progress at any instant. Parallel and distributed
programs usually have tasks that execute simultaneously.
There isn't a hard and fast distinction between parallel and
distributed, although in parallel programs, the tasks are
usually more tightly coupled.
Parallel programs are usually very complex. So it's even
more important to use good program development
techniques with parallel programs.
1.11 Exercises
1.1 Devise formulas for the functions that calculate
and in the global sum example.
Remember that each core should be assigned
roughly the same number of elements of
computations in the loop. : First consider the
case when n is evenly divisible by p.
1.2 We've implicitly assumed that each call to
requires roughly the same amount of
work as the other calls. How would you change your
answer to the preceding question if call requires
times as much work as the call with ? How
would you change your answer if the first call ( )
requires 2 milliseconds, the second call ( )
requires 4, the third ( ) requires 6, and so on?
1.3 Try to write pseudocode for the tree-structured
global sum illustrated in Fig. 1.1. Assume the
number of cores is a power of two (1, 2, 4, 8, …).
: Use a variable to determine whether a
core should send its sum or receive and add. The
should start with the value 2 and be doubled
after each iteration. Also use a variable to
determine which core should be partnered with the
current core. It should start with the value 1 and
also be doubled after each iteration. For example, in
the first iteration and , so 0
receives and adds, while 1 sends. Also in the first
iteration and , so 0 and
1 are paired in the first iteration.
1.4 As an alternative to the approach outlined in the
preceding problem, we can use C's bitwise operators
to implement the tree-structured global sum. To see
how this works, it helps to write down the binary
(base 2) representation of each of the core ranks and
note the pairings during each stage: =8.5cm
From the table, we see that during the first stage each
core is paired with the core whose rank differs in the
rightmost or first bit. During the second stage, cores
that continue are paired with the core whose rank
differs in the second bit; and during the third stage,
cores are paired with the core whose rank differs in
the third bit. Thus if we have a binary value that
is 0012 for the first stage, 0102 for the second, and
1002 for the third, we can get the rank of the core
we're paired with by “inverting” the bit in our rank
that is nonzero in . This can be done using the
bitwise exclusive or ∧ operator.
Implement this algorithm in pseudocode using the
bitwise exclusive or and the left-shift operator.
1.5 What happens if your pseudocode in Exercise 1.3 or
Exercise 1.4 is run when the number of cores is not a
power of two (e.g., 3, 5, 6, 7)? Can you modify the
pseudocode so that it will work correctly regardless
of the number of cores?
1.6 Derive formulas for the number of receives and
additions that core 0 carries out using
a. the original pseudocode for a global sum, and
b. the tree-structured global sum.
Make a table showing the numbers of receives and
additions carried out by core 0 when the two sums
are used with cores.
1.7 The first part of the global sum example—when
each core adds its assigned computed values—is
usually considered to be an example of data-
parallelism, while the second part of the first global
sum—when the cores send their partial sums to the
master core, which adds them—could be considered
to be an example of task-parallelism. What about the
second part of the second global sum—when the
cores use a tree structure to add their partial sums?
Is this an example of data- or task-parallelism? Why?
1.8 Suppose the faculty members are throwing a party
for the students in the department.
a. Identify tasks that can be assigned to the
faculty members that will allow them to use task-
parallelism when they prepare for the party.
Work out a schedule that shows when the various
tasks can be performed.
b. We might hope that one of the tasks in the
preceding part is cleaning the house where the
party will be held. How can we use data-
parallelism to partition the work of cleaning the
house among the faculty?
c. Use a combination of task- and data-parallelism
to prepare for the party. (If there's too much
work for the faculty, you can use TAs to pick up
the slack.)
1.9 Write an essay describing a research problem in
your major that would benefit from the use of
parallel computing. Provide a rough outline of how
parallelism would be used. Would you use task- or
data-parallelism?
Bibliography
[5] Clay Breshears, The Art of Concurrency: A Thread
Monkey's Guide to Writing Parallel Applications.
Sebastopol, CA: O'Reilly; 2009.
[28] John Hennessy, David Patterson, Computer
Architecture: A Quantitative Approach. 6th ed.
Burlington, MA: Morgan Kaufmann; 2019.
[31] IBM, IBM InfoSphere Streams v1.2.0 supports
highly complex heterogeneous data analysis, IBM
United States Software Announcement 210-037,
Feb. 23, 2010
http://www.ibm.com/common/ssi/rep_ca/7/897/ENUS
210-037/ENUS210-037.PDF.
[36] John Loeffler, No more transistors: the end of
Moore's Law, Interesting Engineering, Nov 29, 2018.
See https://interestingengineering.com/no-more-
transistors-the-end-of-moores-law.
Chapter 2: Parallel
hardware and parallel
software
It's perfectly feasible for specialists in disciplines other
than computer science and computer engineering to write
parallel programs. However, to write efficient parallel
programs, we often need some knowledge of the underlying
hardware and system software. It's also very useful to have
some knowledge of different types of parallel software, so
in this chapter we'll take a brief look at a few topics in
hardware and software. We'll also take a brief look at
evaluating program performance and a method for
developing parallel programs. We'll close with a discussion
of what kind of environment we might expect to be working
in, and a few rules and assumptions we'll make in the rest
of the book.
This is a long, broad chapter, so it may be a good idea to
skim through some of the sections on a first reading so that
you have a good idea of what's in the chapter. Then, when a
concept or term in a later chapter isn't quite clear, it may
be helpful to refer back to this chapter. In particular, you
may want to skim over most of the material in
“Modifications to the von Neumann Model,” except “The
Basics of Caching.” Also, in the “Parallel Hardware”
section, you can safely skim the material on
“Interconnection Networks.” You can also skim the material
on “SIMD Systems” unless you're planning to read the
chapter on CUDA programming.
Another random document with
no related content on Scribd:
well be that each brood has its own characteristic secretion, or "smell," as it
were, slightly different from that of other broods, just as every dog has his, so
easily recognisable by other dogs; and that the cells only react to different
"smells" to their own. Such a secretion from the surface of the female cell
would lessen its surface tension, and thereby render easier the penetration of
the sperm into its substance.
As a rule, one at least of the pair-cells is fresh from division, and it would thus
appear that the union of the nuclei is facilitated when one at least of them is a
"young" one. Of the final mechanism of the union of the nuclei, we know
nothing: they may unite in any of the earlier phases of mitosis, or even in the
"resting state." A fibrillation of the cytoplasm during the process, radiating
around a centrosome or two centrosomes indicates a strained condition.[51]
Again, not only is the vegetal cell very ready to surround itself with a
cell-wall, but its food-material, or rather, speaking accurately, the
inorganic materials from which that food is to be manufactured, may
diffuse through this wall with scarcely any difficulty. Such a cell can
and does grow when encysted: it grows even more readily in this
state, since none of its energies are absorbed by the necessities of
locomotion, etc. Growth leads, of course, to division: there is often
an economy of wall-material by the formation of a mere party-wall
dividing the cavity of the old cell-wall at its limit of growth into two
new cavities of equal size. Thus the division tends to form a colonial
aggregate, which continues to grow in a motionless, and, so far, a
"resting" state. We may call this "vegetative rest," to distinguish it
from "absolute rest," when all other life-processes (as well as
motion) are reduced to a minimum or absolutely suspended.
Higher Plants consist of cells for the most part each isolated in its
own cell-cavity, save for the few slender threads of communication.
Higher Animals consist of cells that are rarely isolated in this way,
but are mostly in mutual contact over the greater part of their
surface.
Again, Plants take in either food or else the material for food in
solution through their surface, and only by diffusion through the cell-
wall. Insectivorous Plants that have the power of capturing and
digesting insects have no real internal cavity. Animal-feeding Protista
take in their food into the interior of their protoplasm and digest it
therein, and the Metazoa have an internal cavity or stomach for the
same purpose. Here again there are exceptions in the case of
certain internal parasites, such as the Tapeworms and
Acanthocephala (Vol. II. pp. 74, 174), which have no stomachs, living
as they do in the dissolved food-supplies of their hosts, but still
possessing the general tissues and organs of Metazoa.
Corresponding with the absence of mouth, and the absorption
instead of the prehension of food, we find that the movements of
plant-beings are limited. In the higher Plants, and many lower ones,
the colonial organism is firmly fixed or attached, and the movements
of its parts are confined to flexions. These are produced by
inequalities of growth; or by inequalities of temporary distension of
cell-masses, due to the absorption of liquid into their vacuoles, while
relaxation is effected by the cytoplasm and cell-wall becoming
pervious to the liquid. We find no case of a differentiation of the
cytoplasm within the cell into definite muscular fibrils. In the lower
Plants single naked motile cells disseminate the species; and the
pairing-cells, or at least the males, have the same motile character.
In higher Cryptogams, Cycads, and Ginkgo (the Maiden-hair Tree),
the sperms alone are free-swimming; and as we pass to Flowering
Plants, the migratory character of the male cells is restricted to the
smallest limits. though never wholly absent. Intracellular movements
of the protoplasm are, however, found in all Plants.
Animals all require proteid food; their cyst-walls are never formed of
cellulose; their cells usually divide in the naked condition only, or if
encysted, no secondary party-walls are formed between the
daughter-cells to unite them into a vegetative colony. Their colonies
are usually locomotive, or, if fixed, their parts largely retain their
powers of relative motion, and are often provided on their free
surfaces with cilia or flagella; and many cells have differentiated in
their cytoplasm contractile muscular fibrils. Their food (except in a
few parasitic groups) is always taken into a distinct digestive cavity.
A complex nervous system, of many special cells, with branched
prolongations interlacing or anastomosing, and uniting superficial
sense-organs with internal centres, is universally developed in
Metazoa. All Metazoa fulfil the above conditions.
PROTOZOA
Organisms of various metabolism, formed of a single cell or apocyte,
or of a colony of scarcely differentiated cells, whose organs are
formed by differentiations of the protoplasm, and its secretions and
accretions; not composed of differentiated multicellular tissues or
organs.[59]
Most of the fresh-water and brackish forms (and some marine ones)
possess one or more contractile vacuoles, often in relation to a more
or less complex system of spaces or canals in Flagellates and
Ciliates.
CHAPTER III
I. Sarcodina.
Protozoa performing most of their life-processes by pseudopodia; nucleus
frequently giving off fragments (chromidia) which may play a part in nuclear
reconstitution on division; sometimes with brood-cells, which may be at first
flagellate; but never reproducing in the flagellate state.[64]
1. Rhizopoda
Sarcodina of simple form, whose pseudopodia never coalesce into
networks (1),[65] nor contain an axial filament (2), which commonly
multiply by binary fission (3), though a brood-formation may occur;
which may temporarily aggregate, or undergo temporary or
permanent plastogamic union, but never to form large plasmodia or
complex fructifications as a prelude to spore-formation (4); test when
present gelatinous, chitinous, sandy, or siliceous, simple and 1-
chambered (5).
Classification.[66]
I. Ectoplasm distinct, clear; pseudopodia blunt or
tapering, but not branching at the apex Lobosa
Amoeba, Auctt.; Pelomyxa, Greeff; Trichosphaerium, A.
Schneid.; Dinamoeba, Leidy; Amphizonella, Greeff;
Centropyxis, Stein; Arcella, Ehr.; Difflugia, Leclercq;
Lecqueureusia, Schlumberger; Hyalosphenia, Stein;
Quadrula, F. E. Sch.; Heleopera, Leidy; Podostoma, Cl. and
L.; Arcuothrix, Hallez.
II. Ectoplasm undifferentiated, containing moving
granules; pseudopodia branching freely towards the
tips Filosa
Euglypha, Duj.; Paulinella, Lauterb.; Cyphoderia, Schlumb.;
Campascus, Leidy; Chlamydophrys, Cienk.; Gromia, Duj. =
Hyalopus, M. Sch.