Ebook An Introduction To Parallel Programming PDF Full Chapter PDF

An Introduction to Parallel
Programming - eBook PDF

Visit to download the full and correct content document:
https://ebooksecure.com/download/an-introduction-to-parallel-programming-ebook-pd
f/
An Introduction to Parallel
Programming
SECOND EDITION
Peter S. Pacheco
University of San Francisco
Matthew Malensek
University of San Francisco
Table of Contents
Cover image
Title page
Copyright
Dedication
Preface
Chapter 1: Why parallel computing
1.1. Why we need ever-increasing performance
1.2. Why we're building parallel systems
1.3. Why we need to write parallel programs
1.4. How do we write parallel programs?
1.5. What we'll be doing
1.6. Concurrent, parallel, distributed
1.7. The rest of the book

1.8. A word of warning
1.9. Typographical conventions
1.10. Summary
1.11. Exercises
Bibliography
Chapter 2: Parallel hardware and parallel software
2.1. Some background
2.2. Modifications to the von Neumann model
2.3. Parallel hardware
2.4. Parallel software
2.5. Input and output
2.6. Performance
2.7. Parallel program design
2.8. Writing and running parallel programs
2.9. Assumptions
2.10. Summary
2.11. Exercises
Bibliography
Chapter 3: Distributed memory programming with MPI

3.1. Getting started
3.2. The trapezoidal rule in MPI
3.3. Dealing with I/O
3.4. Collective communication
3.5. MPI-derived datatypes
3.6. Performance evaluation of MPI programs
3.7. A parallel sorting algorithm
3.8. Summary
3.9. Exercises
3.10. Programming assignments
Bibliography
Chapter 4: Shared-memory programming with Pthreads
4.1. Processes, threads, and Pthreads
4.2. Hello, world
4.3. Matrix-vector multiplication
4.4. Critical sections
4.5. Busy-waiting
4.6. Mutexes
4.7. Producer–consumer synchronization and semaphores
4.8. Barriers and condition variables

4.9. Read-write locks
4.10. Caches, cache-coherence, and false sharing
4.11. Thread-safety
4.12. Summary
4.13. Exercises
Bibliography
Chapter 5: Shared-memory programming with OpenMP
5.1. Getting started
5.2. The trapezoidal rule
5.3. Scope of variables
5.4. The reduction clause
5.5. The parallel for directive
5.6. More about loops in OpenMP: sorting
5.7. Scheduling loops
5.8. Producers and consumers
5.9. Caches, cache coherence, and false sharing
5.10. Tasking
5.11. Thread-safety
5.12. Summary
5.13. Exercises
Bibliography
Chapter 6: GPU programming with CUDA
6.1. GPUs and GPGPU
6.2. GPU architectures
6.3. Heterogeneous computing
6.4. CUDA hello
6.5. A closer look
6.6. Threads, blocks, and grids
6.7. Nvidia compute capabilities and device architectures
6.8. Vector addition
6.9. Returning results from CUDA kernels
6.10. CUDA trapezoidal rule I
6.11. CUDA trapezoidal rule II: improving performance
6.12. Implementation of trapezoidal rule with warpSize thread

blocks
6.13. CUDA trapezoidal rule III: blocks with more than one warp
6.14. Bitonic sort
6.15. Summary
6.16. Exercises
Bibliography
Chapter 7: Parallel program development
7.1. Two n-body solvers
7.2. Sample sort
7.3. A word of caution
7.4. Which API?
7.5. Summary
7.6. Exercises
Bibliography
Chapter 8: Where to go from here
Bibliography
Bibliography
Bibliography
Index
Copyright
Morgan Kaufmann is an imprint of Elsevier
50 Hampshire Street, 5th Floor, Cambridge, MA 02139,
United States
Copyright © 2022 Elsevier Inc. All rights reserved.
No part of this publication may be reproduced or

transmitted in any form or by any means, electronic or
mechanical, including photocopying, recording, or any
information storage and retrieval system, without
permission in writing from the publisher. Details on how to
seek permission, further information about the Publisher's
permissions policies and our arrangements with
organizations such as the Copyright Clearance Center and
the Copyright Licensing Agency, can be found at our
website: www.elsevier.com/permissions.
This book and the individual contributions contained in it

are protected under copyright by the Publisher (other than
as may be noted herein).
Cover art: “seven notations,” nickel/silver etched plates,
acrylic on wood structure, copyright © Holly Cohn
Notices
Knowledge and best practice in this field are constantly
changing. As new research and experience broaden our
understanding, changes in research methods,
professional practices, or medical treatment may become
necessary.
Practitioners and researchers must always rely on their
own experience and knowledge in evaluating and using
any information, methods, compounds, or experiments
described herein. In using such information or methods
they should be mindful of their own safety and the safety
of others, including parties for whom they have a
professional responsibility.
To the fullest extent of the law, neither the Publisher nor

the authors, contributors, or editors, assume any liability
for any injury and/or damage to persons or property as a
matter of products liability, negligence or otherwise, or
from any use or operation of any methods, products,
instructions, or ideas contained in the material herein.
Library of Congress Cataloging-in-Publication Data

A catalog record for this book is available from the Library
of Congress
British Library Cataloguing-in-Publication Data

A catalogue record for this book is available from the
British Library
ISBN: 978-0-12-804605-0
For information on all Morgan Kaufmann publications

visit our website at https://www.elsevier.com/books-and-
journals
Publisher: Katey Birtcher

Acquisitions Editor: Stephen Merken
Content Development Manager: Meghan Andress
Publishing Services Manager: Shereen Jameel
Production Project Manager: Rukmani Krishnan
Designer: Victoria Pearson
Typeset by VTeX
Printed in United States of America
Last digit is the print number: 9 8 7 6 5 4 3 2 1

Dedication
To the memory of Robert S. Miller

Preface
Parallel hardware has been ubiquitous for some time
now: it's difficult to find a laptop, desktop, or server that
doesn't use a multicore processor. Cluster computing is
nearly as common today as high-powered workstations
were in the 1990s, and cloud computing is making
distributed-memory systems as accessible as desktops. In
spite of this, most computer science majors graduate with
little or no experience in parallel programming. Many
colleges and universities offer upper-division elective
courses in parallel computing, but since most computer
science majors have to take a large number of required
courses, many graduate without ever writing a
multithreaded or multiprocess program.
It seems clear that this state of affairs needs to change.
Whereas many programs can obtain satisfactory
performance on a single core, computer scientists should
be made aware of the potentially vast performance
improvements that can be obtained with parallelism, and
they should be able to exploit this potential when the need
arises.
Introduction to Parallel Programming was written to
partially address this problem. It provides an introduction
to writing parallel programs using MPI, Pthreads, OpenMP,
and CUDA, four of the most widely used APIs for parallel
programming. The intended audience is students and
professionals who need to write parallel programs. The
prerequisites are minimal: a college-level course in
mathematics and the ability to write serial programs in C.
The prerequisites are minimal, because we believe that
students should be able to start programming parallel
systems as early as possible. At the University of San
Francisco, computer science students can fulfill a
requirement for the major by taking a course on which this
text is based immediately after taking the “Introduction to
Computer Science I” course that most majors take in the
first semester of their freshman year. It has been our
experience that there really is no reason for students to
defer writing parallel programs until their junior or senior
year. To the contrary, the course is popular, and students
have found that using concurrency in other courses is much
easier after having taken this course.
If second-semester freshmen can learn to write parallel
programs by taking a class, then motivated computing
professionals should be able to learn to write parallel
programs through self-study. We hope this book will prove
to be a useful resource for them.
The Second Edition
It has been nearly ten years since the first edition of
Introduction to Parallel Programming was published.
During that time much has changed in the world of parallel
programming, but, perhaps surprisingly, much also remains
the same. Our intent in writing this second edition has been
to preserve the material from the first edition that
continues to be generally useful, but also to add new
material where we felt it was needed.
The most obvious addition is the inclusion of a new
chapter on CUDA programming. When the first edition was
published, CUDA was still very new. It was already clear
that the use of GPUs in high-performance computing would
become very widespread, but at that time we felt that
GPGPU wasn't readily accessible to programmers with
relatively little experience. In the last ten years, that has
clearly changed. Of course, CUDA is not a standard, and
features are added, modified, and deleted with great
rapidity. As a consequence, authors who use CUDA must
present a subject that changes much faster than a
standard, such as MPI, Pthreads, or OpenMP. In spite of
this, we hope that our presentation of CUDA will continue
to be useful for some time.
Another big change is that Matthew Malensek has come
onboard as a coauthor. Matthew is a relatively new
colleague at the University of San Francisco, but he has
extensive experience with both the teaching and
application of parallel computing. His contributions have
greatly improved the second edition.
About This Book
As we noted earlier, the main purpose of the book is to
teach parallel programming in MPI, Pthreads, OpenMP, and
CUDA to an audience with a limited background in
computer science and no previous experience with
parallelism. We also wanted to make the book as flexible as
possible so that readers who have no interest in learning
one or two of the APIs can still read the remaining material
with little effort. Thus the chapters on the four APIs are
largely independent of each other: they can be read in any
order, and one or two of these chapters can be omitted.
This independence has some cost: it was necessary to
repeat some of the material in these chapters. Of course,
repeated material can be simply scanned or skipped.
On the other hand, readers with no prior experience with
parallel computing should read Chapter 1 first. This
chapter attempts to provide a relatively nontechnical
explanation of why parallel systems have come to dominate
the computer landscape. It also provides a short
introduction to parallel systems and parallel programming.
Chapter 2 provides technical background on computer
hardware and software. Chapters 3 to 6 provide
independent introductions to MPI, Pthreads, OpenMP, and
CUDA, respectively. Chapter 7 illustrates the development
of two different parallel programs using each of the four
APIs. Finally, Chapter 8 provides a few pointers to
additional information on parallel computing.
We use the C programming language for developing our
programs, because all four API's have C-language
interfaces, and, since C is such a small language, it is a
relatively easy language to learn—especially for C++ and
Java programmers, since they will already be familiar with
C's control structures.
Classroom Use
This text grew out of a lower-division undergraduate
course at the University of San Francisco. The course
fulfills a requirement for the computer science major, and it
also fulfills a prerequisite for the undergraduate operating
systems, architecture, and networking courses. The course
begins with a four-week introduction to C programming.
Since most of the students have already written Java
programs, the bulk of this introduction is devoted to the
use pointers in C.1 The remainder of the course provides
introductions first to programming in MPI, then Pthreads
and/or OpenMP, and it finishes with material covering
CUDA.
We cover most of the material in Chapters 1, 3, 4, 5, and
6, and parts of the material in Chapters 2 and 7. The
background in Chapter 2 is introduced as the need arises.
For example, before discussing cache coherence issues in
OpenMP (Chapter 5), we cover the material on caches in
Chapter 2.
The coursework consists of weekly homework
assignments, five programming assignments, a couple of
midterms and a final exam. The homework assignments
usually involve writing a very short program or making a
small modification to an existing program. Their purpose is
to insure that the students stay current with the
coursework, and to give the students hands-on experience
with ideas introduced in class. It seems likely that their
existence has been one of the principle reasons for the
course's success. Most of the exercises in the text are
suitable for these brief assignments.
The programming assignments are larger than the
programs written for homework, but we typically give the
students a good deal of guidance: we'll frequently include
pseudocode in the assignment and discuss some of the
more difficult aspects in class. This extra guidance is often
crucial: it's easy to give programming assignments that will
take far too long for the students to complete.
The results of the midterms and finals and the
enthusiastic reports of the professor who teaches operating
systems suggest that the course is actually very successful
in teaching students how to write parallel programs.
For more advanced courses in parallel computing, the
text and its online supporting materials can serve as a
supplement so that much of the material on the syntax and
semantics of the four APIs can be assigned as outside
reading.
The text can also be used as a supplement for project-
based courses and courses outside of computer science
that make use of parallel computation.
Support Materials
An online companion site for the book is located at
www.elsevier.com/books-and-journals/book-
companion/9780128046050.. This site will include errata
and complete source for the longer programs we discuss in
the text. Additional material for instructors, including
downloadable figures and solutions to the exercises in the
book, can be downloaded from
https://educate.elsevier.com/9780128046050.
We would greatly appreciate readers' letting us know of
any errors they find. Please send email to
mmalensek@usfca.edu if you do find a mistake.
Acknowledgments
In the course of working on this book we've received
considerable help from many individuals. Among them we'd
like to thank the reviewers of the second edition, Steven
Frankel (Technion) and Il-Hyung Cho (Saginaw Valley State
University), who read and commented on draft versions of
the new CUDA chapter. We'd also like to thank the
reviewers who read and commented on the initial proposal
for the book: Fikret Ercal (Missouri University of Science
and Technology), Dan Harvey (Southern Oregon
University), Joel Hollingsworth (Elon University), Jens
Mache (Lewis and Clark College), Don McLaughlin (West
Virginia University), Manish Parashar (Rutgers University),
Charlie Peck (Earlham College), Stephen C. Renk (North
Central College), Rolfe Josef Sassenfeld (The University of
Texas at El Paso), Joseph Sloan (Wofford College), Michela
Taufer (University of Delaware), Pearl Wang (George Mason
University), Bob Weems (University of Texas at Arlington),
and Cheng-Zhong Xu (Wayne State University). We are also
deeply grateful to the following individuals for their
reviews of various chapters of the book: Duncan Buell
(University of South Carolina), Matthias Gobbert
(University of Maryland, Baltimore County), Krishna Kavi
(University of North Texas), Hong Lin (University of
Houston–Downtown), Kathy Liszka (University of Akron),
Leigh Little (The State University of New York), Xinlian Liu
(Hood College), Henry Tufo (University of Colorado at
Boulder), Andrew Sloss (Consultant Engineer, ARM), and
Gengbin Zheng (University of Illinois). Their comments and
suggestions have made the book immeasurably better. Of
course, we are solely responsible for remaining errors and
omissions.
Slides and the solutions manual for the first edition were
prepared by Kathy Liszka and Jinyoung Choi, respectively.
Thanks to both of them.
The staff at Elsevier has been very helpful throughout
this project. Nate McFadden helped with the development
of the text. Todd Green and Steve Merken were the
acquisitions editors. Meghan Andress was the content
development manager. Rukmani Krishnan was the
production editor. Victoria Pearson was the designer. They
did a great job, and we are very grateful to all of them.
Our colleagues in the computer science and mathematics
departments at USF have been extremely helpful during
our work on the book. Peter would like to single out Prof.
Gregory Benson for particular thanks: his understanding of
parallel computing—especially Pthreads and semaphores—
has been an invaluable resource. We're both very grateful
to our system administrators, Alexey Fedosov and Elias
Husary. They've patiently and efficiently dealt with all of
the “emergencies” that cropped up while we were working
on programs for the book. They've also done an amazing
job of providing us with the hardware we used to do all
program development and testing.
Peter would never have been able to finish the book
without the encouragement and moral support of his
friends Holly Cohn, John Dean, and Maria Grant. He will
always be very grateful for their help and their friendship.
He is especially grateful to Holly for allowing us to use her
work, seven notations, for the cover.
Matthew would like to thank his colleagues in the USF
Department of Computer Science, as well as Maya
Malensek and Doyel Sadhu, for their love and support.
Most of all, he would like to thank Peter Pacheco for being
a mentor and infallible source of advice and wisdom during
the formative years of his career in academia.
Our biggest debt is to our students. As always, they
showed us what was too easy and what was far too difficult.
They taught us how to teach parallel computing. Our
deepest thanks to all of them.
1 “Interestingly, a number of students have said that they
found the use of C pointers more difficult than MPI
programming.”
Chapter 1: Why parallel
computing
From 1986 to 2003, the performance of microprocessors
increased, on average, more than 50% per year [28]. This
unprecedented increase meant that users and software
developers could often simply wait for the next generation
of microprocessors to obtain increased performance from
their applications. Since 2003, however, single-processor
performance improvement has slowed to the point that in
the period from 2015 to 2017, it increased at less than 4%
per year [28]. This difference is dramatic: at 50% per year,
performance will increase by almost a factor of 60 in 10
years, while at 4%, it will increase by about a factor of 1.5.
Furthermore, this difference in performance increase has
been associated with a dramatic change in processor
design. By 2005, most of the major manufacturers of
microprocessors had decided that the road to rapidly
increasing performance lay in the direction of parallelism.
Rather than trying to continue to develop ever-faster
monolithic processors, manufacturers started putting
multiple complete processors on a single integrated circuit.
This change has a very important consequence for
software developers: simply adding more processors will
not magically improve the performance of the vast majority
of serial programs, that is, programs that were written to
run on a single processor. Such programs are unaware of
the existence of multiple processors, and the performance
of such a program on a system with multiple processors
will be effectively the same as its performance on a single
processor of the multiprocessor system.
All of this raises a number of questions:
• Why do we care? Aren't single-processor systems

fast enough?
• Why can't microprocessor manufacturers continue
to develop much faster single-processor systems?
Why build parallel systems? Why build systems
with multiple processors?
• Why can't we write programs that will automatically
convert serial programs into parallel programs,
that is, programs that take advantage of the
presence of multiple processors?
Let's take a brief look at each of these questions. Keep in

mind, though, that some of the answers aren't carved in
stone. For example, the performance of many applications
may already be more than adequate.
1.1 Why we need ever-increasing performance

The vast increases in computational power that we've been
enjoying for decades now have been at the heart of many of
the most dramatic advances in fields as diverse as science,
the Internet, and entertainment. For example, decoding the
human genome, ever more accurate medical imaging,
astonishingly fast and accurate Web searches, and ever
more realistic and responsive computer games would all
have been impossible without these increases. Indeed,
more recent increases in computational power would have
been difficult, if not impossible, without earlier increases.
But we can never rest on our laurels. As our computational
power increases, the number of problems that we can
seriously consider solving also increases. Here are a few
examples:
• Climate modeling. To better understand climate

change, we need far more accurate computer
models, models that include interactions between
the atmosphere, the oceans, solid land, and the ice
caps at the poles. We also need to be able to make
detailed studies of how various interventions might
affect the global climate.
• Protein folding. It's believed that misfolded proteins
may be involved in diseases such as Huntington's,
Parkinson's, and Alzheimer's, but our ability to study
configurations of complex molecules such as
proteins is severely limited by our current
computational power.
• Drug discovery. There are many ways in which
increased computational power can be used in
research into new medical treatments. For example,
there are many drugs that are effective in treating a
relatively small fraction of those suffering from some
disease. It's possible that we can devise alternative
treatments by careful analysis of the genomes of the
individuals for whom the known treatment is
ineffective. This, however, will involve extensive
computational analysis of genomes.
• Energy research. Increased computational power
will make it possible to program much more detailed
models of technologies, such as wind turbines, solar
cells, and batteries. These programs may provide
the information needed to construct far more
efficient clean energy sources.
• Data analysis. We generate tremendous amounts of
data. By some estimates, the quantity of data stored
worldwide doubles every two years [31], but the vast
majority of it is largely useless unless it's analyzed.
As an example, knowing the sequence of nucleotides
in human DNA is, by itself, of little use.
Understanding how this sequence affects
development and how it can cause disease requires
extensive analysis. In addition to genomics, huge
quantities of data are generated by particle
colliders, such as the Large Hadron Collider at
CERN, medical imaging, astronomical research, and
Web search engines—to name a few.
These and a host of other problems won't be solved without

tremendous increases in computational power.
1.2 Why we're building parallel systems

Much of the tremendous increase in single-processor
performance was driven by the ever-increasing density of
transistors—the electronic switches—on integrated circuits.
As the size of transistors decreases, their speed can be
increased, and the overall speed of the integrated circuit
can be increased. However, as the speed of transistors
increases, their power consumption also increases. Most of
this power is dissipated as heat, and when an integrated
circuit gets too hot, it becomes unreliable. In the first
decade of the twenty-first century, air-cooled integrated
circuits reached the limits of their ability to dissipate heat
[28].
Therefore it is becoming impossible to continue to
increase the speed of integrated circuits. Indeed, in the last
few years, the increase in transistor density has slowed
dramatically [36].
But given the potential of computing to improve our
existence, there is a moral imperative to continue to
increase computational power.
How then, can we continue to build ever more powerful
computers? The answer is parallelism. Rather than building
ever-faster, more complex, monolithic processors, the
industry has decided to put multiple, relatively simple,
complete processors on a single chip. Such integrated
circuits are called multicore processors, and core has
become synonymous with central processing unit, or CPU.
In this setting a conventional processor with one CPU is
often called a single-core system.
1.3 Why we need to write parallel programs
Most programs that have been written for conventional,
single-core systems cannot exploit the presence of multiple
cores. We can run multiple instances of a program on a
multicore system, but this is often of little help. For
example, being able to run multiple instances of our
favorite game isn't really what we want—we want the
program to run faster with more realistic graphics. To do
this, we need to either rewrite our serial programs so that
they're parallel, so that they can make use of multiple
cores, or write translation programs, that is, programs that
will automatically convert serial programs into parallel
programs. The bad news is that researchers have had very
limited success writing programs that convert serial
programs in languages such as C, C++, and Java into
parallel programs.
This isn't terribly surprising. While we can write
programs that recognize common constructs in serial
programs, and automatically translate these constructs into
efficient parallel constructs, the sequence of parallel
constructs may be terribly inefficient. For example, we can
view the multiplication of two matrices as a sequence
of dot products, but parallelizing a matrix multiplication as
a sequence of parallel dot products is likely to be fairly slow
on many systems.
An efficient parallel implementation of a serial program
may not be obtained by finding efficient parallelizations of
each of its steps. Rather, the best parallelization may be
obtained by devising an entirely new algorithm.
As an example, suppose that we need to compute n
values and add them together. We know that this can be
done with the following serial code:
Now suppose we also have p cores and . Then each
core can form a partial sum of approximately values:
Here the prefix indicates that each core is using its own,
private variables, and each core can execute this block of
code independently of the other cores.
After each core completes execution of this code, its
variable will store the sum of the values computed by
its calls to . For example, if there are eight
cores, , and the 24 calls to return the
values
1, 4, 3, 9, 2, 8, 5, 1, 1, 6, 2, 7, 2, 5, 0, 4, 1, 8, 6, 5,
1, 2, 3, 9,
then the values stored in might be
Here we're assuming the cores are identified by

nonnegative integers in the range , where p is the
number of cores.
When the cores are done computing their values of ,
they can form a global sum by sending their results to a
designated “master” core, which can add their results:
In our example, if the master core is core 0, it would add

the values .
But you can probably see a better way to do this—
especially if the number of cores is large. Instead of making
the master core do all the work of computing the final sum,
we can pair the cores so that while core 0 adds in the result
of core 1, core 2 can add in the result of core 3, core 4 can
add in the result of core 5, and so on. Then we can repeat
the process with only the even-ranked cores: 0 adds in the
result of 2, 4 adds in the result of 6, and so on. Now cores
divisible by 4 repeat the process, and so on. See Fig. 1.1.
The circles contain the current value of each core's sum,
and the lines with arrows indicate that one core is sending
its sum to another core. The plus signs indicate that a core
is receiving a sum from another core and adding the
received sum into its own sum.
FIGURE 1.1 Multiple cores forming a global sum.
For both “global” sums, the master core (core 0) does

more work than any other core, and the length of time it
takes the program to complete the final sum should be the
length of time it takes for the master to complete. However,
with eight cores, the master will carry out seven receives
and adds using the first method, while with the second
method, it will only carry out three. So the second method
results in an improvement of more than a factor of two. The
difference becomes much more dramatic with large
numbers of cores. With 1000 cores, the first method will
require 999 receives and adds, while the second will only
require 10—an improvement of almost a factor of 100!
The first global sum is a fairly obvious generalization of
the serial global sum: divide the work of adding among the
cores, and after each core has computed its part of the
sum, the master core simply repeats the basic serial
addition—if there are p cores, then it needs to add p values.
The second global sum, on the other hand, bears little
relation to the original serial addition.
The point here is that it's unlikely that a translation
program would “discover” the second global sum. Rather,
there would more likely be a predefined efficient global
sum that the translation program would have access to. It
could “recognize” the original serial loop and replace it
with a precoded, efficient, parallel global sum.
We might expect that software could be written so that a
large number of common serial constructs could be
recognized and efficiently parallelized, that is, modified so
that they can use multiple cores. However, as we apply this
principle to ever more complex serial programs, it becomes
more and more difficult to recognize the construct, and it
becomes less and less likely that we'll have a precoded,
efficient parallelization.
Thus we cannot simply continue to write serial programs;
we must write parallel programs, programs that exploit the
power of multiple processors.
1.4 How do we write parallel programs?

There are a number of possible answers to this question,
but most of them depend on the basic idea of partitioning
the work to be done among the cores. There are two widely
used approaches: task-parallelism and data-parallelism.
In task-parallelism, we partition the various tasks carried
out in solving the problem among the cores. In data-
parallelism, we partition the data used in solving the
problem among the cores, and each core carries out more
or less similar operations on its part of the data.
As an example, suppose that Prof P has to teach a section
of “Survey of English Literature.” Also suppose that Prof P
has one hundred students in her section, so she's been
assigned four teaching assistants (TAs): Mr. A, Ms. B, Mr. C,
and Ms. D. At last the semester is over, and Prof P makes
up a final exam that consists of five questions. To grade the
exam, she and her TAs might consider the following two
options: each of them can grade all one hundred responses
to one of the questions; say, P grades question 1, A grades
question 2, and so on. Alternatively, they can divide the one
hundred exams into five piles of twenty exams each, and
each of them can grade all the papers in one of the piles; P
grades the papers in the first pile, A grades the papers in
the second pile, and so on.
In both approaches the “cores” are the professor and her
TAs. The first approach might be considered an example of
task-parallelism. There are five tasks to be carried out:
grading the first question, grading the second question,
and so on. Presumably, the graders will be looking for
different information in question 1, which is about
Shakespeare, from the information in question 2, which is
about Milton, and so on. So the professor and her TAs will
be “executing different instructions.”
On the other hand, the second approach might be
considered an example of data-parallelism. The “data” are
the students' papers, which are divided among the cores,
and each core applies more or less the same grading
instructions to each paper.
The first part of the global sum example in Section 1.3
would probably be considered an example of data-
parallelism. The data are the values computed by
, and each core carries out roughly the same
operations on its assigned elements: it computes the
required values by calling and adds them
together. The second part of the first global sum example
might be considered an example of task-parallelism. There
are two tasks: receiving and adding the cores' partial sums,
which is carried out by the master core; and giving the
partial sum to the master core, which is carried out by the
other cores.
When the cores can work independently, writing a
parallel program is much the same as writing a serial
program. Things get a great deal more complex when the
cores need to coordinate their work. In the second global
sum example, although the tree structure in the diagram is
very easy to understand, writing the actual code is
relatively complex. See Exercises 1.3 and 1.4.
Unfortunately, it's much more common for the cores to
need coordination.
In both global sum examples, the coordination involves
communication: one or more cores send their current
partial sums to another core. The global sum examples
should also involve coordination through load balancing.
In the first part of the global sum, it's clear that we want
the amount of time taken by each core to be roughly the
same as the time taken by the other cores. If the cores are
identical, and each call to requires the same
amount of work, then we want each core to be assigned
roughly the same number of values as the other cores. If,
for example, one core has to compute most of the values,
then the other cores will finish much sooner than the
heavily loaded core, and their computational power will be
wasted.
A third type of coordination is synchronization. As an
example, suppose that instead of computing the values to
be added, the values are read from . Say, is an array
that is read in by the master core:
In most systems the cores are not automatically

synchronized. Rather, each core works at its own pace. In
this case, the problem is that we don't want the other cores
to race ahead and start computing their partial sums before
the master is done initializing and making it available to
the other cores. That is, the cores need to wait before
starting execution of the code:
We need to add in a point of synchronization between the
initialization of and the computation of the partial sums:
The idea here is that each core will wait in the function
until all the cores have entered the function—in
particular, until the master core has entered this function.
Currently, the most powerful parallel programs are
written using explicit parallel constructs, that is, they are
written using extensions to languages such as C, C++, and
Java. These programs include explicit instructions for
parallelism: core 0 executes task 0, core 1 executes task 1,
…, all cores synchronize, …, and so on, so such programs
are often extremely complex. Furthermore, the complexity
of modern cores often makes it necessary to use
considerable care in writing the code that will be executed
by a single core.
There are other options for writing parallel programs—
for example, higher level languages—but they tend to
sacrifice performance to make program development
somewhat easier.
1.5 What we'll be doing

We'll be focusing on learning to write programs that are
explicitly parallel. Our purpose is to learn the basics of
programming parallel computers using the C language and
four different APIs or application program interfaces:
the Message-Passing Interface or MPI, POSIX threads
or Pthreads, OpenMP, and CUDA. MPI and Pthreads are
libraries of type definitions, functions, and macros that can
be used in C programs. OpenMP consists of a library and
some modifications to the C compiler. CUDA consists of a
library and modifications to the C++ compiler.
You may well wonder why we're learning about four
different APIs instead of just one. The answer has to do
with both the extensions and parallel systems. Currently,
there are two main ways of classifying parallel systems: one
is to consider the memory that the different cores have
access to, and the other is to consider whether the cores
can operate independently of each other.
In the memory classification, we'll be focusing on
shared-memory systems and distributed-memory
systems. In a shared-memory system, the cores can share
access to the computer's memory; in principle, each core
can read and write each memory location. In a shared-
memory system, we can coordinate the cores by having
them examine and update shared-memory locations. In a
distributed-memory system, on the other hand, each core
has its own, private memory, and the cores can
communicate explicitly by doing something like sending
messages across a network. Fig. 1.2 shows schematics of
the two types of systems.
FIGURE 1.2 (a) A shared memory system and (b) a

distributed memory system.
The second classification divides parallel systems
according to the number of independent instruction
streams and the number of independent data streams. In
one type of system, the cores can be thought of as
conventional processors, so they have their own control
units, and they are capable of operating independently of
each other. Each core can manage its own instruction
stream and its own data stream, so this type of system is
called a Multiple-Instruction Multiple-Data or MIMD
system.
An alternative is to have a parallel system with cores that
are not capable of managing their own instruction streams:
they can be thought of as cores with no control unit.
Rather, the cores share a single control unit. However, each
core can access either its own private memory or memory
that's shared among the cores. In this type of system, all
the cores carry out the same instruction on their own data,
so this type of system is called a Single-Instruction
Multiple-Data or SIMD system.
In a MIMD system, it's perfectly feasible for one core to
execute an addition while another core executes a multiply.
In a SIMD system, two cores either execute the same
instruction (on their own data) or, if they need to execute
different instructions, one executes its instruction while the
other is idle, and then the second executes its instruction
while the first is idle. In a SIMD system, we couldn't have
one core executing an addition while another core executes
a multiplication. The system would have to do something
like this: =5.7cm
Time First core Second core

1 Addition Idle
2 Idle Multiply
Since you're used to programming a processor with its
own control unit, MIMD systems may seem more natural to
you. However, as we'll see, there are many problems that
are very easy to solve using a SIMD system. As a very
simple example, suppose we have three arrays, each with n
elements, and we want to add corresponding entries of the
first two arrays to get the values in the third array. The
serial pseudocode might look like this:
Now suppose we have n SIMD cores, and each core is

assigned one element from each of the three arrays: core i
is assigned elements , and . Then our program can
simply tell each core to add its x- and y-values to get the z
value:
This type of system is fundamental to modern Graphics

Processing Units or GPUs, and since GPUs are extremely
powerful parallel processors, it's important that we learn
how to program them.
Our different APIs are used for programming different
types of systems:
• MPI is an API for programming distributed memory

MIMD systems.
• Pthreads is an API for programming shared memory
MIMD systems.
• OpenMP is an API for programming both shared
memory MIMD and shared memory SIMD systems,
although we'll be focusing on programming MIMD
systems.
• CUDA is an API for programming Nvidia GPUs,
which have aspects of all four of our classifications:
shared memory and distributed memory, SIMD, and
MIMD. We will, however, be focusing on the shared
memory SIMD and MIMD aspects of the API.
1.6 Concurrent, parallel, distributed

If you look at some other books on parallel computing or
you search the Web for information on parallel computing,
you're likely to also run across the terms concurrent
computing and distributed computing. Although there
isn't complete agreement on the distinction between the
terms parallel, distributed, and concurrent, many authors
make the following distinctions:
• In concurrent computing, a program is one in which

multiple tasks can be in progress at any instant [5].
• In parallel computing, a program is one in which
multiple tasks cooperate closely to solve a problem.
• In distributed computing, a program may need to
cooperate with other programs to solve a problem.
So parallel and distributed programs are concurrent, but

a program such as a multitasking operating system is also
concurrent, even when it is run on a machine with only one
core, since multiple tasks can be in progress at any instant.
There isn't a clear-cut distinction between parallel and
distributed programs, but a parallel program usually runs
multiple tasks simultaneously on cores that are physically
close to each other and that either share the same memory
or are connected by a very high-speed network. On the
other hand, distributed programs tend to be more “loosely
coupled.” The tasks may be executed by multiple
computers that are separated by relatively large distances,
and the tasks themselves are often executed by programs
that were created independently. As examples, our two
concurrent addition programs would be considered parallel
by most authors, while a Web search program would be
considered distributed.
But beware, there isn't general agreement on these
terms. For example, many authors consider shared-memory
programs to be “parallel” and distributed-memory
programs to be “distributed.” As our title suggests, we'll be
interested in parallel programs—programs in which closely
coupled tasks cooperate to solve a problem.
1.7 The rest of the book

How can we use this book to help us write parallel
programs?
First, when you're interested in high performance,
whether you're writing serial or parallel programs, you
need to know a little bit about the systems you're working
with—both hardware and software. So in Chapter 2, we'll
give an overview of parallel hardware and software. In
order to understand this discussion, it will be necessary to
review some information on serial hardware and software.
Much of the material in Chapter 2 won't be needed when
we're getting started, so you might want to skim some of
this material and refer back to it occasionally when you're
reading later chapters.
The heart of the book is contained in Chapters 3–7.
Chapters 3, 4, 5, and 6 provide a very elementary
introduction to programming parallel systems using C and
MPI, Pthreads, OpenMP, and CUDA, respectively. The only
prerequisite for reading these chapters is a knowledge of C
programming. We've tried to make these chapters
independent of each other, and you should be able to read
them in any order. However, to make them independent, we
did find it necessary to repeat some material. So if you've
read one of the three chapters, and you go on to read
another, be prepared to skim over some of the material in
the new chapter.
Chapter 7 puts together all we've learned in the
preceding chapters. It develops two fairly large programs
using each of the four APIs. However, it should be possible
to read much of this even if you've only read one of
Chapters 3, 4, 5, or 6. The last chapter, Chapter 8, provides
a few suggestions for further study on parallel
programming.
1.8 A word of warning

Before proceeding, a word of warning. It may be tempting
to write parallel programs “by the seat of your pants,”
without taking the trouble to carefully design and
incrementally develop your program. This will almost
certainly be a mistake. Every parallel program contains at
least one serial program. Since we almost always need to
coordinate the actions of multiple cores, writing parallel
programs is almost always more complex than writing a
serial program that solves the same problem. In fact, it is
often far more complex. All the rules about careful design
and development are usually far more important for the
writing of parallel programs than they are for serial
programs.
1.9 Typographical conventions

We'll make use of the following typefaces in the text:
• Program text, displayed or within running text, will

use the following typefaces:
• Definitions are given in the body of the text, and the
term being defined is printed in boldface type: A
parallel program can make use of multiple cores.
• When we need to refer to the environment in which
a program is being developed, we'll assume that
we're using a UNIX shell, such as , and we'll use a
to indicate the shell prompt:
• We'll specify the syntax of function calls with fixed

argument lists by including a sample argument list.
For example, the integer absolute value function, ,
in , might have its syntax specified with
For more complicated syntax, we'll enclose required

content in angle brackets and optional content in
square brackets . For example, the C statement
might have its syntax specified as follows:
This says that the statement must include an
expression enclosed in parentheses, and the right
parenthesis must be followed by a statement. This
statement can be followed by an optional clause.
If the clause is present, it must include a second
statement.
1.10 Summary
For many years we've reaped the benefits of having ever-
faster processors. However, because of physical limitations,
the rate of performance improvement in conventional
processors has decreased dramatically. To increase the
power of processors, chipmakers have turned to multicore
integrated circuits, that is, integrated circuits with multiple
conventional processors on a single chip.
Ordinary serial programs, which are programs written
for a conventional single-core processor, usually cannot
exploit the presence of multiple cores, and it's unlikely that
translation programs will be able to shoulder all the work
of converting serial programs into parallel programs—
programs that can make use of multiple cores. As software
developers, we need to learn to write parallel programs.
When we write parallel programs, we usually need to
coordinate the work of the cores. This can involve
communication among the cores, load balancing, and
synchronization of the cores.
In this book we'll be learning to program parallel
systems, so that we can maximize their performance. We'll
be using the C language with four different application
program interfaces or APIs: MPI, Pthreads, OpenMP, and
CUDA. These APIs are used to program parallel systems
that are classified according to how the cores access
memory and whether the individual cores can operate
independently of each other.
In the first classification, we distinguish between shared-
memory and distributed-memory systems. In a shared-
memory system, the cores share access to one large pool of
memory, and they can coordinate their actions by accessing
shared memory locations. In a distributed-memory system,
each core has its own private memory, and the cores can
coordinate their actions by sending messages across a
network.
In the second classification, we distinguish between
systems with cores that can operate independently of each
other and systems in which the cores all execute the same
instruction. In both types of system, the cores can operate
on their own data stream. So the first type of system is
called a multiple-instruction multiple-data or MIMD
system, and the second type of system is called a single-
instruction multiple-data or SIMD system.
MPI is used for programming distributed-memory MIMD
systems. Pthreads is used for programming shared-memory
MIMD systems. OpenMP can be used to program both
shared-memory MIMD and shared-memory SIMD systems,
although we'll be looking at using it to program MIMD
systems. CUDA is used for programming Nvidia graphics
processing units or GPUs. GPUs have aspects of all four
types of system, but we'll be mainly interested in the
shared-memory SIMD and shared-memory MIMD aspects.
Concurrent programs can have multiple tasks in
progress at any instant. Parallel and distributed
programs usually have tasks that execute simultaneously.
There isn't a hard and fast distinction between parallel and
distributed, although in parallel programs, the tasks are
usually more tightly coupled.
Parallel programs are usually very complex. So it's even
more important to use good program development
techniques with parallel programs.
1.11 Exercises
1.1 Devise formulas for the functions that calculate
and in the global sum example.
Remember that each core should be assigned
roughly the same number of elements of
computations in the loop. : First consider the
case when n is evenly divisible by p.
1.2 We've implicitly assumed that each call to
requires roughly the same amount of
work as the other calls. How would you change your
answer to the preceding question if call requires
times as much work as the call with ? How
would you change your answer if the first call ( )
requires 2 milliseconds, the second call ( )
requires 4, the third ( ) requires 6, and so on?
1.3 Try to write pseudocode for the tree-structured
global sum illustrated in Fig. 1.1. Assume the
number of cores is a power of two (1, 2, 4, 8, …).
: Use a variable to determine whether a
core should send its sum or receive and add. The
should start with the value 2 and be doubled
after each iteration. Also use a variable to
determine which core should be partnered with the
current core. It should start with the value 1 and
also be doubled after each iteration. For example, in
the first iteration and , so 0
receives and adds, while 1 sends. Also in the first
iteration and , so 0 and
1 are paired in the first iteration.
1.4 As an alternative to the approach outlined in the
preceding problem, we can use C's bitwise operators
to implement the tree-structured global sum. To see
how this works, it helps to write down the binary
(base 2) representation of each of the core ranks and
note the pairings during each stage: =8.5cm
From the table, we see that during the first stage each
core is paired with the core whose rank differs in the
rightmost or first bit. During the second stage, cores
that continue are paired with the core whose rank
differs in the second bit; and during the third stage,
cores are paired with the core whose rank differs in
the third bit. Thus if we have a binary value that
is 0012 for the first stage, 0102 for the second, and
1002 for the third, we can get the rank of the core
we're paired with by “inverting” the bit in our rank
that is nonzero in . This can be done using the
bitwise exclusive or ∧ operator.
Implement this algorithm in pseudocode using the
bitwise exclusive or and the left-shift operator.
1.5 What happens if your pseudocode in Exercise 1.3 or
Exercise 1.4 is run when the number of cores is not a
power of two (e.g., 3, 5, 6, 7)? Can you modify the
pseudocode so that it will work correctly regardless
of the number of cores?
1.6 Derive formulas for the number of receives and
additions that core 0 carries out using
a. the original pseudocode for a global sum, and
b. the tree-structured global sum.
Make a table showing the numbers of receives and
additions carried out by core 0 when the two sums
are used with cores.
1.7 The first part of the global sum example—when
each core adds its assigned computed values—is
usually considered to be an example of data-
parallelism, while the second part of the first global
sum—when the cores send their partial sums to the
master core, which adds them—could be considered
to be an example of task-parallelism. What about the
second part of the second global sum—when the
cores use a tree structure to add their partial sums?
Is this an example of data- or task-parallelism? Why?
1.8 Suppose the faculty members are throwing a party
for the students in the department.
a. Identify tasks that can be assigned to the
faculty members that will allow them to use task-
parallelism when they prepare for the party.
Work out a schedule that shows when the various
tasks can be performed.
b. We might hope that one of the tasks in the
preceding part is cleaning the house where the
party will be held. How can we use data-
parallelism to partition the work of cleaning the
house among the faculty?
c. Use a combination of task- and data-parallelism
to prepare for the party. (If there's too much
work for the faculty, you can use TAs to pick up
the slack.)
1.9 Write an essay describing a research problem in
your major that would benefit from the use of
parallel computing. Provide a rough outline of how
parallelism would be used. Would you use task- or
data-parallelism?
Bibliography
[5] Clay Breshears, The Art of Concurrency: A Thread
Monkey's Guide to Writing Parallel Applications.
Sebastopol, CA: O'Reilly; 2009.
[28] John Hennessy, David Patterson, Computer
Architecture: A Quantitative Approach. 6th ed.
Burlington, MA: Morgan Kaufmann; 2019.
[31] IBM, IBM InfoSphere Streams v1.2.0 supports
highly complex heterogeneous data analysis, IBM
United States Software Announcement 210-037,
Feb. 23, 2010
http://www.ibm.com/common/ssi/rep_ca/7/897/ENUS
210-037/ENUS210-037.PDF.
[36] John Loeffler, No more transistors: the end of
Moore's Law, Interesting Engineering, Nov 29, 2018.
See https://interestingengineering.com/no-more-
transistors-the-end-of-moores-law.
Chapter 2: Parallel
hardware and parallel
software
It's perfectly feasible for specialists in disciplines other
than computer science and computer engineering to write
parallel programs. However, to write efficient parallel
programs, we often need some knowledge of the underlying
hardware and system software. It's also very useful to have
some knowledge of different types of parallel software, so
in this chapter we'll take a brief look at a few topics in
hardware and software. We'll also take a brief look at
evaluating program performance and a method for
developing parallel programs. We'll close with a discussion
of what kind of environment we might expect to be working
in, and a few rules and assumptions we'll make in the rest
of the book.
This is a long, broad chapter, so it may be a good idea to
skim through some of the sections on a first reading so that
you have a good idea of what's in the chapter. Then, when a
concept or term in a later chapter isn't quite clear, it may
be helpful to refer back to this chapter. In particular, you
may want to skim over most of the material in
“Modifications to the von Neumann Model,” except “The
Basics of Caching.” Also, in the “Parallel Hardware”
section, you can safely skim the material on
“Interconnection Networks.” You can also skim the material
on “SIMD Systems” unless you're planning to read the
chapter on CUDA programming.
Another random document with
no related content on Scribd:
well be that each brood has its own characteristic secretion, or "smell," as it
were, slightly different from that of other broods, just as every dog has his, so
easily recognisable by other dogs; and that the cells only react to different
"smells" to their own. Such a secretion from the surface of the female cell
would lessen its surface tension, and thereby render easier the penetration of
the sperm into its substance.
As a rule, one at least of the pair-cells is fresh from division, and it would thus
appear that the union of the nuclei is facilitated when one at least of them is a
"young" one. Of the final mechanism of the union of the nuclei, we know
nothing: they may unite in any of the earlier phases of mitosis, or even in the
"resting state." A fibrillation of the cytoplasm during the process, radiating
around a centrosome or two centrosomes indicates a strained condition.[51]
Regeneration.—Finally, experiments have been made by several

observers as to the effects of removing parts of Protozoa, to see how
far regeneration can take place. The chief results are as follows:—
1. A nucleated portion may regenerate completely, if of sufficient

size. Consequently, multinucleate forms, such as Actinosphaerium
(Heliozoa, Fig. 19, p. 72), may be cut into a number of fragments,
and regenerate completely. In Ciliata, such as Stentor (Fig. 59, p.
156), each fragment must possess a portion of the meganucleus,
and at least one micronucleus (p. 145), and, moreover, must
possess a certain minimum size. A Radiolarian "central-capsule" (p.
75) with its endoplasm and nucleus may regenerate its ectoplasm,
but the isolated ectoplasm being non-nucleate is doomed. A "central
capsule" of one species introduced into the ectoplasm of another,
closely allied, did well. All non-nucleate pieces may exhibit
characteristic movements, but appear unable to digest; and they
survive only a short time.[52]
"Animals" and "Plants"

Hitherto we have discussed the cell as if it were everywhere an
organism that takes in food into its substance, the food being
invariably "organic" material, formed by or for other cells; such
nutrition is termed "holozoic." There are, however, limits to the
possibilities in this direction, as there are to the fabled capacities of
the Scillonians of gaining their precarious livelihood by taking in one
another's washing. For part of the food material taken in in this way
is applied to the supply of the energies of the cell, and is
consequently split up or oxidised into simpler, more stable bodies, no
longer fitted for food; and of the matter remaining to be utilised for
building up the organism, a certain proportion is always wasted in
by-products. Clearly, then, the supply of food under such conditions
is continually lessening in the universe, and we have to seek for a
manufactory of food-material from inorganic materials: this is to be
found in those cells that are known as "vegetal," in the widest sense
of the word. In this, sense, vegetal nutrition is the utilisation of
nitrogenous substances that are more simple than proteids or
peptones, together with suitable organic carbon compounds, etc., to
build up proteids and protoplasm. The simplest of organisms with a
vegetal nutrition are the Schizomycetes, often spoken of loosely as
"bacteria" or "microbes," in which the differentiation of cytoplasm and
nucleus is not clearly recognisable. Some of these can build up their
proteids from the free uncombined nitrogen of the atmosphere,
carbon dioxide, and inorganic salts, such as sulphates and
phosphates. But the majority of vegetal feeders require the nitrogen
to be combined at least in the form of a nitrate or an ammonium salt
—nay, for growth in the dark, they require the carbon also to be
present in some organic combination, such as a tartrate, a
carbohydrate, etc. Acetates and oxalates, "aromatic" compounds[53]
and nitriles are rarely capable of being utilised, and indeed are often
prejudicial to life. In many vegetal feeders certain portions of the
protoplasm are specialised, and have the power of forming a green,
yellow, or brown pigment; these are called "plastids" or
"chromatophores." They multiply by constriction within the cell,
displaying thereby a certain independent individuality. These plastids
have in presence of light the extraordinary power of deoxidising
carbon dioxide and water to form carbohydrates (or fats, etc.) and
free oxygen; and from these carbohydrates or fats, together with
ammonium salts or nitrates, etc., the vegetal protoplasm at large can
build up all necessary food matter. So that in presence of light of the
right quality[54] and adequate intensity, such coloured vegetal beings
have the capacity for building up their bodies and reserves from
purely inorganic materials. Coloured vegetal nutrition, then, is a
process involving the absorption of energy; the source from which
this is derived in the bacteria being very obscure at present. Nutrition
by means of coloured plastids is distinguished as "holophytic," and
that from lower substances, which, however, contain organically
combined carbon, as "saprophytic," for such are formed by the death
and decomposition of living beings. The third mode of nutrition
(found in some bacteria) from wholly inorganic substances, including
free nitrogen, has received no technical name. All three modes are
included in the term "autotrophic" (self-nourishing).
Vegetal feeders have a great tendency to accumulate reserves in

insoluble forms, such as starch, paramylum, and oil-globules on the
one hand, and pyrenoids, proteid crystals, aleurone granules on the
other.
When an animal-feeding cell encysts or surrounds itself with a

continuous membrane, this is always of nitrogenous composition,
usually containing the glucosamide "chitin." The vegetal cell-wall, on
the contrary, usually consists, at least primarily, of the carbohydrate
"cellulose"—the vegetal cell being richly supplied with carbohydrate
reserves, and drawing on them to supply the material for its garment.
This substance is what we are all familiar with in cotton or tissue-
paper.
Again, not only is the vegetal cell very ready to surround itself with a
cell-wall, but its food-material, or rather, speaking accurately, the
inorganic materials from which that food is to be manufactured, may
diffuse through this wall with scarcely any difficulty. Such a cell can
and does grow when encysted: it grows even more readily in this
state, since none of its energies are absorbed by the necessities of
locomotion, etc. Growth leads, of course, to division: there is often
an economy of wall-material by the formation of a mere party-wall
dividing the cavity of the old cell-wall at its limit of growth into two
new cavities of equal size. Thus the division tends to form a colonial
aggregate, which continues to grow in a motionless, and, so far, a
"resting" state. We may call this "vegetative rest," to distinguish it
from "absolute rest," when all other life-processes (as well as
motion) are reduced to a minimum or absolutely suspended.
The cells of a plant colony are usually connected by very fine

threads of protoplasm, passing through minute pores where the new
party-wall is left incomplete after cell-division.[55] In a few plants,
such as most Fungi, the cell-partitions are in abeyance for the most
part, and there is formed an apocyte with a continuous investment,
sometimes, however, chambered at intervals by partitions between
multinucleate units of protoplasm. We started with a purely
physiological consideration, and we have now arrived at a
morphological distinction, very valid among higher organisms.
Higher Plants consist of cells for the most part each isolated in its
own cell-cavity, save for the few slender threads of communication.
Higher Animals consist of cells that are rarely isolated in this way,
but are mostly in mutual contact over the greater part of their
surface.
Again, Plants take in either food or else the material for food in
solution through their surface, and only by diffusion through the cell-
wall. Insectivorous Plants that have the power of capturing and
digesting insects have no real internal cavity. Animal-feeding Protista
take in their food into the interior of their protoplasm and digest it
therein, and the Metazoa have an internal cavity or stomach for the
same purpose. Here again there are exceptions in the case of
certain internal parasites, such as the Tapeworms and
Acanthocephala (Vol. II. pp. 74, 174), which have no stomachs, living
as they do in the dissolved food-supplies of their hosts, but still
possessing the general tissues and organs of Metazoa.
Corresponding with the absence of mouth, and the absorption
instead of the prehension of food, we find that the movements of
plant-beings are limited. In the higher Plants, and many lower ones,
the colonial organism is firmly fixed or attached, and the movements
of its parts are confined to flexions. These are produced by
inequalities of growth; or by inequalities of temporary distension of
cell-masses, due to the absorption of liquid into their vacuoles, while
relaxation is effected by the cytoplasm and cell-wall becoming
pervious to the liquid. We find no case of a differentiation of the
cytoplasm within the cell into definite muscular fibrils. In the lower
Plants single naked motile cells disseminate the species; and the
pairing-cells, or at least the males, have the same motile character.
In higher Cryptogams, Cycads, and Ginkgo (the Maiden-hair Tree),
the sperms alone are free-swimming; and as we pass to Flowering
Plants, the migratory character of the male cells is restricted to the
smallest limits. though never wholly absent. Intracellular movements
of the protoplasm are, however, found in all Plants.
In Plants we find no distinct nervous system formed of cells and

differentiated from other tissues with centres and branches and
sense-organs. These are more or less obvious in all Metazoa, traces
being even found in the Sponges.
We may then define Plants as beings which have the power of

manufacturing true food-stuffs from lower chemical substances than
proteids, often with the absorption of energy. They have the power of
surrounding themselves with a cell-wall, usually of cellulose, and of
growing and dividing freely in this state, in which animal-like changes
of form and locomotion are impossible; their colonies are almost
always fixed or floating; free locomotion is only possible in the case
of their naked reproductive cells, and is transitory even in these. The
movements of motile parts of complex plant-organisms are due to
the changes in the osmotic powers of cells as a whole, and not to the
contraction of differentiated fibrils in the cytoplasm of individual cells.
Plants that can form carbohydrates with liberation of free oxygen
have always definite plastids coloured with a lipochrome[56] pigment,
or else (in the Phycochromaceae) the whole plasma is so coloured.
Solid food is never taken into the free plant-cell nor into an internal
cavity in complex Plants. If, as in insectivorous Plants, it is digested
and absorbed, it is always in contact with the morphological external
surface. In the complex Plants apocytes and syncytes are rare—the
cells being each invested with its own wall, and, at most, only
communicating by minute threads with its neighbours. No trace of a
central nervous system with differentiated connexions can be made
out.
Animals all require proteid food; their cyst-walls are never formed of
cellulose; their cells usually divide in the naked condition only, or if
encysted, no secondary party-walls are formed between the
daughter-cells to unite them into a vegetative colony. Their colonies
are usually locomotive, or, if fixed, their parts largely retain their
powers of relative motion, and are often provided on their free
surfaces with cilia or flagella; and many cells have differentiated in
their cytoplasm contractile muscular fibrils. Their food (except in a
few parasitic groups) is always taken into a distinct digestive cavity.
A complex nervous system, of many special cells, with branched
prolongations interlacing or anastomosing, and uniting superficial
sense-organs with internal centres, is universally developed in
Metazoa. All Metazoa fulfil the above conditions.
But when we turn to the Protozoa we find that many of the

characters evade us. There are some Dinoflagellates (see p. 130)
which have coloured plastids, but which differ in no other respect
(even specific) from others that lack them: the former may have
mouths which are functionless, the latter have functional mouths.
Some colourless Flagellates are saprophytic and absorb nutritive
liquids, such as decomposing infusions of organic matter, possibly
free from all proteid constituents; while others, scarcely different,
take in food after the fashion of Amoeba. Sporozoa in the
persistence of the encysted stage are very plant-like, though they
are often intracellular and are parasitic in living Animals. On the
other hand, the Infusoria for the most part answer to all the
physiological characters of the Animal world, but are single cells, and
by the very perfection of their structure, all due to plasmic not to
cellular differentiation, show that they lie quite off the possible track
of the origin of Metazoa from Protozoa. Indeed, a strong natural line
of demarcation lies between Metazoa and Protista. Of the Protozoa,
certain groups, like the Foraminifera and Radiolaria and the Ciliate
and Suctorial Infusoria are distinctly animal in their chemical
activities or metabolism, their mode of nutrition, and their locomotive
powers. When we turn to the Proteomyxa, Mycetozoa, and the
Flagellates we find that the distinction between these and the lower
Fungi is by no means easy, the former passing, indeed, into true
Fungi by the Chytridieae, which it is impossible to separate sharply
from those Flagellates and Proteomyxa which Cienkowsky and Zopf
have so accurately studied under the name of "Monadineae." Again,
many of the coloured Flagellates can only (if at all) be distinguished
from Plants by the relatively greater prominence and duration of the
mobile state, though classifiers are generally agreed in allotting to
Plants those coloured Flagellates which in the resting state assume
the form of multicellular or apocytial filaments or plates.
On these grounds we should agree with Haeckel in distinguishing

the living world into the Metazoa, or Higher Animals, which are
sharply marked off; the Metaphyta, or Higher Plants, which it is not
so easy to characterise, but which unite at least two or more vegetal
characters; and the Protista, or organisms, whose differentiation is
limited to that within the cell (or apocyte), and does not involve the
cells as units of tissues. These Protista, again, it is impossible to
separate into animal and vegetal so sharply as to treat adequately of
either group without including some of the other: thus it is that every
text-book on Zoology, like the present work, treats of certain
Protophyta. The most unmistakably animal group of the Protista, the
Ciliata, is, as we have seen, by the complex differentiation of its
protoplasm, widely removed from the Metazoa with their relatively
simple cells but differentiated cell-groups and tissues. The line of
probable origin of the Metazoa is to be sought, for Sponges at least,
among the Choanoflagellates (pp. 121 f. 181 f.).
CHAPTER II
PROTOZOA (CONTINUED): SPONTANEOUS GENERATION—CHARACTERS

OF PROTOZOA—CLASSIFICATION
The Question of Spontaneous Generation

From the first discovery of the Protozoa, their life-history has been
the subject of the highest interest: yet it is only within our own times
that we can say that the questions of their origin and development
have been thoroughly worked out. If animal or vegetable matter of
any kind be macerated in water, filtered, or even distilled, various
forms of Protista make their appearance; and frequently, as
putrefaction advances, form after form makes its appearance,
becomes abundant, and then disappears to be replaced by other
species. The questions suggested by such phenomena are these:
(1) Do the Protista arise spontaneously, that is, by the direct
organisation into living beings of the chemical substances present,
as a crystal is organised from a solution: (2) Are the forms of the
Protista constant from one generation to another, as are ordinary
birds, beasts, and fishes?
The question of the "spontaneous generation" of the Protista was

readily answered in the affirmative by men who believed that Lice
bred directly from the filth of human skins and clothes;[57] and that
Blow-flies, to say nothing of Honey-bees, arose in rotten flesh: but
the bold aphorism of Harvey "omne vivum ex ovo" at once gained
the ear of the best-inspired men of science, and set them to work in
search of the "eggs" that gave rise to the organisms of putrefaction.
Redi (ob. 1699) showed that Blow-flies never arise save when other
Blow-flies gain access to meat and deposit their very visible eggs
thereon. Leeuwenhoek, his contemporary, in the latter half of the
seventeenth century adduced strong reasons for ascribing the origin
of the organisms of putrefaction to invisible air-borne eggs. L. Joblot
and H. Baker in the succeeding half-century investigated the matter,
and showed that putrefaction was no necessary antecedent of the
appearance of these beings: that, as well as being air-borne, the
germs might sometimes have adhered to the materials used for
making the infusion; and that no organisms were found if the
infusions were boiled long enough, and corked when still boiling.
These views were strenuously opposed by Needham in England, by
Wrisberg in Germany, and by Buffon, the great French naturalist and
philosopher, whose genius, unballasted by an adequate knowledge
of facts, often played him sad tricks. Spallanzani made a detailed
study of what we should now term the "bionomical" or "oecological"
conditions of Protistic life and reproduction in a manner worthy of
modern scientific research, and not attained by some who took the
opposite side within living recollection. He showed that infusions kept
sufficiently long at the boiling-point in hermetically sealed vessels
developed no Protistic life. As he had shown that active Protists are
killed at much lower temperatures, he inferred that the germs must
have much higher powers of resistance; and, by modifying the
details of his experiments, he was able to meet various objections of
Needham.
Despite this good work, the advocates of spontaneous generation

were never really silenced; and the widespread belief in the
inconstancy of species in Protista added a certain amount of
credibility to their cause. In 1845 Pineau put forward these views
most strongly; and from 1858 to 1864 they were supported by the
elder Pouchet. Louis Pasteur, who began life as a chemist, was led
from a study of alcoholic fermentation to that of the organisms of
fermentation and of putrefaction and disease. He showed that in
infusions boiled sufficiently long and sealed while boiling, or kept at
the boiling-point in a sealed vessel, no life manifested itself:
objections raised on the score of the lack of access of fresh air were
met by the arrangement, so commonly used in "pure cultures" at the
present day, of a flask with a tube attached plugged with a little
cotton-wool, or even merely bent repeatedly into a zigzag. The
former attachment filtered off all germs or floating solid particles from
the air, the latter brought about the settling of such particles in the
elbows or on the sides of the tube: in neither case did living
organisms appear, even after the lapse of months. Other observers
succeeded in showing that the forms and characters of species were
as constant as in Higher Animals and Plants, allowing, of course, for
such regular metamorphoses as occur in Insects, or alternations of
generations paralleled in Tapeworms and Polypes. The regular
sequences of such alternations and metamorphoses constitute,
indeed, a strong support of the "germ-theory"—the view that all
Protista arise from pre-existing germs. It is to the Rev. W. H.
Dallinger and the late Dr. Charles Drysdale that we owe the first
complete records of such complex life-histories in the Protozoa as
are presented by the minute Flagellates which infest putrefying
liquids (see below, p. 116 f.). The still lower Schizomycetes, the
"microbes" of common speech, have also been proved by the
labours of Ferdinand Cohn, von Koch, and their numerous disciples,
to have the same specific constancy in consecutive generations, as
we now know, thanks to the methods first devised by De Bary for the
study of Fungi, and improved and elaborated by von Koch and his
school of bacteriologists.
And so to-day the principle "omne vivum ex vivo" is universally

accepted by men of science. Of the ultimate origin of organic life
from inorganic life we have not the faintest inkling. If it took place in
the remote past, it has not been accomplished to the knowledge of
man in the history of scientific experience, and does not seem likely
to be fulfilled in the immediate or even in the proximate future.[58]
PROTOZOA
Organisms of various metabolism, formed of a single cell or apocyte,
or of a colony of scarcely differentiated cells, whose organs are
formed by differentiations of the protoplasm, and its secretions and
accretions; not composed of differentiated multicellular tissues or
organs.[59]
This definition, as we have seen, excludes Metazoa (including

Mesozoa, Vol. II. p. 92) sharply from Protozoa, but leaves an
uncertain boundary on the botanical side; and, as systematists share
with nations the desire to extend their sphere of influence, we shall
here follow the lead of other zoologists and include many beings that
every botanist would claim for his own realm. Our present knowledge
of the Protozoa has indeed been largely extended by botanists,[60]
while the study of protoplasmic physiology has only passed from
their fostering care into the domain of General Biology within the last
decade. The study of the Protozoa is little more than two centuries
old, dating from the school of microscopists of whom the Dutchman
Leeuwenhoek is the chief representative: and we English may feel a
just pride in the fact that his most important publications are to be
found in the early records of our own Royal Society.
Baker, in the eighteenth century, and the younger Wallich, Carter,

Dallinger and Drysdale, Archer, Saville Kent, Lankester, and Huxley,
in the last half-century, are our most illustrious names. In France,
Joblot, almost as an amateur, like our own Baker, flourished in the
early part of the eighteenth century. Dujardin in the middle of the
same century by his study of protoplasm, or sarcode as he termed it,
did a great work in laying the foundations of our present ideas, while
Balbiani, Georges Pouchet, Fabre-Domergue, Maupas, Léger, and
Labbé in France, have worthily continued and extended the Gallic
traditions of exact observation and careful deduction. Otto Friedrich
Müller, the Dane, in the eighteenth century, was a pioneer in the
exact study and description of a large number of forms of these, as
of other microscopic forms of life. The Swiss collaborators,
Claparède and Lachmann, in the middle of the nineteenth century,
added many facts and many descriptions; and illustrated them by
most valuable figures of the highest merit from every point of view.
Germany, with her large population of students and her numerous
universities, has given many names to our list; among these,
Ehrenberg and von Stein have added the largest number of species
to the roll. Ehrenberg about 1840 described, indeed, an enormous
number of forms with much care, and in detail far too elaborate for
the powers of the microscope of that date: so that he was led into
errors, many and grave, which he never admitted down to the close
of a long and honoured life. Max Schultze did much good work on
the Protozoa, as well as on the tissues of the Metazoa, and largely
advanced our conceptions of protoplasm. His work was continued in
Germany by Ernst Haeckel, who systematised our knowledge of the
Radiolaria, Greeff, Richard Hertwig, Fritz Schaudinn, and especially
Bütschli, who contributed to Bronn's Thier-Reich a monograph of
monumental conception and scope, and of admirable execution, on
which we have freely drawn. Cienkowsky, a Russian, and James-
Clark and Leidy, both Americans, have made contributions of high
quality.
Lankester's article in the Encyclopædia Britannica was of epoch-

making quality in its philosophical breadth of thought.
Delage and Hérouard have given a full account of the Protozoa in

their Traité de Zoologie Concrète, vol. i. (1896); and A. Lang's
monograph in his Vergleichende Anatomie, 2nd ed. (1901), contains
an admirable analysis of their general structure, habits, and life-
cycles, together with full descriptions of well-selected types. Calkins
has monographed "The Protozoa" in the Columbia University
Biological series (1901). These works of Bütschli, Delage, Lang, and
Calkins contain full bibliographies. Doflein has published a most
valuable systematic review of the Protozoa parasitic on animals
under the title Die Protozoen als Parasiten und Krankheitserreger
(1901); and Schaudinn's Archiv für Protistenkunde, commenced only
four years ago, already forms an indispensable collection of facts
and views.
The protoplasm of the Protozoa (see p. 5 f.) varies greatly in

consistency and in differentiation. Its outer layer may be granular
and scarcely altered in Proteomyxa, the true Myxomycetes, Filosa,
Heliozoa, Radiolaria, Foraminifera, etc.; it is clear and glassy in the
Lobose Rhizopods and the Acrasieae; it is continuous with a firm but
delicate superficial pellicle of membranous character in most
Flagellates and Infusoria; and this pellicle may again be consolidated
and locally thickened in some members of both groups so as to form
a coat of mail, even with definite spines and hardened polygonal
plates (Coleps, Fig. 54, p. 150). Again, it may form transitory or more
or less permanent pseudopodia,[61] (1) blunt or tapering and distinct,
with a hyaline outer layer, or (2) granular and pointed, radiating and
more or less permanent, or (3) branching and fusing where they
meet into networks or perforated membranes. Cilia or flagella are
motile thread-like processes of active protoplasm which perforate the
pellicle; they may be combined into flattened platelets or firm rods, or
transformed into coarse bristles or fine motionless sense-hairs. The
skeletons of the Protozoa, foreign to the cytoplasm, will be treated of
under the several groups.
Most of the fresh-water and brackish forms (and some marine ones)
possess one or more contractile vacuoles, often in relation to a more
or less complex system of spaces or canals in Flagellates and
Ciliates.
The Geographical Distribution of Protozoa is remarkable for the

wide, nay cosmopolitan, distribution of the terrestrial and fresh-water
forms;[62] they owe this to their power of forming cysts, within which
they resist drought, and can be disseminated as "dust." Very few of
them can multiply at a temperature approaching freezing-point; the
Dinoflagellates can, however, and therefore present Alpine and
Arctic forms. The majority breed most freely at summer
temperatures; and, occurring in small pools, can thus achieve their
full development in such as are heated by the sun during the long
Arctic day as readily as in the Tropics. In infusions of decaying
matter, the first to appear are those that feed on bacteria, the
essential organisms of putrefaction. These, again, are quickly
followed and preyed upon by carnivorous species, which prefer
liquids less highly charged with organic matters, and do not appear
until the liquid, hitherto cloudy, has begun to clear. Thus we have
rather to do with "habitat" than with "geographical distribution," just
as with the fresh-water Turbellaria and the Rotifers (vol. ii. pp. 4 f.,
226 f.). We can distinguish in fresh-water, as in marine Protista,
"littoral" species living near the bank, among the weeds; "plankton,"
floating at or near the surface; "zonal" species dwelling at various
depths; and "bottom-dwellers," mostly "limicolous" (or "sapropelic,"
as Lauterborn terms them), and to be found among the bottom ooze.
Many species are "epiphytic" or "epizoic," dwelling on plants or
animals, and sometimes choice enough in their preference of a
single genus or species as host. Others again are "moss-dwellers,"
living among the root-hairs of mosses and the like, or even
"terrestrial" and inhabiting damp earth. The Sporozoa are internal
parasites of animals, and so are many Flagellates, while many
Proteomyxa are parasitic in plant-cells. The Foraminifera (with the
exception of most Allogromidiaceae) are marine, and so are the
Radiolaria; while most Heliozoa inhabit fresh water. Concerning the
distribution in time we shall speak under the last two groups, the only
ones whose skeletons have left fossil remains.
Classification.—The classification of the Protozoa is no easy task.

We omit here, for obvious reasons, the unmistakable Plant Protists
that have a holophytic or saprophytic nutrition; that are, with the
exception of a short period of locomotion in the young reproductive
cells, permanently surrounded with a wall of cellulose or fungus-
cellulose, and that multiply and grow freely in this encysted state: to
these consequently we relegate the Chytridieae, which so closely
allied to the Proteomyxa and the Phycomycetous Fungi; and the
Confervaceae, which in the resting state form tubular or flattened
aggregates and are allied to the green Flagellates; besides the
Schizophyta. At the opposite pole stand the Infusoria in the strict
sense, with the most highly differentiated organisation found in our
group, culminating in the possession of a nuclear apparatus with
nuclei of two kinds, and exhibiting a peculiar form of conjugation, in
which the nuclear apparatus is reorganised. The Sporozoa are
clearly marked off as parasites in living animals, which mostly begin
life as sickle-shaped cells and have always at least two alternating
modes of brood-formation, the first giving rise to aplanospores,
wherein is formed the second brood of sickle-shaped, wriggling
zoospores. The remainder, comprising the Sarcodina, or
Rhizopoda in the old wide sense (including all that move by
pseudopodia during the great part of their active life), and the
Flagellata in the widest sense, are not easy to split up; for many of
the former have flagellate reproductive cells, and many of the latter
can emit pseudopodia with or without the simultaneous retraction of
their flagella. The Radiolaria are well defined by the presence in
the body plasm of a central capsule marking it off into a central and a
peripheral portion, the former containing the nucleus, the latter
emitting the pseudopodia. Again, on the other hand, we find that we
can separate as Flagellata in the strict sense the not very natural
assemblage of those Protista that have flagella as their principle
organs of movement or nutrition during the greater part of their active
life. The remaining groups (which with the Radiolaria compose the
Sarcodina of Bütschli), are the most difficult to treat. The
Rhizopoda, as we shall limit them, are naked or possess a simple
shell, never of calcium carbonate, have pseudopodia that never
radiate abundantly nor branch freely, nor coalesce to form plasmatic
networks, nor possess an axial rod of firmer substance. The
Foraminifera have a shell, usually of calcium carbonate, their
pseudopodia are freely reticulated, at least towards the base; and
(with the exception of a few simple forms) all are marine. The
Mycetozoa are clearly united by their tendency to aggregate more
or less completely into complex resting-groups (fructifications), and
by reproducing by unicellular zoospores, flagellate or amoeboid,
which are not known to pair. The Heliozoa resemble the Radiolaria
in their fine radiating pseudopodia, but have an axial filament always
present in each, and lack the central capsule; and are, for the most
part, fresh-water forms. Finally, the Proteomyxa forms a sort of
lumber-room for beings which are intermediate between the
Heliozoa, Rhizopoda, and Flagellata, usually passing through an
amoeboid stage, and for the most part reproducing by brood-
formation. Zoospores that possess flagella are certainly known to
occur in some forms of Foraminifera, Rhizopoda, Heliozoa, and
Radiolaria, though not by any means in all of each group.[63]
A. Pseudopodia the principal means of

locomotion and feeding; flagella absent
or transitory I. Sarcodina
(1) Plastogamy only leading to an
increase in size, never to the
formation of "fructifications."
(a) Pseudopodia never freely
coalescing into a network nor
fine to the base Rhizopoda.
(*) Ectoplasm clear, free from
granules; pseudopodia,
usually blunt Rhizopoda Lobosa
(**) Ectoplasm finely granular;
pseudopodia slender,
branching, but not forming
a network, passing into
the body by basal
dilatation Rhizopoda Filosa
(b) Pseudopodia branching freely
and coalescing to form
networks; ectoplasm granular;
test usually calcareous or
sandy Foraminifera
(c) Pseudopodia fine to the very
base; radiating, rarely
coalescing.
(i.) Pseudopodia with a central
filament Heliozoa
(ii.) Pseudopodia without a
central filament.
(*) Body divided into a
central and a
peripheral part by a
"central capsule" Radiolaria
(**) Body without a central
capsule Proteomyxa
(2) Cells aggregating or fusing into Mycetozoa
plasmodia before forming a
complex "fructification"
B. Cells usually moving by "euglenoid"
wriggling or by excretion of a trail of
viscid matter; reproduction by
alternating modes of brood-formation,
rarely by Spencerian fission II. Sporozoa
C. Flagella (rarely numerous) the chief or
only means of motion and feeding III. Flagellata
D. Cilia the chief organs of motion, in the
young state at least; nuclei of two kinds IV. Infusoria
CHAPTER III
PROTOZOA (CONTINUED): SARCODINA
I. Sarcodina.
Protozoa performing most of their life-processes by pseudopodia; nucleus
frequently giving off fragments (chromidia) which may play a part in nuclear
reconstitution on division; sometimes with brood-cells, which may be at first
flagellate; but never reproducing in the flagellate state.[64]
1. Rhizopoda
Sarcodina of simple form, whose pseudopodia never coalesce into
networks (1),[65] nor contain an axial filament (2), which commonly
multiply by binary fission (3), though a brood-formation may occur;
which may temporarily aggregate, or undergo temporary or
permanent plastogamic union, but never to form large plasmodia or
complex fructifications as a prelude to spore-formation (4); test when
present gelatinous, chitinous, sandy, or siliceous, simple and 1-
chambered (5).
Classification.[66]
I. Ectoplasm distinct, clear; pseudopodia blunt or
tapering, but not branching at the apex Lobosa
Amoeba, Auctt.; Pelomyxa, Greeff; Trichosphaerium, A.
Schneid.; Dinamoeba, Leidy; Amphizonella, Greeff;
Centropyxis, Stein; Arcella, Ehr.; Difflugia, Leclercq;
Lecqueureusia, Schlumberger; Hyalosphenia, Stein;
Quadrula, F. E. Sch.; Heleopera, Leidy; Podostoma, Cl. and
L.; Arcuothrix, Hallez.
II. Ectoplasm undifferentiated, containing moving
granules; pseudopodia branching freely towards the
tips Filosa
Euglypha, Duj.; Paulinella, Lauterb.; Cyphoderia, Schlumb.;
Campascus, Leidy; Chlamydophrys, Cienk.; Gromia, Duj. =
Hyalopus, M. Sch.
We have defined this group mainly by negative characters, as such

are the only means for their differentiation from the remaining
Sarcodina; and indeed from Flagellata, since in this group zoospores
are sometimes formed which possess flagella. Moreover, indeed, in
a few of this group (Podostoma, Arcuothrix), as in some Heliozoa,
the flagellum or flagella may persist or be reproduced side by side
with the pseudopodia. The subdivision of the Rhizopoda is again a
matter of great difficulty, the characters presented being so mixed up
that it is hard to choose: however, the character of the outer layer of
the cytoplasm is perhaps the most obvious to select. In Lobosa
there is a clear layer of ectosarc, which appears to be of a greasy
nature at its surface film, so that it is not wetted. In the Filosa, as in
most other Sarcodina, this film is absent, and the ectoplasm is not
marked off from the endoplasm, and may have a granular surface.
Corresponding to this, the pseudopodia of the Lobosa are usually
blunt, never branching and fraying out, as it were, at the tip, as in the
Filosa; nay, in the normal movements of Amoeba limax (Fig. 1, p. 5)
the front of the cell forms one gigantic pseudopodium, which
constantly glides forward. Apart from this distinction the two groups
are parallel in almost every respect.
There may be a single contractile vacuole, or a plurality; or none,

especially in marine and endoparasitic species. The nucleus may
remain single or multiply without inducing fission, thus leading to
apocytial forms. It often gives off "chromidial" fragments, which may
play an important part in reproduction.[67] In Amoeba binucleata
there are constantly two nuclei, both of which divide as an
antecedent to fission, each giving a separate nucleus to either
daughter-cell. Pelomyxa palustris, the giant of the group, attaining a
diameter of 1‴ (2 mm.), has very blunt pseudopodia, an enormous
number of nuclei, and no contractile vacuole, though it is a fresh-
water dweller, living in the bottom ooze of ponds, etc., richly charged
with organic débris. It is remarkable also for containing symbiotic
bacteria, and brilliant vesicles with a distinct membranous wall,
containing a solution of glycogen.[68] Few, if any, of the Filosa are
recorded as plurinuclear.
The simplest Lobosa have no investment, nor indeed any distinction

of front or back. In some forms of Amoeba, however, the hinder part
is more adhesive, and may assume the form of a sucker-like disc, or
be drawn into a tuft of short filaments or villi, to which particles
adhere. Other species of Lobosa and all Filosa have a "test," or
"theca," i.e. an investment distinct from the outermost layer of the
cell-body. The simplest cases are those of Amphizonella,
Dinamoeba, and Trichosphaerium, where this is gelatinous, and in
the two former allows the passage of food particles through it into the
body by mere sinking in, like the protoplasm itself, closing again
without a trace of perforation over the rupture. In Trichosphaerium
(Fig. 9) the test is perforated by numerous pores of constant position
for the passage of the pseudopodia, closing when these are
retracted; and in the "A" form of the species (see below) it is studded
with radial spicules of magnesium carbonate. Elsewhere the test is
more consistent and possesses at least one aperture for the
emission of pseudopodia and the reception of food; to avoid
confusion we call this opening not the mouth but the "pylome": some
Filosa have two symmetrically placed pylomes. When the test is a
mere pellicle, it may be recognised by the limitation of the
pseudopodia to the one pylomic area. But the shell is often hard. In
Arcella (Fig. 10, C), a form common among Bog-mosses and
Confervas, it is chitinous and shagreened, circular, with a shelf
running in like that of a diving-bell around the pylome: there are two
or more contractile vacuoles, and at least two nuclei. Like some
other genera, it has the power of secreting carbonic acid gas in the
form of minute bubbles in its cytoplasm, so as to enable it to float up
to the surface of the water. The chitinous test shows minute
hexagonal sculpturing, the expression of vertical partitions reaching
from the inner to the outer layer.
Fig. 9.—Trichosphaerium sieboldii. 1, Adult of "A" form; 2, its multiplication by

fission and gemmation; 3, resolution into 1-nucleate amoeboid zoospores;
4, development (from zoospores of "A") into "B" form (5); 6, its multiplication
by fission and gemmation; 7, its resolution after nuclear bipartition into
minute 2-flagellate zoospores or (exogametes); 8, liberation of gametes; 9,

Ebook An Introduction To Parallel Programming PDF Full Chapter PDF

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Ebook An Introduction To Parallel Programming PDF Full Chapter PDF

Uploaded by

Copyright:

Available Formats

An Introduction to Parallel

Programming - eBook PDF

Chapter 1: Why parallel computing

1.1. Why we need ever-increasing performance

1.2. Why we're building parallel systems

1.3. Why we need to write parallel programs

1.4. How do we write parallel programs?

1.5. What we'll be doing

1.6. Concurrent, parallel, distributed

1.7. The rest of the book

1.9. Typographical conventions

Chapter 2: Parallel hardware and parallel software

2.1. Some background

2.2. Modifications to the von Neumann model

2.3. Parallel hardware

2.4. Parallel software

2.5. Input and output

2.7. Parallel program design

2.8. Writing and running parallel programs

Chapter 3: Distributed memory programming with MPI

3.2. The trapezoidal rule in MPI

3.3. Dealing with I/O

3.4. Collective communication

3.5. MPI-derived datatypes

3.6. Performance evaluation of MPI programs

3.7. A parallel sorting algorithm

3.10. Programming assignments

Chapter 4: Shared-memory programming with Pthreads

4.1. Processes, threads, and Pthreads

4.2. Hello, world

4.3. Matrix-vector multiplication

4.4. Critical sections

4.7. Producer–consumer synchronization and semaphores

4.8. Barriers and condition variables

4.10. Caches, cache-coherence, and false sharing

4.14. Programming assignments

Chapter 5: Shared-memory programming with OpenMP

5.1. Getting started

5.2. The trapezoidal rule

5.3. Scope of variables

5.4. The reduction clause

5.5. The parallel for directive

5.6. More about loops in OpenMP: sorting

5.7. Scheduling loops

5.8. Producers and consumers

5.9. Caches, cache coherence, and false sharing

5.14. Programming assignments

Chapter 6: GPU programming with CUDA

6.1. GPUs and GPGPU

6.2. GPU architectures

6.3. Heterogeneous computing

6.4. CUDA hello

6.5. A closer look

6.6. Threads, blocks, and grids

6.7. Nvidia compute capabilities and device architectures

6.8. Vector addition

6.9. Returning results from CUDA kernels

6.10. CUDA trapezoidal rule I

6.11. CUDA trapezoidal rule II: improving performance

6.12. Implementation of trapezoidal rule with warpSize thread

6.14. Bitonic sort

6.17. Programming assignments

Chapter 7: Parallel program development