Concepts of

Parallel Computing
Alf Wachsmann
Stanford Linear Accelerator Center (SLAC)

Why do it in parallel?
• Why is parallel computing a good idea?
• 1 worker needs 3 days to dig a ditch.
How long do 3 workers need?

• Parallel Computing is (in the most general sense) the

simultaneous use of multiple compute resources to
solve a computational problem

• What about
• 1 tree takes 30 years to grow big.
How long do 3 trees need?

Parallel Addition
• Diagram in space and time
• Abstraction from communication (the hard part!)
1+2 3+4 5+6 7+8 9+10 11+12 13+14 15+16
3 7 11 15 19 23 27 31

wall clock time

10 26 42 58

36 100

1 2 3 4 5 6 7 8

Why do it in parallel?
• Algorithmic reasons:
• Save time (wall clock time) – does NOT save work!
• Solve larger problems (more memory)
• Systemic reasons:
• Transmission speed (speed of light)
• Limits to miniaturization
• Economic limits

Maximum Gain
• Gain by doing it in parallel is
running time for best serial algorithm
speedup =
running time for parallel algorithm

Ideally: use P processors and get P-fold speedup.

Linear speedup in P is the best we can hope for!

There are cases of super-linear speedup.

Sequential Computer
• Architecture of serial computers:


Fetch Execute

Von Neuman Architecture:

memory is used to store both program and data

CPU gets instructions and/or data from memory

Decodes instructions

Executes them sequentially

Parallel Computers
• Widely used classification for parallel computers:
Flynn's Taxonomy (1966)

Single Instruction, Single Data Single Instruction, Multiple Data
Multiple Instruction, Single Data Multiple Instruction, Multiple Data

Memory Architectures
• Other important classification schema is according
the parallel computer's memory architecture
• Shared memory
• Uniform memory access
• Non-uniform memory access
• Distributed memory
• Hybrid distributed-shared memory solutions

Shared Memory
• Shared Memory
• Multiple processors can operate independently but share
the same memory resources
• Changes in a memory location effected by one processor
are visible to all other processors (global address space)


CPU Memory CPU


Uniform Memory Access
• Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA.
Cache Coherent means if one processor updates a
location in shared memory, all the other processors
know about the update. Cache coherency is
accomplished at the hardware level.

Non-Uniform Memory Access
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all
• Memory access across link is slower
• If cache coherency is maintained, then may also be
called CC-NUMA - Cache Coherent NUMA

Distributed Memory
• Processors have their own local memory. Memory
addresses in one processor do not map to another
processor, so there is no concept of global address
space across all processors
• Distributed memory systems require a
communication network to connect inter-processor
• The network "fabric" used for data transfer varies
widely; can can be as simple as Ethernet
Node 1 Memory Memory Node 2

Node 3 Memory Memory Node 4


• Shared Memory • Distributed Memory
• Advantages • Advantage
• Global address space • Memory is scalable with
• Data sharing between number of processor
tasks is both fast and • Each processor can
uniform rapidly access own
• Disadvantages memory
• Lack of scalability • Disadvantages
between memory and • NUMA access times
CPUs. • Programmer responsible
• Programmer for many details
responsibility for • Difficult to map existing
synchronization data structures
• Expensive

• Hybrid Distributed-Shared Memory
• Used in most of todays parallel computers
• Cache-coherent SMP nodes
• Distributed memory is networking of multiple SMP nodes

Node 1 Memory Memory Node 2


Node 3 Memory Memory Node 4


Example Machines
Comparison of Shared and Distributed Memory Architectures
Architecture CC-UMA CC-NUMA Distributed

SMPs SGI Origin/Altix Cray T3E

Sun Fire Exxx/Vxxx Sequent Maspar
Examples DEC/Compaq HP Exemplar IBM SP
SGI Challenge DEC/Compaq IBM Blue Gene/L
IBM POWER3 IBM POWER4 Beowulf Clusters
Threads Threads
Communications MPI
OpenMP OpenMP
shmem shmem
Scalability to 10s of processors to 100s of processors to 1000s of processors
“New architecture “ System administration
Limited memory
Draw Backs Point-to-point Programming is hard to
communication develop and maintain
Software Availability declining stable Still rising

Parallel Programming Models
• Abstraction above hardware and memory
• Several programming models in use:
• Shared Memory (“parallel computing”)
• Threads
• Message Passing (“distributed computing”)
• Data Parallel
• Hybrid approaches
• All models exist for all hardware/memory

Shared Memory Model
• Tasks share a common address space, which they
read and write asynchronously
• Access control to shared memory via locks or
• No notion of “ownership” of data – no need to
explicitly communicate data between tasks
• Implementations
• shared memory machines: compiler
• distributed memory machines: simulations

Threads Model
• A single process has multiple, concurrent execution
• Most commonly used on shared mem. machines and
in operating systems
prg.exe T1 T2
call sub1
call sub2
do i = 1, n T3
A(i) = fnct(i^3) T4
B(i) = A(i) * p

end do
call sub3
call sub4

Threads Model
• Implementations
• POSIX Thread Library
• C language only
• Offered for most hardware
• Very explicit parallelism
• Requires significant programmer attention to detail
• OpenMP
• Based on compiler directives; can use sequential code
• Fortran, C, C++
• portable/multi-platform
• Can be very easy and simple to use

Message Passing Model
• Tasks exchange data through communications by
sending and receiving messages
• usually requires cooperative operations to be
performed by each process: a send operation must
have a matching receive operation

Machine A Machine B
task 0 task 1

data data
send(data) receive(data)

Message Passing Model
• Implementations
• Parallel Virtual Machine (PVM)
• Not much in use any more
• Message Passing Interface (MPI)
• Part 1 released 1994
• Part 2 (MPI-2) release 1996
• Now de-facto standard
• Fortran, C, C++
• Available on virtually all machines
• OpenMPI, MPICH, LAM/MPI, many vendor specific versions
• On shared memory machines, MPI implementations usually
don't use a network for task communications

Data Parallel Model
• A set of tasks work collectively on the same data
• Each task works on a different partition of the
same data structure

Data Parallel Model
• Implementations
• Fortran 90
• ISO/ANSI extension of Fortran 77
• Additions to program structure and commands
• Variable additions – methods and arguments
• High Performance Fortran (HPF)
• Contains everything in F90
• Directives to tell compiler how to distribute data added
• Data parallel constructs added (now part of F95)
• On distr. memory machines: translated into MPI code

Hybrid Programming Models
• Two or more of the previous models are used in the
same program
• Common examples:
• POSIX Threads and Message Passing (MPI)
• OpenMP and MPI
• ClusterOpenMP (Intel)
• Works well on network of SMP machines
• Also used:
• Data Parallel and MPI

Designing Parallel Programs
• No real parallelizing compilers
• Compiler “knows” how to parallelize certain constructs (e.g.
• Compiler uses “directives” from programmer
• Not simply a matter of taking sequential algorithm
and “making it parallel”. Sometimes, completely
different algorithmic approach necessary
• Very time consuming and labor intense task

Parallelization Techniques
• Domain Decomposition
• Data is partitioned
• Each task works on different part of data
• Three different ways to partition data

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 26

Parallelization Techniques
• Functional Decomposition
• Problem is partitioned into set of independent tasks

Both types of decomposition can be and often are combined

A little Theory
• Some problems can be parallelized very well:
In complexity theory, the class NC ("Nick's
Class") is the set of decision problems decidable
in poly-logarithmic time on a parallel computer
with a polynomial number of processors. In other
words, a problem is in NC if there are constants c
and k such that it can be solved in time O log n 
using O n k  parallel processors.


A little Theory
• Some problems can't be parallelized at all!
• Example: Calculating the Fibonacci Sequence
(1,1,2,3,5,8,13,21,...) by using the formula

F 1=1
F 2=1
F k2=F k F k 1

Calculation entails dependent calculations: The

calculation of the k + 2 value uses those of both
k + 1 and k. These three terms cannot be calculated
independently and therefore, cannot be parallelized.

• Decomposed problems typically need to communicate:
• Partial results need to be combined
• Changes to neighboring data have effects on a task's data
• Some problem don't need communication:
• “Embarrassingly” parallel problems

Cost of Communication
• Communicating data takes time
• Inter-task comm. has overhead
• Often synchronization is necessary
• Communication is much more “expensive” than
• Communicating data needs to save a lot of computation
before it pays off
• Infiniband needs < 10ms to set up communication
• 2.4GHz AMD Opteron CPU needs ~0.4ns to perform one
floating point operation (Flop)
• 25,000 floating point operations per communication setup!

Latency - Bandwidth
• Latency: the amount of time for the first bit of
data to arrive at the other end
• Bandwidth: how much data per time unit fits

Cost of Communication
• Formula for the time needed to transmit data

N L = Latency [s]
cost = L N = number of bytes [byte]
B = Bandwidth [byte/s]
cost [s]

Visibility of Communication
• With MPI, communication is explicit and very visible
• “Latency Hiding”:
• Communicate and at the same time doing some other
• Implementation via parallel threads or non-blocking MPI
communication functions
• Makes programs faster but more complex

Scope of Communication
• Knowing which tasks must communicate with each
other is critical during the design stage of a parallel
• Point-to-Point: involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
• Collective: involves data
sharing between more than
two tasks, which are often
specified as being
members in a common
group, or collective

Communication Hardware
Architecture Comment Bandwidth Latency
Myrinet Proprietary Sust. one-way for short
but large messages: messages: commodity ~1.2GB/s ~3ms

Infiniband Vendor
indep. ~10ms standard (4x HCAs)

Quadrics (QsNet) Expensive,

~900MB/s ~2ms proprietary

Gigabit Ethernet commodity ~100MB/s ~60ms

Custom: SGI, IBM, Cray, Sun, Compaq, ...

Communication Hardware
InfiniBand Proprietary GigE 10GigE

Mellanox Myrinet Myrinet Quadrics Chelsio
MHGA28 F 10G QM500 T210-CX

Latency (µs) 2.25 1.3 2.6 2.0 1.6 30-100 9.6

Peak Band- 1502 954 493 1200 910 125 860

width (MB/s)

N/2 (Bytes) 512 385 2000 2000 1000 8000 100,000

BW (MB/s) 750 470 250 600 450 60 430

CPU ~5 ~40 ~10 ~10 ~50 >50 ~50

overhead (%)
*Mellanox Technology testing; Ohio State University; PathScale, Myricom, Quadrics, and Chelsio websites
N/2: Message size to achieve half the peak bandwidth
• “handshaking” between tasks that are sharing data
• Types of synchronization:
• Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the barrier.
It then stops, or "blocks"
• When the last task reaches the barrier, all tasks are
• Used in MPI

• More types:
• Lock/Semaphore
• Can involve any number of tasks
• Typically used to serialize (protect) access to global data or a
section of code. Only one task at a time may use (own) the
lock / semaphore / flag
• The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or code.
• Other tasks can attempt to acquire the lock but must wait
until the task that owns the lock releases it.
• Can be blocking or non-blocking
• Used in threads and shared memory

• More types:
• Synchronous Communication Operations
• Involves only those tasks executing a communication
• When a task performs a communication operation, some form
of coordination is required with the other task(s)
participating in the communication. For example, before a
task can perform a send operation, it must first receive an
acknowledgment from the receiving task that it is OK to

• Qualitative measure of
Computation / Communication Ratio
• Typically, periods of computations are separated from
periods if communication by synchronization events
Fine-Grain Parallelism: Coarse-Grain Parallelism:
Small amount of Large amount of
computation between computation between
communication communication

• Fine-Grain • Coarse-Grain
• Low computation to • High computation to
communication ratio communication ratio
• Facilitates load balancing • More opportunity for
• High communication performance increase
overhead; less opportunity • Harder to load balance
for performance efficiently

Data In- and Output
• Parallel computers with thousands of nodes can
handle huge amounts of data
• It is hard to get this data in and out of the nodes
• parallel-I/O systems are still fairly new and not available
for all platforms
• I/O over the network (like NFS) causes severe bottlenecks
• Help can be found with
• Parallel File Systems: Lustre, PVFS2, GPFS (IBM)
• MPI-2 provides support for parallel file systems
• Rule #1: Reduce overall I/O as much as possible!

• Speedup Sp =
• Efficiency = Sp

• Value between zero and one

• estimate how well-utilized the processors are in solving the
problem, compared to how much effort is wasted in
communication and synchronization
• linear speedup and algorithms running on a single processor
have an efficiency of 1
• many difficult-to-parallelize algorithms have efficiency
such as 1/log p that approaches zero as the number of
processors increases

Limits and Costs
• Besides theoretical limits and hardware limits,
there are practical limits to parallel computing
• Amdahl's Law states that potential program
speedup is defined by the fraction of code (P) that
can be parallelized: speedup=
• If none of the code can be parallelized,
P = 0 and the speedup = 1 (no speedup).
If all of the code is parallelized, P = 1 and the
speedup is infinite (in theory).
• If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.

Limits and Costs
• Introducing the number of processors performing
the parallel fraction of work, Amdahl's Law can be
reformulated as

1 N = number of processors,
S P = parallel fraction and
N S=1-P = serial fraction Speedup
N P=0.50 P=0.90 P=0.99 P=1.0
10 1.82 5.26 9.17 10
100 1.98 9.17 50.25 100
1000 1.99 9.91 90.99 1000
10000 1.99 9.99 99.02 10000

Typical Parallel Applications
• Applications that are well suited for parallel
computers are
• Weather and ocean patterns
• Finite Element Method (FEM; crash tests for cars)
• Fluid dynamics, aerodynamics
• Simulation of electro-magnetic problems

• Overview of parallel computing concepts
• Hardware
• Software
• Programming
• Problems of parallel computing
• Communication is expensive (latency)
• I/O is expensive
• Techniques to work around these problems
• Problem decomposition (communicate larger data)
• Parallel File Systems plus supporting hardware
• $$$$ (faster communication fabric)

• Most of this talk is taken from
• Theory book Introduction to Parallel Algorithms and
Architectures: Arrays, Trees, Hypercubes by F.
Thomson Leighton
• Hardware book Computer Architecture: A
Quantitative Approach (3rd edition) by John L.
Hennessy, David A. Patterson, David Goldberg

