Concepts of Parallel Programming

Concepts of
Parallel Computing
Alf Wachsmann
Stanford Linear Accelerator Center (SLAC)
alfw@slac.stanford.edu
Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 1

Why do it in parallel?
• Why is parallel computing a good idea?
• 1 worker needs 3 days to dig a ditch.
How long do 3 workers need?
• Parallel Computing is (in the most general sense) the

simultaneous use of multiple compute resources to
solve a computational problem
• What about
• 1 tree takes 30 years to grow big.
How long do 3 trees need?

Parallel Addition
• Diagram in space and time
• Abstraction from communication (the hard part!)
1+2 3+4 5+6 7+8 9+10 11+12 13+14 15+16
3 7 11 15 19 23 27 31
wall clock time

10 26 42 58
36 100
136
1 2 3 4 5 6 7 8
Processors

Why do it in parallel?
• Algorithmic reasons:
• Save time (wall clock time) – does NOT save work!
• Solve larger problems (more memory)
• Systemic reasons:
• Transmission speed (speed of light)
• Limits to miniaturization
• Economic limits

Maximum Gain
• Gain by doing it in parallel is
running time for best serial algorithm
speedup =
running time for parallel algorithm
Ideally: use P processors and get P-fold speedup.
Linear speedup in P is the best we can hope for!
There are cases of super-linear speedup.

Sequential Computer
• Architecture of serial computers:
Memory
Fetch Execute
CPU
Von Neuman Architecture:

●
memory is used to store both program and data
●
CPU gets instructions and/or data from memory
●
Decodes instructions
●
Executes them sequentially

Parallel Computers
• Widely used classification for parallel computers:
Flynn's Taxonomy (1966)
SISD SIMD
Single Instruction, Single Data Single Instruction, Multiple Data
MISD MIMD
Multiple Instruction, Single Data Multiple Instruction, Multiple Data

Memory Architectures
• Other important classification schema is according
the parallel computer's memory architecture
• Shared memory
• Uniform memory access
• Non-uniform memory access
• Distributed memory
• Hybrid distributed-shared memory solutions

Shared Memory
• Shared Memory
• Multiple processors can operate independently but share
the same memory resources
• Changes in a memory location effected by one processor
are visible to all other processors (global address space)
CPU
CPU Memory CPU
CPU

Uniform Memory Access
• Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA.
Cache Coherent means if one processor updates a
location in shared memory, all the other processors
know about the update. Cache coherency is
accomplished at the hardware level.

Non-Uniform Memory Access
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all
memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be
called CC-NUMA - Cache Coherent NUMA

Distributed Memory
• Processors have their own local memory. Memory
addresses in one processor do not map to another
processor, so there is no concept of global address
space across all processors
• Distributed memory systems require a
communication network to connect inter-processor
memory
• The network "fabric" used for data transfer varies
widely; can can be as simple as Ethernet
Node 1 Memory Memory Node 2
CPU CPU
Network

CPU CPU

Comparison
• Shared Memory • Distributed Memory
• Advantages • Advantage
• Global address space • Memory is scalable with
• Data sharing between number of processor
tasks is both fast and • Each processor can
uniform rapidly access own
• Disadvantages memory
• Lack of scalability • Disadvantages
between memory and • NUMA access times
CPUs. • Programmer responsible
• Programmer for many details
responsibility for • Difficult to map existing
synchronization data structures
• Expensive

Constellations
• Hybrid Distributed-Shared Memory
• Used in most of todays parallel computers
• Cache-coherent SMP nodes
• Distributed memory is networking of multiple SMP nodes

CPU CPU CPU CPU
CPUCPU CPUCPU
Network

CPU CPU CPU CPU
CPUCPU CPUCPU

Example Machines
Comparison of Shared and Distributed Memory Architectures
Architecture CC-UMA CC-NUMA Distributed
SMPs SGI Origin/Altix Cray T3E

Sun Fire Exxx/Vxxx Sequent Maspar
Examples DEC/Compaq HP Exemplar IBM SP
SGI Challenge DEC/Compaq IBM Blue Gene/L
IBM POWER3 IBM POWER4 Beowulf Clusters
MPI MPI
Threads Threads
Communications MPI
OpenMP OpenMP
shmem shmem
Scalability to 10s of processors to 100s of processors to 1000s of processors
“New architecture “ System administration
Limited memory
Draw Backs Point-to-point Programming is hard to
bandwidth
communication develop and maintain
Software Availability declining stable Still rising

Parallel Programming Models
• Abstraction above hardware and memory
architecture
• Several programming models in use:
• Shared Memory (“parallel computing”)
• Threads
• Message Passing (“distributed computing”)
• Data Parallel
• Hybrid approaches
• All models exist for all hardware/memory
architectures

Shared Memory Model
• Tasks share a common address space, which they
read and write asynchronously
• Access control to shared memory via locks or
semaphores
• No notion of “ownership” of data – no need to
explicitly communicate data between tasks
• Implementations
• shared memory machines: compiler
• distributed memory machines: simulations

Threads Model
• A single process has multiple, concurrent execution
paths
• Most commonly used on shared mem. machines and
in operating systems
prg.exe T1 T2
call sub1
call sub2
do i = 1, n T3
A(i) = fnct(i^3) T4
B(i) = A(i) * p
Time
end do
call sub3
call sub4
...
...

Threads Model
• Implementations
• POSIX Thread Library
• C language only
• Offered for most hardware
• Very explicit parallelism
• Requires significant programmer attention to detail
• OpenMP
• Based on compiler directives; can use sequential code
• Fortran, C, C++
• portable/multi-platform
• Can be very easy and simple to use

Message Passing Model
• Tasks exchange data through communications by
sending and receiving messages
• usually requires cooperative operations to be
performed by each process: a send operation must
have a matching receive operation
Machine A Machine B
task 0 task 1
data data
Network
send(data) receive(data)

Message Passing Model
• Implementations
• Parallel Virtual Machine (PVM)
• Not much in use any more
• Message Passing Interface (MPI)
• Part 1 released 1994
• Part 2 (MPI-2) release 1996
• http://www-unix.mcs.anl.gov/mpi/
• Now de-facto standard
• Fortran, C, C++
• Available on virtually all machines
• OpenMPI, MPICH, LAM/MPI, many vendor specific versions
• On shared memory machines, MPI implementations usually
don't use a network for task communications

Data Parallel Model
• A set of tasks work collectively on the same data
structure
• Each task works on a different partition of the
same data structure

Data Parallel Model
• Implementations
• Fortran 90
• ISO/ANSI extension of Fortran 77
• Additions to program structure and commands
• Variable additions – methods and arguments
• High Performance Fortran (HPF)
• Contains everything in F90
• Directives to tell compiler how to distribute data added
• Data parallel constructs added (now part of F95)
• On distr. memory machines: translated into MPI code

Hybrid Programming Models
• Two or more of the previous models are used in the
same program
• Common examples:
• POSIX Threads and Message Passing (MPI)
• OpenMP and MPI
• ClusterOpenMP (Intel)
• Works well on network of SMP machines
• Also used:
• Data Parallel and MPI

Designing Parallel Programs
• No real parallelizing compilers
• Compiler “knows” how to parallelize certain constructs (e.g.
loops)
• Compiler uses “directives” from programmer
• Not simply a matter of taking sequential algorithm
and “making it parallel”. Sometimes, completely
different algorithmic approach necessary
• Very time consuming and labor intense task

Parallelization Techniques
• Domain Decomposition
• Data is partitioned
• Each task works on different part of data
• Three different ways to partition data

Parallelization Techniques
• Functional Decomposition
• Problem is partitioned into set of independent tasks
Both types of decomposition can be and often are combined

A little Theory
• Some problems can be parallelized very well:
In complexity theory, the class NC ("Nick's
Class") is the set of decision problems decidable
in poly-logarithmic time on a parallel computer
with a polynomial number of processors. In other
words, a problem is in NC if there are constants c
c
and k such that it can be solved in time O log n 
using O n k  parallel processors.
Source: http://en2.wikipedia.org/wiki/Class_NC

A little Theory
• Some problems can't be parallelized at all!
• Example: Calculating the Fibonacci Sequence
(1,1,2,3,5,8,13,21,...) by using the formula
F 1=1
F 2=1
F k2=F k F k 1
Calculation entails dependent calculations: The

calculation of the k + 2 value uses those of both
k + 1 and k. These three terms cannot be calculated
independently and therefore, cannot be parallelized.

Communication
• Decomposed problems typically need to communicate:
• Partial results need to be combined
• Changes to neighboring data have effects on a task's data
• Some problem don't need communication:
• “Embarrassingly” parallel problems

Cost of Communication
• Communicating data takes time
• Inter-task comm. has overhead
• Often synchronization is necessary
• Communication is much more “expensive” than
computation
• Communicating data needs to save a lot of computation
before it pays off
• Infiniband needs < 10ms to set up communication
• 2.4GHz AMD Opteron CPU needs ~0.4ns to perform one
floating point operation (Flop)
• 25,000 floating point operations per communication setup!

Latency - Bandwidth
• Latency: the amount of time for the first bit of
data to arrive at the other end
• Bandwidth: how much data per time unit fits
through

Cost of Communication
• Formula for the time needed to transmit data
N L = Latency [s]
cost = L N = number of bytes [byte]
B
B = Bandwidth [byte/s]
cost [s]

Visibility of Communication
• With MPI, communication is explicit and very visible
• “Latency Hiding”:
• Communicate and at the same time doing some other
computations
• Implementation via parallel threads or non-blocking MPI
communication functions
• Makes programs faster but more complex

Scope of Communication
• Knowing which tasks must communicate with each
other is critical during the design stage of a parallel
program
• Point-to-Point: involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer
• Collective: involves data
sharing between more than
two tasks, which are often
specified as being
members in a common
group, or collective

Communication Hardware
Architecture Comment Bandwidth Latency
Myrinet Proprietary Sust. one-way for short
but large messages: messages:
http://www.myricom.com/ commodity ~1.2GB/s ~3ms
Infiniband Vendor
~900MB/s
indep. ~10ms
http://www.infinibandta.org/ standard (4x HCAs)
Quadrics (QsNet) Expensive,

~900MB/s ~2ms
http://www.quadrics.com/ proprietary
Gigabit Ethernet commodity ~100MB/s ~60ms
Custom: SGI, IBM, Cray, Sun, Compaq, ...

Communication Hardware
InfiniBand Proprietary GigE 10GigE
QLogic
Mellanox Myrinet Myrinet Quadrics Chelsio
InfiniPath
MHGA28 F 10G QM500 T210-CX
HT
Latency (µs) 2.25 1.3 2.6 2.0 1.6 30-100 9.6
Peak Band- 1502 954 493 1200 910 125 860

width (MB/s)
N/2 (Bytes) 512 385 2000 2000 1000 8000 100,000

BW (MB/s) 750 470 250 600 450 60 430
CPU ~5 ~40 ~10 ~10 ~50 >50 ~50

overhead (%)
*Mellanox Technology testing; Ohio State University; PathScale, Myricom, Quadrics, and Chelsio websites
N/2: Message size to achieve half the peak bandwidth
http://www.mellanox.com/applications/performance_benchma
Synchronization
• “handshaking” between tasks that are sharing data
• Types of synchronization:
• Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the barrier.
It then stops, or "blocks"
• When the last task reaches the barrier, all tasks are
synchronized
• Used in MPI

Synchronization
• More types:
• Lock/Semaphore
• Can involve any number of tasks
• Typically used to serialize (protect) access to global data or a
section of code. Only one task at a time may use (own) the
lock / semaphore / flag
• The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or code.
• Other tasks can attempt to acquire the lock but must wait
until the task that owns the lock releases it.
• Can be blocking or non-blocking
• Used in threads and shared memory

Synchronization
• More types:
• Synchronous Communication Operations
• Involves only those tasks executing a communication
operation
• When a task performs a communication operation, some form
of coordination is required with the other task(s)
participating in the communication. For example, before a
task can perform a send operation, it must first receive an
acknowledgment from the receiving task that it is OK to
send.

Granularity
• Qualitative measure of
Computation / Communication Ratio
• Typically, periods of computations are separated from
periods if communication by synchronization events
Fine-Grain Parallelism: Coarse-Grain Parallelism:
Small amount of Large amount of
computation between computation between
communication communication

Granularity
• Fine-Grain • Coarse-Grain
• Low computation to • High computation to
communication ratio communication ratio
• Facilitates load balancing • More opportunity for
• High communication performance increase
overhead; less opportunity • Harder to load balance
for performance efficiently
enhancement

Data In- and Output
• Parallel computers with thousands of nodes can
handle huge amounts of data
• It is hard to get this data in and out of the nodes
• parallel-I/O systems are still fairly new and not available
for all platforms
• I/O over the network (like NFS) causes severe bottlenecks
• Help can be found with
• Parallel File Systems: Lustre, PVFS2, GPFS (IBM)
• MPI-2 provides support for parallel file systems
• Rule #1: Reduce overall I/O as much as possible!

Efficiency
Ts
• Speedup Sp =
Tp
• Efficiency = Sp
p
• Value between zero and one

• estimate how well-utilized the processors are in solving the
problem, compared to how much effort is wasted in
communication and synchronization
• linear speedup and algorithms running on a single processor
have an efficiency of 1
• many difficult-to-parallelize algorithms have efficiency
such as 1/log p that approaches zero as the number of
processors increases

Limits and Costs
• Besides theoretical limits and hardware limits,
there are practical limits to parallel computing
• Amdahl's Law states that potential program
speedup is defined by the fraction of code (P) that
can be parallelized: speedup=
1
1−P
• If none of the code can be parallelized,
P = 0 and the speedup = 1 (no speedup).
If all of the code is parallelized, P = 1 and the
speedup is infinite (in theory).
• If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.

Limits and Costs
• Introducing the number of processors performing
the parallel fraction of work, Amdahl's Law can be
reformulated as
speedup=
1 N = number of processors,
P
S P = parallel fraction and
N S=1-P = serial fraction
http://upload.wikimedia.org/wikipedia/en/7/7a/Amdahl-law.jpg Speedup
N P=0.50 P=0.90 P=0.99 P=1.0
10 1.82 5.26 9.17 10
100 1.98 9.17 50.25 100
1000 1.99 9.91 90.99 1000
10000 1.99 9.99 99.02 10000

Typical Parallel Applications
• Applications that are well suited for parallel
computers are
• Weather and ocean patterns
• Finite Element Method (FEM; crash tests for cars)
• Fluid dynamics, aerodynamics
• Simulation of electro-magnetic problems

Summary
• Overview of parallel computing concepts
• Hardware
• Software
• Programming
• Problems of parallel computing
• Communication is expensive (latency)
• I/O is expensive
• Techniques to work around these problems
• Problem decomposition (communicate larger data)
• Parallel File Systems plus supporting hardware
• $$$$ (faster communication fabric)

Acknowledgment/References
• Most of this talk is taken from
http://www.llnl.gov/computing/tutorials/parallel_comp/
• Theory book Introduction to Parallel Algorithms and
Architectures: Arrays, Trees, Hypercubes by F.
Thomson Leighton
• Hardware book Computer Architecture: A
Quantitative Approach (3rd edition) by John L.
Hennessy, David A. Patterson, David Goldberg
• http://www.top500.org/

Concepts of Parallel Programming

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Concepts of Parallel Programming

Uploaded by

Copyright:

Available Formats

Concepts of

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 1

• Parallel Computing is (in the most general sense) the

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 2

wall clock time

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 3

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 4

Ideally: use P processors and get P-fold speedup.

Linear speedup in P is the best we can hope for!

There are cases of super-linear speedup.

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 5

Von Neuman Architecture:

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 6

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 7

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 8

CPU Memory CPU

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 9

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 10

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 11

Node 3 Memory Memory Node 4

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 12

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 13

Node 1 Memory Memory Node 2

Node 3 Memory Memory Node 4

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 14

SMPs SGI Origin/Altix Cray T3E

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 15

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 16

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 17

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 18

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 19

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 20

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 21

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 22

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 23

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 24

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 25

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 26

Both types of decomposition can be and often are combined

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 27

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 28

Calculation entails dependent calculations: The

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 29

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 30

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 31

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 32

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 33

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 34

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 35

Quadrics (QsNet) Expensive,

Gigabit Ethernet commodity ~100MB/s ~60ms

Custom: SGI, IBM, Cray, Sun, Compaq, ...

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 36

Latency (µs) 2.25 1.3 2.6 2.0 1.6 30-100 9.6

Peak Band- 1502 954 493 1200 910 125 860

N/2 (Bytes) 512 385 2000 2000 1000 8000 100,000

CPU ~5 ~40 ~10 ~10 ~50 >50 ~50

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 38

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 39

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 40

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 41

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 42

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 43

• Value between zero and one