Download as pdf or txt
Download as pdf or txt
You are on page 1of 49

Concepts of

Parallel Computing
Alf Wachsmann
Stanford Linear Accelerator Center (SLAC)
alfw@slac.stanford.edu

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 1


Why do it in parallel?
• Why is parallel computing a good idea?
• 1 worker needs 3 days to dig a ditch.
How long do 3 workers need?

• Parallel Computing is (in the most general sense) the


simultaneous use of multiple compute resources to
solve a computational problem

• What about
• 1 tree takes 30 years to grow big.
How long do 3 trees need?

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 2


Parallel Addition
• Diagram in space and time
• Abstraction from communication (the hard part!)
1+2 3+4 5+6 7+8 9+10 11+12 13+14 15+16
3 7 11 15 19 23 27 31

wall clock time


10 26 42 58

36 100

136
1 2 3 4 5 6 7 8
Processors

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 3


Why do it in parallel?
• Algorithmic reasons:
• Save time (wall clock time) – does NOT save work!
• Solve larger problems (more memory)
• Systemic reasons:
• Transmission speed (speed of light)
• Limits to miniaturization
• Economic limits

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 4


Maximum Gain
• Gain by doing it in parallel is
running time for best serial algorithm
speedup =
running time for parallel algorithm

Ideally: use P processors and get P-fold speedup.

Linear speedup in P is the best we can hope for!

There are cases of super-linear speedup.

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 5


Sequential Computer
• Architecture of serial computers:

Memory

Fetch Execute
CPU

Von Neuman Architecture:



memory is used to store both program and data

CPU gets instructions and/or data from memory

Decodes instructions

Executes them sequentially

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 6


Parallel Computers
• Widely used classification for parallel computers:
Flynn's Taxonomy (1966)

SISD SIMD
Single Instruction, Single Data Single Instruction, Multiple Data
MISD MIMD
Multiple Instruction, Single Data Multiple Instruction, Multiple Data

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 7


Memory Architectures
• Other important classification schema is according
the parallel computer's memory architecture
• Shared memory
• Uniform memory access
• Non-uniform memory access
• Distributed memory
• Hybrid distributed-shared memory solutions

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 8


Shared Memory
• Shared Memory
• Multiple processors can operate independently but share
the same memory resources
• Changes in a memory location effected by one processor
are visible to all other processors (global address space)

CPU

CPU Memory CPU

CPU

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 9


Uniform Memory Access
• Most commonly represented today by Symmetric
Multiprocessor (SMP) machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA.
Cache Coherent means if one processor updates a
location in shared memory, all the other processors
know about the update. Cache coherency is
accomplished at the hardware level.

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 10


Non-Uniform Memory Access
• Often made by physically linking two or more SMPs
• One SMP can directly access memory of another SMP
• Not all processors have equal access time to all
memories
• Memory access across link is slower
• If cache coherency is maintained, then may also be
called CC-NUMA - Cache Coherent NUMA

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 11


Distributed Memory
• Processors have their own local memory. Memory
addresses in one processor do not map to another
processor, so there is no concept of global address
space across all processors
• Distributed memory systems require a
communication network to connect inter-processor
memory
• The network "fabric" used for data transfer varies
widely; can can be as simple as Ethernet
Node 1 Memory Memory Node 2
CPU CPU
Network

Node 3 Memory Memory Node 4


CPU CPU

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 12


Comparison
• Shared Memory • Distributed Memory
• Advantages • Advantage
• Global address space • Memory is scalable with
• Data sharing between number of processor
tasks is both fast and • Each processor can
uniform rapidly access own
• Disadvantages memory
• Lack of scalability • Disadvantages
between memory and • NUMA access times
CPUs. • Programmer responsible
• Programmer for many details
responsibility for • Difficult to map existing
synchronization data structures
• Expensive

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 13


Constellations
• Hybrid Distributed-Shared Memory
• Used in most of todays parallel computers
• Cache-coherent SMP nodes
• Distributed memory is networking of multiple SMP nodes

Node 1 Memory Memory Node 2


CPU CPU CPU CPU
CPUCPU CPUCPU
Network

Node 3 Memory Memory Node 4


CPU CPU CPU CPU
CPUCPU CPUCPU

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 14


Example Machines
Comparison of Shared and Distributed Memory Architectures
Architecture CC-UMA CC-NUMA Distributed

SMPs SGI Origin/Altix Cray T3E


Sun Fire Exxx/Vxxx Sequent Maspar
Examples DEC/Compaq HP Exemplar IBM SP
SGI Challenge DEC/Compaq IBM Blue Gene/L
IBM POWER3 IBM POWER4 Beowulf Clusters
MPI MPI
Threads Threads
Communications MPI
OpenMP OpenMP
shmem shmem
Scalability to 10s of processors to 100s of processors to 1000s of processors
“New architecture “ System administration
Limited memory
Draw Backs Point-to-point Programming is hard to
bandwidth
communication develop and maintain
Software Availability declining stable Still rising

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 15


Parallel Programming Models
• Abstraction above hardware and memory
architecture
• Several programming models in use:
• Shared Memory (“parallel computing”)
• Threads
• Message Passing (“distributed computing”)
• Data Parallel
• Hybrid approaches
• All models exist for all hardware/memory
architectures

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 16


Shared Memory Model
• Tasks share a common address space, which they
read and write asynchronously
• Access control to shared memory via locks or
semaphores
• No notion of “ownership” of data – no need to
explicitly communicate data between tasks
• Implementations
• shared memory machines: compiler
• distributed memory machines: simulations

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 17


Threads Model
• A single process has multiple, concurrent execution
paths
• Most commonly used on shared mem. machines and
in operating systems
prg.exe T1 T2
call sub1
call sub2
do i = 1, n T3
A(i) = fnct(i^3) T4
B(i) = A(i) * p

Time
end do
call sub3
call sub4
...
...

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 18


Threads Model
• Implementations
• POSIX Thread Library
• C language only
• Offered for most hardware
• Very explicit parallelism
• Requires significant programmer attention to detail
• OpenMP
• Based on compiler directives; can use sequential code
• Fortran, C, C++
• portable/multi-platform
• Can be very easy and simple to use

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 19


Message Passing Model
• Tasks exchange data through communications by
sending and receiving messages
• usually requires cooperative operations to be
performed by each process: a send operation must
have a matching receive operation

Machine A Machine B
task 0 task 1

data data
Network
send(data) receive(data)

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 20


Message Passing Model
• Implementations
• Parallel Virtual Machine (PVM)
• Not much in use any more
• Message Passing Interface (MPI)
• Part 1 released 1994
• Part 2 (MPI-2) release 1996
• http://www-unix.mcs.anl.gov/mpi/
• Now de-facto standard
• Fortran, C, C++
• Available on virtually all machines
• OpenMPI, MPICH, LAM/MPI, many vendor specific versions
• On shared memory machines, MPI implementations usually
don't use a network for task communications

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 21


Data Parallel Model
• A set of tasks work collectively on the same data
structure
• Each task works on a different partition of the
same data structure

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 22


Data Parallel Model
• Implementations
• Fortran 90
• ISO/ANSI extension of Fortran 77
• Additions to program structure and commands
• Variable additions – methods and arguments
• High Performance Fortran (HPF)
• Contains everything in F90
• Directives to tell compiler how to distribute data added
• Data parallel constructs added (now part of F95)
• On distr. memory machines: translated into MPI code

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 23


Hybrid Programming Models
• Two or more of the previous models are used in the
same program
• Common examples:
• POSIX Threads and Message Passing (MPI)
• OpenMP and MPI
• ClusterOpenMP (Intel)
• Works well on network of SMP machines
• Also used:
• Data Parallel and MPI

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 24


Designing Parallel Programs
• No real parallelizing compilers
• Compiler “knows” how to parallelize certain constructs (e.g.
loops)
• Compiler uses “directives” from programmer
• Not simply a matter of taking sequential algorithm
and “making it parallel”. Sometimes, completely
different algorithmic approach necessary
• Very time consuming and labor intense task

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 25


Parallelization Techniques
• Domain Decomposition
• Data is partitioned
• Each task works on different part of data
• Three different ways to partition data

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 26


Parallelization Techniques
• Functional Decomposition
• Problem is partitioned into set of independent tasks

Both types of decomposition can be and often are combined

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 27


A little Theory
• Some problems can be parallelized very well:
In complexity theory, the class NC ("Nick's
Class") is the set of decision problems decidable
in poly-logarithmic time on a parallel computer
with a polynomial number of processors. In other
words, a problem is in NC if there are constants c
c
and k such that it can be solved in time O log n 
using O n k  parallel processors.

Source: http://en2.wikipedia.org/wiki/Class_NC

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 28


A little Theory
• Some problems can't be parallelized at all!
• Example: Calculating the Fibonacci Sequence
(1,1,2,3,5,8,13,21,...) by using the formula

F 1=1
F 2=1
F k2=F k F k 1

Calculation entails dependent calculations: The


calculation of the k + 2 value uses those of both
k + 1 and k. These three terms cannot be calculated
independently and therefore, cannot be parallelized.

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 29


Communication
• Decomposed problems typically need to communicate:
• Partial results need to be combined
• Changes to neighboring data have effects on a task's data
• Some problem don't need communication:
• “Embarrassingly” parallel problems

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 30


Cost of Communication
• Communicating data takes time
• Inter-task comm. has overhead
• Often synchronization is necessary
• Communication is much more “expensive” than
computation
• Communicating data needs to save a lot of computation
before it pays off
• Infiniband needs < 10ms to set up communication
• 2.4GHz AMD Opteron CPU needs ~0.4ns to perform one
floating point operation (Flop)
• 25,000 floating point operations per communication setup!

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 31


Latency - Bandwidth
• Latency: the amount of time for the first bit of
data to arrive at the other end
• Bandwidth: how much data per time unit fits
through

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 32


Cost of Communication
• Formula for the time needed to transmit data

N L = Latency [s]
cost = L N = number of bytes [byte]
B
B = Bandwidth [byte/s]
cost [s]

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 33


Visibility of Communication
• With MPI, communication is explicit and very visible
• “Latency Hiding”:
• Communicate and at the same time doing some other
computations
• Implementation via parallel threads or non-blocking MPI
communication functions
• Makes programs faster but more complex

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 34


Scope of Communication
• Knowing which tasks must communicate with each
other is critical during the design stage of a parallel
program
• Point-to-Point: involves two tasks with one task acting as
the sender/producer of data, and the other acting as the
receiver/consumer
• Collective: involves data
sharing between more than
two tasks, which are often
specified as being
members in a common
group, or collective

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 35


Communication Hardware
Architecture Comment Bandwidth Latency
Myrinet Proprietary Sust. one-way for short
but large messages: messages:
http://www.myricom.com/ commodity ~1.2GB/s ~3ms

Infiniband Vendor
~900MB/s
indep. ~10ms
http://www.infinibandta.org/ standard (4x HCAs)

Quadrics (QsNet) Expensive,


~900MB/s ~2ms
http://www.quadrics.com/ proprietary

Gigabit Ethernet commodity ~100MB/s ~60ms

Custom: SGI, IBM, Cray, Sun, Compaq, ...

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 36


Communication Hardware
InfiniBand Proprietary GigE 10GigE

QLogic
Mellanox Myrinet Myrinet Quadrics Chelsio
InfiniPath
MHGA28 F 10G QM500 T210-CX
HT

Latency (µs) 2.25 1.3 2.6 2.0 1.6 30-100 9.6

Peak Band- 1502 954 493 1200 910 125 860


width (MB/s)

N/2 (Bytes) 512 385 2000 2000 1000 8000 100,000


BW (MB/s) 750 470 250 600 450 60 430

CPU ~5 ~40 ~10 ~10 ~50 >50 ~50


overhead (%)
*Mellanox Technology testing; Ohio State University; PathScale, Myricom, Quadrics, and Chelsio websites
N/2: Message size to achieve half the peak bandwidth

http://www.mellanox.com/applications/performance_benchma
Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 37
Synchronization
• “handshaking” between tasks that are sharing data
• Types of synchronization:
• Barrier
• Usually implies that all tasks are involved
• Each task performs its work until it reaches the barrier.
It then stops, or "blocks"
• When the last task reaches the barrier, all tasks are
synchronized
• Used in MPI

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 38


Synchronization
• More types:
• Lock/Semaphore
• Can involve any number of tasks
• Typically used to serialize (protect) access to global data or a
section of code. Only one task at a time may use (own) the
lock / semaphore / flag
• The first task to acquire the lock "sets" it. This task can
then safely (serially) access the protected data or code.
• Other tasks can attempt to acquire the lock but must wait
until the task that owns the lock releases it.
• Can be blocking or non-blocking
• Used in threads and shared memory

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 39


Synchronization
• More types:
• Synchronous Communication Operations
• Involves only those tasks executing a communication
operation
• When a task performs a communication operation, some form
of coordination is required with the other task(s)
participating in the communication. For example, before a
task can perform a send operation, it must first receive an
acknowledgment from the receiving task that it is OK to
send.

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 40


Granularity
• Qualitative measure of
Computation / Communication Ratio
• Typically, periods of computations are separated from
periods if communication by synchronization events
Fine-Grain Parallelism: Coarse-Grain Parallelism:
Small amount of Large amount of
computation between computation between
communication communication

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 41


Granularity
• Fine-Grain • Coarse-Grain
• Low computation to • High computation to
communication ratio communication ratio
• Facilitates load balancing • More opportunity for
• High communication performance increase
overhead; less opportunity • Harder to load balance
for performance efficiently
enhancement

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 42


Data In- and Output
• Parallel computers with thousands of nodes can
handle huge amounts of data
• It is hard to get this data in and out of the nodes
• parallel-I/O systems are still fairly new and not available
for all platforms
• I/O over the network (like NFS) causes severe bottlenecks
• Help can be found with
• Parallel File Systems: Lustre, PVFS2, GPFS (IBM)
• MPI-2 provides support for parallel file systems
• Rule #1: Reduce overall I/O as much as possible!

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 43


Efficiency
Ts
• Speedup Sp =
Tp
• Efficiency = Sp
p

• Value between zero and one


• estimate how well-utilized the processors are in solving the
problem, compared to how much effort is wasted in
communication and synchronization
• linear speedup and algorithms running on a single processor
have an efficiency of 1
• many difficult-to-parallelize algorithms have efficiency
such as 1/log p that approaches zero as the number of
processors increases

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 44


Limits and Costs
• Besides theoretical limits and hardware limits,
there are practical limits to parallel computing
• Amdahl's Law states that potential program
speedup is defined by the fraction of code (P) that
can be parallelized: speedup=
1
1−P
• If none of the code can be parallelized,
P = 0 and the speedup = 1 (no speedup).
If all of the code is parallelized, P = 1 and the
speedup is infinite (in theory).
• If 50% of the code can be parallelized, maximum
speedup = 2, meaning the code will run twice as fast.

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 45


Limits and Costs
• Introducing the number of processors performing
the parallel fraction of work, Amdahl's Law can be
reformulated as

speedup=
1 N = number of processors,
P
S P = parallel fraction and
N S=1-P = serial fraction
http://upload.wikimedia.org/wikipedia/en/7/7a/Amdahl-law.jpg Speedup
N P=0.50 P=0.90 P=0.99 P=1.0
10 1.82 5.26 9.17 10
100 1.98 9.17 50.25 100
1000 1.99 9.91 90.99 1000
10000 1.99 9.99 99.02 10000

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 46


Typical Parallel Applications
• Applications that are well suited for parallel
computers are
• Weather and ocean patterns
• Finite Element Method (FEM; crash tests for cars)
• Fluid dynamics, aerodynamics
• Simulation of electro-magnetic problems

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 47


Summary
• Overview of parallel computing concepts
• Hardware
• Software
• Programming
• Problems of parallel computing
• Communication is expensive (latency)
• I/O is expensive
• Techniques to work around these problems
• Problem decomposition (communicate larger data)
• Parallel File Systems plus supporting hardware
• $$$$ (faster communication fabric)

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 48


Acknowledgment/References
• Most of this talk is taken from
http://www.llnl.gov/computing/tutorials/parallel_comp/
• Theory book Introduction to Parallel Algorithms and
Architectures: Arrays, Trees, Hypercubes by F.
Thomson Leighton
• Hardware book Computer Architecture: A
Quantitative Approach (3rd edition) by John L.
Hennessy, David A. Patterson, David Goldberg
• http://www.top500.org/

Intro. to Parallel Computing – Spring 2007 Concepts of Parallel Computing – A. Wachsmann 49

You might also like