Basics of Parallel Programming: Unit-1

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 79

Unit-1

Basics of Parallel
Programming
Syllabus-Parallel & Distributed Computing
Syllabus-Parallel & Distributed
Computing
Syllabus-Parallel & Distributed
Computing
TEXT BOOKS:
1. Parallel Programming in C with MPI and OpenMP by M.J. Quinn, McGraw-Hill Science/Engineering/
Math.
2. Introduction to Parallel Computing, Ananth Grama, Anshul Gupta, George Karypis, Vipin
Kumar, By Pearson Publication
3. Distributed Computing, Sunita Mahajan and Seema Shah, Oxford University Press.
4. Distributed Systems: Concepts and Design, By G. Coulouris, J. Dollimore, and T. Kindberg, Pearson
Education.
5. Mastering Cloud Computing foundation and application Programming, Rajkumar Buyya,
Christan Vecchiola, S. Thamarai Selvi, MK

Reference Books:

1. Introduction to Parallel Processing, M. SasiKumar, Dinesh Shikhare, P.Raviprakash By PHI


Publication
2. Parallel Computers – Architecture and Programming – By V. Rajaraman And C. Siva Ram Murthy.
3. Distributed Systems: Principles and Paradigms, By Tanenbaum.
4. Cloud Computing Bible. Barrie Sosinsky. John Wiley & Sons. ISBN-13: 978-0470903568.
5. Rajkumar Buyya, Cloud Computing: Principles and Paradigms, John Wiley & Sons, First Edition
Introduction-What is Parallel
Computing?
• Model for designing and building computers, based on the
following three characteristics:

1) The computer consists of four main sub-systems:


• Memory
• ALU (Arithmetic/Logic Unit)
• Control Unit
• Input/output System (I/O)

2) Program is stored in memory during execution.

3) Program instructions are executed sequentially.


The Von Neumann Architecture
Bu
s
Processor (CPU)

Memory Input-Output
Control Unit

ALU
Communicate
Store data and
with
program
"outside world",
Execute
e.g.
program
Do arithmetic/logic • Screen
operations • Keyboard
requested by program • Storage
What is Parallel Computing?
• Traditionally, software has been written for serial
computation:
• To be run on a single computer having a single Central Processing
Unit (CPU);
• A problem is broken into a discrete series of instructions.
• Instructions are executed one after another.
• Only one instruction may execute at any moment in time.
Limitations of Serial Computing
• Limits to serial computing - both physical and practical reasons
pose significant constraints to simply building ever faster serial
computers.

• Transmission speeds - the speed of a serial computer is


directly dependent upon how fast data can move through
hardware. Absolute limits are the speed of light (30 cm/
nanosecond) and the transmission limit of copper wire (9 cm/
nanosecond). Increasing speeds necessitate increasing
proximity of processing elements.

• Economic limitations - it is increasingly expensive to make a


single processor faster. Using a larger number of moderately
fast commodity processors to achieve the same (or better)
performance is less expensive.
What is Parallel Computing? (2)
• In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem.
• To be run using multiple CPUs
• A problem is broken into discrete parts that can be solved
concurrently
• Each part is further broken down to a series of instructions
• Instructions from each part execute simultaneously on different CPUs
Motivating Parallelism

• The role of parallelism in accelerating computing speeds


has been recognized for several decades.

• Its role in providing multiplicity of datapaths and


increased access to storage elements has been
significant in commercial applications.

• The scalable performance and lower cost of parallel


platforms is reflected in the wide variety of applications.
Motivating Parallelism
• Developing parallel hardware and software has traditionally been
time and effort intensive.

• If one is to view this in the context of rapidly improving uniprocessor


speeds, one is tempted to question the need for parallel computing.

• There are some unmistakable trends in hardware design, which


indicate that uniprocessor architectures may not be able to sustain
the rate of realizable performance increments in the future.

• This is the result of a number of fundamental physical and


computational limitations.

• The emergence of standardized parallel programming environments,


libraries, and hardware have significantly reduced time to (parallel)
solution.
The Memory/Disk Speed
Argument
• While clock rates of high-end processors have increased at roughly
40% per year over the past decade, DRAM access times have only
improved at the rate of roughly 10% per year over this interval.

• This mismatch in speeds causes significant performance bottlenecks.

• Parallel platforms provide increased bandwidth to the memory


system.

• Parallel platforms also provide higher aggregate caches.

• Principles of locality of data reference and bulk access, which guide


parallel algorithm design also apply to memory optimization.

• Some of the fastest growing applications of parallel computing utilize


not their raw computational speed, rather their ability to pump data to
memory and disk faster.
The Data Communication
Argument
• As the network evolves, the vision of the Internet as one
large computing platform has emerged.

• This view is exploited by applications such as


SETI@home and Folding@home.

• In many other applications (typically databases and data


mining) the volume of data is such that they cannot be
moved.

• Any analyses on this data must be performed over the


network using parallel techniques.
Applications in Engineering and
Design
• Design of airfoils (optimizing lift, drag, stability), internal
combustion engines (optimizing charge distribution, burn),
high-speed circuits (layouts for delays and capacitive and
inductive effects), and structures (optimizing structural
integrity, design parameters, cost, etc.).

• Design and simulation of micro- and nano-scale systems.

• Process optimization, operations research.


Scientific Applications
• Functional and structural characterization of genes and
proteins.

• Advances in computational physics and chemistry have


explored new materials, understanding of chemical pathways,
and more efficient processes.

• Applications in astrophysics have explored the evolution of


galaxies, thermonuclear processes, and the analysis of
extremely large datasets from telescopes.

• Weather modeling, mineral prospecting, flood prediction, etc.,


are other important applications.
Commercial Applications
• Some of the largest parallel computers power the wall
street!

• Data mining and analysis for optimizing business and


marketing decisions.

• Large scale servers (mail and web servers) are often


implemented using parallel platforms.

• Applications such as information retrieval and search are


typically powered by large clusters.
Applications in Computer Systems
• Network intrusion detection, cryptography, multiparty
computations are some of the core users of parallel
computing techniques.

• Embedded systems increasingly rely on distributed


control algorithms.

• A modern automobile consists of tens of processors


communicating to perform complex tasks for optimizing
handling and performance.
Classification of Parallel Computer
• Flynn’s taxonomy
• SISD
• SIMD
• MISD
• MIMD

• Classification based on the memory arrangement


• Shared Memory
• Message Passing

• Classification based on communication


• Static Network
• Dynamic Network

• Classification based on the kind of parallelism


• Data
• Instruction
Flynn's Classical Taxonomy
• There are different ways to classify parallel computers.
One of the more widely used classifications, in use since
1966, is called Flynn's Taxonomy.

• Flynn's taxonomy distinguishes multi-processor computer


architectures according to how they can be classified
along the two independent dimensions of Instruction
and Data. Each of these dimensions can have only one
of two possible states: Single or Multiple.
Flynn Matrix
• The matrix below defines the 4 possible classifications
according to Flynn
Single Instruction, Single Data
(SISD)
• A serial (non-parallel) computer

• Single instruction: only one instruction


stream is being acted on by the CPU
during any one clock cycle

• Single data: only one data stream is being


used as input during any one clock cycle

• This is the oldest and until recently, the


most prevalent form of computer

• Examples: most PCs, single CPU


workstations and mainframes
Single Instruction, Multiple Data (SIMD)
• A type of parallel computer
• Single instruction: All processing units execute the same instruction at
any given clock cycle
• Multiple data: Each processing unit can operate on a different data
element
• This type of machine typically has an instruction dispatcher, a very high-
bandwidth internal network, and a very large array of very small-
capacity instruction units.
• Best suited for specialized problems characterized by a high degree of
regularity, such as image processing.
• Examples:
• Processor Arrays: Connection Machine CM-2, Maspar MP-1, MP-2
• Vector Pipelines: IBM 9000, Cray C90, Fujitsu VP, NEC SX-2, Hitachi
S820
Single Instruction, Multiple Data
(SIMD)…
Multiple Instruction, Single Data (MISD)
• A single data stream is fed into multiple processing
units.

• Each processing unit operates on the data
independently via independent instruction streams.

• Few actual examples of this class of parallel computer


have ever existed. One is the experimental Carnegie-
Mellon C.mmp computer (1971).
Multiple Instruction, Single Data
(MISD)…
Multiple Instruction, Multiple Data
(MIMD)
• Currently, the most common type of parallel computer. Most
modern computers fall into this category.

• Multiple Instruction: every processor may be executing a different


instruction stream

• Multiple Data: every processor may be working with a different


data stream

• Execution can be synchronous or asynchronous, deterministic or


non-deterministic

• Examples: most current supercomputers, networked parallel


computer "grids" and multi-processor SMP computers - including
some types of PCs.
Multiple Instruction, Multiple Data
(MIMD)…
Flynn taxonomy

– Advantages of Flynn
» Universally accepted
» Compact Notation
» Easy to classify a system (?)
– Disadvantages of Flynn
» Very coarse-grain differentiation among
machine systems
» Comparison of different systems is limited
» Interconnections, I/O, memory not
considered in the scheme
Classification based on memory arrangement
• Shared Memory
• Distributed Memory

Shared memory
Interconnection
I/O1 network
Interconnection
network
I/On
PE1 PEn

PE1 PEn M1 Mn

Processor P P
s 1 n
Shared memory -
Message passing -
multiprocessors
multicomputers
Shared Memory
• Shared memory parallel computers vary widely, but generally have
in common the ability for all processors to access all memory as
global address space.

• Multiple processors can operate independently but share the same


memory resources.
• Changes in a memory location effected by one processor are visible
to all other processors.
• Shared memory machines can be divided into two main classes
based upon memory access times: UMA and NUMA.
The UMA Model
• Tightly-coupled systems (high degree of resource sharing)
• Suitable for general-purpose and time-sharing applications
by multiple users.
The NUMA Model
• The access time varies with the location of the memory word.
• Shared memory is distributed to local memories.
• All local memories form a global address space accessible by
all processors

Access time: Cache, Local memory, Remote memory

Distributed Shared Memory (NUMA)


Cache Only Memory Architecture
(COMA)
• The COMA model is a special case of the NUMA model. Here,
all the distributed main memories are converted to cache
memories.
Shared Memory:Advantages

• Global address space provides a user-friendly


programming perspective to memory

• Data sharing between tasks is both fast and uniform


due to the proximity of memory to CPUs
Shared Memory:Disadvantages
• Primary disadvantage is the lack of scalability between
memory and CPUs. Adding more CPUs can
geometrically increases traffic on the shared memory-
CPU path, and for cache coherent systems,
geometrically increase traffic associated with cache/
memory management.

• Programmer responsibility for synchronization


constructs that insure "correct" access of global
memory.

• Expense: it becomes increasingly difficult and


expensive to design and produce shared memory
machines with ever increasing numbers of processors.
Distributed memory multicomputers
• Multiple computers- nodes
• Message-passing network
• Local memories are private with its own program and data
• No memory contention so that the number of processors is very
large
• The processors are connected by communication lines, and the
precise way in which the lines are connected is called the
topology of the multicomputer.
• A typical program consists of subtasks residing in all the
memories.
Distributed Memory: Pro and Con
• Advantages:

• Memory is scalable with number of processors. Increase the


number of processors and the size of memory increases
proportionately.
• Each processor can rapidly access its own memory without
interference and without the overhead incurred with trying to
maintain cache coherency.
• Cost effectiveness: can use commodity, off-the-shelf processors
and networking.

• Disadvantages:

• The programmer is responsible for many of the details associated


with data communication between processors.
• Non-uniform memory access (NUMA) times
Classification based on type of interconnections

• Static networks

• Dynamic networks
Interconnection Networks for Parallel
Computers
• Interconnection networks carry data between processors
and to memory.
• Interconnects are made of switches and links (wires,
fiber).
• Interconnects are classified as static or dynamic.
• Static networks consist of point-to-point communication
links among processing nodes and are also referred to as
direct networks.
• Dynamic networks are built using switches and
communication links. Dynamic networks are also referred
to as indirect networks.
Static and Dynamic
Interconnection Networks

Classification of interconnection networks: (a) a static


network; and (b) a dynamic network.
Interconnection Networks: Network
Interfaces
• Processors talk to the network via a network interface.

• The network interface may hang off the I/O bus or the
memory bus.

• It is used for packetizing of data, Computing routing


information, Buffering of input and output data and error
checking.
Classification based on the kind of parallelism[3]
Parallel
PA
architectures
s

Data-parallel Function-parallel
architectures architectures

Instruction- Thread- Process-


level level level
PAs
PAs PAs

DP
s ILP MIMD
S s

Associati SIMD Systoli Pipelin VLIW Superscal Ditribut Share


ector and architect processo memo
processo ed memo
architect architect
ve s c ed s ar MIM d(mult
neural ure rs rs ry
(multi- ry
Processo
ure ure D i-
computer) rs)
Instruction-level parallelism
• A computer program, is in essence, a stream of instructions
executed by a processor. These instructions can be re-ordered
and combined into groups which are then executed in parallel
without changing the result of the program. This is known as
instruction-level parallelism.

• Modern processors have multi-stage instruction pipelines.


Each stage in the pipeline corresponds to a different action the
processor performs on that instruction in that stage; a
processor with an N-stage pipeline can have up to N different
instructions at different stages of completion.
Instruction-level parallelism

A canonical five-stage pipeline in a RISC machine (IF = Instruction Fetch,


ID = Instruction Decode, EX = Execute, MEM = Memory access, WB =
Register write back)
Instruction-level parallelism
• In addition to instruction-level parallelism from pipelining, some
processors can issue more than one instruction at a time.
These are known as superscalar processors.
Instruction-level parallelism
• Consider the following program:

• 1. e = a + b (Independent)
• 2. f = c + d (independent)
• 3. m = e * f (dependent)

• If we assume that each operation can be


completed in one unit of time then these three
instructions can be completed in a total of two
units of time, giving an ILP of 3/2.
Data Level parallelism
• Data parallelism is a form of parallelization of computing
across multiple processors in parallel computing
environments. Data parallelism focuses on distributing
the data across different parallel computing nodes.

• In a multiprocessor system executing a single set of


instructions (SIMD), data parallelism is achieved when
each processor performs the same task on different
pieces of distributed data.

• For instance, consider a 2-processor system (CPUs A


and B) in a parallel environment, and we wish to do a
task on some data d.
Data Level parallelism
• Example:

if CPU = "a"
lower_limit := 1
upper_limit := round(d.length/2)
else if CPU = "b"
lower_limit := round(d.length/2) + 1
upper_limit := d.length

for i from lower_limit to upper_limit by 1


foo(d[i])
Data Level parallelism
• In an SPMD system, both CPUs will execute the code.

• In a parallel environment, both will have access to d.

• A mechanism is presumed to be in place whereby each CPU will


create its own copy of lower_limit and upper_limit that is independent
of the other.

• The if clause differentiates between the CPUs. CPU "a" will read true
on the if; and CPU "b" will read true on the else if, thus having their
own values of lower_limit and upper_limit.

• Now, both CPUs execute foo(d[i]), but since each CPU has different
values of the limits, they operate on different parts of d simultaneously,
thereby distributing the task among themselves.
Data Level parallelism-Speedup

Data Level parallelism-Speedup

Task Level Parallelism
• Task parallelism is the characteristic of a parallel program
that "entirely different calculations can be performed on
either the same or different sets of data“.

• In a multiprocessor system, task parallelism is achieved


when each processor executes a different thread (or
process) on the same or different data.

• In the general case, different execution threads


communicate with one another as they work.
Task Level Parallelism
• As a simple example, if we are running code on a 2-processor
system (CPUs "a" & "b") in a parallel environment and we
wish to do tasks "A" and "B", it is possible to tell CPU "a" to do
task "A" and CPU "b" to do task 'B" simultaneously, thereby
reducing the run time of the execution.

• program:

if CPU="a" then
do task "A"
else if CPU="b" then
do task "B“
end if ...
end program
Process Coordination
Shared Memory v. Message Passing

• Shared memory global int


• Efficient, familiar process x process
• Not always available foo bar
begin begin
• Potentially insecure
: :
x := ... y := x
: :
• Message passing end foo end bar
Extensible to communication in distributed systems

Canonical syntax:

send(process : process_id, message : string)


receive(process : process_id, var message : string)
Shared Memory Programming
Model
• Programs/threads communicate/cooperate via loads/
stores to memory locations they share.

• Communication is therefore at memory access speed


(very fast), and is implicit.

• Cooperating pieces must all execute on the same system


(computer).

• OS services and/or libraries used for creating tasks


(processes/threads) and coordination (semaphores/
barriers/locks.)
Shared Memory Code
fork N processes
each process has a number, p, and computes
istart[p], iend[p], jstart[p], jend[p]
for(s=0;s<STEPS;s++) {
k = s&1 ; m = k^1 ;
forall(i=istart[p];i<=iend[p];i++) {
forall(j=jstart[p];j<=jend[p];j++) {
a[k][i][j] = c1*a[m][i][j] + c2*a[m][i-1][j]
+
c3*a[m][i+1][j] + c4*a[m][i][j-1] +
c5*a[m][i][j+1] ; // implicit comm
}
}
barrier() ;
}
Symmetric Multiprocessors
• Several processors share
one address space
• conceptually a shared memory
• Communication is implicit
P P P
• read and write accesses to
shared memory locations
Network
• Synchronization
• via shared memory locations
M
• spin waiting for non-zero
• Atomic instructions (Test&set, Conceptual
compare&swap, load linked/ Model
store conditional)
• barriers
Non-Uniform Memory Access
P P
- NUMA
P P
CPU/Memory busses cannot 1 2 n N
support more than ~4-8 CPUs
before bus bandwidth is
exceeded (the SMP “sweet $ $ $ $
spot”).
To provide shared-memory MPs Interconnect
beyond these limits requires
some memory to be “closer to”
some processors than to M M
others.

The “Interconnect” usually includes


a cache-directory to reduce snoop traffic
Remote Cache to reduce access latency (think of it as an L3)
• Cache-Coherent NUMA Systems (CC-NUMA):
SGI Origin, Stanford Dash, Sequent NUMA-Q, HP Superdome
• Non Cache-Coherent NUMA (NCC-NUMA)
Cray T32E
Message Passing Programming Model
• “Shared” data is communicated using “send”/”receive” services
(across an external network).

• Unlike Shared Model, shared data must be formatted into message


chunks for distribution (shared model works no matter how the data
is intermixed).

• Coordination is via sending/receiving messages.

• Program components can be run on the same or different systems,


so can use 1,000s of processors.

• “Standard” libraries exist to encapsulate messages:


• Parasoft's Express (commercial)
• PVM (standing for Parallel Virtual Machine, non-commercial)
• MPI (Message Passing Interface, also non-commercial).
Message Passing Issues
Synchronization semantics

• When does a send /receive operation terminate?


Sender Receiver
OS
Blocking (aka Synchronous): Kernel
Sender waits until its message is received
Receiver waits if no message is available

Non-blocking (aka Asynchronous):


Send operation “immediately” returns
Receive operation returns if no message is
available (polling) Sender Receiver
OS
Kernel
Partially blocking/non-blocking:
send()/receive() with timeout
How many buffers?
Interconnection Networks
• Switches map a fixed number of inputs to outputs.
• The total number of ports on a switch is the degree of the
switch.
• The cost of a switch grows as the square of the degree of
the switch, the peripheral hardware linearly as the degree,
and the packaging costs linearly as the number of pins.
Interconnection Networks:
Network Interfaces
• Processors talk to the network via a network interface.
• The network interface may hang off the I/O bus or the
memory bus.
• In a physical sense, this distinguishes a cluster from a
tightly coupled multicomputer.
• The relative speeds of the I/O and memory buses impact
the performance of the network.
Network Topologies
• A variety of network topologies have been proposed and
implemented.
• These topologies tradeoff performance for cost.
• Commercial machines often implement hybrids of
multiple topologies for reasons of packaging, cost, and
available components.
Network Topologies: Buses
• Some of the simplest and earliest parallel machines used
buses.
• All processors access a common bus for exchanging
data.
• The distance between any two nodes is O(1) in a bus.
The bus also provides a convenient broadcast media.
• However, the bandwidth of the shared bus is a major
bottleneck.
• Typical bus based machines are limited to dozens of
nodes. Sun Enterprise servers and Intel Pentium based
shared-bus multiprocessors are examples of such
architectures.
Interconnection Networks:
Network Interfaces
• Two types of Interconnection Network

• 1) Static (Direct) Network


• 2) Dynamic (Indirect) Network
Network Topologies:
Completely Connected Network (Direct
Network)
• Each processor is connected to every other processor.
• The number of links in the network scales as O(p2 ).
• While the performance scales very well, the hardware
complexity is not realizable for large values of p.
• In this sense, these networks are static counterparts of
crossbars.
Network Topologies: Completely Connected and
Star Connected Networks

Example of an 8-node completely connected


network.

(a) A completely-connected network of eight nodes;


(b) a star connected network of nine nodes.
Network Topologies:
Star Connected Network

• Every node is connected only to a common node at the


center.
• Distance between any pair of nodes is O(1). However,
the central node becomes a bottleneck.
• In this sense, star connected networks are static
counterparts of buses.
Network Topologies:
Linear Arrays, Meshes, and k-d Meshes

• In a linear array, each node has two neighbors, one to its


left and one to its right. If the nodes at either end are
connected, we refer to it as a 1-D torus or a ring
.
• A generalization to 2 dimensions has nodes with 4
neighbors, to the north, south, east, and west.

• A further generalization to d dimensions has nodes with


2d neighbors.

• A special case of a d-dimensional mesh is a hypercube.


Here, d = log p, where p is the total number of nodes.
Network Topologies:
Two- and Three Dimensional Meshes

Two and three dimensional meshes: (a) 2-D mesh with no


wraparound; (b) 2-D mesh with wraparound link (2-D torus); and
(c) a 3-D mesh with no wraparound.
Network Topologies:
Hypercubes and their Construction

Construction of hypercubes from hypercubes of lower


dimension.
Network Topologies:
Properties of Hypercubes

• The distance between any two nodes is at most log p.


• Each node has log p neighbors.

• The distance between two nodes is given by the number


of bit positions at which the two nodes differ.
Network Topologies:
Crossbars(Indirect Network)
• The cost of a crossbar of p processors grows as O(p2 ).
• This is generally difficult to scale for large values of p.
• Examples of machines that employ crossbars include the
Sun Ultra HPC 10000 and the Fujitsu VPP500.
Network Topologies:
Multistage Networks
• Crossbars have excellent performance scalability but
poor cost scalability.

• Buses have excellent cost scalability, but poor


performance scalability.

• Multistage interconnects strike a compromise between


these extremes.
Network Topologies:
Multistage Networks

The schematic of a typical multistage interconnection network.


Network Topologies: Multistage Omega Network

• One of the most commonly used multistage


interconnects is the Omega network.
• This network consists of log p stages, where p is the
number of inputs/outputs.
• At each stage, input i is connected to output j if:
Network Topologies:
Multistage Omega Network
Each stage of the Omega network implements a perfect
shuffle as follows:

A perfect shuffle interconnection for eight inputs and


outputs.
Network Topologies:
Multistage Omega Network
• The perfect shuffle patterns are connected using 2×2
switches.
• The switches operate in two modes – crossover or
passthrough.

Two switching configurations of the 2 × 2 switch:


(a) Pass-through; (b) Cross-over.
Network Topologies: Multistage Omega Network
A complete Omega network with the perfect shuffle
interconnects and switches can now be illustrated:

A complete omega network connecting eight inputs and eight outputs.

An omega network has p/2 × log p switching nodes,


and the cost of such a network grows as (p log p).

You might also like