Unit VI Parallel Programming Concepts

Unit VI
Parallel Programming Concepts

High Performance Cluster Computing
Vol 1. Rajkumar Buyya
Introduction to Parallel
Computing
A parallel computer is a “Collection of processing
elements that communicate and co-operate to solve large problems fast”.
“Processing of multiple tasks simultaneous on multiple
processor is called parallel processing".
PROF. ANAND GHARU

What is Parallel Computing
 Parallel computing is a type of computation in which many

calculations or the execution of processes are carried out
simultaneously.
3
 Large problems can often be divided into smaller ones,

which can then be solved at the same time.
What is Parallel Computing?
Traditionally, software has been written for serial computation:
To be run on a single computer having a single Central Processing Unit (CPU)
What is Parallel Computing?
In the simplest sense, parallel computing is the simultaneous use of
multiple compute resources to solve a computational problem.
Avantages of Parallel Computing
 It saves time and money as many resources working

together will reduce the time and cut potential costs.
 It can be impractical to solve larger problems on Serial Computing
3
 It can take advantage of non-local resources when the local
resources are finite.
 Serial Computing ‘wastes’ the potential computing
power, thus Parallel Computing makes better work of hardware.
Types of Parallelism
Parallelism in Hardware (Uniprocessor)
▪ Parallelism in a Uniprocessor
– Pipelining
– Superscalar, VLIW etc.
▪ SIMD instructions, Vector processors, GPUs
▪ Multiprocessor 3
– Symmetric shared-memory multiprocessors

– Distributed-memory multiprocessors
– Chip-multiprocessors a.k.a. Multi-cores
▪ Multicomputers a.k.a. clusters
Parallelism in Software
▪ Instruction level parallelism
▪ Task-level parallelism
▪ Data parallelism
▪ Transaction level parallelism
Types of Parallelism
1. Instruction Level Parallelism

2. Thread or Task Level Parallelism
3. Data Level Parallelism 3
4. Bit Level Parallism

Instruction-level parallelism (ILP)
 Instruction-level parallelism means the simultaneous execution of

multiple instructions from a program.
 While pipelining is a form of ILP, we must exploit it to achieve

parallel execution of the instructions in the instruction stream.
3
 Example
for (i=1; i<=100; i= i+1)
y[i] = y[i] + x[i];
This is a parallel loop. Every iteration of the

loop can overlap with any
other iteration, although within each loop
iteration there is little opportunity for overlap.
Thread-level or task-level
parallelism (TLP)
 Task Parallelism means concurrent execution of the different task

on multiple computing cores
 Consider an example of task parallelism might involve two threads,

3
each performing a unique statistical operation on the array of

elements. Again the threads are operating in parallel on separate
computing cores, but each is performing a unique operation.
Data level parallelism (DLP)
 Data Parallelism means concurrent execution of the same task on

each multiple computing core.
Let’s take an example, summing the contents of an array of size N. For a single-core
system, one thread would simply sum the elements [0] . . . [N − 1]. For a dual-
core system, however, thread A, running on core 0, could sum the elements 3
[0] . . . [N/2 − 1] and while thread B, running on core 1, could sum the elements [N/2] . .
. [N − 1]. So the Two threads would be running in parallel on separate computing cores
Bit-level parallelism
 Bit-level parallelism is a form of parallel computing based on
increasing processor word size, depending on very-large-scale
integration (VLSI) technology.
Enhancements in computers designs were done by
3
increasing bit-level parallelisms.

 (For e.g., consider a case where an 8-bit processor must add two 16-bit
integers. The processor must first add the 8 lower-order bits from each
integer, then add the 8 higher-order bits, requiring two instructions to
complete a single operation. A 16-bit processor would be able to
complete the operation with single instruction
Taxonomy of Parallel Computers
According to instruction and data streams (Flynn):
– Single instruction single data (SISD): this is the standard
uniprocessor
– Single instruction, multiple data streams (SIMD):
▪ Same instruction is executed in all processors with different data
▪ E.g., Vector processors, SIMD instructions, GPUs
– Multiple instruction, single data streams (MISD):
▪ Different instructions on the same data
▪ Fault-tolerant computers, Near memory computing (Micron Automata
processor).
According to instruction and data streams (Flynn):
– Multiple instruction, multiple data streams (MIMD):
the “common” multiprocessor
▪ Each processor uses it own data and executes its own program
▪ Most flexible approach
▪ Easier/cheaper to build by putting together “off-the-shelf ” processors
According to physical organization of processors and memory:
– Physically centralized memory, uniform memory access (UMA)
▪ All memory is allocated at same distance from all processors
▪ Also called symmetric multiprocessors (SMP)
▪ Memory bandwidth is fixed and must accommodate all processors →
does not scale to large number of processors
▪ Used in CMPs today (single-socket ones)
Physically distributed memory, non-uniform memory access (NUMA)
▪ A portion of memory is allocated with each processor (node)
▪ Accessing local memory is much faster than remote memory
▪ If most accesses are to local memory than overall memory bandwidth increases
linearly with the number of processors
▪ Used in multi-socket CMPs E.g Intel Nehalem
According to memory communication model
– Shared address or shared memory
▪ Processes in different processors can use the same virtual address
space
▪ Any processor can directly access memory in another processor node
▪ Communication is done through shared memory variables
▪ Explicit synchronization with locks and critical sections
▪ Arguably easier to program??
– Distributed address or message passing
▪ Processes in different processors use different virtual address spaces
▪ Each processor can only directly access memory in its own node
▪ Communication is done through explicit messages
▪ Synchronization is implicit in the messages
▪ Arguably harder to program??
▪ Some standard message passing libraries (e.g., MPI)
Motivating Parallelism
 The role of parallelism in accelerating computing speeds has

been recognized for several decades.
 Its role in providing multiplicity of datapaths and increased access to

storage elements has been significant in commercial applications. 7
 The scalable performance and lower cost of parallel platforms is reflected

in the wide variety of applications.
 Developing parallel hardware and software has traditionally been time and effort
intensive.
 If one is to view this in the context of rapidly improving uniprocessor speeds, one
is tempted to question the need for parallel computing.
 The emergence of standardized parallel programming environments, libraries,8

and hardware have significantly reduced time to (parallel) solution.
 The Computational Speed Argument: For some applications, this is the only means
of achieving needed performance.
 The Memory/Disk Speed Argument: For some other applications, the needed I/O
throughput can be provided only by a collection of nodes.
7
 The Data Communication Argument: In yet other applications, the distributed
nature of data implies that it is unreasonable to collect data to process it at a single
location.
In short, motivation of parallel computing are:
1. Overcome limits to serial computing

2. Limits to increase transistor
density
3. Limits to data transmission speed 9
4. Faster turn-around time

5. Solve larger problems
Scope of Parallel Computing
 Parallel computing has great impact on wide range of applications
 Commerical (industry, automation)
 Scientific (research)
 Turn around time should be minimum (data mining, disk) 10
 High performance (clustering, image processing)
 Resource mangement(online exam, remote operation)
 Load balancing (add or remove resource)
 Dynamic library (e.g. update or upgrade)
 Minimum network congetion and latency

Applications
 Commercial computing.
- Weather forecasting
- Remote sensors, Image processing
- Process optimization, operations
research.
 Scientific and Engineering
application.
11
- Computational chemistry
- Molecular modelling
- Structure mechanics
 Business application.
- E – Governance
- Medical Imaging
Internet applications.
 - Internet server
- Digital Libraries
Concurrency Vs Parallelism
 Concurrency is when two tasks can start, run, and complete in
overlapping time periods.
 The term Parallelism refers to techniques to make programs faster by

performing several computations at the same time.
12
Parallel Programming Platforms
 The traditional logical view of a sequential computer consists of a
memory connected to a processor via a datapath. All three components
– processor, memory, and datapath – present bottlenecks to the
overall processing rate of a computer system
 The main objective is to provide sufficient details to programmer to
12
be able to write efficient code on variety of platform.

 Pthread and OpenMPI (Message Passing Interface).
Implicit Parallelism
 A programming language is said to be implicitly parallel if its

Compiler or interpreter can recognize opportunities for parallelization
and implement them without being told to do so
13
 implicit parallelism is a characteristic of a programming language that
allows a compiler or interpreter to automatically exploit the parallelism
inherent to the computations expressed by some of the language's
constructs.
Implicitly parallel programming
Pipelining
 The process of fetching next instruction when current instruction is
being executed by processor
 Pipelining is the process of accumulating instruction from the 14

processor through a pipeline. It allows storing and executing
instructions in an orderly process. It is also known as pipeline
processing.
VLIW Processor
 Very long instruction word (VLIW) describes a computer processing
architecture in which a language compiler or pre- processor breaks
program instruction down into basic operations that can be
performed by the processor in parallel (that is, at the same time).
14
 These operations are put into a very long instruction word
which the processor can then take apart without further analysis,
handing each operation to an appropriate functional unit.
VLIW Processor
 VLIW is sometimes viewed as the next step beyond the reduced instruction set
computing ( RISC ) architecture, which also works with a limited set of relatively
basic instructions and can usually execute more than one instruction at a time (a
characteristic referred to as superscalar ).
14
VLIW Architecture
14
VLIW Processor
 Advantages of VLIW architecture
 Increased performance.
 Potentially scalable i.e. more execution units can be added and so more instructions
can be packed into the VLIW instruction.
 Disadvantages of VLIW architecture 14
 New programmer needed.

 Program must keep track of Instruction scheduling.
 Increased memory use.
 High power consumption.
Dichotomy of Parallel Computing Platforms
• First explore a dichotomy based on the logical and physical organization of

parallel platforms.
• The logical organization refers to a programmer's view of the platform while the
physical organization refers to the actual hardware organization of the
platform.
• The two critical components of parallel computing from a programmer's
perspective are ways of expressing parallel tasks and mechanisms for 15
specifying interaction between these tasks
• The former is sometimes also referred to as the control structure and the latter as the
communication model
Control Structure of Parallel Platforms
Parallel tasks can be specified at various levels of granularity. At the other extreme, individual instructions
within a program can be viewed as parallel tasks. Between these extremes lie a range of models for
specifying the control structure of programs and the corresponding architectural support for them.
Parallelism from single instruction on multiple processors
Consider the following code segment that adds two vectors:
16
1. for (i = 0; i < 1000; i++)
2 c[i] = a[i] + b[i];
In this example, various iterations of the loop are independent of each other; i.e.,
c[0] = a[0] + b[0]; c[1] = a[1] + b[1];, etc., can all be executedindependently of each other.
Consequently, if there is a mechanism for executing the same instruction, in this case add on all
the processors with appropriate data, we could execute this loop much faster
Definitions
 Computation / Communication Ratio:
In parallel computing, granularity is a qualitative measure of the ratio of
computation to commu–nication.
– Periods of computation are typically separated from periods of communication by
synchronization events.
 Fine grain parallelism
 Coarse grain parallelism
Fine-grain Parallelism
• Relatively small amounts of computational work
are done between communication events
• Low computation to communication ratio
• Facilitates load balancing
• Implies high communication overhead and less
opportunity for performance enhancement
• If granularity is too fine it is possible that the
overhead required for communications and
synchronization between tasks takes longer than
the computation.
Coarse-grain Parallelism
 Relatively large amounts of
computational work are done between
communication/synchronization events
 High computation to communication
ratio
 Implies more opportunity for
performance increase
 Harder to load balance efficiently
A typical SIMD architecture (a) and a typical MIMD architecture (b).
17
Figure A typical SIMD architecture (a) and a typical MIMD architecture (b)
Executing a conditional statement on an SIMD computer with four processors: (a) the conditional statement;
(b) the execution of the statement in two steps
18
Communication Model of Parallel Platforms
Shared-Address-Space Platforms
Typical shared-address-space architectures:
(a)Uniform-memory-access (UMA) shared-address-space computer;.
 In thismodel, all the processors share the physical memory uniformly.
 All the processors have equal access time to all the memory words.
 Each processor may have a private cache memory. Same rule is followed for peripheral
devices.
 When all the processors have equal access to all the peripheral devices, the system
19
is called a symmetric multiprocessor
 When only one or a few processors canaccess the peripheral devices, the
system is called an asymmetric multiprocessor.
Typical shared-address-space architectures:

(a) Uniform-memory-access (UMA) shared-address-space
computer;.
19
Uniform-memory-access(UMA)
shared- address-space computer with
caches and memories
19
Non-uniform- memory-access (NUMA)
shared-address-space computer with local memory only.
 In NUMA multiprocessor model, the access time varies

with the location of the memory word.
 Here, the shared memory is physically distributed among all the processors, called local
19
memories.
 The collection of all local memories forms a global
address space which can be accessed by all the processors.
Non-uniform- memory-access (NUMA)
shared-address-space computer with local
memory only.
19
Cache Only - memory-access (COMA)
The COMA model is a special case of the NUMA model. Here, all the distributed main
memories are converted to cache memories.
19
Physical Organization
of Parallel Platforms
Parallel Random Access Machines (PRAM) is a model, which is considered for most of the
parallel algorithms. Here, multiple processors are attached to a single block of memory.
A PRAM model contains −

 A set of similar type of processors. 21
 All the processors share a common memory unit. Processors can communicate among
themselves through the shared memory only.
 A memory access unit (MAU) connects the processors with the single shared memory.
PRAM :
21
Here, n number of processors can perform independent operations on n number of data in a

particular unit of time. This may result in simultaneous access of same memory location by
different processors.
PARM Platforms
Architecture of an Ideal Parallel Computer (pram)
Exclusive-read, exclusive-write (EREW) PRAM. Here no two processors are allowed to read from
or write to the same memory location at the same time.(E.g. Mutual Exclusion, )
Concurrent-read, exclusive-write (CREW) PRAM. In this class, multiple read accesses to a memory
location are allowed. But multiple write are not allowed(e,g, websites, blogs) 21
Exclusive-read, concurrent-write (ERCW) PRAM. Multiple write accesses are allowed to a memory
location, but multiple read accesses are serialized.(e.g. Devepoler with DBA)
Concurrent-read, concurrent-write (CRCW) PRAM. This class allows multiple read and write accesses
to a common memory location. This is the most powerful PRAM model.(e.g. Cloud Services)
There are many methods to implement the
PRAM model,
Shared memory model

Message passing model 20
Distributed Memory
model
1. Shared Memory Model
• Shared memory emphasizes on control parallelism than on data parallelism.

• In the shared memory model, multiple processes execute on different processors
independently, but they share a common memory space.
• Due to any processor activity, if there is any change in any memory location, it is visible to
the rest of the processors.
20
Message-Passing Platforms
 The logical machine view of a message-passing platform consists of p

processing nodes.
 On such platforms, interactions between processes running on different nodes must be
accomplished using messages, hence the name message passing.
 This exchange of messages is used to transfer data, work, and to synchronize actions among
the processes. 20
 In its most general form, message-passing paradigms support execution of a different program
on each of the p nodes.
2. Message Passing Model
• Message passing is the most commonly used parallel programming approach in

distributed memory systems.
• Here, the programmer has to determine the parallelism. In this model, all the
processors have their own local memory unit and they exchange data through a
communication network.
20
20
Distributed Memory
 Processors have their own local memory. Memory addresses in one processor do not map
to another processor, so there is no concept of global address space across all processors.
 Distributed memory systems require a communication network to connect inter-processor
memory.
 Because each processor has its own local memory, it operates independently.
 Changes it makes to its local memory have no effect on the memory of other
processors. Hence, the concept of cache coherency does not apply.
 When a processor needs access to data in another processor, it is usually the task of the
programmer to explicitly define how and when data is communicated.
 Synchronization between tasks is likewise the programmer's responsibility.
 The network "fabric" used for data transfer varies widely, though it can can be as simple
as Ethernet.
Distributed Memory
Interconnection Networks for Parallel Computers
 Interconnection networks carry data between processors and to memory.
 Interconnects are made of switches and links (wires, fiber).
 Interconnects are classified as static or dynamic.

 Static networks consist of point-to-point communication links among processing
nodes and are also referred to as direct networks. 22
 Dynamic networks are built using switches and communication links. Dynamic
networks are also referred to as indirect networks.
Interconnection Networks for Parallel Computers
Interconnection networks can be classified as static or dynamic.
Static networks consist of point- to-point communication links among processing nodes and are also referred to as
direct networks.
Figure .Classification of interconnection networks: (a) a static network; and (b) a dynamic
network.
22
Network Topology
 Static Network consist linear array, Ring, Tree, Star, Mesh , Hypercube etc
 Dynamic Network consist Buses, Crossbar switch, Mesh network, Multistage network
etc
22
Linear Arrays
Linear arrays: (a) with no wrap around links; (b) with wraparound link.
• Tree-Based Networks : In this topology one path is used between any pair of
nodes.
• Static and dynamic tree
• Static tree: Each node of the tree are processing elements
• Dynamic tree: Intermediate nodes are switching nodes
26
Complete binary tree networks: (a) a static tree network; and (b) a dynamic tree network.
A mesh is a network topology in which processing elements are arranged in a grid.
The rows and column positions are used to denote a particular processor in the
mesh network.
Two and three dimensional meshes: (a) 2-D mesh with no wraparound; (b) 2-D
mesh with wraparound link (2-D torus); and (c) a 3-D mesh with no wraparound.
24
Construction of hypercubes from hypercubes of lower dimension.
25
N-wide
superscalar
architectur
e
PROF. ANAND GHARU
Base
Scalar
• It is defined as a machine with one instruction issued per cycle.
Processor:
٣
What does Superscalar mean?
• Common instructions (arithmetic, load/store, conditional branch) can be
initiated and executed independently in separate pipelines
—Instructions are not necessarily executed in the order in which
they appear in a program
—Processor attempts to find instructions that can be executed
independently, even if they are out-of-order
—Use additional registers and register renaming to eliminate some
dependencies
• Equally applicable to RISC & CISC
• Quickly adopted and now standard approach for high-
performance microprocessors
A 5-stage Pipeline
Memory General
registers
IF ID
IF = instruction fetch (includes PC increment)

ID = instruction decode + fetching values from general
purpose registers EXE = arithmetic/logic operations
or address computation
MEM = memory access or branch completion
WB = write back results to general purpose registers
A 5-stage Pipeline
Stage 1 (Instruction Fetch)
In this stage the CPU reads instructions from the address in the memory whose
value is present in the program counter.
Stage 2 (Instruction Decode)
In this stage, instruction is decoded and the register file is accessed to get the values
from the registers used in the instruction.
Stage 3 (Instruction Execute)
In this stage, ALU operations are performed.
Stage 4 (Memory Access)
In this stage, memory operands are read and written from/to the memory that is
present in the instruction.
Stage 5 (Write Back)
In this stage, computed/fetched value is written back to the register present in the
instructions
.
.
.
.
Why Superscalar?
• Two main ideas:
—To Execute instructions concurrently and independently in separate

pipelines
—To Improve throughput of concurrent pipelines by allowing out-of-order
execution
Superscalar Processors
▪ Pipelining: several instructions are simultaneously fetched at different stages of
their execution
▪ Superscalar: several instructions are simultaneously fetched at the same stages of their
execution
▪ Out-of-order execution: instructions can be executed in an order different from that

specified in the program
▪ Dependences between instructions:
– Data Dependence (Read after Write - RAW)
– Control dependence
▪ Speculative execution: tentative execution despite dependencies

N-wide superscalar architecture:
❖ Superscalar architecture is called as N-wide architecture if it
supports to fetch and dispatch of n instructions in every cycle.
N-wide superscalar
architecture:
Functional Unit
Functional Unit
Multi – core
Processors
Introduction: What is Processor?
A processor is the logic circuitry that responds to

and
processes the basic instructions that drive a computer. The
term processor has generally replaced the term central
processing unit (CPU). The processor in a
computer or embedded in small devices is often called a
personal
microprocessor .
What Is Core?
• Actually, a CORE is the part of something that is central to its
existence or character.
• Similarly,in computer system the CPU is referred as CORE.
• Basically, there are two types of core processor:
1. Single Core Processor
2. Multi Core Processor
Single Core processor
It is a processor that has only one core, so it can only start one operation at a time. It can
however in some situations start a new operation before the previous one is complete.
Originally all processors were single core. Examples are Intel Pentium 4 670, AMD
Athlon 64 FX-55.
PROF. ANAND GHARU

Single - core architectures:
Multi Core Processor
• A multi-core processor is one which combines two or more independent processors

into a single package, often a single integrated circuit. Examples are Intel core i7,
intel core 2 duo, Intel core i5 , i3 etc.

Multi-core architectures:
Applications of Multicore
• 3D Gaming
• Database servers
• Multimedia applications
• Video editing
• Powerful graphics solution
• Encoding
• Computer Aided Design (CAD)
PROF. ANAND GHARU

EXAMPLES
 dual-core processor with 2 cores
e.g. AMD Phenom II X2, Intel Core 2 Duo E8500
 quad-core processor with 4 cores
e.g. AMD Phenom II X4, Intel Core i5 2500T)

 hexa-core processor with 6 cores
e.g. AMD Phenom II X6, Intel Core i7 Extreme Ed.

980X
 octa-core processor with 8 cores
e.g. AMD FX-8150, Intel XeonPERO7-F2. 8A2N0AND GHARU

Unit VI Parallel Programming Concepts

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Unit VI Parallel Programming Concepts

Uploaded by

Copyright:

Available Formats

Unit VI

Parallel Programming Concepts

A parallel computer is a “Collection of processing

elements that communicate and co-operate to solve large problems fast”.

“Processing of multiple tasks simultaneous on multiple

processor is called parallel processing".

PROF. ANAND GHARU

 Parallel computing is a type of computation in which many

 Large problems can often be divided into smaller ones,

 It saves time and money as many resources working

– Symmetric shared-memory multiprocessors

1. Instruction Level Parallelism

3. Data Level Parallelism 3

4. Bit Level Parallism

 Instruction-level parallelism means the simultaneous execution of

 While pipelining is a form of ILP, we must exploit it to achieve

This is a parallel loop. Every iteration of the

 Task Parallelism means concurrent execution of the different task

 Consider an example of task parallelism might involve two threads,

each performing a unique statistical operation on the array of

 Data Parallelism means concurrent execution of the same task on

increasing bit-level parallelisms.

 The role of parallelism in accelerating computing speeds has

 Its role in providing multiplicity of datapaths and increased access to

 The scalable performance and lower cost of parallel platforms is reflected

 The emergence of standardized parallel programming environments, libraries,8

1. Overcome limits to serial computing

4. Faster turn-around time

 Parallel computing has great impact on wide range of applications

 Commerical (industry, automation)

 Turn around time should be minimum (data mining, disk) 10

 High performance (clustering, image processing)

 Resource mangement(online exam, remote operation)

 Load balancing (add or remove resource)

 Dynamic library (e.g. update or upgrade)

 Minimum network congetion and latency

 The term Parallelism refers to techniques to make programs faster by

be able to write efficient code on variety of platform.

 A programming language is said to be implicitly parallel if its

 Pipelining is the process of accumulating instruction from the 14

 Disadvantages of VLIW architecture 14

 New programmer needed.

• First explore a dichotomy based on the logical and physical organization of

specifying interaction between these tasks

Typical shared-address-space architectures:

 In NUMA multiprocessor model, the access time varies

A PRAM model contains −

Here, n number of processors can perform independent operations on n number of data in a

Shared memory model

• Shared memory emphasizes on control parallelism than on data parallelism.

 The logical machine view of a message-passing platform consists of p

• Message passing is the most commonly used parallel programming approach in

 Interconnection networks carry data between processors and to memory.

 Interconnects are made of switches and links (wires, fiber).

 Interconnects are classified as static or dynamic.

IF = instruction fetch (includes PC increment)

• Two main ideas:

—To Execute instructions concurrently and independently in separate

▪ Out-of-order execution: instructions can be executed in an order different from that

▪ Speculative execution: tentative execution despite dependencies

A processor is the logic circuitry that responds to

PROF. ANAND GHARU