Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational

Multiprocessors -
Parallel Processing Overview

“The real world is inherently concurrent; yet our computational
perception of it has been strained through 300 years of basically
sequential mathematics, 50 years of sequential algorithm
development, and 15 years of sequential FORTRAN programming. Is
it any wonder that those searching for parallelism or concurrency in
a FORTRAN do-loop cannot find much?”
Thurber and Patton

IEEE Transactions on Computing, 1973
Introduction
This course has concentrated on single-processor

architectures and techniques to improve their performance:
– Efficient hardware implementations.

– Enhanced processor operation through
pipelined instruction execution and multiplicity of
functional units.
– Memory hierarchy.
– Control unit design.
– I/O operations.
Through these techniques and implementation

improvements, the processing power of a computer system
has increased by an order of magnitude every 5 years.
We are approaching performance bounds due to physical
limitations of the hardware.
Von Neumann design ultimately limited by component

and signals speeds. (The speed of light is starting to be a
limiting factor!)
In the early days of computing, the best way to increase
the speed of a computer was to use faster logic devices.
However, this approach is not always possible today,

since we are approaching the physical limits.
As device-switching times grow shorter, propagation

delay becomes significant.
Logic signals travel at the speed of light, approximately

30 cm/nsec in a vacuum. If two devices are one meter
apart, the propagation delay is approximately 3.3 nsecs
(Actually, slightly longer because medium is not a
vacuum, but gold, copper, …)
In the 1960s, switching speed was 10-100 nsec.
In 1990s, switching speed is typically measured in

picoseconds (10-12 sec or 10-3 nsec)
Also, we are running into packing density problems for

the logic gates in chips - how close transistors, etc. can
be packed together on the processor chip. When
transistors are closer together, signals have less
distance to travel and so operations may be performed
more quickly.
The important factor is the size of the logic gate (transistor)
=> greater logic density.
The logic circuits of a processor comprise millions of logic

gates, which need to be switched open or closed, millions
of times a second.
The smaller the gate, the faster it works.
Similar reasoning applies to the interconnecting “wires”.
Today’s gates are about 0.25 x 10-6 m wide.
Researchers already have reduced size to 0.06 x 10-6 m

wide - this is about 100 silicon atoms wide!
The technology for doing this is reaching a limit.

If Moore’s law stays true, then in about 2015, the size of lo
gic gates will be about 1 atom wide!
Increasing the clock rate, also introduces problems - the h

eat generated by the logic gates increases as well.
>> Following Moore’s law, there should be 5 GHz

processors by the year 2005.
>> But they will require a bath of liquid nitrogen to

keep them cool!
>> Alternatively, your PC may also be used as a

space heater or a barbecue grill!
Then how can we build faster computers?
- The question is this: How can we put N processors
to work on a single problem and achieve a speed increase
of O(N) (I.e. a linear increase in speed)?
Two sub-questions:
- How do we interconnect the processors?
- How do we program them?

Real programs achieve less than the perfect speedup
indicated by the dotted line.
At any given level of performance, the demand for
higher performance machines has, and will continue,
to exist
– Perform computationally-intense applications
faster and more accurately than ever before.
Different approaches are possible:

1. Improve the basic performance of a single
processor machine:
» Architecture / organization improvements
» Implementation improvements
E.g.
- SSI => VLSI => ULSI
- Clock speed
- Packaging
2. Multiple processor system architectures:
Three general categories -
» Tightly-coupled system
» Loosely-coupled system
» Distributed computing system
I.e. dealing with architectures involving a number

(possibly large) of processors instead of just one.
Parallel processing has been around several decades.

There are a number of competing designs.
Basically, a parallel computer is a collection of processing
elements that cooperate to solve large problems, fast
Some broad issues:

Resource Allocation:
- how large a collection of processors?
- how powerful are the elements?
- how much memory?
Data access, communication and synchronization

- how do the elements cooperate and communicate?
- how are data transmitted between processors?
- what are the abstractions and primitives for
cooperation?
Performance and Scalability
- how does it all translate into performance?
- how does it scale?
Inevitability of Parallel Computing
Application demands: Our insatiable need for processor

cycles:
- Scientific computing: Biology, Chemistry, Physics, ...
- General-purpose computing: Video, Graphics, CAD,
Databases, TP, ...
Technology Trends
- Number of transistors on chip growing rapidly
- Clock rates expected to go up only slowly
Architecture Trends
- Instruction-level parallelism valuable but limited
- Coarser-level parallelism, as in multiprocessors,
the most viable approach
Economics
Current trends:
Today’s microprocessors have multiprocessor

support.
Servers, workstations, and PCs becoming

multiprocessors: Sun, SGI, COMPAQ, Dell, ...
Tomorrow’s microprocessors are multiprocessors

Large parallel machines are a mainstay in many
industries:
- Petroleum (reservoir analysis)
- Automotive (crash simulation, drag analysis,
combustion efficiency)
- Aeronautics (airflow analysis, engine efficiency,
structural mechanics)
- Computer-aided design
- Pharmaceuticals (molecular modeling)
- Visualization
in all of the above
entertainment (e.g. special effects in films)
architecture (walk-throughs and rendering)
- Financial modeling (yield and derivative analysis)
etc.
Summary of Application Trends
• Transition to parallel computing has occurred for

scientific and engineering computing
• Also rapid progress in commercial computing
– Database and transactions as well as financial
– Usually smaller-scale, but large-scale systems also
used
• Desktops also use multithreaded programs, which are a
lot like parallel programs
• Demand for improving throughput on sequential
workloads
– Greatest use of small-scale multiprocessors
• Solid application demand exists and will increase
Summary: Why Parallel Architectures?
• Increasingly attractive
– Economics, technology, architecture, application
demand
• Increasingly central and mainstream
• Parallelism exploited at many levels
– Instruction-level parallelism
– Thread-level parallelism within a microprocessor
– Multiprocessor servers
– Large-scale multiprocessors (“MPPs – Massively
Parallel Multiprocessors”)
• Same story from memory system perspective
– Increase bandwidth, reduce average latency with

many local memories
• Wide range of parallel architectures make sense

– Different cost, performance and scalability
Multiple Processor Systems
There are a number of ways of categorizing parallel

systems.
1. Tightly-coupled systems
– Multiple processors
– Shared, common (global) memory system
– Processors under the integrated control of a
common operating system
– Data is exchanged between processors by
accessing common shared variable locations in
memory
– Common shared memory ultimately presents an
overall system bottleneck that effectively limits the sizes
of these systems to a fairly small number of processors
(dozens).
Tightly Coupled - SMP
• Processors share memory
• Communicate via that shared memory
• Symmetric Multiprocessor (SMP)

– Share single memory or pool
– Shared bus to access memory
– Memory access time to given area of memory is
approximately the same for each processor
Tightly Coupled - NUMA
• Non-uniform memory access
• Access times to different regions of memory may

differ
2. Loosely-coupled systems
– Multiple processors
– Each processor has its own independent
(local, private) memory system
– Processors under the integrated control of a
common operating system
– Data exchanged between processors via
inter-processor messages
Obviously, there is need for multiprocessors to

cooperate, giving rise to problems such as
communication, synchronization (coordination), etc.
Loosely Coupled - Clusters
• Collection of independent uniprocessors or SMPs
• Interconnected to form a cluster
• Communication via fixed path or network

connections
3. Distributed computing systems
– Collections of relatively autonomous

computers, each capable of independent operation.
– Example: local area network of computer
workstations
» Each machine is running its own “copy” of
the operating system.
» Some tasks are done on different machines
(e.g. mail server is on one machine)
» Supports multiple independent users
» Load balancing between machines can
cause a user’s job on one machine to be shifted to
another.
Performance bounds
– For a system with N processors, we would like a

net processing speedup (meaning lower overall execution
time) of nearly N times when compared to the performance
of a similar uniprocessor system.
– A number of poor performance “upper

bounds” have been proposed over the years
» Maximum speedup of O(log N)
» Maximum speedup of O(N / log N)
– These “bounds” were based on runtime

performance of applications and were not necessarily valid
in all cases
– They reinforced the computer industry’s hesitancy

to “get into” parallel processing
Parallel Processors
The tightly-coupled and loosely-coupled architectures

can have 10s, 100s, or 1000s of processors.
These machines are the true parallel processors

(also called concurrent processors)
A taxonomy of parallel computers.
Parallel computers fall into Flynn’s taxonomy classes
of SIMD and MIMD systems:
– SIMD: single instruction stream and multiple
data streams - the same operations (instructions)
are performed on all data items.
It is also possible for a single processor to perform
the same instruction on a large set of data items.
In this case, parallelism is achieved by pipelining:
- one set of operands starts through the pipeline
and
- before the computation is finished on this set of
operands, another set of operands starts flowing through
the pipeline.
– MIMD: multiple instruction streams and
multiple data streams. Several complete processors
connected together to form a multiprocessor system.
The processors are connected together via an
interconnection network to provide a means of
cooperation during the computation.
The processors need not be identical.
Can handle a greater variety of tasks than an array

processor.
The MOMS taxonomy of parallel machines
The SIMD/MIMD taxonomy leaves something to be

desired, since there are many subclasses of MIMD that do
not appear in the model, and one class (MISD) that
appears in the model but not in real life. (Or does it??!!)
Gustafson (1990) proposed the following taxonomy:

This yields the following classifications:
- MOMS (Monolithic operations, monolithic

storage) - conventional computers.
- MODS (Monolithic operations, distributed
data)
- DOMS (Distributed operations, monolithic
storage)
- DODS (Distributed operations, distributed
storage)
Memory access
Common design in tightly coupled systems is all

processors reference a common memory, or set of
memories.
If the access time to memory does not depend on which

part of memory the data is stored in, it is called a uniform
memory access (UMA) design.
Common design in loosely coupled systems is that each
processor directly accesses only its own memory:
Communication between nodes is by message passing

instead of by shared memory using Send() and
Receive() commands
A hybrid between UMA and message-passing
architectures is to allow a processor to access a
distant memory, even though it will take more time
to do so. This is called a non-uniform memory access
(NUMA) design.
SIMDs (Array and vector processors)
SIMDs consist of large number of processors performing

the same computation (i.e. executing the same program)
on large arrays of data.
Computation controlled by lock-step synchronization,

processors starting and stopping together.
Multiple processors operate in parallel to perform same

operation, under the guidance of a common control
unit, which issues global control signals.
SIMDs may be
 word-organized (word-slice SIMD) - processors

operate on words
 bit-organized (bit-slice SIMD) - processors operate

on bit(s).
E.g. image processing - pixels = array of bits.
SIMD performs same operation on each pixel at great
speed.
SIMDs are well-suited to numerical problems that
may be expressed in vector or array format.
Unlike vector processors which achieve high

performance typically via pipelining, array processors
provide extensive parallelism by replication of
processors.
SIMD - Single “control unit” computer and an array
of “computational” computers.
Control unit executes control-flow instructions and

scalar operations and passes vector instructions to
the processor array
Processor instruction types:

– Extensions of scalar instructions
» Adds, stores, multiplies, etc.
become vector operations executed in all processors
concurrently
– Must add the ability to transfer vector and

scalar data between processors to the instruction set
- attributes of a “parallel language”
SIMD Examples
Vector addition:
for I = 1 to n do
C(I) = A(I) + B(I)
– Complexity O(n) in SISD systems

– Complexity O(1) in SIMD systems
Matrix multiply:
– A, B, and C are n-by-n matrices
– Compute C= A * B
– Complexity O(n3 ) in SISD systems

» n2 dot products, each of which is
O(n)
– Complexity O(n2) in SIMD systems

» Perform n dot products in parallel
across the array
Image smoothing:
– Smooth an n-by-n pixel image to reduce

“noise”.
– Each pixel is replaced by the average of

itself and its 8 nearest neighbors.
– Complexity O(n2) in SISD systems.
– Complexity O(n) in SIMD systems

The Pentium architecture incorporates a form of SIMD
- multimedia extensions - a SIMD unit in conjunction
with the Pentium superscalar processor
Has 57 new instructions, e.g. high-speed multiply-

accumulate function which computes the sum of a series
of products.
Such a computation is a key element of several digital

signal processing (DSP) and MM applications.
Has eight 64-bit wide registers.
Four new data types - packed bytes, words, etc.

Can perform a calculation simultaneously on two, four or
eight data elements by packing multiple operands into a
single 64-bit SIMD register and operating on them all in
parallel, in a single clock cycle.
E.g. 8 graphics pixels (8-bits) can be packed into SIMD

register and operated on simultaneously.
MIMD - Multiprocessors
MIMD systems differ from SIMD ones in that the

“lock-step” operation requirement is removed.
Each processor has its own control unit and can

execute an independent stream of instructions
– Rather than forcing all processors to
perform the same task at the same time, processors
can be assigned different tasks that, when taken as a
whole, complete the assigned application.
Comprise large number of independent processors

(i.e. different programs) operating on separate data.
Cooperation via two distinct approaches:
 shared memory (although there may be

some local memory) for holding data and
communication purposes.
 message passing (no shared memory) via

interconnections
A multiprocessor with 16 CPUs sharing a common memory
A multicomputer with 16 CPUs, each with each own private memory
A taxonomy of parallel computers.
Interconnection structures
Multiprocessors comprise large number of CPUs, IOPs and

MMUs.
Need to
 pass information (commands, data) between
processors
 connect CPUs and IOPs to memory banks
 connect CPUs to IOPs

How connected?
Via buses or interconnection networks.
Number of different physical configurations.
E.g. ring, mesh, tree, hypercube

Various design factors dictate applicability of any
given configuration:
 communication diameter - the number of
processors that commands/data must pass through to
reach destination.
 expandability
 redundancy - the number of different paths
available to commands/data to reach destination
 routing
 bandwidth - the capacity in terms of bytes
per sec
 connection degree - the number of paths
leaving a processor
Keys to high MIMD performance are
– Process synchronization
– Process scheduling
Process synchronization concerned with methods for

allowing process shared access to resources and
communication - no deadlock situations.
Note: Deadlock – insidious state of computer system in

which some or all programs are blocked from progressing
due to lack of resources.
Process scheduling can be performed
– By the programmer through the use of
parallel language constructs
» Specify a priori what processes will
be instantiated and where they will be executed.
– During program execution by spawning

processes off for execution on available processors.
» Fork-join construct in some languages.
Keep all processors busy and not suspended awaiting

data from another processor
Shared memory
Classified as
 with uniform memory access - UMA.
Each processor has the same path length and equal

access time to memory.
 with non-uniform memory access - NUMA.
Distributed shared memory. Non-uniform access

time for each processor - larger path lengths for some
processors.
 with cache-only memory access - COMA.

All have a cache problem concerning data consistency -
cache coherence - which complicates the design.
Shared data may reside in several caches.
When any processor updates its cached value, all other

caches must either update their values or invalidate them.
 Write-through with update protocol - when cache
is updated, so too is memory. Update is broadcast to
all other processors which update their own caches
on receipt.
Alternative is write-through with invalidation - broadcast
causes other caches to invalidate “stale” data.
 Write-back protocol - when cache is updated,

processor obtains exclusive ownership of block, and
all other copies, including memory copy are
invalidated.
When another processor wishes to read block, updated
block is sent by current owner, and memory copy is also
updated.
 Snoopy caches - a special cache controller
constantly monitors (snoops on) the (single-) bus for
write operations.
If the address of a write operation matches the address
which the caches holds, the cache invalidates that
entry.
Message-passing computers
Described in terms of granularity.
Granularity = computation
communication
which is measured as
 coarse - more computation than communication
 medium
 fine - more communication than computation
Problems include
 message routing
 synchronization for message passing - send
and receive primitives
 deadlock
Interconnection networks
Once the number of processors starts increasing, a bus-

based design becomes impractical. Why?
Therefore, some kind of interconnection network

must be used to connect the processors
- to the memories (if a shared-memory
design is used), or
- to the processors (if a message-passing

design is used).
Connection Topologies for Parallel Processors
Parallel processors may be interconnected in a number of

different ways.
For example,
As a bus:
As a ring:
As a star:
As a mesh/
grid:
As a (hyper)
cube:
Parallel computing in the Sega home
video game system
System bus model of the Sega Genesis

architecture
Multiprocessing metrics
Speedup
Speedup, S = Execution time for 1 processor

factor Execution time for N processors
Might intuitively assume linear speedup - execution

time decreases in proportion to number of processors.
Amdahl’s law - concerns parallel computation with a serial

component
(a) A program has a sequential part and a parallelizable
part.
(b) Effect of running part of the program in parallel.
f - fraction of computation which must be executed
serially
Time with N processors = f * t + (1 - f) * t

N
Speedup, S = t
f * t + [(1 - f) * t]/N
= N
1 + (N - 1)*f
For significant increase in speed to be realized, fraction
of serial computation needs to be very small.
With large number of processors, maximum speedup

limits to 1/f
System examples
MIMD
- NASA has developed a parallel system comprising a n

etwork of PCs, called a Beowulf cluster. Such a system
with 600 computers was rated at 233 Gigaflops!
- An IBM 8192-processor system called ASCI

White, built for the DoE, USA. This system rates at 12.3
teraflops - that is 12.3 x 1012 floating-point operations
per sec. The computers used are based on the IBM
RS/6000 family.
Problems
– Hardware is relatively easy to build.

– Massively parallel systems just take
massive amounts of money to build.
– How should/can the large numbers of
processors be interconnected?
– The real trick is building software that will
exploit the capabilities of the system
Reality check
– Outside of a limited number of high-profile

applications, parallel processing is still a “young”
discipline
– Parallel software is still fairly sparse
– Risky for companies to adopt parallel
strategies - they just wait for the next new SISD system!

Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Multiprocessors - Parallel Processing Overview: "The Real World Is Inherently Concurrent Yet Our Computational

Uploaded by

Copyright:

Available Formats

Multiprocessors -

Parallel Processing Overview

Thurber and Patton

This course has concentrated on single-processor

– Efficient hardware implementations.

Through these techniques and implementation

Von Neumann design ultimately limited by component

However, this approach is not always possible today,

As device-switching times grow shorter, propagation

Logic signals travel at the speed of light, approximately

In 1990s, switching speed is typically measured in

Also, we are running into packing density problems for

The logic circuits of a processor comprise millions of logic

The smaller the gate, the faster it works.

Similar reasoning applies to the interconnecting “wires”.

Today’s gates are about 0.25 x 10-6 m wide.

Researchers already have reduced size to 0.06 x 10-6 m

The technology for doing this is reaching a limit.

Increasing the clock rate, also introduces problems - the h

>> Following Moore’s law, there should be 5 GHz

>> But they will require a bath of liquid nitrogen to

>> Alternatively, your PC may also be used as a

- How do we program them?

Different approaches are possible:

I.e. dealing with architectures involving a number

Parallel processing has been around several decades.

Some broad issues:

Data access, communication and synchronization

Application demands: Our insatiable need for processor

Today’s microprocessors have multiprocessor

Servers, workstations, and PCs becoming

Tomorrow’s microprocessors are multiprocessors

• Transition to parallel computing has occurred for

– Increase bandwidth, reduce average latency with

• Wide range of parallel architectures make sense

There are a number of ways of categorizing parallel

• Communicate via that shared memory

• Symmetric Multiprocessor (SMP)

• Access times to different regions of memory may

Obviously, there is need for multiprocessors to

• Interconnected to form a cluster

• Communication via fixed path or network

– Collections of relatively autonomous

– For a system with N processors, we would like a

– A number of poor performance “upper

– These “bounds” were based on runtime

– They reinforced the computer industry’s hesitancy

The tightly-coupled and loosely-coupled architectures

These machines are the true parallel processors

The processors need not be identical.

Can handle a greater variety of tasks than an array

The SIMD/MIMD taxonomy leaves something to be

Gustafson (1990) proposed the following taxonomy:

- MOMS (Monolithic operations, monolithic

Common design in tightly coupled systems is all

If the access time to memory does not depend on which

Communication between nodes is by message passing

SIMDs consist of large number of processors performing

Computation controlled by lock-step synchronization,

Multiple processors operate in parallel to perform same

 word-organized (word-slice SIMD) - processors

 bit-organized (bit-slice SIMD) - processors operate