Download as ppt, pdf, or txt
Download as ppt, pdf, or txt
You are on page 1of 78

Multiprocessors -

Parallel Processing Overview


“The real world is inherently concurrent; yet our computational
perception of it has been strained through 300 years of basically
sequential mathematics, 50 years of sequential algorithm
development, and 15 years of sequential FORTRAN programming. Is
it any wonder that those searching for parallelism or concurrency in
a FORTRAN do-loop cannot find much?”

Thurber and Patton


IEEE Transactions on Computing, 1973
Introduction

This course has concentrated on single-processor


architectures and techniques to improve their performance:

– Efficient hardware implementations.


– Enhanced processor operation through
pipelined instruction execution and multiplicity of
functional units.
– Memory hierarchy.
– Control unit design.
– I/O operations.

Through these techniques and implementation


improvements, the processing power of a computer system
has increased by an order of magnitude every 5 years.
We are approaching performance bounds due to physical
limitations of the hardware.

Von Neumann design ultimately limited by component


and signals speeds. (The speed of light is starting to be a
limiting factor!)
In the early days of computing, the best way to increase
the speed of a computer was to use faster logic devices.

However, this approach is not always possible today,


since we are approaching the physical limits.

As device-switching times grow shorter, propagation


delay becomes significant.

Logic signals travel at the speed of light, approximately


30 cm/nsec in a vacuum. If two devices are one meter
apart, the propagation delay is approximately 3.3 nsecs
(Actually, slightly longer because medium is not a
vacuum, but gold, copper, …)
In the 1960s, switching speed was 10-100 nsec.

In 1990s, switching speed is typically measured in


picoseconds (10-12 sec or 10-3 nsec)

Also, we are running into packing density problems for


the logic gates in chips - how close transistors, etc. can
be packed together on the processor chip. When
transistors are closer together, signals have less
distance to travel and so operations may be performed
more quickly.
The important factor is the size of the logic gate (transistor)
=> greater logic density.

The logic circuits of a processor comprise millions of logic


gates, which need to be switched open or closed, millions
of times a second.

The smaller the gate, the faster it works.

Similar reasoning applies to the interconnecting “wires”.

Today’s gates are about 0.25 x 10-6 m wide.

Researchers already have reduced size to 0.06 x 10-6 m


wide - this is about 100 silicon atoms wide!

The technology for doing this is reaching a limit.


If Moore’s law stays true, then in about 2015, the size of lo
gic gates will be about 1 atom wide!

Increasing the clock rate, also introduces problems - the h


eat generated by the logic gates increases as well.

>> Following Moore’s law, there should be 5 GHz


processors by the year 2005.

>> But they will require a bath of liquid nitrogen to


keep them cool!

>> Alternatively, your PC may also be used as a


space heater or a barbecue grill!
Then how can we build faster computers?
- The question is this: How can we put N processors
to work on a single problem and achieve a speed increase
of O(N) (I.e. a linear increase in speed)?

Two sub-questions:
- How do we interconnect the processors?

- How do we program them?


Real programs achieve less than the perfect speedup
indicated by the dotted line.
At any given level of performance, the demand for
higher performance machines has, and will continue,
to exist
– Perform computationally-intense applications
faster and more accurately than ever before.

Different approaches are possible:


1. Improve the basic performance of a single
processor machine:
» Architecture / organization improvements
» Implementation improvements
E.g.
- SSI => VLSI => ULSI
- Clock speed
- Packaging
2. Multiple processor system architectures:
Three general categories -
» Tightly-coupled system
» Loosely-coupled system
» Distributed computing system

I.e. dealing with architectures involving a number


(possibly large) of processors instead of just one.

Parallel processing has been around several decades.


There are a number of competing designs.
Basically, a parallel computer is a collection of processing
elements that cooperate to solve large problems, fast

Some broad issues:


Resource Allocation:
- how large a collection of processors?
- how powerful are the elements?
- how much memory?

Data access, communication and synchronization


- how do the elements cooperate and communicate?
- how are data transmitted between processors?
- what are the abstractions and primitives for
cooperation?
Performance and Scalability
- how does it all translate into performance?
- how does it scale?
Inevitability of Parallel Computing

Application demands: Our insatiable need for processor


cycles:
- Scientific computing: Biology, Chemistry, Physics, ...
- General-purpose computing: Video, Graphics, CAD,
Databases, TP, ...
Technology Trends
- Number of transistors on chip growing rapidly
- Clock rates expected to go up only slowly
Architecture Trends
- Instruction-level parallelism valuable but limited
- Coarser-level parallelism, as in multiprocessors,
the most viable approach
Economics
Current trends:

Today’s microprocessors have multiprocessor


support.

Servers, workstations, and PCs becoming


multiprocessors: Sun, SGI, COMPAQ, Dell, ...

Tomorrow’s microprocessors are multiprocessors


Large parallel machines are a mainstay in many
industries:
- Petroleum (reservoir analysis)
- Automotive (crash simulation, drag analysis,
combustion efficiency)
- Aeronautics (airflow analysis, engine efficiency,
structural mechanics)
- Computer-aided design
- Pharmaceuticals (molecular modeling)
- Visualization
in all of the above
entertainment (e.g. special effects in films)
architecture (walk-throughs and rendering)
- Financial modeling (yield and derivative analysis)
etc.
Summary of Application Trends

• Transition to parallel computing has occurred for


scientific and engineering computing
• Also rapid progress in commercial computing
– Database and transactions as well as financial
– Usually smaller-scale, but large-scale systems also
used
• Desktops also use multithreaded programs, which are a
lot like parallel programs
• Demand for improving throughput on sequential
workloads
– Greatest use of small-scale multiprocessors
• Solid application demand exists and will increase
Summary: Why Parallel Architectures?

• Increasingly attractive
– Economics, technology, architecture, application
demand
• Increasingly central and mainstream
• Parallelism exploited at many levels
– Instruction-level parallelism
– Thread-level parallelism within a microprocessor
– Multiprocessor servers
– Large-scale multiprocessors (“MPPs – Massively
Parallel Multiprocessors”)
• Same story from memory system perspective

– Increase bandwidth, reduce average latency with


many local memories

• Wide range of parallel architectures make sense


– Different cost, performance and scalability
Multiple Processor Systems

There are a number of ways of categorizing parallel


systems.

1. Tightly-coupled systems
– Multiple processors
– Shared, common (global) memory system
– Processors under the integrated control of a
common operating system
– Data is exchanged between processors by
accessing common shared variable locations in
memory
– Common shared memory ultimately presents an
overall system bottleneck that effectively limits the sizes
of these systems to a fairly small number of processors
(dozens).
Tightly Coupled - SMP
• Processors share memory

• Communicate via that shared memory

• Symmetric Multiprocessor (SMP)


– Share single memory or pool
– Shared bus to access memory
– Memory access time to given area of memory is
approximately the same for each processor
Tightly Coupled - NUMA
• Non-uniform memory access

• Access times to different regions of memory may


differ
2. Loosely-coupled systems

– Multiple processors
– Each processor has its own independent
(local, private) memory system
– Processors under the integrated control of a
common operating system
– Data exchanged between processors via
inter-processor messages

Obviously, there is need for multiprocessors to


cooperate, giving rise to problems such as
communication, synchronization (coordination), etc.
Loosely Coupled - Clusters
• Collection of independent uniprocessors or SMPs

• Interconnected to form a cluster

• Communication via fixed path or network


connections
3. Distributed computing systems

– Collections of relatively autonomous


computers, each capable of independent operation.
– Example: local area network of computer
workstations
» Each machine is running its own “copy” of
the operating system.
» Some tasks are done on different machines
(e.g. mail server is on one machine)
» Supports multiple independent users
» Load balancing between machines can
cause a user’s job on one machine to be shifted to
another.
Performance bounds

– For a system with N processors, we would like a


net processing speedup (meaning lower overall execution
time) of nearly N times when compared to the performance
of a similar uniprocessor system.

– A number of poor performance “upper


bounds” have been proposed over the years
» Maximum speedup of O(log N)
» Maximum speedup of O(N / log N)

– These “bounds” were based on runtime


performance of applications and were not necessarily valid
in all cases

– They reinforced the computer industry’s hesitancy


to “get into” parallel processing
Parallel Processors

The tightly-coupled and loosely-coupled architectures


can have 10s, 100s, or 1000s of processors.

These machines are the true parallel processors


(also called concurrent processors)
A taxonomy of parallel computers.
Parallel computers fall into Flynn’s taxonomy classes
of SIMD and MIMD systems:
– SIMD: single instruction stream and multiple
data streams - the same operations (instructions)
are performed on all data items.
It is also possible for a single processor to perform
the same instruction on a large set of data items.
In this case, parallelism is achieved by pipelining:
- one set of operands starts through the pipeline

and
- before the computation is finished on this set of
operands, another set of operands starts flowing through
the pipeline.
– MIMD: multiple instruction streams and
multiple data streams. Several complete processors
connected together to form a multiprocessor system.
The processors are connected together via an
interconnection network to provide a means of
cooperation during the computation.

The processors need not be identical.

Can handle a greater variety of tasks than an array


processor.
The MOMS taxonomy of parallel machines

The SIMD/MIMD taxonomy leaves something to be


desired, since there are many subclasses of MIMD that do
not appear in the model, and one class (MISD) that
appears in the model but not in real life. (Or does it??!!)

Gustafson (1990) proposed the following taxonomy:


This yields the following classifications:

- MOMS (Monolithic operations, monolithic


storage) - conventional computers.
- MODS (Monolithic operations, distributed
data)
- DOMS (Distributed operations, monolithic
storage)
- DODS (Distributed operations, distributed
storage)
Memory access

Common design in tightly coupled systems is all


processors reference a common memory, or set of
memories.

If the access time to memory does not depend on which


part of memory the data is stored in, it is called a uniform
memory access (UMA) design.
Common design in loosely coupled systems is that each
processor directly accesses only its own memory:

Communication between nodes is by message passing


instead of by shared memory using Send() and
Receive() commands
A hybrid between UMA and message-passing
architectures is to allow a processor to access a
distant memory, even though it will take more time
to do so. This is called a non-uniform memory access
(NUMA) design.
SIMDs (Array and vector processors)

SIMDs consist of large number of processors performing


the same computation (i.e. executing the same program)
on large arrays of data.

Computation controlled by lock-step synchronization,


processors starting and stopping together.

Multiple processors operate in parallel to perform same


operation, under the guidance of a common control
unit, which issues global control signals.
SIMDs may be

 word-organized (word-slice SIMD) - processors


operate on words

 bit-organized (bit-slice SIMD) - processors operate


on bit(s).
E.g. image processing - pixels = array of bits.
SIMD performs same operation on each pixel at great
speed.
SIMDs are well-suited to numerical problems that
may be expressed in vector or array format.

Unlike vector processors which achieve high


performance typically via pipelining, array processors
provide extensive parallelism by replication of
processors.
SIMD - Single “control unit” computer and an array
of “computational” computers.

Control unit executes control-flow instructions and


scalar operations and passes vector instructions to
the processor array

Processor instruction types:


– Extensions of scalar instructions
» Adds, stores, multiplies, etc.
become vector operations executed in all processors
concurrently

– Must add the ability to transfer vector and


scalar data between processors to the instruction set
- attributes of a “parallel language”
SIMD Examples

Vector addition:
for I = 1 to n do
C(I) = A(I) + B(I)

– Complexity O(n) in SISD systems


– Complexity O(1) in SIMD systems
Matrix multiply:
– A, B, and C are n-by-n matrices
– Compute C= A * B

– Complexity O(n3 ) in SISD systems


» n2 dot products, each of which is
O(n)

– Complexity O(n2) in SIMD systems


» Perform n dot products in parallel
across the array
Image smoothing:

– Smooth an n-by-n pixel image to reduce


“noise”.

– Each pixel is replaced by the average of


itself and its 8 nearest neighbors.

– Complexity O(n2) in SISD systems.

– Complexity O(n) in SIMD systems


The Pentium architecture incorporates a form of SIMD
- multimedia extensions - a SIMD unit in conjunction
with the Pentium superscalar processor

Has 57 new instructions, e.g. high-speed multiply-


accumulate function which computes the sum of a series
of products.

Such a computation is a key element of several digital


signal processing (DSP) and MM applications.

Has eight 64-bit wide registers.

Four new data types - packed bytes, words, etc.


Can perform a calculation simultaneously on two, four or
eight data elements by packing multiple operands into a
single 64-bit SIMD register and operating on them all in
parallel, in a single clock cycle.

E.g. 8 graphics pixels (8-bits) can be packed into SIMD


register and operated on simultaneously.
MIMD - Multiprocessors

MIMD systems differ from SIMD ones in that the


“lock-step” operation requirement is removed.

Each processor has its own control unit and can


execute an independent stream of instructions
– Rather than forcing all processors to
perform the same task at the same time, processors
can be assigned different tasks that, when taken as a
whole, complete the assigned application.

Comprise large number of independent processors


(i.e. different programs) operating on separate data.
Cooperation via two distinct approaches:

 shared memory (although there may be


some local memory) for holding data and
communication purposes.

 message passing (no shared memory) via


interconnections
A multiprocessor with 16 CPUs sharing a common memory
A multicomputer with 16 CPUs, each with each own private memory
A taxonomy of parallel computers.
Interconnection structures

Multiprocessors comprise large number of CPUs, IOPs and


MMUs.

Need to
 pass information (commands, data) between
processors

 connect CPUs and IOPs to memory banks

 connect CPUs to IOPs


How connected?

Via buses or interconnection networks.

Number of different physical configurations.

E.g. ring, mesh, tree, hypercube


Various design factors dictate applicability of any
given configuration:
 communication diameter - the number of
processors that commands/data must pass through to
reach destination.
 expandability
 redundancy - the number of different paths
available to commands/data to reach destination
 routing
 bandwidth - the capacity in terms of bytes
per sec
 connection degree - the number of paths
leaving a processor
Keys to high MIMD performance are
– Process synchronization

– Process scheduling

Process synchronization concerned with methods for


allowing process shared access to resources and
communication - no deadlock situations.

Note: Deadlock – insidious state of computer system in


which some or all programs are blocked from progressing
due to lack of resources.
Process scheduling can be performed
– By the programmer through the use of
parallel language constructs
» Specify a priori what processes will
be instantiated and where they will be executed.

– During program execution by spawning


processes off for execution on available processors.
» Fork-join construct in some languages.

Keep all processors busy and not suspended awaiting


data from another processor
Shared memory

Classified as

 with uniform memory access - UMA.

Each processor has the same path length and equal


access time to memory.
 with non-uniform memory access - NUMA.

Distributed shared memory. Non-uniform access


time for each processor - larger path lengths for some
processors.

 with cache-only memory access - COMA.


All have a cache problem concerning data consistency -
cache coherence - which complicates the design.

Shared data may reside in several caches.

When any processor updates its cached value, all other


caches must either update their values or invalidate them.
 Write-through with update protocol - when cache
is updated, so too is memory. Update is broadcast to
all other processors which update their own caches
on receipt.
Alternative is write-through with invalidation - broadcast
causes other caches to invalidate “stale” data.

 Write-back protocol - when cache is updated,


processor obtains exclusive ownership of block, and
all other copies, including memory copy are
invalidated.
When another processor wishes to read block, updated
block is sent by current owner, and memory copy is also
updated.
 Snoopy caches - a special cache controller
constantly monitors (snoops on) the (single-) bus for
write operations.
If the address of a write operation matches the address
which the caches holds, the cache invalidates that
entry.
Message-passing computers

Described in terms of granularity.

Granularity = computation
communication

which is measured as
 coarse - more computation than communication
 medium
 fine - more communication than computation

Problems include
 message routing
 synchronization for message passing - send
and receive primitives
 deadlock
Interconnection networks

Once the number of processors starts increasing, a bus-


based design becomes impractical. Why?

Therefore, some kind of interconnection network


must be used to connect the processors
- to the memories (if a shared-memory
design is used), or

- to the processors (if a message-passing


design is used).
Connection Topologies for Parallel Processors

Parallel processors may be interconnected in a number of


different ways.

For example,

As a bus:

As a ring:
As a star:

As a mesh/
grid:
As a (hyper)
cube:
Parallel computing in the Sega home
video game system

System bus model of the Sega Genesis


architecture
Multiprocessing metrics

Speedup

Speedup, S = Execution time for 1 processor


factor Execution time for N processors

Might intuitively assume linear speedup - execution


time decreases in proportion to number of processors.

Amdahl’s law - concerns parallel computation with a serial


component
(a) A program has a sequential part and a parallelizable
part.
(b) Effect of running part of the program in parallel.
f - fraction of computation which must be executed
serially

Time with N processors = f * t + (1 - f) * t


N

Speedup, S = t
f * t + [(1 - f) * t]/N

= N
1 + (N - 1)*f
For significant increase in speed to be realized, fraction
of serial computation needs to be very small.

With large number of processors, maximum speedup


limits to 1/f
System examples

MIMD

- NASA has developed a parallel system comprising a n


etwork of PCs, called a Beowulf cluster. Such a system
with 600 computers was rated at 233 Gigaflops!

- An IBM 8192-processor system called ASCI


White, built for the DoE, USA. This system rates at 12.3
teraflops - that is 12.3 x 1012 floating-point operations
per sec. The computers used are based on the IBM
RS/6000 family.
Problems

– Hardware is relatively easy to build.


– Massively parallel systems just take
massive amounts of money to build.
– How should/can the large numbers of
processors be interconnected?
– The real trick is building software that will
exploit the capabilities of the system

Reality check

– Outside of a limited number of high-profile


applications, parallel processing is still a “young”
discipline
– Parallel software is still fairly sparse
– Risky for companies to adopt parallel
strategies - they just wait for the next new SISD system!

You might also like