04 - Lecture #4

High Performance
Computing
LECTURE #4
1
o Parallel Computing Platforms.
• Von Numman Architecture.
• Flynn Techxonomy
Agenda
o Logical Organization
• Control
• Communication
o Physical Organization
2
von Neumann Architecture
❖Named after the Hungarian mathematician John von Neumann who first
authored the general requirements for an electronic computer in his 1945
papers.
❖Since then, virtually all computers have followed this basic design, which
differed from earlier computers programmed through "hard wiring".
3
von Neumann Architecture
4
Parallel Computing Platform
Logical Organization
5
Models:
Flynn's Classical Taxonomy
❖ There are different ways to classify parallel computers. One of the more
widely used classifications, in use since 1966, is called Flynn's Taxonomy.
❖ Flynn's taxonomy distinguishes multi-processor computer architectures

according to how they can be classified along the two independent dimensions
of Instruction and Data.
❖Each of these dimensions can have only one of two possible states: Single or
Multiple.
6
Flynn's Classical Taxonomy
The matrix below defines the 4 possible classifications according to Flynn:
7
Flynn's Taxonomy
Single Instruction, Single Data (SISD):

❖ A serial (non-parallel) computer
❖ Single instruction: only one instruction stream is being acted on by the

CPU during any one clock cycle
❖ Single data: only one data stream is being used as input during any one
clock cycle
❖ This is the oldest and even today, the most common type of computer
❖ Examples: older generation mainframes, minicomputers and

workstations; most modern day PCs.
8
Flynn's Taxonomy
Single Instruction, Multiple Data (SIMD):

❖ Single instruction: All processing
units execute the same
instruction at any given clock
cycle
❖Multiple data: Each processing

unit can operate on a different
data element A single control unit that
dispatches the same
instruction to various
processors (that work on
different data)
9
Flynn's Taxonomy
10
Flynn's Taxonomy
❖Processor Arrays: ILLIAC IV, DAP Connection Machine CM-2, MasPar MP-1.
❖Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2,
Hitachi S820, ETA10
❖Most modern computers, particularly those with graphics processor units

(GPUs) employ SIMD instructions and execution units.
❖Examples:
For (I = 0; i<1000; i++)
c[i] = a[i] + b[i];
11
Flynn's Taxonomy
12
Flynn's Taxonomy
13
Your Turn !!!
Guess what are the SIMD drawbacks??!!
14
Flynn's Taxonomy
Multiple Instruction, Single Data (MISD):

❖A single data stream is fed into multiple processing
units.
❖Each processing unit operates on the data

independently via independent instruction streams.
❖Few actual examples of this class of parallel

computer have ever existed. One is the experimental
Carnegie-Mellon C.mmp computer (1971).
❖ex: Multiple cryptography algorithms attempting to

crack a single coded message.
15
Multiple Instruction, Multiple Data (MIMD):
❖Currently, the most common type of parallel computer. Most
modern computers fall into this category.
❖ Multiple Instruction: every processor may be executing a

different instruction stream.
❖ Multiple Data: every processor may be working with a

different data stream.
❖ Examples: most current supercomputers, networked parallel

computer clusters and "grids", multi-processor SMP computers,
multi-core PCs.
❖ Note: many MIMD architectures also include SIMD execution

sub-components
16
Flynn's Taxonomy
17
Flynn's Taxonomy
❖Multiple Instruction, Multiple Data (MIMD):
A simple variant of this model is
❖Relies on multiple instance of the same program

executing on different data
❖Widely used by many parallel platforms and requires

minimal architectural support
❖Ex : Sun Ultra servers, multiprocessor PCs, workstation

cluster & IBM SP
18
Your Turn
Compare between SIMD and MIMD
19
20
❖Suppose you want to do a puzzle that has, say, a
thousand pieces. time??
21
❖ A friend came to help!!
❖ works on his/her half of the puzzle
❖ you’ll both reach into the pile of pieces at the same time
❖contend for the same resource
❖ you will have to work together (communicate)
❖ Speedup??
22
❖More help!!
❖ Contention??
❖ Communication??
❖ Speedup??
23
❖Now let’s try something a little different.
❖ set up two tables
❖ put half of the puzzle pieces on each table
❖ work completely independently, without any

contention
❖ the cost of communicating??
❖ Decomposition ??
24
❖More??
❖ Easy??
❖ Load balance
25
26
Platforms that provide a shared data space are Platforms that support messaging are called
called shared-address-space machines or message passing platforms or multi-computers.
multiprocessors
27
1- Accessing Shared data
❖It is important to note the difference between the terms shared address
space and shared memory.
✓ Shared address space is a programming abstraction.
✓ Shared memory is a physical machine attribute.
28
1- Accessing Shared data (cont.) Logical Organization
❖Part (or all) of the memory is accessible to all processors.
❖Processors interact by modifying data objects stored in this shared-address-

space.
❖ Changes in a memory location effected by one processor are visible to all

other processors (global address space).
❖ Shared memory machines can divided into two main classes based upon
memory access times:
➢Uniform Memory Access (UMA) and
➢ Non- Uniform Memory Access (NUMA). 29

Uniform Memory Access (UMA) Parallel Computing Platform
❖Most commonly represented today by ❖Sometimes called Cache Coherent UMA

Symmetric Multiprocessors (SMPs). (CC-UMA).
❖Identical processors. ❖Cache Coherent means if one processor

updates a location in shared memory, all the
❖Each processor has equal access and
other processors know about the update.
access times to memory.
30
Non-Uniform Memory Access(distributed) Logical Organization
(NUMA)
❖Often made by physically linking two or more Symmetric Multiprocessors SMPs.
✓ One SMP can directly access memory of another SMP.
✓ Not all processors have equal access time to all memories.
✓Memory access across link is slower.
31
(a) Uniform-memory access (b) Uniform-memory-access (c) Non-uniform-memory-
shared-address-space computer; shared-address-space access shared-address-
computer with caches and space computer with local
memories memory only. 32
Your Turn
Compare between NUMA and UMA
33
Platforms that provide a shared data space are Platforms that support messaging are called
called shared-address-space machines or message passing platforms or multicomputers.
multiprocessors
34
2- Exchanging messages
❖Distributed memory systems require a communication network to
connect inter-processor memory.
❖Processors have their own local memory, so

❖Each one operates independently.
❖Changes it makes to its local memory have no effect on the memory of
other processors.
❖Hence, the concept of cache coherency does not apply.
35
2- Exchanging messages (Cont.) Logical Organization
❖These platforms comprise of a set of processors and their own

(exclusive) memory.
❖ Instances of such a view come naturally from clustered workstations

and non-shared-address-space multi-computers.
❖ These platforms are programmed using (variants of) send and receive
primitives. {GetID, NumProcs}.
❖Principal functions send(), receive(), each processor has unique ID
❖ Libraries such as MPI and PVM provide such primitives.

36
❖When a processor needs access to data in another processor
“Distributed Memory”, it is usually the task of the programmer
to explicitly define how and when data is communicated.
❖Synchronization between tasks is the programmer's

responsibility.
37
❖Each processor P (with its own local
cache C) is connected to exclusive local
memory, i.e. no other CPU has direct
access to it.
❖ Each node comprises at least one

network interface (NI) that mediates the
connection to a communication network.
❖ On each CPU runs a serial process that

can communicate with other processes on
other CPUs by means of the network.
❖ Non-blocking vs Blocking
communication 38
(MPI)—A distributed memory parallel programming language
❖ Synchronizes well with Data Parallelism.
❖ The same program on each processor/machine (SPMD—a very useful subset of MIMD)
❖ Each process distinguished by its rank.
❖ The program is written in a sequential language (FORTRAN/C[++])
❖All variables are local! No concept of shared memory
❖Data exchange between processes through Send/receive messages via appropriate

library
39
(MPI)—A distributed memory parallel programming language
❖MPI System requires information about:

✓ Which processor is sending the message. (Sender)
✓ Where is the data on the sending processor. (S Variable)
✓ What kind of data is being sent. (Data type)
✓ How much data is there. (Size)
✓ Which processor (s) are receiving the message. (Receiver)
✓ Where should the data be left on the receiving processor. (R variable)
✓ How much data is the receiving processor prepared to accept. (Size)

40
Your Turn
Compare between shared address space and message passing platforms
Shared Address Space Platforms Message Passing

❖ Shared address space ❖Message passing requires little
platforms can easily emulate hardware support, other than a network.
message passing. The reverse is
more difficult to do (in an
efficient manner).
41
Your Turn
Shared Memory Distributed Memory
Advantages Advantages
Global address space provides a user-friendly Memory is scalable with number of processor,
Increase the number of processors and the size of
programming prespective to memory. memory increases proportionally.
Data sharing between tasks is both fast and Each processor can rapidly access its own
uniform due to the proximity of memory to memory without interference and without
CPUs overhead incurred with trying to maintain cache
coherency.
Cost effectiveness
Disadvantages
Lack of scalability between memory and
CPUs. Adding more CPUs can increases traffic Disadvantages
on the shared memory and CPU path Programmer responsible for many details
associated with data communication between
Programmer responsibility for processors.
synchronization
Difficult to map existing data structures, based on
Expensive global memory, to this memory organization
NUMA access times 42

04 - Lecture #4

Uploaded by

Copyright:

Available Formats

You might also like

04 - Lecture #4

Uploaded by

Document Information

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

04 - Lecture #4

Uploaded by

Copyright:

Available Formats

High Performance

❖ Flynn's taxonomy distinguishes multi-processor computer architectures

Single Instruction, Single Data (SISD):

❖ Single instruction: only one instruction stream is being acted on by the

❖ Examples: older generation mainframes, minicomputers and

Single Instruction, Multiple Data (SIMD):

❖Multiple data: Each processing

❖Most modern computers, particularly those with graphics processor units

Multiple Instruction, Single Data (MISD):

❖Each processing unit operates on the data

❖Few actual examples of this class of parallel

❖ex: Multiple cryptography algorithms attempting to

❖ Multiple Instruction: every processor may be executing a

❖ Multiple Data: every processor may be working with a

❖ Examples: most current supercomputers, networked parallel

❖ Note: many MIMD architectures also include SIMD execution

A simple variant of this model is

❖Relies on multiple instance of the same program

❖Widely used by many parallel platforms and requires

❖Ex : Sun Ultra servers, multiprocessor PCs, workstation

❖ works on his/her half of the puzzle

❖contend for the same resource

❖ you will have to work together (communicate)

❖ set up two tables

❖ put half of the puzzle pieces on each table

❖ work completely independently, without any

❖ the cost of communicating??

1- Accessing Shared data

✓ Shared address space is a programming abstraction.

✓ Shared memory is a physical machine attribute.

❖Part (or all) of the memory is accessible to all processors.

❖Processors interact by modifying data objects stored in this shared-address-

❖ Changes in a memory location effected by one processor are visible to all

➢ Non- Uniform Memory Access (NUMA). 29

❖Most commonly represented today by ❖Sometimes called Cache Coherent UMA

❖Identical processors. ❖Cache Coherent means if one processor

✓ One SMP can directly access memory of another SMP.

✓ Not all processors have equal access time to all memories.

✓Memory access across link is slower.

❖Processors have their own local memory, so

❖These platforms comprise of a set of processors and their own

❖ Instances of such a view come naturally from clustered workstations

❖Principal functions send(), receive(), each processor has unique ID

❖ Libraries such as MPI and PVM provide such primitives.

❖Synchronization between tasks is the programmer's

❖ Each node comprises at least one

❖ On each CPU runs a serial process that

❖ Synchronizes well with Data Parallelism.

❖ Each process distinguished by its rank.

❖ The program is written in a sequential language (FORTRAN/C[++])

❖All variables are local! No concept of shared memory

❖Data exchange between processes through Send/receive messages via appropriate

❖MPI System requires information about:

✓ Where is the data on the sending processor. (S Variable)

✓ What kind of data is being sent. (Data type)

✓ How much data is there. (Size)

✓ Which processor (s) are receiving the message. (Receiver)

✓ Where should the data be left on the receiving processor. (R variable)