04 - Lecture #4

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 42

High Performance

Computing
LECTURE #4

1
o Parallel Computing Platforms.
• Von Numman Architecture.
• Flynn Techxonomy
Agenda
o Logical Organization
• Control
• Communication

o Physical Organization

2
von Neumann Architecture
❖Named after the Hungarian mathematician John von Neumann who first
authored the general requirements for an electronic computer in his 1945
papers.

❖Since then, virtually all computers have followed this basic design, which
differed from earlier computers programmed through "hard wiring".

3
von Neumann Architecture

4
Parallel Computing Platform
Logical Organization

5
Models:
Flynn's Classical Taxonomy
❖ There are different ways to classify parallel computers. One of the more
widely used classifications, in use since 1966, is called Flynn's Taxonomy.

❖ Flynn's taxonomy distinguishes multi-processor computer architectures


according to how they can be classified along the two independent dimensions
of Instruction and Data.

❖Each of these dimensions can have only one of two possible states: Single or
Multiple.

6
Flynn's Classical Taxonomy
The matrix below defines the 4 possible classifications according to Flynn:

7
Flynn's Taxonomy

Single Instruction, Single Data (SISD):


❖ A serial (non-parallel) computer

❖ Single instruction: only one instruction stream is being acted on by the


CPU during any one clock cycle

❖ Single data: only one data stream is being used as input during any one
clock cycle

❖ This is the oldest and even today, the most common type of computer

❖ Examples: older generation mainframes, minicomputers and


workstations; most modern day PCs.
8
Flynn's Taxonomy

Single Instruction, Multiple Data (SIMD):


❖ Single instruction: All processing
units execute the same
instruction at any given clock
cycle

❖Multiple data: Each processing


unit can operate on a different
data element A single control unit that
dispatches the same
instruction to various
processors (that work on
different data)
9
Flynn's Taxonomy

10
Flynn's Taxonomy

❖Processor Arrays: ILLIAC IV, DAP Connection Machine CM-2, MasPar MP-1.
❖Vector Pipelines: IBM 9000, Cray X-MP, Y-MP & C90, Fujitsu VP, NEC SX-2,
Hitachi S820, ETA10

❖Most modern computers, particularly those with graphics processor units


(GPUs) employ SIMD instructions and execution units.

❖Examples:
For (I = 0; i<1000; i++)
c[i] = a[i] + b[i];

11
Flynn's Taxonomy

12
Flynn's Taxonomy

13
Your Turn !!!
Guess what are the SIMD drawbacks??!!

14
Flynn's Taxonomy

Multiple Instruction, Single Data (MISD):


❖A single data stream is fed into multiple processing
units.

❖Each processing unit operates on the data


independently via independent instruction streams.

❖Few actual examples of this class of parallel


computer have ever existed. One is the experimental
Carnegie-Mellon C.mmp computer (1971).

❖ex: Multiple cryptography algorithms attempting to


crack a single coded message.
15
Multiple Instruction, Multiple Data (MIMD):
❖Currently, the most common type of parallel computer. Most
modern computers fall into this category.

❖ Multiple Instruction: every processor may be executing a


different instruction stream.

❖ Multiple Data: every processor may be working with a


different data stream.

❖ Examples: most current supercomputers, networked parallel


computer clusters and "grids", multi-processor SMP computers,
multi-core PCs.

❖ Note: many MIMD architectures also include SIMD execution


sub-components
16
Flynn's Taxonomy

17
Flynn's Taxonomy
❖Multiple Instruction, Multiple Data (MIMD):

A simple variant of this model is

❖Relies on multiple instance of the same program


executing on different data

❖Widely used by many parallel platforms and requires


minimal architectural support

❖Ex : Sun Ultra servers, multiprocessor PCs, workstation


cluster & IBM SP

18
Your Turn
Compare between SIMD and MIMD

19
20
❖Suppose you want to do a puzzle that has, say, a
thousand pieces. time??

21
❖ A friend came to help!!

❖ works on his/her half of the puzzle

❖ you’ll both reach into the pile of pieces at the same time

❖contend for the same resource

❖ you will have to work together (communicate)

❖ Speedup??

22
❖More help!!

❖ Contention??

❖ Communication??

❖ Speedup??

23
❖Now let’s try something a little different.

❖ set up two tables

❖ put half of the puzzle pieces on each table

❖ work completely independently, without any


contention

❖ the cost of communicating??

❖ Decomposition ??

24
❖More??
❖ Easy??
❖ Load balance

25
Parallel Computing Platform
Logical Organization

26
Parallel Computing Platform
Logical Organization

Platforms that provide a shared data space are Platforms that support messaging are called
called shared-address-space machines or message passing platforms or multi-computers.
multiprocessors
27
Parallel Computing Platform
Logical Organization

1- Accessing Shared data

❖It is important to note the difference between the terms shared address
space and shared memory.

✓ Shared address space is a programming abstraction.

✓ Shared memory is a physical machine attribute.

28
Parallel Computing Platform
1- Accessing Shared data (cont.) Logical Organization

❖Part (or all) of the memory is accessible to all processors.

❖Processors interact by modifying data objects stored in this shared-address-


space.

❖ Changes in a memory location effected by one processor are visible to all


other processors (global address space).

❖ Shared memory machines can divided into two main classes based upon
memory access times:
➢Uniform Memory Access (UMA) and

➢ Non- Uniform Memory Access (NUMA). 29


Uniform Memory Access (UMA) Parallel Computing Platform
Logical Organization

❖Most commonly represented today by ❖Sometimes called Cache Coherent UMA


Symmetric Multiprocessors (SMPs). (CC-UMA).

❖Identical processors. ❖Cache Coherent means if one processor


updates a location in shared memory, all the
❖Each processor has equal access and
other processors know about the update.
access times to memory.
30
Parallel Computing Platform
Non-Uniform Memory Access(distributed) Logical Organization

(NUMA)
❖Often made by physically linking two or more Symmetric Multiprocessors SMPs.

✓ One SMP can directly access memory of another SMP.

✓ Not all processors have equal access time to all memories.

✓Memory access across link is slower.

31
(a) Uniform-memory access (b) Uniform-memory-access (c) Non-uniform-memory-
shared-address-space computer; shared-address-space access shared-address-
computer with caches and space computer with local
memories memory only. 32
Your Turn
Compare between NUMA and UMA

33
Parallel Computing Platform
Logical Organization

Platforms that provide a shared data space are Platforms that support messaging are called
called shared-address-space machines or message passing platforms or multicomputers.
multiprocessors
34
Parallel Computing Platform
Logical Organization

2- Exchanging messages
❖Distributed memory systems require a communication network to
connect inter-processor memory.

❖Processors have their own local memory, so


❖Each one operates independently.
❖Changes it makes to its local memory have no effect on the memory of
other processors.
❖Hence, the concept of cache coherency does not apply.

35
Parallel Computing Platform
2- Exchanging messages (Cont.) Logical Organization

❖These platforms comprise of a set of processors and their own


(exclusive) memory.

❖ Instances of such a view come naturally from clustered workstations


and non-shared-address-space multi-computers.

❖ These platforms are programmed using (variants of) send and receive
primitives. {GetID, NumProcs}.

❖Principal functions send(), receive(), each processor has unique ID

❖ Libraries such as MPI and PVM provide such primitives.


36
❖When a processor needs access to data in another processor
“Distributed Memory”, it is usually the task of the programmer
to explicitly define how and when data is communicated.

❖Synchronization between tasks is the programmer's


responsibility.

37
❖Each processor P (with its own local
cache C) is connected to exclusive local
memory, i.e. no other CPU has direct
access to it.

❖ Each node comprises at least one


network interface (NI) that mediates the
connection to a communication network.

❖ On each CPU runs a serial process that


can communicate with other processes on
other CPUs by means of the network.

❖ Non-blocking vs Blocking
communication 38
(MPI)—A distributed memory parallel programming language

❖ Synchronizes well with Data Parallelism.

❖ The same program on each processor/machine (SPMD—a very useful subset of MIMD)

❖ Each process distinguished by its rank.

❖ The program is written in a sequential language (FORTRAN/C[++])

❖All variables are local! No concept of shared memory

❖Data exchange between processes through Send/receive messages via appropriate


library

39
(MPI)—A distributed memory parallel programming language

❖MPI System requires information about:


✓ Which processor is sending the message. (Sender)

✓ Where is the data on the sending processor. (S Variable)

✓ What kind of data is being sent. (Data type)

✓ How much data is there. (Size)

✓ Which processor (s) are receiving the message. (Receiver)

✓ Where should the data be left on the receiving processor. (R variable)

✓ How much data is the receiving processor prepared to accept. (Size)


40
Your Turn
Compare between shared address space and message passing platforms

Shared Address Space Platforms Message Passing


❖ Shared address space ❖Message passing requires little
platforms can easily emulate hardware support, other than a network.
message passing. The reverse is
more difficult to do (in an
efficient manner).

41
Your Turn
Shared Memory Distributed Memory

Advantages Advantages

Global address space provides a user-friendly Memory is scalable with number of processor,
Increase the number of processors and the size of
programming prespective to memory. memory increases proportionally.
Data sharing between tasks is both fast and Each processor can rapidly access its own
uniform due to the proximity of memory to memory without interference and without
CPUs overhead incurred with trying to maintain cache
coherency.
Cost effectiveness
Disadvantages
Lack of scalability between memory and
CPUs. Adding more CPUs can increases traffic Disadvantages
on the shared memory and CPU path Programmer responsible for many details
associated with data communication between
Programmer responsibility for processors.
synchronization
Difficult to map existing data structures, based on
Expensive global memory, to this memory organization
NUMA access times 42

You might also like