StudM1p1Parallel Computer Modelsppt1shared

You might also like

Download as pptx, pdf, or txt
Download as pptx, pdf, or txt
You are on page 1of 107

COMPUTER SYSTEM ARCHITECTURE - CS 405

MODULE -1 PART - 1

PARALLEL COMPUTER MODELS


Five Generations of Electronic Computers
Computing problems
• Numerical computing - For numerical problems in science and technology, the solutions
demand complex mathematical formulations and intensive integer or floating-point
computations.
• Transaction processing - For alphanumerical problems in business and government, the
solutions demand accurate transactions, large database management, and information
retrieval operations.
• Logical reasoning -For artificial intelligence (AI) problems, the solutions demand logic
inferences and symbolic manipulations.
 Algorithms and Data Structures
• Special algorithms and data structures are needed to specify the computations and
communications involved in computing problems.
• Most numerical algorithms are deterministic, using regularly structured data.
 Hardware Resources
•Coordinated efforts by hardware resources, an operating system, and application software.
•Hardware core of a computer system - Processors, memory, and peripheral devices
•Special hardware interfaces are often built into I/O devices, such as terminals, workstations,
optical page scanners, magnetic ink character recognizers, modems, file servers, voice
data entry, printers, and plotters. These peripherals are connected to mainframe computers
directly or through local or wide-area networks.
 Operating System
• An effective operating system manages the allocation and deallocation of resources during
the execution of user programs.
• Beyond the OS, application software must be developed to benefit the users.
 Mapping
• Mapping of algorithmic and data structures onto the machine architecture is a bidirectional
process matching algorithmic structure with hardware architecture, and vice versa.
• Efficient mapping will benefit the programmer and produce better source codes.
• It includes processor scheduling, memory maps, interprocessor communications, etc
• System Software Support
• The source code written in a HLL must be first translated into object code by an optimizing
compiler.
• The compiler assigns variables to registers or to memory words and reserves functional
units for operators.
• An assembler is used to translate the compiled object code into machine code which can be
recognized by the machine hardware.
• A loader is used to initiate the program execution through the OS kernel.
FLYNN'S CLASSICAL TAXONOMY
 This taxonomy distinguishes multi-processor computer architectures
according two independent dimensions of Instruction stream and
Data stream.
 An instruction stream is sequence of instructions executed by
machine.
 Data stream is a sequence of data including input, partial or temporary
results used by instruction stream.
 Each of these dimensions can have only one of two possible states:
Single or Multiple.
 Flynn’s taxonomy
 Classification based on memory arrangement

 Classification based on communication

 Classification based on the kind of parallelism


• Data-parallel
• Function-parallel

16
FLYNN’S TAXONOMY
– The most universally excepted method of classifying computer systems
– Published in the Proceedings of the IEEE in 1966
 Any computer can be placed in one of 4 broad categories:

» SISD: Single instruction stream, single data stream


» SIMD: Single instruction stream, multiple data streams
» MIMD: Multiple instruction streams, multiple data streams
» MISD: Multiple instruction streams, single data stream

17
FLYNN’S TAXONOMY….
• Two types of information flow into a processor:
- instructions and data
• Instruction stream is defined as the sequence of instructions performed
by the processing unit.
• Data stream is defined as the data traffic exchanged between the
memory and the processing unit.
• According to Flynn’s classification, either of the instruction or data
streams can be single or multiple.

18
SISD - SINGLE INSTRUCTION STREAM, SINGLE DATA STREAM

Instructions
Processing Main memory (M)
element (PE)
Data

DS
Control Unit IS PE Memory

19
IS
SIMD - SINGLE INSTRUCTION STREAM, MULTIPLE DATA STREAMS

Applications:
• Image processing
• Matrix manipulations
• Sorting
• Eg vector computers

20
 A type of parallel computer
 Single instruction: All processing units execute the same instruction issued by the
control unit at any given clock cycle. where there are multiple processor executing
instruction given by one control unit.
 Multiple data: Each processing unit can operate on a different data element as shown
figure below the processor are connected to shared memory or interconnection
network providing multiple data to processing unit.
 Single instruction is executed by different processing unit on different set of data.
SIMD ARCHITECTURES
 Fine-grained
 Image processing application
 Large number of PEs
 Minimum complexity PEs
 Programming language is a
simple extension of a
sequential language
 Coarse-grained
 Each PE is of higher
complexity and it is usually
built with commercial devices
 Each PE has local memory
22
MIMD - MULTIPLE INSTRUCTION STREAMS, MULTIPLE DATA STREAMS
Applications:
• Parallel computers
• Shared Memory

23
 Multiple Instruction: every processor may be executing a different
instruction stream.
 Multiple Data: every processor may be working with a different data
stream as shown in figure multiple data stream is provided by shared
memory.
 Can be categorized as loosely coupled or tightly coupled depending on
sharing of data and control.
 Execution can be synchronous or asynchronous, deterministic or
nondeterministic.
 There are different processor each processing different task.
MISD -MULTIPLE INSTRUCTION STREAMS, SINGLE DATA STREAM
Applications:
• Classification
• Robot vision
• Systolic array for pipelined execution
of specific algorithms

25
 A single data stream is fed into multiple processing units.
 Each processing unit operates on the data independently via independent
instruction streams as shown in figure a single data stream is forwarded
to different processing unit which are connected to different control unit
and execute instruction given to it by control unit to which it is attached.
 Thus in these computers same data flow through a linear
array of processors executing different instruction streams as
shown in figure .
 This architecture is also known as systolic arrays for
pipelined execution of specific instructions.
FLYNN’S TAXONOMY
Advantages of Flynn
» Universally accepted
» Compact Notation
» Easy to classify a system .

Disadvantages of Flynn
» Very coarse-grain differentiation among machine systems
» Comparison of different systems is limited
» Interconnections, I/O, memory not considered in the scheme

29
High Performance Computing Applications
PERFORMANCE FACTORS
 Processor Cycle time (t in nanoseconds) - CPU is driven by a clock with a
constant cycle time (usually measured in nanoseconds, which controls the rate of
internal operations in the CPU.
 Clock rate ( f = 1/t, f in megahertz) - inverse of the cycle time. A shorter clock
cycle time, or equivalently a larger number of cycles per second, implies more
operations can be performed per unit time.
 Instruction count (Ic)- the number of machine instructions to be executed by the
program. Determines the size of the program. Different machine instructions require
different numbers of clock cycles to execute.
 Average CPI (Cycles Per Instruction)- CPI is important to measure the
execution time of an instruction. Average CPI can be determined for a particular
processor if we know the frequency of occurrence of each instruction type.
The term CPI is used with respect to a particular instruction set and a given program
mix.
PERFORMANCE FACTORS
CPU time (T = Ic x CPI x t) – CPU time required to execute a program containing Ic
instructions. Each instruction must be fetched from memory, decoded, then operands
fetched from memory, the instruction executed, and the results stored.

 Memory Cycle time (k x t) -The time required to access memory, usually k times the
processor cycle time τ. The value of k depends on the memory technology and the
processor-memory interconnection scheme. The instruction cycle may involve (k)
memory references, (eg k=4; one-instruction fetching, two-operand fetching, and one-
storing result).
 CPI (p + m x k) - Processor cycles required for each instruction can be attributed to
cycles needed for instruction decode and execution (processor cycles (p)) and cycles
needed for memory references (memory cycles = m x k).
 Total CPU time –Effective CPU time needed to execute a program rewritten as
T = Ic x (p + m x k) x t

p is the number of processor cycles needed for the instruction decode and execution,
m is the number of memory references needed, k is the ratio between memory cycle and
processor cycle, Ic is the instruction count, and t is the processor cycle time.
System Attributes
The five performance factors (Ic, p, m, k, t) are influenced by four system attributes

Performance Factors  Ic p m k t
System Attributes
Instruction set architecture X X
Compiler technology X X X
CPU implementation & control X X
Cache & memory hierarchy X X

• The instruction set architecture affects program length (Ic) and processor cycles (p)
• Compiler design affects the values of Ic, p & m.
• The CPU implementation & control determine the total processor time= p*t
• The memory technology & hierarchy design affect the memory access time= k*t
SYSTEM ATTRIBUTES
 MIPS Rate - Let C be the total number of clock cycles needed to execute a given
program. Then Total CPU time can be estimated as T = C x t => C/f
Furthermore, CPI = C/Ic , T = Ic x CPI x t => Ic x CPI/f
The processor speed is measured in terms of million instructions per second (MIPS).
MIPS rate varies with respect to a number of factors, including the clock rate, the
instruction count (Ic), and the CPI of a given machine.
MIPS Rate = Ic / (T x 106 ) ) = f / (CPI x 106 ) = (f x Ic) / (C x 106 )
MIPS Rate directly proportional to clock rate and inversely proportional to CPI.
CPU time, T = Ic / (MIPS x 106 )
 Throughput Rate
System throughput, Ws (in programs/second) - how many programs a system
can execute per unit time. It is measured across a large number of programs over a
long observation period.
CPU throughput, Wp (in programs/second) – in multi programmed system, how
many programs can be executed per unit time, based on MIPS rate and average
program length, Ic.
Wp = f / (Ic x CPI) = MIPS x (106 )/ Ic

In a multiprogrammed system, (Ws < Wp), due to additional system overheads


caused by the I/O, compiler & OS when multiple programs are interleaved for CPU
execution by multiprogramming or time sharing operation. If CPU kept busy in a
perfect program-interleaving fashion, (Ws = Wp).
 Floating Point Operations Per Second (FLOPS) –computation intensive
applications in science and engineering performance in flops.
megaflops (106 ) , gigaflops (109 ), teraflops (1012 ), petaflops (1015 ), etc
 Speed or Throughput (W/Tn) - the execution rate on an ‘n-processor’ system,
measured in FLOPs/unit-time or instructions/unit-time.
 Speedup (Sn = T1/Tn) - how much faster in an actual machine, n processors
compared to 1 will perform the workload. The ratio T1/T∞ is asymptotic
speedup.
 Efficiency (En = Sn/n) - fraction of the theoretical maximum speedup achieved
by n processors. Efficiency is a measure of the fraction of time for which a PE is
usefully employed. In an ideal parallel system efficiency is equal to one. In
practice, efficiency is between zero and one.
 Degree of Parallelism (DOP) - for a given piece of the workload, the number of
processors that can be kept busy sharing that piece of computation equally.
Neglecting overhead, we assume that if k processors work together on any
workload, the workload gets done k times as fast as a sequential execution.
Performance

 For some program running on machine A,


Performance of A, Perf(A) = 1 / ExecTime(A)
 “A is n times faster than B" iff

Perf(A) / Perf(B) = ExecTime(B) / ExecTime(A) = n

 “A is X% faster than B“ iff

Perf(A) / Perf(B) = ExecTime(B) / ExecTime(A) = 1 + X/100


CPU Performance Equation:

٥
Example:
Now, when the task given in the previous example is executed on a FOUR-processor
system with shared memory. Due to the need for synchronization among the FOUR
program parts, 2000 extra instructions are added to each part.
– Calculate the average CPI?
– Determine the corresponding MIPS rate?
– Calculate the speedup factor of the FOUR-processor system?
– Calculate the efficiency of the FOUR-processor system?
– Show the interconnection network of this system?
Solution:
Average CPI = 2 cycles/instruction.
MIPS = (4 * 500MHz)/2 = 1000
Speedup = [T1/T4]
T1 = [ Ic/MIPS ] = 100000/250 =0.400 msec
T4 = [ Ic/MIPS ] = [100000+4*2000]/1000 =0.108 msec
Speedup = 0.4/0.108 = 3.703
Efficiency = Speedup / #Processors = 3.703/4 = 92.59%
• For CPU design:

• The overall CPI is given by:

Where;
CPIi: represents the average number of instructions per clock for instruction (i).
Ici: represents number of times instruction (i) is executed in a program.
Example
Suppose you have made the following measurements;
–Frequency of FP operations (other than FP SQR)= 25%
–Average CPI of FP operations = 4
–Average CPI of other operations =1.33
–Frequency of FPSQR= 2%
–CPI of FPSQR= 20
•Assume that TWO design alternatives are to decrease the CPI of FPSQR to 2, or to decrease the
average CPI of all FP operations to 2.5. Compare these two design alternatives?
Amdahl’s Law
A program (or algorithm) which can be parallelized can be split up into two
parts:
A part which cannot be parallelized and
 A part which can be parallelized
Eg:
Imagine a program that processes files from disk. A small part of that program
may scan the directory and create a list of files internally in memory. After that,
each file is passed to a separate thread for processing. The part that scans the
directory and creates the file list cannot be parallelized, but processing the
files can be done in parallel.
Total time taken to execute the program only serially is called T.
The time T includes the time of both the non-parallelizable and parallelizable
parts.
T = Total time of serial execution
B = Total time of non- parallelizable part
T - B = Total time of parallelizable part (when executed serially, not in parallel)
First of all, a program can be broken up into a non-parallelizable part B, and
a parallelizable part 1-B, as illustrated by this diagram:
The line with the delimiters on at the top is the total execution time T(1).

Execution time with a


Execution time with a parallelization factor of 3:
parallelization factor of 2:
Example: 1
Suppose that a calculation has a 4% serial portion,
a) What is the limit of speedup on 16 processors?
b) What is the maximum speedup?
Ans:
a) Limit of speedup on 16 processors
=16/(1 + (16 – 1)*.04) = 10
b) The maximum speedup = 1/α
= 1/0.04 = 25
Example: 2
If 90% of a calculation can be parallelized, then what is the maximum
speed-up which can be achieved on 5 processors?
Ans: S(n) = n/(1 + (n – 1)* α) (α =1-0.9 = 0.1) (sequential
fraction)
= 5/(1 + (5 – 1)*.10) = 3.57
(the program can theoretically run 3.57 times faster on five processors than
Amdahl’s Law
 The performance gain that can be obtained by improving some
portion of a computer can be calculated using Amdahl’s law.
 Amdahl’s law states that the performance improvement to be
gained from using some faster mode of execution is limited by the
fraction of the time the faster mode can be used.
 Law defines the term ‘speedup’.
Amdahl’s Law
 Amdahl’s law gives us a quick way to find the speedup from
some enhancement, which depends on two factors:
 The fraction of the computation time in the original computer

that can be converted to take advantage of the enhancement.


Example:
If 20 seconds of the execution time of a program that takes 60
seconds in total can use an enhancement, the fraction enhanced is
20/60.
This value, Fractionenhancement , is always less than or equal to 1.
Amdahl’s Law
 The speedup or improvement gained by the enhanced
execution mode, that is, how much faster the task would
run if the enhanced mode were used for the entire
program – this value is the time of the original mode over
the time of the enhanced mode.
 Example, if the enhanced mode takes 2 sec for a portion

of the program, while it is 5 sec in the original mode, the


speedup enhanced
Corollary of Amdahl’s law
is 5/2.
 This
If an is called Speedup
enhancement is only usable forwhich
enhanced is always
a fraction of a taskgreater
then we can’t
speed
thanup1.the task by more than the reciprocal of 1 minus that fraction.
Amdahl’s Law
 The execution time using the original computer with the
enhanced mode will be the time spent using the unenhanced
portion of the computer plus the time spent using the
enhancement:
Amdahl’s Law
 If three different enhancements use fractions of time (f1,
f2 and f3 ) respectively, and have individual speedups as
S1, S2 and S3 respectively, the overall speedup is
Applications of Amdahl’s Law
 Amdahl’s law gives us a quick way to find the speedup from
some enhancement.
 Amdahl’s law is particularly useful for comparing the
overall system performance of two different systems
 It can also be applied to compare two processor design
alternatives, based on enhancement on the same system
Improving the performance of the FP operations overall is slightly better because of the higher
frequency
Three enhancements with the following speedup are proposed for a
new architecture. S1=30, S2=20, S3 = 15. If enhancements 1and 2
are each usable for 25% of the time, what fraction of the time must
enhancement 3 be used to achieve an overall speed up of 10?

10 = 1/ [1 – (0.25+0.25+ f3) + [(0.25/30)+ (0.25/20)+(f3/10)]]


10 = 1/ [ 0.5 – f3) + [ (0.5+ 0.75 + 4 f3)/ 60 ] ]
10 = 60 / [ 30 – 60 f3 + 1.25 + 4 f3 ]
– 56 f3 = 6 – 31.25
f3 = –25.25/ –56 = 0.45 = 45%
Classification based on memory arrangement
77

Shared memory
Interconnection
I/O1 network
Interconnection
network
I/On
PE1 PEn
PE1 PEn M1 Mn
Processors P1 Pn

Shared memory - multiprocessors Message passing -


multicomputers
Symmetric and Asymmetric Multiprocessors
79 Symmetric:
- all processors have equal access to all peripheral devices.
- all processors are identical.
 Asymmetric:
- one processor (executive or master) executes the operating system and
handle I/O
- other processors (attached) may be of different types and may be
dedicated to special tasks.
Shared-memory multiprocessors
80
 Uniform Memory Access (UMA)
 Non-Uniform Memory Access (NUMA)

 Cache-only Memory Architecture (COMA)

 Memory is common to all the processors.


 Processors easily communicate by means of shared
variables.
Uniform Memory Access (UMA) Model
Uniform Memory Access (UMA) Model

82
 Most commonly represented today by Symmetric Multiprocessor (SMP)
machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means
if one processor updates a location in shared memory, all the other
processors know about the update. Cache coherency is accomplished at the
hardware level.

 Tightly-coupled systems (high degree of resource sharing)


 Suitable for general-purpose and time-sharing applications by multiple users.
Non-Uniform Memory Access (NUMA) Model
 Often made by physically linking two or more SMPs
83
 Shared memory is distributed to local memories
 One SMP can directly access memory of another SMP
 Not all processors have equal access time to all memories
 The access time varies with the location of the memory word. Memory access across
link is slower
 All local memories form a global address space accessible by all processors
 If cache coherency is maintained, then may also be called CC-NUMA - Cache Coherent
NUMA

Access time: Cache, Local memory, Remote memory


Non-Uniform Memory Access (NUMA) Model
COMA - Cache-only Memory Architecture
COMA - Cache-only Memory Architecture
The COMA model is a special case of NUMA machine in which the distributed main
memories are converted to caches.
All caches form a global address space and there is no memory hierarchy at each
processor node.
Advantages:
• Global address space provides a user-friendly programming perspective to memory
• Data sharing between tasks is both fast and uniform due to the proximity of memory
to CPUs
Disadvantages:
• lack of scalability between memory and CPUs - due to geometrically increased traffic
• Synchronization constructs that insure "correct" access of global memory.
• increasingly difficult and expensive
Distributed memory multicomputers
 Multiple computers- nodes, Message-passing network
87 
Local memories are private with its own program and data and are
accessible only by local processors. So traditional multicomputers
have been called
NO-Remote-Memory-Access (NORMA) machines.
 Changes it makes to its local memory have no effect on the memory
of other processors. So concept of cache coherency does not apply .
 Memory addresses in one processor do not map to another processor,
so there is no concept of global address space across all processors.
 No memory contention so that the number of processors is very large.
 The processors are connected by communication lines, and the
precise way in which the lines are connected is called the topology of
the multicomputer.
Vector Processor
 A vector operand contains an ordered set of n elements, where n is called the length of
the vector. Each element in a vector is a scalar quantity, which may be a floating point
number, an integer, a logical value or a character.
 A vector processor consists of a scalar processor and a vector unit, which could be
thought of as an independent functional unit capable of efficient vector operations.
 Register-to-register architecture: Vector registers are used to hold the vector operands,
intermediate and final vector results . There are fixed numbers of vector registers and
functional pipelines in a vector processor
 Memory to memory architecture: uses a vector stream unit to replace the vector registers.
Vector operands and results are directly retrieved from memory in superwords (512 bits).
 Vector hardware has the special ability to overlap or pipeline operand processing.
 Vector functional units pipelined, fully segmented -each stage of the pipeline performs a
step of the function on different operand(s) .Once pipeline is full, a new result is produced
each clock period (cp).
 Applications -
Long Range Weather forecasting, Petroleum explorations,
Medical diagnosis, Space flight simulations
SIMD has two basic architectural organizations
a. Array processor using random access memory
b. Associative processors using content addressable memory.
 An array processor is a synchronous array of parallel processors that coordinate
concurrent operations in lockstep through global clocks, central control units, or vector unit
controllers.
 These processors are composed of N identical processing elements (PES) under the
supervision of a one control unit (CU).
 This Control unit is a computer with high speed registers, local memory and arithmetic logic
unit.
 There are N data streams; one per processor, so different data be used in each processor .
 Two categories of array processors depending how the memory units are organized:

a. Dedicated memory organization


b. Global memory organization
Architectural Development Tracks
Architecture of systems pursue development tracks.. These tracks are
illustrious by likeness in computational model & technological bases.
There are mainly 3 tracks:
1. Multiple -Processor tracks
multiple processor system can be shared memory or distributed memory
(a) Shared Memory track: single address space
(b) Message Passing track:

2. Multivector and SIMD tracks


(a) Multivector track
(b) SIMD track

3. Multi threaded and Dataflow tracks


(a) Multi threaded track - execute multiple contexts at the same time
(b) Data Flow Track

You might also like