Professional Documents
Culture Documents
StudM1p1Parallel Computer Modelsppt1shared
StudM1p1Parallel Computer Modelsppt1shared
StudM1p1Parallel Computer Modelsppt1shared
MODULE -1 PART - 1
16
FLYNN’S TAXONOMY
– The most universally excepted method of classifying computer systems
– Published in the Proceedings of the IEEE in 1966
Any computer can be placed in one of 4 broad categories:
17
FLYNN’S TAXONOMY….
• Two types of information flow into a processor:
- instructions and data
• Instruction stream is defined as the sequence of instructions performed
by the processing unit.
• Data stream is defined as the data traffic exchanged between the
memory and the processing unit.
• According to Flynn’s classification, either of the instruction or data
streams can be single or multiple.
18
SISD - SINGLE INSTRUCTION STREAM, SINGLE DATA STREAM
Instructions
Processing Main memory (M)
element (PE)
Data
DS
Control Unit IS PE Memory
19
IS
SIMD - SINGLE INSTRUCTION STREAM, MULTIPLE DATA STREAMS
Applications:
• Image processing
• Matrix manipulations
• Sorting
• Eg vector computers
20
A type of parallel computer
Single instruction: All processing units execute the same instruction issued by the
control unit at any given clock cycle. where there are multiple processor executing
instruction given by one control unit.
Multiple data: Each processing unit can operate on a different data element as shown
figure below the processor are connected to shared memory or interconnection
network providing multiple data to processing unit.
Single instruction is executed by different processing unit on different set of data.
SIMD ARCHITECTURES
Fine-grained
Image processing application
Large number of PEs
Minimum complexity PEs
Programming language is a
simple extension of a
sequential language
Coarse-grained
Each PE is of higher
complexity and it is usually
built with commercial devices
Each PE has local memory
22
MIMD - MULTIPLE INSTRUCTION STREAMS, MULTIPLE DATA STREAMS
Applications:
• Parallel computers
• Shared Memory
23
Multiple Instruction: every processor may be executing a different
instruction stream.
Multiple Data: every processor may be working with a different data
stream as shown in figure multiple data stream is provided by shared
memory.
Can be categorized as loosely coupled or tightly coupled depending on
sharing of data and control.
Execution can be synchronous or asynchronous, deterministic or
nondeterministic.
There are different processor each processing different task.
MISD -MULTIPLE INSTRUCTION STREAMS, SINGLE DATA STREAM
Applications:
• Classification
• Robot vision
• Systolic array for pipelined execution
of specific algorithms
25
A single data stream is fed into multiple processing units.
Each processing unit operates on the data independently via independent
instruction streams as shown in figure a single data stream is forwarded
to different processing unit which are connected to different control unit
and execute instruction given to it by control unit to which it is attached.
Thus in these computers same data flow through a linear
array of processors executing different instruction streams as
shown in figure .
This architecture is also known as systolic arrays for
pipelined execution of specific instructions.
FLYNN’S TAXONOMY
Advantages of Flynn
» Universally accepted
» Compact Notation
» Easy to classify a system .
Disadvantages of Flynn
» Very coarse-grain differentiation among machine systems
» Comparison of different systems is limited
» Interconnections, I/O, memory not considered in the scheme
29
High Performance Computing Applications
PERFORMANCE FACTORS
Processor Cycle time (t in nanoseconds) - CPU is driven by a clock with a
constant cycle time (usually measured in nanoseconds, which controls the rate of
internal operations in the CPU.
Clock rate ( f = 1/t, f in megahertz) - inverse of the cycle time. A shorter clock
cycle time, or equivalently a larger number of cycles per second, implies more
operations can be performed per unit time.
Instruction count (Ic)- the number of machine instructions to be executed by the
program. Determines the size of the program. Different machine instructions require
different numbers of clock cycles to execute.
Average CPI (Cycles Per Instruction)- CPI is important to measure the
execution time of an instruction. Average CPI can be determined for a particular
processor if we know the frequency of occurrence of each instruction type.
The term CPI is used with respect to a particular instruction set and a given program
mix.
PERFORMANCE FACTORS
CPU time (T = Ic x CPI x t) – CPU time required to execute a program containing Ic
instructions. Each instruction must be fetched from memory, decoded, then operands
fetched from memory, the instruction executed, and the results stored.
Memory Cycle time (k x t) -The time required to access memory, usually k times the
processor cycle time τ. The value of k depends on the memory technology and the
processor-memory interconnection scheme. The instruction cycle may involve (k)
memory references, (eg k=4; one-instruction fetching, two-operand fetching, and one-
storing result).
CPI (p + m x k) - Processor cycles required for each instruction can be attributed to
cycles needed for instruction decode and execution (processor cycles (p)) and cycles
needed for memory references (memory cycles = m x k).
Total CPU time –Effective CPU time needed to execute a program rewritten as
T = Ic x (p + m x k) x t
p is the number of processor cycles needed for the instruction decode and execution,
m is the number of memory references needed, k is the ratio between memory cycle and
processor cycle, Ic is the instruction count, and t is the processor cycle time.
System Attributes
The five performance factors (Ic, p, m, k, t) are influenced by four system attributes
Performance Factors Ic p m k t
System Attributes
Instruction set architecture X X
Compiler technology X X X
CPU implementation & control X X
Cache & memory hierarchy X X
• The instruction set architecture affects program length (Ic) and processor cycles (p)
• Compiler design affects the values of Ic, p & m.
• The CPU implementation & control determine the total processor time= p*t
• The memory technology & hierarchy design affect the memory access time= k*t
SYSTEM ATTRIBUTES
MIPS Rate - Let C be the total number of clock cycles needed to execute a given
program. Then Total CPU time can be estimated as T = C x t => C/f
Furthermore, CPI = C/Ic , T = Ic x CPI x t => Ic x CPI/f
The processor speed is measured in terms of million instructions per second (MIPS).
MIPS rate varies with respect to a number of factors, including the clock rate, the
instruction count (Ic), and the CPI of a given machine.
MIPS Rate = Ic / (T x 106 ) ) = f / (CPI x 106 ) = (f x Ic) / (C x 106 )
MIPS Rate directly proportional to clock rate and inversely proportional to CPI.
CPU time, T = Ic / (MIPS x 106 )
Throughput Rate
System throughput, Ws (in programs/second) - how many programs a system
can execute per unit time. It is measured across a large number of programs over a
long observation period.
CPU throughput, Wp (in programs/second) – in multi programmed system, how
many programs can be executed per unit time, based on MIPS rate and average
program length, Ic.
Wp = f / (Ic x CPI) = MIPS x (106 )/ Ic
٥
Example:
Now, when the task given in the previous example is executed on a FOUR-processor
system with shared memory. Due to the need for synchronization among the FOUR
program parts, 2000 extra instructions are added to each part.
– Calculate the average CPI?
– Determine the corresponding MIPS rate?
– Calculate the speedup factor of the FOUR-processor system?
– Calculate the efficiency of the FOUR-processor system?
– Show the interconnection network of this system?
Solution:
Average CPI = 2 cycles/instruction.
MIPS = (4 * 500MHz)/2 = 1000
Speedup = [T1/T4]
T1 = [ Ic/MIPS ] = 100000/250 =0.400 msec
T4 = [ Ic/MIPS ] = [100000+4*2000]/1000 =0.108 msec
Speedup = 0.4/0.108 = 3.703
Efficiency = Speedup / #Processors = 3.703/4 = 92.59%
• For CPU design:
Where;
CPIi: represents the average number of instructions per clock for instruction (i).
Ici: represents number of times instruction (i) is executed in a program.
Example
Suppose you have made the following measurements;
–Frequency of FP operations (other than FP SQR)= 25%
–Average CPI of FP operations = 4
–Average CPI of other operations =1.33
–Frequency of FPSQR= 2%
–CPI of FPSQR= 20
•Assume that TWO design alternatives are to decrease the CPI of FPSQR to 2, or to decrease the
average CPI of all FP operations to 2.5. Compare these two design alternatives?
Amdahl’s Law
A program (or algorithm) which can be parallelized can be split up into two
parts:
A part which cannot be parallelized and
A part which can be parallelized
Eg:
Imagine a program that processes files from disk. A small part of that program
may scan the directory and create a list of files internally in memory. After that,
each file is passed to a separate thread for processing. The part that scans the
directory and creates the file list cannot be parallelized, but processing the
files can be done in parallel.
Total time taken to execute the program only serially is called T.
The time T includes the time of both the non-parallelizable and parallelizable
parts.
T = Total time of serial execution
B = Total time of non- parallelizable part
T - B = Total time of parallelizable part (when executed serially, not in parallel)
First of all, a program can be broken up into a non-parallelizable part B, and
a parallelizable part 1-B, as illustrated by this diagram:
The line with the delimiters on at the top is the total execution time T(1).
Shared memory
Interconnection
I/O1 network
Interconnection
network
I/On
PE1 PEn
PE1 PEn M1 Mn
Processors P1 Pn
82
Most commonly represented today by Symmetric Multiprocessor (SMP)
machines
• Identical processors
• Equal access and access times to memory
• Sometimes called CC-UMA - Cache Coherent UMA. Cache coherent means
if one processor updates a location in shared memory, all the other
processors know about the update. Cache coherency is accomplished at the
hardware level.