ACA Mod1

Advanced Computer Architecture
17CS72/15CS72
1
CHAPTER 1: PARALLEL COMPUTING MODEL
THE STATE OF COMPUTING
MULTIPROCESSORS AND MULTICOMPUTERS
MULTIVECTOR AND SIMD COMPUTERS
PRAM AND VLSI MODELS
2
Computer Development Milestones
How it all started…
o 500 BC: Abacus (China) – The earliest mechanical computer/calculating device.
o Operated to perform decimal arithmetic with carry propagation digit by digit
o 1642: Mechanical Adder/Substractor (Blaise Pascal)
o 1827: Difference Engine (Charles Babbage)
o 1941: First binary mechanical computer (Konrad Zuse; Germany)
o 1944: Harvard Mark I (IBM)
o The very first electromechanical decimal computer as proposed by Howard Aiken
3
COMPUTER DEVELOPMENT MILESTONES
Computer Generations Division into generations marked primarily changes in hardware and software
technologies
Generation Technology & Architecture Software & Applications Representative
systems
First (1945- Vacuum tubes & relay memories, CPU Machine/assembly languages, ENIAC, Princeton,
54) motivated by Pc & accumulator, fixed single user, no subroutine linkage, IAS, IBM 701
point arithmetic. programmed I/O using CPU.
Second Discrete transistors and core HLL used with compilere, IBM 7090, CDC
(1955-64) memories, floating point arithmetic, subroutine libraries, batch 1604, Univac
I/O processors, multiplexed memory processing monitor. LARC.
access.
Third (1965- Integrated circuits (SSI-MSI), Multiprogramming & time IBM 360/370, CDC
74) microprogramming, pipelining, cache sharing OS, multi user 6600, TI- ASC,
& lookahead processors. applications. PDP-8
Forth(1975- LSI/VLSI & semi conductor memory, Multiprocessor OS, languages, VAX 9000, Cray
90) multiprocessors, vector compilers & environment for XMP,
supercomputers, multi computers. parallel processing. IBM 3090
BBN TC 2000
Fifth(1991- Advanced VLSI processors, memory & Superscalar processors System on See Tables in the
Present) switches, high density packaging, Chip, Massively parallel Text 1.3-1.6 Further
scalable architectures. processing, grand challenge
applications, heterogeneous 4
FIRST GENERATION (1945 – 54)
o Technology & Architecture:
• Vacuum Tubes
• Relay Memories
• CPU driven by PC and accumulator
• Fixed Point Arithmetic
o Software and Applications:
• Machine/Assembly Languages
• Single user
• No subroutine linkage
• Programmed I/O using CPU
o Representative Systems: ENIAC, Princeton IAS, IBM 701
5
SECOND GENERATION (1955 – 64)
• Discrete Diodes, Transistors
• Core Memories
• Floating Point Arithmetic
• I/O Processors
• Multiplexed memory access
• High level languages used with compilers
• Subroutine libraries
• Processing Monitor
o Representative Systems: IBM 7090, CDC 1604, Univac LARC
Diodes
Transistors
6
THIRD GENERATION (1965 – 74)
• IC Chips (SSI/MSI)
• Microprogramming
• Pipelining
• Cache
• Look-ahead processors
• Multiprogramming and Timesharing OS
• Multiuser applications SSI Chips
o Representative Systems: IBM 360/370, CDC 6600, T1-ASC, PDP-8
7
FORTH GENERATION (1975 – 90)
• LSI/VLSI Multilayer PCBs
• Semiconductor memories
• Multiprocessors
• Multi-computers
• Vector supercomputers
Multilayer PCB
• Multiprocessor OS
• Languages, Compilers and environment for parallel processing
o Representative Systems: VAX 9000, Cray X-MP, IBM 3090
8
FIFTH GENERATION (1991 ONWARDS)
• Advanced VLSI processors
• Scalable Architectures
• Superscalar processors
• Systems on a chip
• Massively parallel processing Adv. VLSI
Chip
• Grand challenge applications
• Heterogeneous processing
o Representative Systems: S-81, IBM ES/9000, Intel Paragon, nCUBE 6480, MPP,
VPP500
9
ELEMENTS OF MODERN COMPUTING
• Computing Problems
• Algorithms and Data Structures
• Hardware Resources
• Operating System
• System Software Support
• Compiler Support
10
COMPUTING PROBLEMS
 A modern computer is an integrated system consisting of machine hardware, an
instruction set, system software, application programs, and user interfaces.
 The use of a computer is driven by real-life problems demanding cost effective solutions.
 For numerical problems in science and technology, the solutions demand complex
mathematical formulations and intensive integer or floating-point computations.
 For artificial intelligence [Al] problems, the solutions demand logic inferences and
symbolic manipulations.
Algorithms and Data Structures

 Traditional algorithms and data structures are designed for sequential machines.
 New, specialized algorithms and data structures are needed to exploit the capabilities
of parallel architectures.
 These often require interdisciplinary interactions among theoreticians,
experimentalists, and programmers.
11
HARDWARE RESOURCES
 Processors, memory, and peripheral devices form the hardware core of a computer system.
 A modem computer system demonstrates its power through coordinated efforts by hardware
resources , an operating system, and application software.
 Special hardware interfaces are often built into l-O devices such as display terminals,
workstations.
 In addition, software interface programs are needed. These software interfaces include file
transfer systems, editors, word processors, device drivers, interrupt handlers, network
communication programs, etc.
Operating System
Operating systems manage the allocation and deallocation of resources during user program
execution.
UNIX, Mach, and OSF/1 provide support for
multiprocessors and multicomputers multithreaded kernel functions
virtual memory management
file subsystems
network communication services
An OS plays a significant role in mapping hardware resources to algorithmic and data structure
12
SYSTEM SOFTWARE SUPPORT
 Compilers, assemblers, and loaders are traditional tools for developing programs in
high-level languages.
 With the operating system, these tools determine the bind of resources to
applications, and the effectiveness of this determines the efficiency of hardware
utilization and the system’s programmability.
 Most programmers still employ a sequential mind set, abetted by a lack of
popular parallel software support.
 Parallel software can be developed using entirely new languages designed
specifically with parallel support as its goal, or by using extensions to existing
sequential languages.
 New languages have obvious advantages but require additional programmer
education and system software.
 The most common approach is to extend an existing language.
13
COMPILER SUPPORT
Preprocessors
use existing sequential compilers and specialized libraries to implement parallel
constructs
Precompilers
perform some program flow analysis, dependence checking, and limited parallel
optimizations
Parallelizing Compilers
requires full detection of parallelism in source code, and transformation of sequential
code into parallel constructs
Compiler directives are often inserted into source code to aid compiler parallelizing
efforts
14
EVOLUTION OF COMPUTER
ARCHITECTURE
Over the past decades, Architecture has gone through evolutional, rather than
revolution change.
Sustaining features are those that are proven to improve performance.
Starting with the von Neumann architecture (strictly sequential), architectures have
evolved to include processing lookahead, parallelism, and pipelining
15
FLYNN’S CLASSIFICATION
Michael Flynn(1972) introduced a classification of various computer architecture based on notions
of instruction and data streams. Viz.
SISD – Single instruction stream over a single data stream
SIMD – Single instruction stream over multiple data streams
MISD – Multiple instruction streams and a single data stream
MIMD – Multiple instruction streams over multiple data streams
Captions
IS – Instruction Stream
DS – Data Stream
CU – Control Unit
PU – Processing Unit
MU – Memory
16
SIMD
Captions
DS – Data Stream
CU – Control Unit
PE – Processing Element
LM – Local Memory
02/09/2024 17
MISD
Captions
DS – Data Stream
CU – Control Unit
MU – Memory Unit
LM – Local Memory
18
MIMD
Captions
DS – Data Stream
MIMD architecture (with Shared Memory) CU – Control Unit
MU – Memory Unit
LM – Local Memory
19
DEVELOPMENT LAYERS
20
SYSTEM ATTRIBUTES TO PERFORMANCE
Machine Capability and Program Behavior
 • Peak Performance
 • Turn around time
 • Cycle Time, Clock Rate and Cycles Per Instruction (CPI)
• Performance Factors
 o Instruction Count, Average CPI, Cycle Time, Memory Cycle Time and No. of memory cycles
• System Attributes
 o Instruction Set Architecture, Compiler Technology, Processor Implementation and control,
Cache and Memory Hierarchy

• MIPS Rate, FLOPS and Throughput Rate
• Programming Environments – Implicit and Explicit Parallelism
21
SYSTEM ATTRIBUTES AFFECTING
PERFORMANCE
Clock Rate(& Cycles per Instruction (CPI)
Performance Factors
If Ic = Instruction Count
T= CPU Time
T= Ic X CPI X
T = Ic x ( p + m x k ) x where p no of processor cycles needed,

 m no of memory references needed
 k is the ratio between memory cycle & processor cycle
22
SYSTEM ATTRIBUTES
The 5 performance factors (Ic , p , m , k , ) are influenced by 4 system attributes
They are viz.
Instruction Set Architecture
Compiler Technology
Processor Implementation and Control
Cache and Memory Hierarchy ref Table next slide
23
MIPS RATE
Million instructions per second – MIPS

Let C be the total no. of clock cycles required to execute a program
then the CPU time in equation T = Ic x ( p + m x k ) x can be estimated as
T=Cx
T=C/f
Further CPI = C/Ic and T = Ic x CPI x
=Ic x CPI/f
The speed is often measured in terms of MIPS which often varies to no. of factors
including clock rate f and the CPI of a given machine & can be defined as
MIPS rate = = =
24
THROUGHPUT RATE
Another important concept related to how many programs a system can execute per
unit time called system throughput represented as Ws expressed in programs/second
In multiprogramming environment system throughput W s is often lower than CPU
throughput Wp which defined as
Wp=
Note that Wp = MIPS x 106 /Ic
25
MULTIPROCESSORS AND MULTICOMPUTERS
• Shared Memory Multiprocessors
o The UMA Model
o The NUMA Model
o The COMA Model
o The CC-NUMA Model
• Distributed-Memory Multicomputers
 o The NORMA Machines
 o Message Passing multi-computers
• Taxonomy of MIMD Computers

• Representative Systems
 o Multiprocessors: BBN TC-200, MPP, S-81, IBM ES/9000 Model 900/VF,
 o Multicomputers: Intel Paragon XP/S, nCUBE/2 6480, SuperNode 1000, CM5, KSR-1
26
THE UNIFORM MEMORY ACCESS (UMA)
MULTIPROCESSOR MODEL
02/09/2024 27
UNIFORM MEMORY ACCESS
UMA is a shared memory architecture used in parallel computers.
All the processors in the UMA model share the physical memory uniformly.
In a UMA architecture, access time to a memory location is independent of
which processor makes the request or which memory chip contains the
transferred data.
Uniform memory access computer architectures are often contrasted
with non-uniform memory access (NUMA) architectures.
In the UMA architecture, each processor may use a private cache.
Peripherals are also shared in some fashion.
The UMA model is suitable for general purpose and time sharing
applications by multiple users. It can be used to speed up the execution of a
single large program in time-critical applications
28
THE TWO NON-UNIFORM
MEMORY
ACCESS (NUMA) MODELS FOR
MULTIPROCESSOR SYSTEM
02/09/2024 29
NON-UNIFORM MEMORY ACCESS
NUMA is a computer memory design used in multiprocessing, where the memory

access time depends on the memory location relative to the processor.
Under NUMA, a processor can access its own local memory faster than non-local
memory (memory local to another processor or memory shared between
processors).
The benefits of NUMA are limited to particular workloads, notably on servers

where the data is often associated strongly with certain tasks or users.
30
THE CACHE ONLY MEMORY ARCHITECTURE
(COMA) MODEL OF MULTIPROCESSOR
Captions
P – Processor
D – Directory
C – Cache
Example KSR – 1
31
CACHE-ONLY MEMORY
ARCHITECTURE –(COMA)
COMA is a computer memory organization for use in multiprocessors in which the local
memories (typically DRAM) at each node are used as cache.
This is in contrast to using the local memories as actual main memory, as
in NUMA organizations
In NUMA, each address in the global address space is typically assigned a fixed home node.
When processors access some data, a copy is made in their local cache, but space remains
allocated in the home node.
Instead, with COMA, there is no home. An access from a remote node may cause that data
to migrate. Compared to NUMA, this reduces the number of redundant copies and may allow
more efficient use of the memory resources.
On the other hand, it raises problems of how to find a particular data and what to do if a
local memory fills up. Hardware memory coherence mechanisms are typically used to
implement the migration.
32
GENERIC MODEL OF A MESSAGE-PASSING
MULTICOMPUTER
33
The CC-NUMA Model
Nearly all CPU architectures use a small amount of very fast non-shared memory
known as cache to exploit locality of reference in memory accesses.
Typically, CCNUMA uses inter-processor communication between cache controllers

to keep a consistent memory image when more than one cache stores the same
memory location.
34
MULTI VECTOR AND SIMD COMPUTERS
Vector Processors
o Vector Processor Variants
 • Vector Supercomputers • Attached Processors
o Vector Processor Models/Architectures

 • Register-to-register architecture • Memory-to-memory architecture
o Representative Systems:
 • Cray-I
 • Cray Y-MP (2,4, or 8 processors with 16Gflops peak performance)
 • Convex C1, C2, C3 series (C3800 family with 8 processors, 4 GB main memory, 2 Gflops peak
performance)
• DEC VAX 9000 (pipeline chaining support)
35
02/09/2024 36
MULTI VECTOR AND SIMD
COMPUTERS
• SIMD Supercomputers
o SIMD Machine Model
• S = < N, C, I, M, R >
• N: No. of PEs in the machine
• C: Set of instructions (scalar/program flow) directly executed by control unit
• I: Set of instructions broadcast by CU to all PEs for parallel execution
• M: Set of masking schemes
• R: Set of data routing functions
o Representative Systems: • MasPar MP-1 (1024 to 16384 PEs)
• CM-2 (65536 PEs)
• DAP600 Family (up to 4096 PEs)
• Illiac-IV
02/09/2024
(64 PEs) 37
MULTI VECTOR AND SIMD COMPUTERS
38
PRAM AND VLSI MODELS
Parallel Random Access Machines
o Time and Space Complexities
• Time complexity
• Space complexity
• Serial and Parallel complexity
• Deterministic and Non-deterministic algorithm
o PRAM
• Developed by Fortune and Wyllie (1978)
• Objective: Modelling idealized parallel computers with zero synchronization or
memory access overhead
• An n-processor PRAM has a globally addressable Memory
39
PRAM MODEL OF MULTIPROCESSOR
40
PRAM & VLSI MODELS
Parallel Random Access Machines
o PRAM Models
o PRAM Variants
 EREW-PRAM Model
 CREW-PRAM Model
 ERCW-PRAM Model
 CRCW-PRAM Model
o Discrepancy with Physical Models

 Most popular variants: EREW (Exclusive Read Exclusive Write) and CRCW (Concurrent Read Concurrent
Write)
 SIMD machine with shared architecture is closest architecture modelled by PRAM
 PRAM Allows different instructions to be executed on different processors simultaneously. Thus, PRAM really
operates in synchronized MIMD mode with Shared Memory
41
PRAM & VLSI MODELS
VLSI Complexity
Model
o The 𝑨𝑻𝟐 Model
• Memory Bound
on Chip Area
• I/O Bound on
Volume 𝑨𝑻
• Bisection
Communication
Bound (Cross
section area) √𝑨 𝑻
• Square of this area
used as lower bound
42
CHAPTER :2 PROGRAM AND NETWORK PROPERTIES
DATA AND RESOURCE DEPENDENCIES
Program segment cannot be executed in parallel unless they are independent

Independence comes in several forms
• Data dependence: data modified by one segment must not be modified by another parallel segment
• Control dependence: if the control flow of segment cannot be identified before run time then the data
dependence between the segment is variable
• Resource dependence: even if several segment are independent in other ways, they cannot be executed
in parallel if there aren’t sufficient processing resources(eg: functional units)
Program segment cannot be executed in parallel unless they are independent
PROGRAM & NETWORK PROPERTIES
Conditions of Parallelism
The ordering relationship between statements is indicated by data dependence
 Data Dependence
1. Flow dependence
2. Antidependence
3. Output dependence
4. I/O dependence
5. Unknown dependence
44
DATA DEPENDENCE
1. Flow dependence
A statement S2 is flow-dependent on statement S1 if an execution path exists from s1 to s2 and if at least
one output(variables assigned) of S1 feeds in as input(operands to be used) to S2.
Flow dependence is denoted as S1S2
2. Antidependence
Statement S2 is antidependent on Statement S1 if S2 follows S1 in program order and if the output of S2
overlaps the input to S1. A direct arrow crossed with bar as S1 + S2 indicates antidependence from S1
to S2
3. Output dependence
The two statements are output-dependent if they produce (write) the same output variable.
indicates output dependence from S1 to S2.
45
DATA DEPENDENCE
4. I/O dependence
Read and write are I/O statements. I/O dependences occur not because the same variable is involved but
because the same file is referenced by both I/O statements.
5. Unknown dependence
The dependence relation between two Statements cannot be determined in the following situations.
 The subscript of the variable is itself subscribed.
 The subscript does not contain loop index variable.
 The variable appears more than once with subscripts having different coefficients of the loop
variable.
 The subscript is non linear in the loop index variable.
When one or more of these conditions exists a conservative assumption is to claim unknown
dependence among the statements involved.
46
47
48
02/09/2024 49
50
DATA DEPENDENCE -EXAMPLE
Consider the following code fragment involving I/O operation:
S1: Read(4), A(I) Read array A from file 4
S2: Process Process data
S3: Write (4), B(I) Write array B into file 4
S4: Close close file 4
S1 and S3 are I/O dependent on each other as they both access the same file.
The above data dependence relation should not be arbitrarily violated during program execution.
Else erroneous result may be produced with changed program order.
S1 S3
I/O dependence caused by accessing the same file by read and write statements
51
CONTROL DEPENDENCE
 This refers to the situation where the order of execution of statements cannot be determined
before run time.
 Eg. Conditional statements will not be resolved until run time.
 Different paths taken after a conditional branch may introduce or eliminate data dependence
among instructions.
 Dependence may also exists between operations performed in successive iterations of a
looping procedure.
Resource Dependence
It is concerned with conflicts in using shared resources such as integers units, floating point units,
registers, memory areas among parallel events.
When conflicting resource is ALU it is known as ALU dependence.
If the conflict involves workplace storage its called storage dependence.
52
BERNSTEIN CONDITION
 Bernstein revealed a set of conditions based on which two processes can execute in parallel. A process is
software entity corresponds to the abstraction of a program fragment defined at various processing level.
 We define the input set Ii of a process Pi as a set of input variables needed to execute a process. Similarly
the output set Oi consists of all output variables generated after execution of all processes P i.
 Consider two processes P1 and P2 with their input sets I1 & I2 and output sets O1 & O2 respectively. These
two processes can execute in parallel denoted as P1 || P2 , if they are independent and therefore create
deterministic results.
 Formally these conditions as stated as
These 3 conditions are known as Bernstein conditions
53
BERNSTEIN CONDITION
 The input set Ii are called read set or domain of Pi and similarly the output set Oi has been called
write set or range of a process Pi.
 The conditions simply imply that if two processes can execute in parallel if they are flow
independent , anti-independent and output independent.
 Also the parallel execution of such two processes produce the same result regardless of whether
they are executed sequentially or in parallel.
 In general a set of processes P1 P2 P3 ………Pk. can execute in parallel if Bernstein’s conditions are
satisfied on a pair wise basis i.e. P1 || P2 || P3 || …..|| Pk if and only if Pi || Pj for all i not equal to j
(i≠j)
54
55
56
HARDWARE & SOFTWARE PARALLELISM
HARDWARE Parallelism
This refers to the type of parallelism defined by machine architecture & hardware multiplicity
 it’s a tradeoff between cost and performance
Resource utilization patterns of simultaneously executable operations
 It can be used to observe the peak performance of processor resources
parallism in a processor characterize by number of instruction issue per cycle
SOFTWARE Parallelism
This type of parallelism is revealed in the program profile or in the program flow graph.
 It is the function of algorithm, programming style, and program design.
The program flow graph displays the patterns of simultaneously executable operations .
57
58
59
60
PROGRAM PARTITIONING & SCHEDULING
Grain Sizes & latency
Grain Size or granularity is the measure of the amount of computation involved in a software process.
The simplest measure being is to count the number of instructions in a grain(program segment).
The grain size determines the basic program segment chosen for parallel processing. They are
commonly described as
 fine,
 medium or
 coarse depending on processing levels involved.
61
62
PARALLELISM LEVELS
Instruction level parallelism
Loop level parallelism
Procedure level parallelism
Subprogram level parallelism
Program level parallelism
63
64
65
66
67
68
69
70
71
02/09/2024 72
GRAIN PACKING & SCHEDULING
Basic concept of Partitioning
Example
A program graph shows the structure of a program Each node in the program graph
corresponds to computational unit in the program. The grain size is measured by the
number of basic machine cycles (both processor & memory cycles) needed to
execute all the operations within node.
Each node is denoted by a pair (n, s) where n is node id/name and the s is grain size
of the node. Thus grain size reflects the no. of computations involved in the program
segment
The edge label (v, d) between two end nodes specifies the output variable v
source/input node to destination and communication delay d between them which
include all path and memory latency delay.
There are 17 nodes in fine grain graph and 5 in coarse grain program graph.
02/09/2024 73
EXAMPLE PROGRAM
Var a, b, c, d, e, f, g, h, i, j, k, l, m, n, o, p, q
Begin
1. a:=1
11. k:= d x f
2. b:=2
12. l:= j x k
3. c:= 3
13. m:= 4 x 1
4. d:= 4
14. n:= 3 x m
5. e:= 5
15. o:= n x i
6. f:= 6
16. p:= o x h
7. g:= a x b
17. q:= p x q
8. h:= c x d
9. i:= d x e
End
10 j:= e x f
74
Nodes 1, 2, … 6 are memory reference data fetch operations
Each take one cycle to address and six cycles to fetch from
memory All remaining nodes i.e. 7 to 17 are CPU operations
each requiring two cycles to complete. After packing the
coarse grain nodes have larger grain sizes ranging from 4 to 8
75
76
77
78
79
PROCESS OF GRAIN
DETERMINATION &
SCHEDULING
Four major steps are involved in the grain determination and the process of
scheduling optimization:
 Step 1. Construct a fine-grain program graph.

 Step 2. Schedule the fine-grain computation.
 Step 3. Grain packing to produce the coarse grains.
 Step 4. Generate a parallel schedule based on the packed graph.
80
PROGRAM FLOW MECHANISM
81
82
PROGRAM FLOW MECHANISM
83
84
85
COMPARISON OF FLOW MECHANISMS
86
SYSTEM INTERCONNECT ARCHITECTURE
Direct networks for Static connections

Indirect networks for dynamic connections
Static and dynamic networks for interconnecting computer subsystems for constructing
multiprocessor or multicomputer
 Static networks :point to point direct connections which will not change during execution
Dynamic Networks: Implemented with switched channel .they are dynamically configured to
match the communication demand in user programs
87
GOALS AND ANALYSIS
The goals of an interconnection network are to provide
 low-latency
 high data transfer rate
 wide communication bandwidth
Analysis includes
 latency
 bisection bandwidth
 data-routing functions
 scalability of parallel architecture
88
TERMINOLOGY
 Network usually represented by a graph with a finite number of nodes linked by directed
or undirected edges
 Number of nodes in graph :Network size
 Number of edges incident on node=Node degree(d)
 Node degree reflects number of I/O ports associated with a node ,and should ideally be
small and constant
 Diameter D :network with maximum shortest path between any two nodes. The path
length is measured by the number of links traversed, the network diameter indicates
maximum number distant hops between ant two nodes
 Channel bisection width b=minimum number of edges cut to split a network into 2 parts
each having the same number of nodes
 Each channel has W bit wires ,then wire bisection width B=bw .Bisection width provides
good indication of maximum communication bandwidth along the bisection of network
 Wire length=length of edges between nodes
89
DATA ROUTING FUNCTIONS
 Shifting rotating
 Permutation(one to one)
 Broadcast(one to all)
 Multicast(many to many)
 Personalized broadcast(one to many)
 Shuffle
 Exchange
90
PERMUTATIONS
Given n objects, there are n ! ways in which they can be reordered .
A permutation can be specified by giving the rule for reordering a group of objects.
Permutations can be implemented using crossbar switches, multistage networks, shifting, and
broadcast operations. The time required to perform permutations of the connections between nodes
often dominates the network performance when n is large
Perfect Shuffle and Exchange

Let N processors P0,p1…..,pn-1.N Should be power of two their exit one way link exist between
each pair of processors pi to pj where
J={2i for 0<=i<=n/2-1
2i+1-n for n/2<=i<=n-1}
shifting 1 bit to the left and wrapping it around to the least significant bit position
91
HYPERCUBE ROUTING FUNCTIONS
If the vertices of a n-dimensional cube are labeled with n-bit numbers so that only one bit
differs between each pair of adjacent vertices, then n routing functions are defined by the bits
in the node (vertex) address.
For example, with a 3-dimensional cube, we can easily identify routing functions that
exchange data between nodes with addresses that differ in the least significant, most
significant, or middle bit.
92
FACTORS AFFECTING PERFORMANCE
Functionality – how the network supports data routing, interrupt handling, synchronization,
request/message combining, and coherence
Network latency – worst-case time for a unit message to be transferred
Bandwidth – maximum data rate
Hardware complexity – implementation costs for wire, logic, switches, connectors, etc.
Scalability – how easily does the scheme adapt to an increasing number of processors,
memories, etc.?
93
STATIC NETWORKS
Linear Array
Ring and Chordal Ring
Barrel Shifter
Tree and Star
Fat Tree
Mesh and Torus
94
LINEAR ARRAY
N nodes connected by n-1 links (not a bus); segments between different pairs of
nodes can be used in parallel.
Internal nodes have degree 2; end nodes have degree 1.
Diameter = n-1
Bisection = 1
For small n, this is economical, but for large n, it is obviously inappropriate
Like a linear array, but the two end nodes are connected by an n th
link; the ring can be uni- or bi-directional. Diameter is n/2 for a
bidirectional ring, or n for a unidirectional ring.
By adding additional links , the node degree is increased, and we
obtain a chordal ring. This reduces the network diameter.
In the limit, we obtain a fully-connected network, with a node degree
of n -1 and a diameter of 1
95
STATIC NETWORKS – BARREL SHIFTER
Like a ring, but with additional links between all pairs of nodes that have a distance equal to
a power of 2.
With a network of size N = 2n, each node has degree d = 2n -1, and the network has diameter
D = n /2.
Barrel shifter connectivity is greater than any chordal ring of lower node degree.
Barrel shifter much less complex than fully-interconnected network
96
BINARY TREE , STAR &FAT TREE
A k-level completely balanced binary tree will have N = 2k – 1 nodes, with maximum node
degree of 3 and network diameter is 2(k – 1).
The balanced binary tree is scalable, since it has a constant maximum node degree.
A star is a two-level tree with a node degree d = N – 1 and a constant diameter of 2.
A fat tree is a tree in which the number of edges between nodes increases closer to the root .
The edges represent communication channels (“wires”), and since communication traffic
increases as the root is approached, it seems logical to increase the number of channels there
97
MESH AND TORUS SYSTOLIC ARRAY
Pure mesh – N = n k nodes with links between each adjacent pair of nodes in a row or column (or higher
degree). This is not a symmetric network; interior node degree d = 2k, diameter = k (n – 1).
Illiac mesh (used in Illiac IV computer) – wraparound is allowed, thus reducing the network diameter to
about half that of the equivalent pure mesh.
A torus has ring connections in each dimension, and is symmetric. An n  n binary torus has node degree
of 4 and a diameter of 2  n / 2 .
A systolic array is an arrangement of processing elements and communication links designed specifically to
match the computation and communication requirements of a specific algorithm.
This specialized character may yield better performance than more generalized structures, but also makes
them more expensive, and more difficult to program
98
HYPERCUBES, CUBE-CONNECTED CYCLES
A binary n-cube architecture with N = 2n nodes spanning along n dimensions, with two nodes per
dimension.
The hypercube scalability is poor, and packaging is difficult for higher-dimensional hypercubes
k-cube connected cycles (CCC) can be created from a k-cube by replacing each vertex of the k-
dimensional hypercube by a ring of k nodes.
A k-cube can be transformed to a k-CCC with k  2k nodes.
The major advantage of a CCC is that each node has a constant degree (but longer latency) than in the
corresponding k-cube. In that respect, it is more scalable than the hypercube architecture
99
NETWORK THROUGHPUT
Network throughput – number of messages a network can handle in a unit time interval.
One way to estimate is to calculate the maximum number of messages that can be present in
a network at any instant (its capacity); throughput usually is some fraction of its capacity.
A hot spot is a pair of nodes that accounts for a disproportionately large portion of the total
network traffic (possibly causing congestion).
Hot spot throughput is maximum rate at which messages can be sent between two specific
nodes.
100
DYNAMIC CONNECTION
NETWORKS
Dynamic connection networks can implement all communication patterns
based on program demands.
In increasing order of cost and performance, these include
 bus systems
 multistage interconnection networks
 crossbar switch networks
Price can be attributed to the cost of wires, switches, arbiters, and connectors.
Performance is indicated by network bandwidth, data transfer rate, network
latency, and communication patterns supported.
101
BUS SYSTEMS
A bus system (contention bus, time-sharing bus) has
 a collection of wires and connectors
 multiple modules (processors, memories, peripherals, etc.) which connect to the wires
 data transactions between pairs of modules
Bus supports only one transaction at a time.

Bus arbitration logic must deal with conflicting requests.
Lowest cost and bandwidth of all dynamic schemes.
Many bus standards are available.
102
SWITCH MODULES
An a  b switch module has a inputs and b outputs. A binary switch has a = b = 2.
It is not necessary for a = b, but usually a = b = 2k, for some integer k.
In general, any input can be connected to one or more of the outputs. However, multiple
inputs may not be connected to the same output.
When only one-to-one mappings are allowed, the switch is called a crossbar switch
103
MULTISTAGE NETWORKS
In general, any multistage network is comprised of a collection of a  b switch modules and
fixed network modules. The a  b switch modules are used to provide variable permutation
or other reordering of the inputs, which are then further reordered by the fixed network
modules.
A generic multistage network consists of a sequence alternating dynamic switches (with
relatively small values for a and b) with static networks (with larger numbers of inputs and
outputs). The static networks are used to implement interstage connections (ISC).
104
OMEGA NETWORK
A 2  2 switch can be configured for
 Straight-through
 Crossover
 Upper broadcast (upper input to both outputs)
 Lower broadcast (lower input to both outputs)
 (No output is a somewhat vacuous possibility as well)
With four stages of eight 2  2 switches, and a static perfect shuffle for each of the four ISCs,
a 16 by 16 Omega network can be constructed (but not all permutations are possible).
In general , an n-input Omega network requires log 2 n stages of 2  2 switches and n / 2
switch modules
105
CROSSBAR NETWORKS
A m  n crossbar network can be used to provide a constant latency connection between
devices; it can be thought of as a single stage switch.
Different types of devices can be connected, yielding different constraints on which switches
can be enabled.
 With m processors and n memories, one processor may be able to generate requests for
multiple memories in sequence; thus several switches might be set in the same row.
 For m  m interprocessor communication, each PE is connected to both an input and an
output of the crossbar; only one switch in each row and column can be turned on
simultaneously. Additional control processors are used to manage the crossbar itself.
106
CHAPTER3
PRINCIPLES OF SCALABLE PERFORMANCE
Basics of performance Evalution
A sequential algorithm is evaluated in terms of its execution time which is expressed as
a function of its input size.
For a parallel algorithm, the execution time depends not only on input size but also
on factors such as parallel architecture, no. of processors, etc.
Performance Metrics
Parallel Run Time Speedup Efficiency
Standard Performance Measures

Peak Performance Instruction Sustained Performance
Execution Rate (in MIPS) Floating Point Capability (in MFLOPS)
107
Parallel runtime
The parallel run time T(n) of a program or application is the time required to run the
program on an n-processor parallel computer. When n = 1, T(1) denotes sequential runtime
of the program on single processor.
speedup
Speedup S(n) is defined as the ratio of time taken to run a program on a single processor to
the time taken to run the program on a parallel computer with identical processors
It measures how faster the program runs on a parallel computer rather than on a single
processor.
108
PARALLELISM PROFILE IN PROGRAMS
Degree of parallelism (DOP)
Discrete function of time . number of processors executing in a given time period
assuming only non negetive integer
The plot of DOP as a function of time is called parallelism profile
Average parallism A
For n processing elements and max parallism profile m, n>>m.
Computing capacity ∆ can be computed in MIPS, MFLlops
With I PEs busy ,DOP=i
t2
Total work W   t DOP(t )dt
1
m
W    i  ti
Average parallelism can
i 1
be calculated by
In discreate form we have
109
FACTORS AFFECTING
PARALLELISM PROFILES
Algorithm structure
Program optimization
Resource utilization
Run-time conditions
Realistically limited by # of available processors, memory, and other nonprocessor
resources
02/09/2024 110
Mean performance
Let {Ri} be the execution rate of programs i=1,2,3,
….m
Arithmetic Mean
Harmonic mean
Harmonic mean speedup
Example :program has 2 phases, each phase with 100 instructions .phase 1 has 60ips and phase 2 has 20ips
What is the average ips?
111
AMDAHL’S LAW
This law often used in parallel computing to predict the theoretical speedup when using multiple
processors
Let assume Rii rate of execution

w=(α,0,0,..(1- α))
w i= α,w n=1- α
R1=1,rn=n
112
EFFICIENCY ,UTILIZATION AND QUALITY
There are several parameter required for evaluating parallel computations
System efficiency
letO(n)=Total number of operations by an n-processor system
T(n)=executions time in steaps
Then T(n)<O(n) and T(1)=O(1)
Then speed up factor quality of parallism

s(n)=t(1)/t(n) Q(n)=S(n)E(n)/R(n)
Redundancy and utilization

R(n)=0(n)/0(1)
U(n)=R(n)E(n)=O(n)/nT(n)
System Efficiency
E(n)=S(n)/n
=T(1)/nT(n)
113
BENCHMARKS AND PERFORMANCE MEASURES
Dhrystone
This is standard benchmarks for all types of computers
this is CPU intensive benchmark
Measures of integer performance
No floats
Unit :Kdhrystones
Whetstone
This is synthetic floating point performance measures
Sythetic benchmarkss
Unit kwhetstones
TPS(Transaction per second )
KLIPS(Kilo logic inferences per second)
114
SCALABILITY OF PARALLEL ALGORITHMS
Algorithm characteristics
Characteristics of parallel algorithm
• Deterministic vs non deterministic
• Computational Granularity
• Parallelism profile
• Communication patterns and synchronization requirements
• Uniformity of the operations
• Memory requirement and datastructures
ISOEFFICIENCY and ISOEFFICIENCY Function

E=w(s)/(w(s)+h(s, n)
W(s)=E/1-E*h(s,n)
• W(s)=C*h(s,n)
115
SCALABILITY ANALYSIS AND APPROACHES
Scalability is defined as the performance of computer system increases linearly with respect to
the number of processors used for a given application
Scalability metrics and goals
Machine Size(n)
The number of processors employed in parallel computer system A large machine size implies
more resources and more computing power
Clock rate(f)
The clock rate determines the basic machine cycle
Problem size(S)
The amount of computational workload .the problem size is directly proportional to the
sequential execution time T(s,1)
Cpu time(T)
Time requried to execute a given program on parallel machine with n processors
collectively .the Parallel execution time denoted as T(s,n)
I/O demand(d)
The I/O demand in moving the program ,data and results associated with given application run
116
Communication overhead(h)
The amount of time spent for IPC ,synchronization ,remote memory access .This overhead
also includes all non compute operations which do not involve the CPU or I/O devices
Computer cost(c)
The total cost of hardware and software resources required to carry out executing of
program
Programming overhead(p)
The programming overhead may slow down software productivity and thus implies a high
cost
117
118

ACA Mod1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

ACA Mod1

Uploaded by

Copyright:

Available Formats

Advanced Computer Architecture

THE STATE OF COMPUTING

MULTIPROCESSORS AND MULTICOMPUTERS

MULTIVECTOR AND SIMD COMPUTERS

PRAM AND VLSI MODELS

o Representative Systems: IBM 360/370, CDC 6600, T1-ASC, PDP-8

Algorithms and Data Structures

Cache and Memory Hierarchy

T = Ic x ( p + m x k ) x where p no of processor cycles needed,

Million instructions per second – MIPS

• Taxonomy of MIMD Computers

NUMA is a computer memory design used in multiprocessing, where the memory

The benefits of NUMA are limited to particular workloads, notably on servers

Typically, CCNUMA uses inter-processor communication between cache controllers

o Vector Processor Models/Architectures

• DEC VAX 9000 (pipeline chaining support)

• An n-processor PRAM has a globally addressable Memory

o Discrepancy with Physical Models

DATA AND RESOURCE DEPENDENCIES

Program segment cannot be executed in parallel unless they are independent

When conflicting resource is ALU it is known as ALU dependence.

These 3 conditions are known as Bernstein conditions

Loop level parallelism

Procedure level parallelism

Subprogram level parallelism

Program level parallelism

 Step 1. Construct a fine-grain program graph.

Direct networks for Static connections

 Personalized broadcast(one to many)

Perfect Shuffle and Exchange

Bus supports only one transaction at a time.

Standard Performance Measures

Harmonic mean speedup

Let assume Rii rate of execution

Then speed up factor quality of parallism

Redundancy and utilization

ISOEFFICIENCY and ISOEFFICIENCY Function

You might also like