CS 211: Computer Architecture

Computer Architecture
• Part I: Processor Architectures

CS 211: Computer Architecture ¾ starting with simple ILP using pipelining
¾ explicit ILP - EPIC
¾ key concept: issue multiple instructions/cycle
Instructor: Prof. Bhagi Narahari • Part II: Multi Processor Architectures

Dept. of Computer Science ¾ move from Processor to System level
Course URL: www.seas.gwu.edu/~narahari/cs211/ ¾ can utilize all the techniques covered thus far
¾ i.e., the processors used in a multi-processor can be EPIC
¾ move from fine grain to medium/coarse grain

¾ assume all processor issues are resolved when
discussing system level Multiprocessor design
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Moving from Fine grained to Coarser

Multi-Processor Architectures
grained computations. . .
• Introduce Parallel Processing
¾ grains, mapping of s/w to h/w, issues
• Overview of Multiprocessor Architectures
¾ Shared-memory, distributed memory
¾ SIMD architectures
• Programming and Synchronization
¾ programming constructs, synch constructs, cache
• Interconnection Networks
• Parallel algorithm design and analysis
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
1
10 11
Hardware and Software Parallelism Software vs. Hardware Parallelism (Example) 11
Software parallelism
Hardware Parallelism :
(three cycles)
-- Defined by machine architecture and hardware multiplicity
-- Number of instruction issues per machine cycle
-- k issues per machine cycle : k-issue processor
Cycle 1 L1 L2 L3 L4
Software parallelism :
-- Control and data dependence of programs
-- Compiler extensions
-- OS extensions (parallel scheduling, shared memory
allocation, (communication links) Cycle 2 X2
X1
Implicit Parallelism :
-- Conventional programming language
-- Parallelizing compiler -
Cycle 3 +
Explicit Parallelism :
-- Parallelising constructs in programming languages
-- Parallelising programs development tools
-- Debugging, validation, testing, etc.
12 13
Software vs. Hardware Parallelism (Example) 13
Software vs. Hardware Parallelism (Example)
Cycle 1 Hardware parallelism Cycle 1 Hardware parallelism

L1 L1 L3
(2-issue processor) (dual processor --
(one memory access, single issue processors)
Cycle 2 L2 one arithmetic operation) Cycle 2 L2 L4
Cycle 3 L3 X1 Cycle 3 X1 X2
Cycle 4 L4 Cycle 4 S1 S2
Cycle 5 X2 Cycle 5 L5 L6 } Instructions added

for IPC
Cycle 6 + Cycle 6 + -
Cycle 7
-
2
14 Program Partitioning: 16
Types of Software Parallelism Grains and Latencies
Grain :
-- Program segment to be executed on a single processor
Control Parallelism : -- Coarse-grain, medium-grain, and fine-grain
-- Two or more operations performed simultaneously

Latency :
-- Forms : pipelining or multiple functional units
-- Time measure of the communication overhead
-- Limitations : pipeline length and multiplicity
-- Memory latency
of functional units
-- Processor latency (synchronization latency)
Data Parallelism :
Parallelism (Granularity) :
-- The same operation performed on many data elements -- Instruction level (fine-grain -- 20 instructions in a segment)
-- The highest potential for concurrency -- Loop level (fine grain -- 500 instructions)
-- Requires compilation support, parallel programming -- Procedure level (medium grain -- 2000 instructions)
languages, and hardware redesign -- Subprogram level (medium grain --
thousands of instructions)
-- Job/program level (coarse grain)
17 18
Levels of Program Grains Partitioning and Scheduling
Grain Packing :
Jobs and programs
Level 5 -- How to partition program into program segments
Coarse to get the shortest possible execution time ?
grain
Increasing Higher degree -- What is the optimal size of concurrent grains ?
communication Subprograms, job steps, of parallelism
demand and Level 4 or parts of a program
scheduling Program Graph :
Medium
overhead -- Each node (n, s) corresponds to the computational unit :
grain
Procedures, subroutines,
n -- node name; s -- grain size
Level 3 or tasks -- Each edge between two nodes (v,d) denotes the output
variable v and communications delay d
Example:
Level 2 Nonrecursive loops
or unfolded iterations 1. a := 1 10. j := e x f
Fine 2. b := 2 11. k := d x f
grain 3. c := 3 12. l := j x k
4. d := 4 13. m := 4 x l
Level 1 Instructions or statements
5. e := 5 14. n := 3 x m
6. f := 6 15. o := n x i
7. g := a x b 16. p := o x h
8. h := c x d 17. q := p x q
9. a := d x e
3
19 20
Fine-grain program graph (before packing)
Coarse-grain program graph (after packing)
1,1 2,1 3,1 4,1 5,1 6,1
A,8
B,4 C,4 D,8
E,6
n,s
21 Multiprocessor Architectures: 26
Program Flow Mechanisms
Scheduling of the fine-grain and coarse-grain programs
Control Flow :
-- Conventional computers
-- Instructions execution controlled by the PC
-- Instructions sequence explicitly stated in user program
Data Flow :
-- Data driven execution
-- Instructions executed as soon as their input data are available
-- Higher degree of parallelism at the fine-grain level
28 Reduction computers :
-- Use reduced instructions set
-- Demand driven
-- Instructions executed when their results are needed
38
42
Fine-grain
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Coarse-grain Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
4
Multiprocessor Architectures: Scope of
Review: Parallel Processing Intro
Course
• We will focus on parallel control flow • Long term goal of the field: scale number
architectures processors to size of budget, desired
performance
• Successes today:
¾ dense matrix scientific computing (Petrolium,
Automotive, Aeronautics, Pharmaceuticals)
¾ file server, databases, web search engines
¾ entertainment/graphics
• Machines today: workstations!!
Parallel Architecture Parallel Framework for Communication
• Parallel Architecture extends traditional

computer architecture with a • Layers:
communication architecture ¾ Programming Model:
¾ Multiprogramming : lots of jobs, no communication
¾ abstractions (HW/SW interface) ¾ Shared address space: communicate via memory
¾ organizational structure to realize abstraction ¾ Message passing: send and recieve messages
efficiently ¾ Data Parallel: several processors operate on several data sets
simultaneously and then exchange information globally and
simultaneously
(shared or message passing)
¾ Communication Abstraction:
¾ Shared address space: e.g., load, store, atomic swap
¾ Message passing: e.g., send, recieve library calls
¾ Debate over this topic (ease of programming, large scaling)
=> many hardware designs 1:1 programming model
5
Shared Address/Memory Multiprocessor
Model Example: Small-Scale MP Designs
• Communicate via Load and Store • Memory: centralized with uniform access
¾ Oldest and most popular model time (“uma”) and bus interconnect, I/O
• Based on timesharing: processes on multiple • Examples: Sun Enterprise 6000, SGI
processors vs. sharing single processor Challenge, Intel SystemPro
• process: a virtual address space Processor Processor Processor Processor
and 1 thread of control

One or One or One or One or
¾ Multiple processes can overlap (share), but ALL more levels more levels more levels more levels
threads share a process address space of cache of cache of cache of cache
• Writes to shared address space by one thread

are visible to reads of other threads
¾ Usual model: share code, private stack, some Main memory
I/O System
shared heap, some private heap

SMP Interconnect
Large-Scale MP Designs
• Processors to Memory AND to I/O • Memory: distributed with nonuniform access
• Bus based: all memory locations equal time (“numa”) and scalable interconnect
access time so SMP = “Symmetric MP” (distributed memory)
¾ Sharing limited BW as add processors, I/O • Examples:
Processor T3E:Processor
(see Ch. 1, Processor
Figs 1-21, page
Processor 45 of
+ cache + cache + cache + cache
[CSG96])
1 cycle
• Crossbar: expensive to expand Memory I/O Memory I/O Memory I/O Memory I/O
• Multistage network (less expensive to 40 cycles 100 cycles

expand than crossbar with more BW) Interconnection Network
Low Latency
High Reliability
• “Dance Hall” designs: All processors on
the left, all memories on the right Memory I/O Memory I/O Memory I/O Memory I/O
Processor Processor Processor Processor

+ cache + cache + cache + cache
6
Shared Address Model Summary
Message Passing Model
• Each processor can name every physical • Whole computers (CPU, memory, I/O devices)
location in the machine communicate as explicit I/O operations
¾ Essentially NUMA but integrated at I/O devices vs.
• Each process can name all data it shares with memory system
other processes • Send specifies local buffer + receiving process on remote
• Data transfer via load and store computer
• Receive specifies sending process on remote computer +
• Data size: byte, word, ... or cache blocks local buffer to place data
• Uses virtual memory to map virtual to local or ¾ Usually send includes process tag
remote physical and receive has rule on tag: match 1, match any
¾ Synch: when send completes, when buffer free, when
• Memory hierarchy model applies: now request accepted, receive wait for send
communication moves data to local proc. cache • Send+receive => memory-memory copy, where each
(as load moves data from memory to cache) supplies local address,
AND does pairwise sychronization!
¾ Latency, BW (cache block?),
scalability when communicate?
Message Passing Model

Communication Models
• Shared Memory
• Send+receive => memory-memory copy, sychronization ¾ Processors communicate with shared address space
on OS even on 1 processor ¾ Easy on small-scale machines
• History of message passing: ¾ Advantages:
¾ Model of choice for uniprocessors, small-scale MPs
¾ Network topology important because could only send to
¾ Ease of programming
immediate neighbor
¾ Lower latency
¾ Typically synchronouns, blocking send & receive ¾ Easier to use hardware controlled caching
Later DMA with non-blocking sends, DMA for receive into buffer
¾
until processor does receive, and then data is tranfered to local
• Message passing
memory ¾ Processors have private memories,
¾ Later SW libraries to allow arbitrary communication communicate via messages
¾ Advantages:
• Example: IBM SP-2, RS6000 workstations in racks ¾ Less hardware, easier to design
¾ Network Inteface Card has Intel 960 ¾ Focuses attention on costly non-local operations
¾ 8X8 Crossbar switch as communication building block • Can support either SW model on either HW base
¾ 40 MByte/sec per link
7
Popular Flynn Architecture Categories SISD : A Conventional Computer
• SISD (Single Instruction Single Data)
Instructions
¾ Uniprocessors
• MISD (Multiple Instruction Single Data)

¾ ???
• SIMD (Single Instruction Multiple Data)

Data Input Processor
Processor Data Output
¾ Examples: Illiac-IV, CM-2
¾ Simple programming model
¾ Low overhead
¾ Flexibility
¾ All custom integrated circuits
ÎSpeed is limited by the rate at which computer
• MIMD (Multiple Instruction Multiple Data) can transfer information internally.
¾ Examples: Sun Enterprise 5000, Cray T3D, Ex:PC, Macintosh, Workstations
SGI Origin
¾ Flexible
Bhagi Narahari, Lab. For
¾ Embedded Systems (LEMS), micros
Use off-the-shelf CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
The MISD Architecture SIMD Architecture

Instruction
Instruction Stream
Stream A
Instruction
Stream B
Instruction Stream C
Data Output
Processor Data Input Processor stream A
A Data stream A A
Output Data Output
Data Processor Stream Data Input Processor
stream B
Input B stream B B
Stream
Processor Processor Data Output
Data Input stream C
C C
stream C
Ci<= Ai * Bi
ÎMore of an intellectual exercise than a practical configuration. Few
built, but commercially not available Ex: CRAY machine vector processing, Thinking machine cm*
8
MIMD Architecture Shared Memory MIMD machine
Instruction Instruction Instruction
Stream A Stream B Stream C
Processor
Processor Processor
Processor Processor
Processor
AA BB CC
Data Output
Data Input Processor stream A M M M
stream A A E
M B
E
M B
E
M B
O U O U O U
Data Output R S R S R S
Data Input Processor Y Y Y
stream B
stream B B
Processor Data Output
Data Input stream C Global
GlobalMemory
MemorySystem
System
C
stream C
Comm: Source PE writes data to GM & destination retrieves it
Î Easy to build, conventional OSes of SISD can be easily be ported
Unlike SISD, MISD, MIMD computer works asynchronously. Î Limitation : reliability & expandability. A memory component or any
processor failure affects the whole system.
Shared memory (tightly coupled) MIMD
Î Increase of processors leads to memory contention.
Distributed memory (loosely coupled) MIMD Ex. : Silicon graphics supercomputers....
Distributed Memory MIMD Data Parallel Model

IPC IPC
channel channel
Processor Processor Processor

• Operations can be performed in parallel on each
Processor Processor Processor
AA BB CC element of a large regular data structure, such as an
array
M M
E
M
E
• 1 Control Processsor broadcast to many PEs
E
M B M B M B
O U O U O U ¾ When computers were large, could amortize the
R S R S R S
Y Y Y control portion of many replicated PEs
• Condition flag per PE so that can skip
Memory
Memory
System
System AA
Memory
Memory
System
Memory
Memory • Data distributed in each memory
System BB System
SystemCC
• Early 1980s VLSI => SIMD rebirth:
z Communication : IPC on High Speed Network. 32 1-bit PEs + memory on a chip was the PE
z Network can be configured to ... Tree, Mesh, Cube, etc.
• Data parallel programming languages lay out data to
z Unlike Shared MIMD processor
Î easily/ readily expandable
Î Highly reliable (any CPU failure does not affect the whole system)
9
Data Parallel Model Convergence in Parallel Architecture
• Complete computers connected to
• Vector processors have similar ISAs, scalable network via communication
but no data placement restriction assist
• SIMD led to Data Parallel Programming • Different programming models place different
languages
requirements on communication assist
• Advancing VLSI led to single chip FPUs and Shared address space: tight integration with
¾
whole fast µProcs (SIMD less attractive) memory to capture memory events that interact
with others + to accept requests from other nodes
• SIMD programming model led to
¾ Message passing: send messages quickly and
Single Program Multiple Data (SPMD) model
respond to incoming messages: tag match, allocate
¾ All processors execute identical program buffer, transfer data, wait for receive posting
• Data parallel programming languages still ¾ Data Parallel: fast global synchronization
useful, do communication all at once: • Hi Perf Fortran shared-memory, data parallel;

“Bulk Synchronous” phases in which all Msg. Passing Inter. message passing library;
communicate after a global barrier both work on many machines, different
implementations
Fundamental Issues Fundamental Issue #1: Naming
• 3 Issues to characterize parallel • Naming: how to solve large problem fast

machines ¾ what data is shared
1) Naming/Program Partitioning ¾ how it is addressed
¾ what operations can access data

2) Synchronization
¾ how processes refer to each other
3) Latency and Bandwidth
• Choice of naming affects code produced by a
compiler; via load where just remember
address or keep track of processor number
and local virtual address for msg. passing
• Choice of naming affects replication of data;
via load in cache memory hierachy or via SW
replication and consistency
10
Fundamental Issue #1: Naming Fundamental Issue #2: Synchronization
• Global physical address space: • To cooperate, processes must

any processor can generate, address and coordinate
access it in a single operation
¾ memory can be anywhere:
• Message passing is implicit coordination
virtual addr. translation handles it with transmission or arrival of data
• Global virtual address space: if the address • Shared address
space of each process can be configured to => additional operations to explicitly
contain all shared data of the parallel program coordinate:
• Segmented shared address space: e.g., write a flag, awaken a thread,
locations are named interrupt a processor
<process number, address>
uniformly for all processes of the parallel
program
Fundamental Issue #3:

Small-Scale—Shared Memory
Latency and Bandwidth
• Bandwidth
¾ Need high bandwidth in communication
¾ Cannot scale, but stay close • Caches serve to: Processor Processor Processor Processor
¾ Match limits in network, memory, and processor

¾ Increase
¾ Overhead to communicate is a problem in many machines bandwidth versus One or One or One or One or
• Latency
more levels more levels more levels more levels
bus/memory of cache of cache of cache of cache
¾ Affects performance, since processor may have to wait ¾ Reduce latency of
¾ Affects ease of programming, since requires more thought to access
overlap communication and computation
¾ Valuable for both
• Latency Hiding private data and Main memory
I/O System
¾ How can a mechanism help hide latency? shared data
¾ Examples: overlap message send with computation, prefetch
data, switch to other tasks • What about cache
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU consistency?
11
The Problem of Cache Coherency What Does Coherency Mean?
CPU CPU CPU
Cache Cache Cache

• Informally:
¾ “Any read must return the most recent write”
A' 100 A' 550 A' 100
¾ Too strict and too difficult to implement
B' 200 B' 200 B' 200
• Better:
¾ “Any write must eventually be seen by a read”
Memory Memory Memory
¾ All writes are seen in proper order (“serialization”)
A 100 A 100 A 100
• Two rules to ensure this:
B 200 B 200 B 440
¾ “If P writes x and P1 reads it, P’s write will be seen by P1 if
the read and write are sufficiently far apart”
¾ Writes to a single location are serialized:
I/O I/O
Output A
I/O
Input
seen in one order
gives 100 440 to B
¾ Latest write will be seen
(a) Cache and (b) Cache and (c) Cache and ¾ Otherewise could see writes in illogical order
memory coherent: memory incoherent: memory incoherent:
A’ = A & B’ = B A’ ≠ A (A stale) B’ ≠ B (B' stale)
(could see older value after a newer value)
Cache Cohernecy Solutions Synchronization
• Why Synchronize? Need to know when it is

• more detail ..…after we cover Cache and safe for different processes to use shared
Memory design data
¾ Snooping Solution (Snoopy Bus): • Issues for Synchronization:
¾ Directory-Based Schemes ¾ Uninterruptable instruction to fetch
and update memory (atomic
operation);
¾ User level synchronization operation
using this primitive;
¾ For large scale MPs, synchronization
can be a bottleneck; techniques to
reduce contention and latency of
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU synchronization
12
Hardware level synchronization Uninterruptable Instruction to Fetch and Update
Memory
• Key is to provide uninterruptible • Atomic exchange: interchange a value in a register for
instruction or instruction sequence a value in memory
capable of atomically retrieving a value 0 => synchronization variable is free
1 => synchronization variable is locked and unavailable
¾ S/W mechanisms then constructed from
¾ Set register to 1 & swap
these H/W primitives
¾ New value in register determines success in getting lock
• Special load: load linked 0 if you succeeded in setting the lock (you were first)
1 if other processor had already claimed access
• Special store: store conditional ¾ Key is that exchange operation is indivisible
¾ If contents of mem changed before store • Test-and-set: tests a value and sets it if the value
conditional, then store conditional fails passes the test
¾ Store conditional returns value specifying • Fetch-and-increment: it returns the value of a memory
success or failure location and atomically increments it
¾ 0 => synchronization variable is free
Coordination/Synchronization Constructs Synchronization Constructs
• For shared memory and message passing • Barrier synchronization

two types of synchronization activity ¾ for sequence control
¾ Sequence control ... to enable correct operation ¾ processors wait at barrier till all (or subset) have
¾ Access control ... to allow access to common completed
resources ¾ hardware implementations available
• synchronization activities constitute an ¾ can also implement in s/w
overhead! • Critical section access control mechanisms
• For SIMD these are done at machine (H/W) ¾ Test&Set Lock protocols
level ¾ Semaphores
13
Barrier Synchronization Barrier Synch. . . Example
• Many programs require that all For i := 1 to N do in parallel

processes come to a “barrier” point A[i] := k* A[i];
before proceeding further
B[i] := A[i] + B[i];
¾ this constitutes a synchronization point
endfor
• Concept of Barrier
¾ When processor hits a barrier it cannot BARRIER POINT
proceed further until ALL processors have hit for i := 1 to N do in parallel
the barrier point
¾ note that this forces a global synchronization point C[i] := B[i] + B[i-1] + B[i-2];
• Can implement in S/W or Hardware
¾ in s/w can implement using a shared variable;
proc checks value of shared variable
Barrier Synchronization: Implementation Synchronization: Message Passing
• Bus based • Synchronous vs. Asynchronous

¾ each processor sets single bit when it arrives at barrier • Synchronous: sending and receiving process
¾ collection of bits sent to AND (or OR) gates synch in time and space
¾ send outputs of gates to all processors
¾ system must check if (i) receiver ready, (ii) path
¾ number of synchs/cycle grows with N (proc) if change available and (iii) one or more to be sent to same or
in bit at one proc can be propagated in since cycle multiple dest
¾ takes O( log N ) in reality
¾ delay in performance due to barrier measured how? ¾ also known as blocking send-receive
• Multiple Barrier lines ¾ send and receive process cannot continue past
instruction till message transfer complete
¾ a barrier bit sent to each processor
¾ each can set bit for each barrier line • Asychronous: send&rec do not have to synch
¾ X1,...,Xn in processor; Y1,...,Yn is barrier setting
14
Lock Protocols Semaphores
Test&Set (lock) • P(S) for shared variable/section S

temp <- lock ¾ test if S>0 & enter critical section and
lock := 1 decrement S else wait
return (temp); • V(S)
¾ increment S and exit
Reset (lock)
lock :=0 • note that P and V are blocking
synchronization constructs
Process waits for lock to be 0 • can allow number of concurrent
accesses to S
can remove indefinite waits by ???
Semaphores : Example Next– Distributed Memory MPs
Z= A*B + [ (C*D) * (I+G) ] • Multiple processors connected through

var S_w, S_y are semaphores an interconnection network
initial: S_w=S_y= 0
P1: begin P2: begin • Network properties play vital role in
U = A*B W = C*D system performance
P( S_y) V(S_w)
• Next…
Z=U+Y end
end ¾ Interconnection networks definitions
¾ Examples of routing on static topology
P3: begin networks – you are required to read the notes
X= I+G for some detailed discussion on this
P(S_w)
Y= W*X
V(S_y)
end
15
34 35
Interconnection Networks Basic Concepts and Definitions
Two types :
-- direct networks with static interconnections:
point-to-point direct connections between system elements
Node degree : the number of edges (links) connected to a node
-- indirect networks with dynamic interconnections :
switched dynamically programmable channels Diameter : the shortest path between any two nodes
Relevant aspects : Bisection :
-- scalability
-- channel bisection width ( b ) is the number of edges
-- communication efficiency (latency)
along a network bisection
-- flexibility of reconfiguration
-- wire bisection width ( B = bw ) is the number of wires
-- complexity
along a network bisection
-- cost
Data routing functions : simple (primitive) and complex
38 39
Network Performance
Static Connection Networks
Ring
Functionality : data routing, interrupt handling, synchronization, Linear Array
request/response combining, . . .
Network Latency : the worst case time delay for a message to be transferred
through the network
Bandwidth : maximum data transfer rate (Mbits/sec)
Hardware complexity : implementation costs for components
Scalability : modular expansions with increasing machine resources
16
49 50
Bus Connection
Dynamic Connection Networks
P1 P2 Pn
Characteristics : connections established dynamically
based on program demands
I/O
C1 C2 Cn Subsystem
Types : bus system, multistage interconnection network,
crossbar switch network
Interconnection Bus
Priorities : arbitration logic
Contention : conflict resolution

M1 M2 Mn Secondary
storage
54
Interconnection Networks
Crossbar network
• Topology of interconnection network

(ICN) determines routing delays
P1
¾ need efficient routing algorithms
P2
• Switching techniques also determine
latency
¾ packet, circuit, wormhole
Pn • Details on some static topology networks
covered in notes
M1 M2 Mn ¾ You are required to read these notes
17
SIMD Architectures SIMD Architectures…contd
• Single Instruction stream, Multiple Data • Control Unit (CU)

stream ¾ Broadcasts instructions to processors
¾ Each processor executes same instruction
on different data ¾ Has memory for program
• Efficient in applications requiring large ¾ Executes control flow instructions (branches)
data processing • Processing elements (PE)
¾ Low level image processing ¾ Data distributed among PE memories
¾ Multimedia processing ¾ Each PE can be enabled or disabled using
¾ Scientific computing Mask
• Synchronization Implicit ¾ MASK instruction broadcast by CU
¾ All processors are in lock step with control
unit
SIMD. . . PE Organization SIMD Masking Schemes
• Simple processors • All PEs execute same instruction (broadcast

¾ Do not need to fetch instructions by CU)
¾ Can be simple microcontrollers
¾ Masking scheme allows subset of PE to
• CPU ignore/suspend
• Local memory to store data ¾ Only processors enabled execute instruction
• General purpose registers ¾ Masking/status register denotes if PE enabled
• Address register – addr of PE ¾ If Reg=1 then active PE else inactive
• Data transfer registers for network (DTR_in, ¾ CU can broadcast MASK vector
DTR_out) ¾ One bit for each PE or use Log N bits to enable sets
• Status flag – enabled/disabled ¾ Data/conditional masks

• Index register – used in mem access ¾ Allow each PE to set its Mask register depending on data
¾ Eg: If A< 0 then S=1 (sets Mask to 1 if value of A in its local memory
¾ Offset by x_i in mem i of PE i is less than 0)
18
Processing on SIMD Matrix Multiplication on SIMD using CU
• CU broadcasts instructions • Assume we have N processors in SIMD

configuration
• PE executes – can be simple decode and
execute units • Algorithm to multiply N by N matrix using
the CU to broadcast an element
• CU can also broadcast a data value For i := 1 to N do
• Time taken to process a task is time to For j := 1 to N do
complete tasks on all processors
C[i,j] := 0
• All processors execute same instruction for k :=0 to N do
in one cycle
C[i,j] := C[i,j] + A[i,k]*B[k,j]
¾ Note also that processors are hardware
synchronized (tied to same clock ?) • Note each row of A is required N times
for C[i,1],C[i,2],…..C[i,N]
Matrix Multiplication on SIMD using CU Sample Code
• Assume each processor P_k stores For i :=1 to N do

column k of matrix B In Parallel for ALL processors P_k
(i.e., enable all processors)
• CU can broadcast the current value of A
Broadcast i /* send value of i to all proc */
• Each processor k computes C[i,k] for all C[i] := 0 /*initialize C[i] to 0 in all proc k */
values of I For j:=1 to N do
¾ Processor k computes column k of result Broadcast j
matrix
Broadcast A[i,j]
MULT A[i,j], B[j] Æ temp
ADD C[i], temp Æ C[i]
Endfor (j loop)
Endfor
19
Time Analysis Storage Rules
• There are N2 iterations at control unit • In previous example, the algorithm required that
¾ Time taken is N2 a row of B be executed upon at each cycle
Since B was stored column wise, this was not a
• Instructions are broadcast to all PE ¾
problem
• Essentially the k loop has been • What if a column of B has to be processed at
parallelized each cycle
¾ Using k processors ¾ Since an entire column is stored in one processor, this
requires N cycles
• Requires each processor store N
¾ No speedup and waste of N processors
elements of B and N elements of result
matrix • Need to come up with better ways to store
matrices
• Ideal speedup and efficiency ¾ Allow row or column fetching in parallel
¾ Got speedup of N using N processors for ¾ Skew Storage Rules allow this
100% efficiency
Progression to MIMD MIMD Issues
• Multiple instructions, Multiple Data • H/W

• shared memory or distributed memory ¾ Processors more complex, more memory
• Each processor executes its own ¾ flexible communication
program • S/W
¾ processor must store inst and data ¾ each processor creates and terminates
¾ larger memory required processes -> language constructs needed
¾ more complex (than SIMD) processors ¾ O/S at each node
¾ can also have heterogeneous processors ¾ Coordination/Synchronization Constructs
¾ shared memory
¾ message passing
¾ load balancing and program partitioning

¾ algorithm design: exploit functional
parallelism
20
Moving from multiprocessor to
Language Constructs
distributed systems
• Similarity to concurrent programming
• language constructs to express
parallelism must
¾ define subtasks to be executed in parallel
¾ start and stop execution
¾ coordinate and specify interaction
• examples:
¾ FORK-JOIN (subsumes all other models)
¾ Cobegin-Coend (Parbegin-Parend)
¾ Forall/Doall
57 58
Local Calls (Subroutines) Remote Calls (Sockets)
Main Local
program computer
Main Remote
call ABC (a, b, c) program computer
Library
IP number, port
send (a, b, c) receive (a, b, c)
ABC (a, b, c)
ABC (a, b, c)
network
return
return
21
59 60
Object Request Broker (ORB) Java Remote Method Invocation (RMI)
Local Local
Remote Remote
computer computer
computer computer
Main Main
program program
call ABC (a, b, c) call ABC (a, b, c)

network network
ORB Platform ORB Platform JVM JVM
ABC (a, b, c) ABC (a, b, c) ABC (a, b, c) ABC (a, b, c)
Scalability Parallel Algorithms
• Performance must scale with • Solving problems on a multiprocessor

¾ system size architecture requires design of parallel
¾ problem/workload size algorithms
• Amdahl’s Law • How do we measure efficiency of a
¾ perfect speedup cannot be achieved since
parallel algorithm ?
there is a inherently sequential part to every ¾ 10 seconds on Machine 1 and 20 seconds on
program machine 2 – which algorithm is better ?
• Scalability measures
¾ Efficiency ( speedup per processor )
22
Parallel Algorithm Complexity Parallel Computation Models
• Parallel time complexity • Shared Memory

¾ Express in terms of Input size and System ¾ Protocol for shared memory ?..what happens
size (num of processors) when two processors/processes try to
¾ T(N,P): input size N, P processors access same data
¾ Relationship between N and P ¾ EREW: Exclusive Read, Exclusive Write
¾ Independent size analysis – no link between N and P ¾ CREW: Concurrent Read, Exclusive Write
¾ Dependent size – P is a function of N; eg. P=N/2 ¾ CRCW: Concurrent read, Concurrent write
• Speedup: how much faster than • Distributed Memory

sequential ¾ Explicit communication through message
¾ S(P)= T(N,1)/T(N,P) passing
• Efficiency: speedup per processor ¾ Send/Receive instructions
¾ S(P)/P
Formal Models of Parallel Computation P-RAM model
• Alternating Turing machine • P programs, one per processor

• P-RAM model • One memory
¾ Extension of sequential Random Access ¾ In distributed memory it becomes P
Machine (RAM) model memories
• P accumulators
• RAM model
¾ One program
• One read/write tape
¾ One memory • Depending on shared memory protocol
we have
¾ One accumulator
¾ CREW P-RAM
¾ One read/write tape ¾ EREW PRAM
¾ CRCW PRAM
23
PRAM Model PRAM Algorithms -- Summing
• Assumes synchronous execution • Add N numbers in parallel using P

• Idealized machine processors
¾ Helps in developing theoretically sound ¾ How to parallelize ?
solutions
¾ Actual performance will depend on machine
characteristics and language implementation
Parallel Summing Parallel Sorting on CREW PRAM
• Using N/2 processors to sum N numbers • Sort N numbers using P processors

in O(Log N) time ¾ Assume P unlimited for now.
• Independent size analysis: • Given an unsorted list (a1, a2,…,an)
¾ Do sequential sum on N/P values and then created sorted list W, where W[i]<W[i+1]
add in parallel
• Where does a1 go ?
¾ Time= O(N/P + log P)
24
Parallel Sorting on CREW PRAM Parallel Algorithms
• Using P=N2 processors • Design of parallel algorithm has to take

• For each processor P(i,j) compare ai>aj system architecture into consideration
¾ If ai>aj then R[i,j]=1 else 0 • Must minimize interprocessor
¾ Time = O(1) communication in a distributed memory
• For each “row” of processors P(i,j) for system
j=1 to j=N do parallel sum to compute ¾ Communication time is much larger than
rank computation
¾ Compute R[i]= sum of R[i,j] ¾ Comm. Time can dominate computation if not
¾ Time = O(log N) problem is not “partitioned” well
• Write ai into W[R(i)] • Efficiency
• Total time complexity= O(log N)
Next topics…
• Memory design
¾ Single processor – high performance
processors
¾ Focus on cache
¾ Multiprocessor cache design

• “Special Architectures”
¾ Embedded Systems
¾ Reconfigurable architectures – FPGA
technology
¾ Cluster and Networked Computing
25

CS 211: Computer Architecture

Uploaded by

Copyright:

Available Formats

You might also like

CS 211: Computer Architecture

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

CS 211: Computer Architecture

Uploaded by

Copyright:

Available Formats

Computer Architecture

• Part I: Processor Architectures

Instructor: Prof. Bhagi Narahari • Part II: Multi Processor Architectures

¾ move from fine grain to medium/coarse grain

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Moving from Fine grained to Coarser

Cycle 1 Hardware parallelism Cycle 1 Hardware parallelism

Cycle 5 X2 Cycle 5 L5 L6 } Instructions added

Types of Software Parallelism Grains and Latencies

-- Two or more operations performed simultaneously

1,1 2,1 3,1 4,1 5,1 6,1

B,4 C,4 D,8

• Machines today: workstations!!

Parallel Architecture Parallel Framework for Communication

• Parallel Architecture extends traditional

and 1 thread of control

• Writes to shared address space by one thread

shared heap, some private heap

• Multistage network (less expensive to 40 cycles 100 cycles

Processor Processor Processor Processor

Message Passing Model

• SISD (Single Instruction Single Data)

• MISD (Multiple Instruction Single Data)

• SIMD (Single Instruction Multiple Data)

The MISD Architecture SIMD Architecture

Distributed Memory MIMD Data Parallel Model

Processor Processor Processor

useful, do communication all at once: • Hi Perf Fortran shared-memory, data parallel;

Fundamental Issues Fundamental Issue #1: Naming

• 3 Issues to characterize parallel • Naming: how to solve large problem fast

1) Naming/Program Partitioning ¾ how it is addressed

¾ what operations can access data

• Global physical address space: • To cooperate, processes must

Fundamental Issue #3:

¾ Match limits in network, memory, and processor

Cache Cache Cache

Cache Cohernecy Solutions Synchronization

• Why Synchronize? Need to know when it is

Coordination/Synchronization Constructs Synchronization Constructs

• For shared memory and message passing • Barrier synchronization

• Many programs require that all For i := 1 to N do in parallel

Barrier Synchronization: Implementation Synchronization: Message Passing

• Bus based • Synchronous vs. Asynchronous

Test&Set (lock) • P(S) for shared variable/section S

Semaphores : Example Next– Distributed Memory MPs

Z= A*B + [ (C*D) * (I+G) ] • Multiple processors connected through

Bandwidth : maximum data transfer rate (Mbits/sec)

Hardware complexity : implementation costs for components

Scalability : modular expansions with increasing machine resources

Contention : conflict resolution

• Topology of interconnection network

• Single Instruction stream, Multiple Data • Control Unit (CU)

SIMD. . . PE Organization SIMD Masking Schemes

• Simple processors • All PEs execute same instruction (broadcast

• Status flag – enabled/disabled ¾ Data/conditional masks

• CU broadcasts instructions • Assume we have N processors in SIMD

Matrix Multiplication on SIMD using CU Sample Code

Z= AB + [ (CD) * (I+G) ] • Multiple processors connected through