CS 211: Computer Architecture

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 25

Computer Architecture

• Part I: Processor Architectures


CS 211: Computer Architecture ¾ starting with simple ILP using pipelining
¾ explicit ILP - EPIC
¾ key concept: issue multiple instructions/cycle

Instructor: Prof. Bhagi Narahari • Part II: Multi Processor Architectures


Dept. of Computer Science ¾ move from Processor to System level
Course URL: www.seas.gwu.edu/~narahari/cs211/ ¾ can utilize all the techniques covered thus far
¾ i.e., the processors used in a multi-processor can be EPIC

¾ move from fine grain to medium/coarse grain


¾ assume all processor issues are resolved when
discussing system level Multiprocessor design

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Moving from Fine grained to Coarser


Multi-Processor Architectures
grained computations. . .
• Introduce Parallel Processing
¾ grains, mapping of s/w to h/w, issues
• Overview of Multiprocessor Architectures
¾ Shared-memory, distributed memory
¾ SIMD architectures
• Programming and Synchronization
¾ programming constructs, synch constructs, cache
• Interconnection Networks
• Parallel algorithm design and analysis

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

1
10 11
Hardware and Software Parallelism Software vs. Hardware Parallelism (Example) 11

Software parallelism
Hardware Parallelism :
(three cycles)
-- Defined by machine architecture and hardware multiplicity
-- Number of instruction issues per machine cycle
-- k issues per machine cycle : k-issue processor
Cycle 1 L1 L2 L3 L4
Software parallelism :
-- Control and data dependence of programs
-- Compiler extensions
-- OS extensions (parallel scheduling, shared memory
allocation, (communication links) Cycle 2 X2
X1
Implicit Parallelism :
-- Conventional programming language
-- Parallelizing compiler -
Cycle 3 +
Explicit Parallelism :
-- Parallelising constructs in programming languages
-- Parallelising programs development tools
-- Debugging, validation, testing, etc.
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

12 13
Software vs. Hardware Parallelism (Example) 13
Software vs. Hardware Parallelism (Example)

Cycle 1 Hardware parallelism Cycle 1 Hardware parallelism


L1 L1 L3
(2-issue processor) (dual processor --
(one memory access, single issue processors)
Cycle 2 L2 one arithmetic operation) Cycle 2 L2 L4

Cycle 3 L3 X1 Cycle 3 X1 X2

Cycle 4 L4 Cycle 4 S1 S2

Cycle 5 X2 Cycle 5 L5 L6 } Instructions added


for IPC

Cycle 6 + Cycle 6 + -

Cycle 7
-
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

2
14 Program Partitioning: 16

Types of Software Parallelism Grains and Latencies

Grain :
-- Program segment to be executed on a single processor
Control Parallelism : -- Coarse-grain, medium-grain, and fine-grain

-- Two or more operations performed simultaneously


Latency :
-- Forms : pipelining or multiple functional units
-- Time measure of the communication overhead
-- Limitations : pipeline length and multiplicity
-- Memory latency
of functional units
-- Processor latency (synchronization latency)

Data Parallelism :
Parallelism (Granularity) :
-- The same operation performed on many data elements -- Instruction level (fine-grain -- 20 instructions in a segment)
-- The highest potential for concurrency -- Loop level (fine grain -- 500 instructions)
-- Requires compilation support, parallel programming -- Procedure level (medium grain -- 2000 instructions)
languages, and hardware redesign -- Subprogram level (medium grain --
thousands of instructions)
-- Job/program level (coarse grain)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

17 18
Levels of Program Grains Partitioning and Scheduling

Grain Packing :
Jobs and programs
Level 5 -- How to partition program into program segments
Coarse to get the shortest possible execution time ?
grain
Increasing Higher degree -- What is the optimal size of concurrent grains ?
communication Subprograms, job steps, of parallelism
demand and Level 4 or parts of a program
scheduling Program Graph :
Medium
overhead -- Each node (n, s) corresponds to the computational unit :
grain
Procedures, subroutines,
n -- node name; s -- grain size
Level 3 or tasks -- Each edge between two nodes (v,d) denotes the output
variable v and communications delay d
Example:
Level 2 Nonrecursive loops
or unfolded iterations 1. a := 1 10. j := e x f
Fine 2. b := 2 11. k := d x f
grain 3. c := 3 12. l := j x k
4. d := 4 13. m := 4 x l
Level 1 Instructions or statements
5. e := 5 14. n := 3 x m
6. f := 6 15. o := n x i
7. g := a x b 16. p := o x h
8. h := c x d 17. q := p x q
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
9. a := d x e
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

3
19 20
Fine-grain program graph (before packing)
Coarse-grain program graph (after packing)

1,1 2,1 3,1 4,1 5,1 6,1

A,8

B,4 C,4 D,8

E,6

n,s

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

21 Multiprocessor Architectures: 26
Program Flow Mechanisms
Scheduling of the fine-grain and coarse-grain programs

Control Flow :
-- Conventional computers
-- Instructions execution controlled by the PC
-- Instructions sequence explicitly stated in user program

Data Flow :
-- Data driven execution
-- Instructions executed as soon as their input data are available
-- Higher degree of parallelism at the fine-grain level

28 Reduction computers :
-- Use reduced instructions set
-- Demand driven
-- Instructions executed when their results are needed
38

42

Fine-grain
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Coarse-grain Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

4
Multiprocessor Architectures: Scope of
Review: Parallel Processing Intro
Course
• We will focus on parallel control flow • Long term goal of the field: scale number
architectures processors to size of budget, desired
performance
• Successes today:
¾ dense matrix scientific computing (Petrolium,
Automotive, Aeronautics, Pharmaceuticals)
¾ file server, databases, web search engines

¾ entertainment/graphics

• Machines today: workstations!!

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Parallel Architecture Parallel Framework for Communication

• Parallel Architecture extends traditional


computer architecture with a • Layers:
communication architecture ¾ Programming Model:
¾ Multiprogramming : lots of jobs, no communication
¾ abstractions (HW/SW interface) ¾ Shared address space: communicate via memory
¾ organizational structure to realize abstraction ¾ Message passing: send and recieve messages
efficiently ¾ Data Parallel: several processors operate on several data sets
simultaneously and then exchange information globally and
simultaneously
(shared or message passing)

¾ Communication Abstraction:
¾ Shared address space: e.g., load, store, atomic swap
¾ Message passing: e.g., send, recieve library calls
¾ Debate over this topic (ease of programming, large scaling)
=> many hardware designs 1:1 programming model

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

5
Shared Address/Memory Multiprocessor
Model Example: Small-Scale MP Designs

• Communicate via Load and Store • Memory: centralized with uniform access
¾ Oldest and most popular model time (“uma”) and bus interconnect, I/O
• Based on timesharing: processes on multiple • Examples: Sun Enterprise 6000, SGI
processors vs. sharing single processor Challenge, Intel SystemPro
• process: a virtual address space Processor Processor Processor Processor

and 1 thread of control


One or One or One or One or
¾ Multiple processes can overlap (share), but ALL more levels more levels more levels more levels
threads share a process address space of cache of cache of cache of cache

• Writes to shared address space by one thread


are visible to reads of other threads
¾ Usual model: share code, private stack, some Main memory
I/O System

shared heap, some private heap


Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

SMP Interconnect
Large-Scale MP Designs
• Processors to Memory AND to I/O • Memory: distributed with nonuniform access
• Bus based: all memory locations equal time (“numa”) and scalable interconnect
access time so SMP = “Symmetric MP” (distributed memory)
¾ Sharing limited BW as add processors, I/O • Examples:
Processor T3E:Processor
(see Ch. 1, Processor
Figs 1-21, page
Processor 45 of
+ cache + cache + cache + cache
[CSG96])
1 cycle
• Crossbar: expensive to expand Memory I/O Memory I/O Memory I/O Memory I/O

• Multistage network (less expensive to 40 cycles 100 cycles


expand than crossbar with more BW) Interconnection Network
Low Latency
High Reliability
• “Dance Hall” designs: All processors on
the left, all memories on the right Memory I/O Memory I/O Memory I/O Memory I/O

Processor Processor Processor Processor


+ cache + cache + cache + cache

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

6
Shared Address Model Summary
Message Passing Model
• Each processor can name every physical • Whole computers (CPU, memory, I/O devices)
location in the machine communicate as explicit I/O operations
¾ Essentially NUMA but integrated at I/O devices vs.
• Each process can name all data it shares with memory system
other processes • Send specifies local buffer + receiving process on remote
• Data transfer via load and store computer
• Receive specifies sending process on remote computer +
• Data size: byte, word, ... or cache blocks local buffer to place data
• Uses virtual memory to map virtual to local or ¾ Usually send includes process tag

remote physical and receive has rule on tag: match 1, match any
¾ Synch: when send completes, when buffer free, when
• Memory hierarchy model applies: now request accepted, receive wait for send
communication moves data to local proc. cache • Send+receive => memory-memory copy, where each
(as load moves data from memory to cache) supplies local address,
AND does pairwise sychronization!
¾ Latency, BW (cache block?),
scalability when communicate?
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Message Passing Model


Communication Models
• Shared Memory
• Send+receive => memory-memory copy, sychronization ¾ Processors communicate with shared address space
on OS even on 1 processor ¾ Easy on small-scale machines
• History of message passing: ¾ Advantages:
¾ Model of choice for uniprocessors, small-scale MPs
¾ Network topology important because could only send to
¾ Ease of programming
immediate neighbor
¾ Lower latency
¾ Typically synchronouns, blocking send & receive ¾ Easier to use hardware controlled caching
Later DMA with non-blocking sends, DMA for receive into buffer
¾
until processor does receive, and then data is tranfered to local
• Message passing
memory ¾ Processors have private memories,
¾ Later SW libraries to allow arbitrary communication communicate via messages
¾ Advantages:
• Example: IBM SP-2, RS6000 workstations in racks ¾ Less hardware, easier to design
¾ Network Inteface Card has Intel 960 ¾ Focuses attention on costly non-local operations
¾ 8X8 Crossbar switch as communication building block • Can support either SW model on either HW base
¾ 40 MByte/sec per link
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

7
Popular Flynn Architecture Categories SISD : A Conventional Computer

• SISD (Single Instruction Single Data)

Instructions
¾ Uniprocessors

• MISD (Multiple Instruction Single Data)


¾ ???

• SIMD (Single Instruction Multiple Data)


Data Input Processor
Processor Data Output
¾ Examples: Illiac-IV, CM-2
¾ Simple programming model
¾ Low overhead
¾ Flexibility
¾ All custom integrated circuits
ÎSpeed is limited by the rate at which computer
• MIMD (Multiple Instruction Multiple Data) can transfer information internally.
¾ Examples: Sun Enterprise 5000, Cray T3D, Ex:PC, Macintosh, Workstations
SGI Origin
¾ Flexible
Bhagi Narahari, Lab. For
¾ Embedded Systems (LEMS), micros
Use off-the-shelf CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

The MISD Architecture SIMD Architecture


Instruction
Instruction Stream
Stream A

Instruction
Stream B
Instruction Stream C
Data Output
Processor Data Input Processor stream A
A Data stream A A
Output Data Output
Data Processor Stream Data Input Processor
stream B
Input B stream B B
Stream
Processor Processor Data Output
Data Input stream C
C C
stream C
Ci<= Ai * Bi
ÎMore of an intellectual exercise than a practical configuration. Few
built, but commercially not available Ex: CRAY machine vector processing, Thinking machine cm*
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

8
MIMD Architecture Shared Memory MIMD machine
Instruction Instruction Instruction
Stream A Stream B Stream C
Processor
Processor Processor
Processor Processor
Processor
AA BB CC

Data Output
Data Input Processor stream A M M M
stream A A E
M B
E
M B
E
M B
O U O U O U
Data Output R S R S R S
Data Input Processor Y Y Y
stream B
stream B B
Processor Data Output
Data Input stream C Global
GlobalMemory
MemorySystem
System
C
stream C
Comm: Source PE writes data to GM & destination retrieves it
Î Easy to build, conventional OSes of SISD can be easily be ported
Unlike SISD, MISD, MIMD computer works asynchronously. Î Limitation : reliability & expandability. A memory component or any
processor failure affects the whole system.
Shared memory (tightly coupled) MIMD
Î Increase of processors leads to memory contention.
Distributed memory (loosely coupled) MIMD Ex. : Silicon graphics supercomputers....
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Distributed Memory MIMD Data Parallel Model


IPC IPC
channel channel

Processor Processor Processor


• Operations can be performed in parallel on each
Processor Processor Processor
AA BB CC element of a large regular data structure, such as an
array
M M
E
M
E
• 1 Control Processsor broadcast to many PEs
E
M B M B M B
O U O U O U ¾ When computers were large, could amortize the
R S R S R S
Y Y Y control portion of many replicated PEs
• Condition flag per PE so that can skip
Memory
Memory
System
System AA
Memory
Memory
System
Memory
Memory • Data distributed in each memory
System BB System
SystemCC
• Early 1980s VLSI => SIMD rebirth:
z Communication : IPC on High Speed Network. 32 1-bit PEs + memory on a chip was the PE
z Network can be configured to ... Tree, Mesh, Cube, etc.
• Data parallel programming languages lay out data to
z Unlike Shared MIMD processor
Î easily/ readily expandable
Î Highly reliable (any CPU failure does not affect the whole system)
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

9
Data Parallel Model Convergence in Parallel Architecture
• Complete computers connected to
• Vector processors have similar ISAs, scalable network via communication
but no data placement restriction assist
• SIMD led to Data Parallel Programming • Different programming models place different
languages
requirements on communication assist
• Advancing VLSI led to single chip FPUs and Shared address space: tight integration with
¾
whole fast µProcs (SIMD less attractive) memory to capture memory events that interact
with others + to accept requests from other nodes
• SIMD programming model led to
¾ Message passing: send messages quickly and
Single Program Multiple Data (SPMD) model
respond to incoming messages: tag match, allocate
¾ All processors execute identical program buffer, transfer data, wait for receive posting
• Data parallel programming languages still ¾ Data Parallel: fast global synchronization

useful, do communication all at once: • Hi Perf Fortran shared-memory, data parallel;


“Bulk Synchronous” phases in which all Msg. Passing Inter. message passing library;
communicate after a global barrier both work on many machines, different
implementations
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Fundamental Issues Fundamental Issue #1: Naming

• 3 Issues to characterize parallel • Naming: how to solve large problem fast


machines ¾ what data is shared

1) Naming/Program Partitioning ¾ how it is addressed

¾ what operations can access data


2) Synchronization
¾ how processes refer to each other
3) Latency and Bandwidth
• Choice of naming affects code produced by a
compiler; via load where just remember
address or keep track of processor number
and local virtual address for msg. passing
• Choice of naming affects replication of data;
via load in cache memory hierachy or via SW
replication and consistency
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

10
Fundamental Issue #1: Naming Fundamental Issue #2: Synchronization

• Global physical address space: • To cooperate, processes must


any processor can generate, address and coordinate
access it in a single operation
¾ memory can be anywhere:
• Message passing is implicit coordination
virtual addr. translation handles it with transmission or arrival of data
• Global virtual address space: if the address • Shared address
space of each process can be configured to => additional operations to explicitly
contain all shared data of the parallel program coordinate:
• Segmented shared address space: e.g., write a flag, awaken a thread,
locations are named interrupt a processor
<process number, address>
uniformly for all processes of the parallel
program
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Fundamental Issue #3:


Small-Scale—Shared Memory
Latency and Bandwidth

• Bandwidth
¾ Need high bandwidth in communication
¾ Cannot scale, but stay close • Caches serve to: Processor Processor Processor Processor

¾ Match limits in network, memory, and processor


¾ Increase
¾ Overhead to communicate is a problem in many machines bandwidth versus One or One or One or One or

• Latency
more levels more levels more levels more levels
bus/memory of cache of cache of cache of cache
¾ Affects performance, since processor may have to wait ¾ Reduce latency of
¾ Affects ease of programming, since requires more thought to access
overlap communication and computation
¾ Valuable for both
• Latency Hiding private data and Main memory
I/O System
¾ How can a mechanism help hide latency? shared data
¾ Examples: overlap message send with computation, prefetch
data, switch to other tasks • What about cache
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU consistency?
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

11
The Problem of Cache Coherency What Does Coherency Mean?
CPU CPU CPU

Cache Cache Cache


• Informally:
¾ “Any read must return the most recent write”
A' 100 A' 550 A' 100
¾ Too strict and too difficult to implement
B' 200 B' 200 B' 200
• Better:
¾ “Any write must eventually be seen by a read”
Memory Memory Memory
¾ All writes are seen in proper order (“serialization”)
A 100 A 100 A 100
• Two rules to ensure this:
B 200 B 200 B 440
¾ “If P writes x and P1 reads it, P’s write will be seen by P1 if
the read and write are sufficiently far apart”
¾ Writes to a single location are serialized:
I/O I/O
Output A
I/O
Input
seen in one order
gives 100 440 to B
¾ Latest write will be seen
(a) Cache and (b) Cache and (c) Cache and ¾ Otherewise could see writes in illogical order
memory coherent: memory incoherent: memory incoherent:
A’ = A & B’ = B A’ ≠ A (A stale) B’ ≠ B (B' stale)
(could see older value after a newer value)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Cache Cohernecy Solutions Synchronization

• Why Synchronize? Need to know when it is


• more detail ..…after we cover Cache and safe for different processes to use shared
Memory design data
¾ Snooping Solution (Snoopy Bus): • Issues for Synchronization:
¾ Directory-Based Schemes ¾ Uninterruptable instruction to fetch
and update memory (atomic
operation);
¾ User level synchronization operation
using this primitive;
¾ For large scale MPs, synchronization
can be a bottleneck; techniques to
reduce contention and latency of
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU synchronization
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

12
Hardware level synchronization Uninterruptable Instruction to Fetch and Update
Memory
• Key is to provide uninterruptible • Atomic exchange: interchange a value in a register for
instruction or instruction sequence a value in memory
capable of atomically retrieving a value 0 => synchronization variable is free
1 => synchronization variable is locked and unavailable
¾ S/W mechanisms then constructed from
¾ Set register to 1 & swap
these H/W primitives
¾ New value in register determines success in getting lock
• Special load: load linked 0 if you succeeded in setting the lock (you were first)
1 if other processor had already claimed access
• Special store: store conditional ¾ Key is that exchange operation is indivisible

¾ If contents of mem changed before store • Test-and-set: tests a value and sets it if the value
conditional, then store conditional fails passes the test
¾ Store conditional returns value specifying • Fetch-and-increment: it returns the value of a memory
success or failure location and atomically increments it
¾ 0 => synchronization variable is free

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Coordination/Synchronization Constructs Synchronization Constructs

• For shared memory and message passing • Barrier synchronization


two types of synchronization activity ¾ for sequence control
¾ Sequence control ... to enable correct operation ¾ processors wait at barrier till all (or subset) have
¾ Access control ... to allow access to common completed
resources ¾ hardware implementations available
• synchronization activities constitute an ¾ can also implement in s/w
overhead! • Critical section access control mechanisms
• For SIMD these are done at machine (H/W) ¾ Test&Set Lock protocols
level ¾ Semaphores

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

13
Barrier Synchronization Barrier Synch. . . Example

• Many programs require that all For i := 1 to N do in parallel


processes come to a “barrier” point A[i] := k* A[i];
before proceeding further
B[i] := A[i] + B[i];
¾ this constitutes a synchronization point
endfor
• Concept of Barrier
¾ When processor hits a barrier it cannot BARRIER POINT
proceed further until ALL processors have hit for i := 1 to N do in parallel
the barrier point
¾ note that this forces a global synchronization point C[i] := B[i] + B[i-1] + B[i-2];
• Can implement in S/W or Hardware
¾ in s/w can implement using a shared variable;
proc checks value of shared variable

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Barrier Synchronization: Implementation Synchronization: Message Passing

• Bus based • Synchronous vs. Asynchronous


¾ each processor sets single bit when it arrives at barrier • Synchronous: sending and receiving process
¾ collection of bits sent to AND (or OR) gates synch in time and space
¾ send outputs of gates to all processors
¾ system must check if (i) receiver ready, (ii) path
¾ number of synchs/cycle grows with N (proc) if change available and (iii) one or more to be sent to same or
in bit at one proc can be propagated in since cycle multiple dest
¾ takes O( log N ) in reality
¾ delay in performance due to barrier measured how? ¾ also known as blocking send-receive
• Multiple Barrier lines ¾ send and receive process cannot continue past
instruction till message transfer complete
¾ a barrier bit sent to each processor
¾ each can set bit for each barrier line • Asychronous: send&rec do not have to synch
¾ X1,...,Xn in processor; Y1,...,Yn is barrier setting

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

14
Lock Protocols Semaphores

Test&Set (lock) • P(S) for shared variable/section S


temp <- lock ¾ test if S>0 & enter critical section and
lock := 1 decrement S else wait
return (temp); • V(S)
¾ increment S and exit
Reset (lock)
lock :=0 • note that P and V are blocking
synchronization constructs
Process waits for lock to be 0 • can allow number of concurrent
accesses to S
can remove indefinite waits by ???

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Semaphores : Example Next– Distributed Memory MPs

Z= A*B + [ (C*D) * (I+G) ] • Multiple processors connected through


var S_w, S_y are semaphores an interconnection network
initial: S_w=S_y= 0
P1: begin P2: begin • Network properties play vital role in
U = A*B W = C*D system performance
P( S_y) V(S_w)
• Next…
Z=U+Y end
end ¾ Interconnection networks definitions
¾ Examples of routing on static topology
P3: begin networks – you are required to read the notes
X= I+G for some detailed discussion on this
P(S_w)
Y= W*X
V(S_y)
end
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

15
34 35
Interconnection Networks Basic Concepts and Definitions

Two types :
-- direct networks with static interconnections:
point-to-point direct connections between system elements
Node degree : the number of edges (links) connected to a node
-- indirect networks with dynamic interconnections :
switched dynamically programmable channels Diameter : the shortest path between any two nodes
Relevant aspects : Bisection :
-- scalability
-- channel bisection width ( b ) is the number of edges
-- communication efficiency (latency)
along a network bisection
-- flexibility of reconfiguration
-- wire bisection width ( B = bw ) is the number of wires
-- complexity
along a network bisection
-- cost
Data routing functions : simple (primitive) and complex

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

38 39
Network Performance
Static Connection Networks

Ring
Functionality : data routing, interrupt handling, synchronization, Linear Array
request/response combining, . . .

Network Latency : the worst case time delay for a message to be transferred
through the network

Bandwidth : maximum data transfer rate (Mbits/sec)

Hardware complexity : implementation costs for components

Scalability : modular expansions with increasing machine resources

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

16
49 50
Bus Connection
Dynamic Connection Networks

P1 P2 Pn
Characteristics : connections established dynamically
based on program demands
I/O
C1 C2 Cn Subsystem
Types : bus system, multistage interconnection network,
crossbar switch network
Interconnection Bus
Priorities : arbitration logic

Contention : conflict resolution


M1 M2 Mn Secondary
storage

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

54
Interconnection Networks
Crossbar network

• Topology of interconnection network


(ICN) determines routing delays
P1
¾ need efficient routing algorithms

P2
• Switching techniques also determine
latency
¾ packet, circuit, wormhole
Pn • Details on some static topology networks
covered in notes
M1 M2 Mn ¾ You are required to read these notes

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

17
SIMD Architectures SIMD Architectures…contd

• Single Instruction stream, Multiple Data • Control Unit (CU)


stream ¾ Broadcasts instructions to processors
¾ Each processor executes same instruction
on different data ¾ Has memory for program
• Efficient in applications requiring large ¾ Executes control flow instructions (branches)
data processing • Processing elements (PE)
¾ Low level image processing ¾ Data distributed among PE memories
¾ Multimedia processing ¾ Each PE can be enabled or disabled using
¾ Scientific computing Mask
• Synchronization Implicit ¾ MASK instruction broadcast by CU
¾ All processors are in lock step with control
unit

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

SIMD. . . PE Organization SIMD Masking Schemes

• Simple processors • All PEs execute same instruction (broadcast


¾ Do not need to fetch instructions by CU)
¾ Can be simple microcontrollers
¾ Masking scheme allows subset of PE to
• CPU ignore/suspend
• Local memory to store data ¾ Only processors enabled execute instruction
• General purpose registers ¾ Masking/status register denotes if PE enabled
• Address register – addr of PE ¾ If Reg=1 then active PE else inactive
• Data transfer registers for network (DTR_in, ¾ CU can broadcast MASK vector
DTR_out) ¾ One bit for each PE or use Log N bits to enable sets

• Status flag – enabled/disabled ¾ Data/conditional masks


• Index register – used in mem access ¾ Allow each PE to set its Mask register depending on data
¾ Eg: If A< 0 then S=1 (sets Mask to 1 if value of A in its local memory
¾ Offset by x_i in mem i of PE i is less than 0)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

18
Processing on SIMD Matrix Multiplication on SIMD using CU

• CU broadcasts instructions • Assume we have N processors in SIMD


configuration
• PE executes – can be simple decode and
execute units • Algorithm to multiply N by N matrix using
the CU to broadcast an element
• CU can also broadcast a data value For i := 1 to N do
• Time taken to process a task is time to For j := 1 to N do
complete tasks on all processors
C[i,j] := 0
• All processors execute same instruction for k :=0 to N do
in one cycle
C[i,j] := C[i,j] + A[i,k]*B[k,j]
¾ Note also that processors are hardware
synchronized (tied to same clock ?) • Note each row of A is required N times
for C[i,1],C[i,2],…..C[i,N]

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Matrix Multiplication on SIMD using CU Sample Code

• Assume each processor P_k stores For i :=1 to N do


column k of matrix B In Parallel for ALL processors P_k
(i.e., enable all processors)
• CU can broadcast the current value of A
Broadcast i /* send value of i to all proc */
• Each processor k computes C[i,k] for all C[i] := 0 /*initialize C[i] to 0 in all proc k */
values of I For j:=1 to N do
¾ Processor k computes column k of result Broadcast j
matrix
Broadcast A[i,j]
MULT A[i,j], B[j] Æ temp
ADD C[i], temp Æ C[i]
Endfor (j loop)
Endfor

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

19
Time Analysis Storage Rules

• There are N2 iterations at control unit • In previous example, the algorithm required that
¾ Time taken is N2 a row of B be executed upon at each cycle
Since B was stored column wise, this was not a
• Instructions are broadcast to all PE ¾
problem
• Essentially the k loop has been • What if a column of B has to be processed at
parallelized each cycle
¾ Using k processors ¾ Since an entire column is stored in one processor, this
requires N cycles
• Requires each processor store N
¾ No speedup and waste of N processors
elements of B and N elements of result
matrix • Need to come up with better ways to store
matrices
• Ideal speedup and efficiency ¾ Allow row or column fetching in parallel
¾ Got speedup of N using N processors for ¾ Skew Storage Rules allow this
100% efficiency

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Progression to MIMD MIMD Issues

• Multiple instructions, Multiple Data • H/W


• shared memory or distributed memory ¾ Processors more complex, more memory
• Each processor executes its own ¾ flexible communication
program • S/W
¾ processor must store inst and data ¾ each processor creates and terminates
¾ larger memory required processes -> language constructs needed
¾ more complex (than SIMD) processors ¾ O/S at each node
¾ can also have heterogeneous processors ¾ Coordination/Synchronization Constructs
¾ shared memory
¾ message passing

¾ load balancing and program partitioning


¾ algorithm design: exploit functional
parallelism
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

20
Moving from multiprocessor to
Language Constructs
distributed systems
• Similarity to concurrent programming
• language constructs to express
parallelism must
¾ define subtasks to be executed in parallel
¾ start and stop execution
¾ coordinate and specify interaction
• examples:
¾ FORK-JOIN (subsumes all other models)
¾ Cobegin-Coend (Parbegin-Parend)
¾ Forall/Doall

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

57 58
Local Calls (Subroutines) Remote Calls (Sockets)

Main Local
program computer

Main Remote
call ABC (a, b, c) program computer

Library
IP number, port
send (a, b, c) receive (a, b, c)
ABC (a, b, c)
ABC (a, b, c)
network

return
return

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

21
59 60

Object Request Broker (ORB) Java Remote Method Invocation (RMI)

Local Local
Remote Remote
computer computer
computer computer
Main Main
program program

call ABC (a, b, c) call ABC (a, b, c)


network network

ORB Platform ORB Platform JVM JVM

ABC (a, b, c) ABC (a, b, c) ABC (a, b, c) ABC (a, b, c)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Scalability Parallel Algorithms

• Performance must scale with • Solving problems on a multiprocessor


¾ system size architecture requires design of parallel
¾ problem/workload size algorithms
• Amdahl’s Law • How do we measure efficiency of a
¾ perfect speedup cannot be achieved since
parallel algorithm ?
there is a inherently sequential part to every ¾ 10 seconds on Machine 1 and 20 seconds on
program machine 2 – which algorithm is better ?
• Scalability measures
¾ Efficiency ( speedup per processor )

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

22
Parallel Algorithm Complexity Parallel Computation Models

• Parallel time complexity • Shared Memory


¾ Express in terms of Input size and System ¾ Protocol for shared memory ?..what happens
size (num of processors) when two processors/processes try to
¾ T(N,P): input size N, P processors access same data
¾ Relationship between N and P ¾ EREW: Exclusive Read, Exclusive Write
¾ Independent size analysis – no link between N and P ¾ CREW: Concurrent Read, Exclusive Write
¾ Dependent size – P is a function of N; eg. P=N/2 ¾ CRCW: Concurrent read, Concurrent write

• Speedup: how much faster than • Distributed Memory


sequential ¾ Explicit communication through message
¾ S(P)= T(N,1)/T(N,P) passing
• Efficiency: speedup per processor ¾ Send/Receive instructions

¾ S(P)/P

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Formal Models of Parallel Computation P-RAM model

• Alternating Turing machine • P programs, one per processor


• P-RAM model • One memory
¾ Extension of sequential Random Access ¾ In distributed memory it becomes P
Machine (RAM) model memories
• P accumulators
• RAM model
¾ One program
• One read/write tape
¾ One memory • Depending on shared memory protocol
we have
¾ One accumulator
¾ CREW P-RAM
¾ One read/write tape ¾ EREW PRAM
¾ CRCW PRAM

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

23
PRAM Model PRAM Algorithms -- Summing

• Assumes synchronous execution • Add N numbers in parallel using P


• Idealized machine processors
¾ Helps in developing theoretically sound ¾ How to parallelize ?
solutions
¾ Actual performance will depend on machine
characteristics and language implementation

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Parallel Summing Parallel Sorting on CREW PRAM

• Using N/2 processors to sum N numbers • Sort N numbers using P processors


in O(Log N) time ¾ Assume P unlimited for now.
• Independent size analysis: • Given an unsorted list (a1, a2,…,an)
¾ Do sequential sum on N/P values and then created sorted list W, where W[i]<W[i+1]
add in parallel
• Where does a1 go ?
¾ Time= O(N/P + log P)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

24
Parallel Sorting on CREW PRAM Parallel Algorithms

• Using P=N2 processors • Design of parallel algorithm has to take


• For each processor P(i,j) compare ai>aj system architecture into consideration
¾ If ai>aj then R[i,j]=1 else 0 • Must minimize interprocessor
¾ Time = O(1) communication in a distributed memory
• For each “row” of processors P(i,j) for system
j=1 to j=N do parallel sum to compute ¾ Communication time is much larger than
rank computation
¾ Compute R[i]= sum of R[i,j] ¾ Comm. Time can dominate computation if not
¾ Time = O(log N) problem is not “partitioned” well
• Write ai into W[R(i)] • Efficiency
• Total time complexity= O(log N)

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

Next topics…

• Memory design
¾ Single processor – high performance
processors
¾ Focus on cache

¾ Multiprocessor cache design


• “Special Architectures”
¾ Embedded Systems
¾ Reconfigurable architectures – FPGA
technology
¾ Cluster and Networked Computing

Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU

25

You might also like