Professional Documents
Culture Documents
CS 211: Computer Architecture
CS 211: Computer Architecture
CS 211: Computer Architecture
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
1
10 11
Hardware and Software Parallelism Software vs. Hardware Parallelism (Example) 11
Software parallelism
Hardware Parallelism :
(three cycles)
-- Defined by machine architecture and hardware multiplicity
-- Number of instruction issues per machine cycle
-- k issues per machine cycle : k-issue processor
Cycle 1 L1 L2 L3 L4
Software parallelism :
-- Control and data dependence of programs
-- Compiler extensions
-- OS extensions (parallel scheduling, shared memory
allocation, (communication links) Cycle 2 X2
X1
Implicit Parallelism :
-- Conventional programming language
-- Parallelizing compiler -
Cycle 3 +
Explicit Parallelism :
-- Parallelising constructs in programming languages
-- Parallelising programs development tools
-- Debugging, validation, testing, etc.
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
12 13
Software vs. Hardware Parallelism (Example) 13
Software vs. Hardware Parallelism (Example)
Cycle 3 L3 X1 Cycle 3 X1 X2
Cycle 4 L4 Cycle 4 S1 S2
Cycle 6 + Cycle 6 + -
Cycle 7
-
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
2
14 Program Partitioning: 16
Grain :
-- Program segment to be executed on a single processor
Control Parallelism : -- Coarse-grain, medium-grain, and fine-grain
Data Parallelism :
Parallelism (Granularity) :
-- The same operation performed on many data elements -- Instruction level (fine-grain -- 20 instructions in a segment)
-- The highest potential for concurrency -- Loop level (fine grain -- 500 instructions)
-- Requires compilation support, parallel programming -- Procedure level (medium grain -- 2000 instructions)
languages, and hardware redesign -- Subprogram level (medium grain --
thousands of instructions)
-- Job/program level (coarse grain)
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
17 18
Levels of Program Grains Partitioning and Scheduling
Grain Packing :
Jobs and programs
Level 5 -- How to partition program into program segments
Coarse to get the shortest possible execution time ?
grain
Increasing Higher degree -- What is the optimal size of concurrent grains ?
communication Subprograms, job steps, of parallelism
demand and Level 4 or parts of a program
scheduling Program Graph :
Medium
overhead -- Each node (n, s) corresponds to the computational unit :
grain
Procedures, subroutines,
n -- node name; s -- grain size
Level 3 or tasks -- Each edge between two nodes (v,d) denotes the output
variable v and communications delay d
Example:
Level 2 Nonrecursive loops
or unfolded iterations 1. a := 1 10. j := e x f
Fine 2. b := 2 11. k := d x f
grain 3. c := 3 12. l := j x k
4. d := 4 13. m := 4 x l
Level 1 Instructions or statements
5. e := 5 14. n := 3 x m
6. f := 6 15. o := n x i
7. g := a x b 16. p := o x h
8. h := c x d 17. q := p x q
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
9. a := d x e
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
3
19 20
Fine-grain program graph (before packing)
Coarse-grain program graph (after packing)
A,8
E,6
n,s
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
21 Multiprocessor Architectures: 26
Program Flow Mechanisms
Scheduling of the fine-grain and coarse-grain programs
Control Flow :
-- Conventional computers
-- Instructions execution controlled by the PC
-- Instructions sequence explicitly stated in user program
Data Flow :
-- Data driven execution
-- Instructions executed as soon as their input data are available
-- Higher degree of parallelism at the fine-grain level
28 Reduction computers :
-- Use reduced instructions set
-- Demand driven
-- Instructions executed when their results are needed
38
42
Fine-grain
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Coarse-grain Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
4
Multiprocessor Architectures: Scope of
Review: Parallel Processing Intro
Course
• We will focus on parallel control flow • Long term goal of the field: scale number
architectures processors to size of budget, desired
performance
• Successes today:
¾ dense matrix scientific computing (Petrolium,
Automotive, Aeronautics, Pharmaceuticals)
¾ file server, databases, web search engines
¾ entertainment/graphics
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
¾ Communication Abstraction:
¾ Shared address space: e.g., load, store, atomic swap
¾ Message passing: e.g., send, recieve library calls
¾ Debate over this topic (ease of programming, large scaling)
=> many hardware designs 1:1 programming model
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
5
Shared Address/Memory Multiprocessor
Model Example: Small-Scale MP Designs
• Communicate via Load and Store • Memory: centralized with uniform access
¾ Oldest and most popular model time (“uma”) and bus interconnect, I/O
• Based on timesharing: processes on multiple • Examples: Sun Enterprise 6000, SGI
processors vs. sharing single processor Challenge, Intel SystemPro
• process: a virtual address space Processor Processor Processor Processor
SMP Interconnect
Large-Scale MP Designs
• Processors to Memory AND to I/O • Memory: distributed with nonuniform access
• Bus based: all memory locations equal time (“numa”) and scalable interconnect
access time so SMP = “Symmetric MP” (distributed memory)
¾ Sharing limited BW as add processors, I/O • Examples:
Processor T3E:Processor
(see Ch. 1, Processor
Figs 1-21, page
Processor 45 of
+ cache + cache + cache + cache
[CSG96])
1 cycle
• Crossbar: expensive to expand Memory I/O Memory I/O Memory I/O Memory I/O
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
6
Shared Address Model Summary
Message Passing Model
• Each processor can name every physical • Whole computers (CPU, memory, I/O devices)
location in the machine communicate as explicit I/O operations
¾ Essentially NUMA but integrated at I/O devices vs.
• Each process can name all data it shares with memory system
other processes • Send specifies local buffer + receiving process on remote
• Data transfer via load and store computer
• Receive specifies sending process on remote computer +
• Data size: byte, word, ... or cache blocks local buffer to place data
• Uses virtual memory to map virtual to local or ¾ Usually send includes process tag
remote physical and receive has rule on tag: match 1, match any
¾ Synch: when send completes, when buffer free, when
• Memory hierarchy model applies: now request accepted, receive wait for send
communication moves data to local proc. cache • Send+receive => memory-memory copy, where each
(as load moves data from memory to cache) supplies local address,
AND does pairwise sychronization!
¾ Latency, BW (cache block?),
scalability when communicate?
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
7
Popular Flynn Architecture Categories SISD : A Conventional Computer
Instructions
¾ Uniprocessors
Instruction
Stream B
Instruction Stream C
Data Output
Processor Data Input Processor stream A
A Data stream A A
Output Data Output
Data Processor Stream Data Input Processor
stream B
Input B stream B B
Stream
Processor Processor Data Output
Data Input stream C
C C
stream C
Ci<= Ai * Bi
ÎMore of an intellectual exercise than a practical configuration. Few
built, but commercially not available Ex: CRAY machine vector processing, Thinking machine cm*
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
8
MIMD Architecture Shared Memory MIMD machine
Instruction Instruction Instruction
Stream A Stream B Stream C
Processor
Processor Processor
Processor Processor
Processor
AA BB CC
Data Output
Data Input Processor stream A M M M
stream A A E
M B
E
M B
E
M B
O U O U O U
Data Output R S R S R S
Data Input Processor Y Y Y
stream B
stream B B
Processor Data Output
Data Input stream C Global
GlobalMemory
MemorySystem
System
C
stream C
Comm: Source PE writes data to GM & destination retrieves it
Î Easy to build, conventional OSes of SISD can be easily be ported
Unlike SISD, MISD, MIMD computer works asynchronously. Î Limitation : reliability & expandability. A memory component or any
processor failure affects the whole system.
Shared memory (tightly coupled) MIMD
Î Increase of processors leads to memory contention.
Distributed memory (loosely coupled) MIMD Ex. : Silicon graphics supercomputers....
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
9
Data Parallel Model Convergence in Parallel Architecture
• Complete computers connected to
• Vector processors have similar ISAs, scalable network via communication
but no data placement restriction assist
• SIMD led to Data Parallel Programming • Different programming models place different
languages
requirements on communication assist
• Advancing VLSI led to single chip FPUs and Shared address space: tight integration with
¾
whole fast µProcs (SIMD less attractive) memory to capture memory events that interact
with others + to accept requests from other nodes
• SIMD programming model led to
¾ Message passing: send messages quickly and
Single Program Multiple Data (SPMD) model
respond to incoming messages: tag match, allocate
¾ All processors execute identical program buffer, transfer data, wait for receive posting
• Data parallel programming languages still ¾ Data Parallel: fast global synchronization
10
Fundamental Issue #1: Naming Fundamental Issue #2: Synchronization
• Bandwidth
¾ Need high bandwidth in communication
¾ Cannot scale, but stay close • Caches serve to: Processor Processor Processor Processor
• Latency
more levels more levels more levels more levels
bus/memory of cache of cache of cache of cache
¾ Affects performance, since processor may have to wait ¾ Reduce latency of
¾ Affects ease of programming, since requires more thought to access
overlap communication and computation
¾ Valuable for both
• Latency Hiding private data and Main memory
I/O System
¾ How can a mechanism help hide latency? shared data
¾ Examples: overlap message send with computation, prefetch
data, switch to other tasks • What about cache
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU consistency?
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
11
The Problem of Cache Coherency What Does Coherency Mean?
CPU CPU CPU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
12
Hardware level synchronization Uninterruptable Instruction to Fetch and Update
Memory
• Key is to provide uninterruptible • Atomic exchange: interchange a value in a register for
instruction or instruction sequence a value in memory
capable of atomically retrieving a value 0 => synchronization variable is free
1 => synchronization variable is locked and unavailable
¾ S/W mechanisms then constructed from
¾ Set register to 1 & swap
these H/W primitives
¾ New value in register determines success in getting lock
• Special load: load linked 0 if you succeeded in setting the lock (you were first)
1 if other processor had already claimed access
• Special store: store conditional ¾ Key is that exchange operation is indivisible
¾ If contents of mem changed before store • Test-and-set: tests a value and sets it if the value
conditional, then store conditional fails passes the test
¾ Store conditional returns value specifying • Fetch-and-increment: it returns the value of a memory
success or failure location and atomically increments it
¾ 0 => synchronization variable is free
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
13
Barrier Synchronization Barrier Synch. . . Example
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
14
Lock Protocols Semaphores
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
15
34 35
Interconnection Networks Basic Concepts and Definitions
Two types :
-- direct networks with static interconnections:
point-to-point direct connections between system elements
Node degree : the number of edges (links) connected to a node
-- indirect networks with dynamic interconnections :
switched dynamically programmable channels Diameter : the shortest path between any two nodes
Relevant aspects : Bisection :
-- scalability
-- channel bisection width ( b ) is the number of edges
-- communication efficiency (latency)
along a network bisection
-- flexibility of reconfiguration
-- wire bisection width ( B = bw ) is the number of wires
-- complexity
along a network bisection
-- cost
Data routing functions : simple (primitive) and complex
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
38 39
Network Performance
Static Connection Networks
Ring
Functionality : data routing, interrupt handling, synchronization, Linear Array
request/response combining, . . .
Network Latency : the worst case time delay for a message to be transferred
through the network
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
16
49 50
Bus Connection
Dynamic Connection Networks
P1 P2 Pn
Characteristics : connections established dynamically
based on program demands
I/O
C1 C2 Cn Subsystem
Types : bus system, multistage interconnection network,
crossbar switch network
Interconnection Bus
Priorities : arbitration logic
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
54
Interconnection Networks
Crossbar network
P2
• Switching techniques also determine
latency
¾ packet, circuit, wormhole
Pn • Details on some static topology networks
covered in notes
M1 M2 Mn ¾ You are required to read these notes
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
17
SIMD Architectures SIMD Architectures…contd
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
18
Processing on SIMD Matrix Multiplication on SIMD using CU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
19
Time Analysis Storage Rules
• There are N2 iterations at control unit • In previous example, the algorithm required that
¾ Time taken is N2 a row of B be executed upon at each cycle
Since B was stored column wise, this was not a
• Instructions are broadcast to all PE ¾
problem
• Essentially the k loop has been • What if a column of B has to be processed at
parallelized each cycle
¾ Using k processors ¾ Since an entire column is stored in one processor, this
requires N cycles
• Requires each processor store N
¾ No speedup and waste of N processors
elements of B and N elements of result
matrix • Need to come up with better ways to store
matrices
• Ideal speedup and efficiency ¾ Allow row or column fetching in parallel
¾ Got speedup of N using N processors for ¾ Skew Storage Rules allow this
100% efficiency
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
20
Moving from multiprocessor to
Language Constructs
distributed systems
• Similarity to concurrent programming
• language constructs to express
parallelism must
¾ define subtasks to be executed in parallel
¾ start and stop execution
¾ coordinate and specify interaction
• examples:
¾ FORK-JOIN (subsumes all other models)
¾ Cobegin-Coend (Parbegin-Parend)
¾ Forall/Doall
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
57 58
Local Calls (Subroutines) Remote Calls (Sockets)
Main Local
program computer
Main Remote
call ABC (a, b, c) program computer
Library
IP number, port
send (a, b, c) receive (a, b, c)
ABC (a, b, c)
ABC (a, b, c)
network
return
return
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
21
59 60
Local Local
Remote Remote
computer computer
computer computer
Main Main
program program
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
22
Parallel Algorithm Complexity Parallel Computation Models
¾ S(P)/P
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
23
PRAM Model PRAM Algorithms -- Summing
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
24
Parallel Sorting on CREW PRAM Parallel Algorithms
Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU Bhagi Narahari, Lab. For Embedded Systems (LEMS), CS, GWU
Next topics…
• Memory design
¾ Single processor – high performance
processors
¾ Focus on cache
25