Professional Documents
Culture Documents
Necessity of High Performance Parallel Computing
Necessity of High Performance Parallel Computing
(or faster)
Performance > 100x per decade; clock rate 10x, rest transistor
count
How to use more transistors?
Parallelism in processing
multiple operations per cycle reduces CPI
Locality in data access
avoids latency and reduces CPI
also improves processor utilization
Both need resources, so tradeoff
Fundamental issue is resource distribution, as in uniprocessors
Technology: A Closer Look
Proc $
Interconnect
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
15
Clock Frequency Growth Rate
0.1
1
10
100
1,000
1970
1975
1980
1985
1990
1995
2000
2005
C
l
o
c
k
r
a
t
e
(
M
H
z
)
i4004
i8008
i8080
i8086
i80286
i80386
Pentium100
R10000
30% per year
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
16
Transistor Count Growth Rate
T
r
a
n
s
i
s
t
o
r
s
1,000
10,000
100,000
1,000,000
10,000,000
100,000,000
1970
1975
1980
1985
1990
1995
2000
2005
i4004
i8008
i8080
i8086
i80286
i80386
R2000
Pentium
R10000
R3000
100 million transistors on chip by early 2000s A.D.
Transistor count grows much faster than clock rate
- 40% per year, order of magnitude more contribution in 2 decades
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
17
Divergence between memory capacity and speed more
pronounced
Capacity increased by 1000x from 1980-2005, speed only 5x
4 Gigabit DRAM now, but gap with processor speed much greater
Larger memories are slower, while processors get faster
Need to transfer more data in parallel
Need deeper cache hierarchies
How to organize caches?
Parallelism increases effective size of each level of hierarchy,
without increasing access time
Parallelism and locality within memory systems too
New designs fetch many bits within memory chip; follow with fast
pipelined transfer across narrower interface
Buffer caches most recently accessed data
Disks too: Parallel disks plus caching
Storage Performance
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
18
Architecture translates technologys gifts to performance and
capability
Resolves the tradeoff between parallelism and locality
Current microprocessor: 1/3 compute, 1/3 cache, 1/3 off-chip
connect
Tradeoffs may change with scale and technology advances
Understanding microprocessor architectural trends
Helps build intuition about design issues or parallel machines
Shows fundamental role of parallelism even in sequential
computers
Four generations of architectural history: tube, transistor, IC,
VLSI
Here focus only on VLSI generation
Greatest delineation in VLSI has been in type of parallelism
exploited
Architectural Trends
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
19
Greatest trend in VLSI generation is increase in parallelism
Up to 1985: bit level parallelism: 4-bit -> 8 bit -> 16-bit
slows after 32 bit
adoption of 64-bit now under way, 128-bit far (not performance
issue)
great inflection point when 32-bit micro and cache fit on a chip
Mid 80s to mid 90s: instruction level parallelism
pipelining and simple instruction sets, + compiler advances (RISC)
on-chip caches and functional units => superscalar execution
greater sophistication: out of order execution, speculation,
prediction
to deal with control transfer and latency problems
2000s: Thread-level parallelism
Architectural Trends in VLSI
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
20
ILP Ideal Potential
0 1 2 3 4 5 6+
0
5
10
15
20
25
30
0 5 10 15
0
0.5
1
1.5
2
2.5
3
F
r
a
c
t
i
o
n
o
f
t
o
t
a
l
c
y
c
l
e
s
(
%
)
Number of instructions issued
S
p
e
e
d
u
p
Instructions issued per cy cle
Infinite resources and fetch bandwidth, perfect branch prediction and renaming
real caches and non-zero miss latencies
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
21
Raw Uniprocessor Performance: LINPACK
L
I
N
P
A
C
K
(
M
F
L
O
P
S
)
s
s
s
s
s
s
1
10
100
1,000
10,000
1975 1980 1985 1990 1995 2000
s CRA Y 100 x 100 matrix
CRA Y 1,000x1000 matrix
Micro 100 x 100 matrix
Micro 1,000x1000 matrix
CRA Y 1s
Xmp/14se
Xmp/416
Ymp
C90
T94
DEC 8200
IBM Power2/990
MIPS R4400
HP9000/735
DEC Alpha
DEC Alpha AXP
HP 9000/750
IBM RS6000/540
MIPS M/2000
MIPS M/120
Sun 4/260
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
22
Increasingly attractive
Economics, technology, architecture, application demand
Increasingly central and mainstream
Parallelism exploited at many levels
Instruction-level parallelism
Multiprocessor servers
Large-scale multiprocessors (MPPs)
Focus of this class: multiprocessor level of parallelism
Same story from memory system perspective
Increase bandwidth, reduce average latency with many local
memories
Wide range of parallel architectures make sense
Different cost, performance and scalability
Summary: Why Parallel Architecture?
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
23
Historically, parallel architectures tied to programming models
Divergent architectures, with no predictable pattern of growth.
Uncertainty of direction paralyzed parallel software development!
History of parallel architectures
Application Software
System
Software
SIMD
Message Passing
Shared Memory
Dataflow
Systolic
Arrays
Architecture
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
24
Architectural enhancements to support communication and
cooperation
OLD: Instruction Set Architecture
NEW: Communication Architecture
Defines
Critical abstractions, boundaries, and primitives (interfaces)
Organizational structures that implement interfaces (h/w or s/w)
Compilers, libraries and OS are important bridges today
Todays architectural viewpoint
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
25
Modern Layered Framework
CAD
Multipr ogramming Shar ed
addr ess
Message
passing
Data
parallel
Database Scientific modeling Parallel applications
Pr ogramming models
Communication abstraction
User/system boundary
Compilation
or library
Operating systems support
Communication har dwar e
Physical communication medium
Har dwar e/softwar e boundary
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
26
What programmer uses in coding applications
Specifies communication and synchronization
Examples:
Multiprogramming: no communication or synch. at program level
Shared address space: like bulletin board
Message passing: like letters or phone calls, explicit point to point
Data parallel: more regimented, global actions on data
Implemented with shared address space or message passing
Programming Model
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
27
User level communication primitives provided
Realizes the programming model
Mapping exists between language primitives of programming model
and these primitives
Supported directly by HW, or via OS, or via user SW
Lot of debate about what to support in SW and gap between
layers
Today:
HW/SW interface tends to be flat, i.e. complexity roughly uniform
Compilers and software play important roles as bridges today
Technology trends exert strong influence
Result is convergence in organizational structure
Relatively simple, general purpose communication primitives
Communication Abstraction
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
28
Comm. Arch. = User/System Interface + Implementation
User/System Interface:
Comm. primitives exposed to user-level by HW and system-level
SW
Implementation:
Organizational structures that implement the primitives: HW or OS
How optimized are they? How integrated into processing node?
Structure of network
Goals:
Performance
Broad applicability
Programmability
Scalability
Low Cost
Communication Architecture
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
29
Historically machines tailored to programming models
Prog. model, comm. abstraction, and machine organization lumped
together as the architecture
Evolution helps understand convergence
Identify core concepts
Shared Address Space
Message Passing
Data Parallel
Others:
Dataflow
Systolic Arrays
Examine programming model, motivation, intended applications,
and contributions to convergence
Evolution of Architectural Models
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
30
Any processor can directly reference any memory location
Communication occurs implicitly as result of loads and stores
Convenient:
Location transparency
Similar programming model to time-sharing on uniprocessors
Except processes run on different processors
Good throughput on multiprogrammed workloads
Naturally provided on wide range of platforms
History dates at least to precursors of mainframes in early 60s
Wide range of scale: few to hundreds of processors
Popularly known as shared memory machines or model
Ambiguous: memory may be physically distributed among
processors
Shared Address Space Architectures
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
31
Process: virtual address space plus one or more threads of control
Portions of address spaces of processes are shared
Writes to shared address visible to other threads (in other processes
too)
Natural extension of uniprocessors model: conventional memory operations
for comm.; special atomic operations for synchronization
OS uses shared memory to coordinate processes
Shared Address Space Model
S t o r e
P
1
P
2
P
n
P
0
L o a d
P
0
p r i v a t e
P
1
p r i v a t e
P
2
p r i v a t e
P
n
p r i v a t e
Virtual address spaces for a
collection of processes communicating
via shared addresses
Machine physical address space
Shared portion
of address space
Private portion
of address space
Common physical
addresses
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
32
Communication Hardware in SAS arch.
Also natural extension of uniprocessor
Already have processor, one or more memory modules and I/O
controllers connected by hardware interconnect of some sort
Memory capacity increased by adding modules, I/O by
controllers
Add processors for processing!
For higher-throughput multiprogramming, or parallel programs
I/O ctrl Mem Mem Mem
Int erconnect
Mem
I/O ctrl
Processor Processor
Int erconnect
I/O
dev ices
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
33
Mainframe approach
Motivated by multiprogramming
Extends crossbar used for mem bw and I/O
Originally processor cost limited to small
later, cost of crossbar
Bandwidth scales with p
High incremental cost; use multistage instead
Minicomputer approach
Almost all microprocessor systems have bus
Motivated by multiprogramming, TP
Used heavily for parallel computing
Called symmetric multiprocessor (SMP)
Latency larger than for uniprocessor
Bus is bandwidth bottleneck
caching is key: coherence problem
Low incremental cost
History of SAS architectures
P
P
C
C
I/O
I/O
M M M M
P P
C
I/O
M M C
I/O
$ $
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
34
Example of SAS: Intel Pentium Pro Quad
P-Pro bus (64-bit data, 36-bit addr ess, 66 MHz)
CPU
Bus interf ace
MIU
P-Pro
module
P-Pro
module
P-Pro
module 256-KB
L
2
$
Interrupt
cont roller
PCI
bridge
PCI
bridge
Memory
cont roller
1-, 2-, or 4-way
interleav ed
DRAM
P
C
I
b
u
s
P
C
I
b
u
s
PCI
I/O
cards
All coherence and
multiprocessing glue in
processor module
Highly integrated, targeted at
high volume
Low latency and bandwidth
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
35
Example of SAS: SUN Enterprise
Gigaplane bus (256 data, 41 addr ess, 83 MHz)
S
B
U
S
S
B
U
S
S
B
U
S
2
F
i
b
e
r
C
h
a
n
n
e
l
1
0
0
b
T
,
S
C
S
I
Bus interf ace
CPU/mem
cards P
$
2
$
P
$
2
$
Mem ctrl
Bus interf ace/switch
I/O cards
16 cards of either type:
processors + memory, or I/O
All memory accessed over bus, so
symmetric
Higher bandwidth, higher latency
bus
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
36
Problem is interconnect: cost (crossbar) or bandwidth (bus)
Dance-hall: bandwidth still scalable, but lower cost than crossbar
latencies to memory uniform, but uniformly large
Distributed memory or non-uniform memory access (NUMA)
Construct shared address space out of simple message transactions
across a general-purpose network (e.g. read-request, read-response)
Caching shared (particularly nonlocal) data?
Scaling Up SAS Architecture
M M M
M
M M
Network Network
P
$
P
$
P
$
P
$
P
$
P
$
Dance hall Distributed memory
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
37
Scale up to 1024 processors, 480MB/s links
Memory controller generates comm. request for nonlocal references
No hardware mechanism for coherence (SGI Origin etc. provide this
Example scaled-up SAS arch. : Cray T3E
Switch
P
$
XY
Z
External I /O
Mem
ct rl
and NI
Mem
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
38
Complete computer as building block, including I/O
Communication via explicit I/O operations
Programming model: directly access only private address space
(local memory), comm. via explicit messages (send/receive)
High-level block diagram similar to distributed-memory SAS
But comm. integrated at IO level, neednt be into memory system
Like networks of workstations (clusters), but tighter integration
Easier to build than scalable SAS
Programming model more removed from basic hardware
operations
Library or OS intervention
Message Passing Architectures
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
39
Send specifies buffer to be transmitted and receiving process
Recv specifies sending process and application storage to receive into
Memory to memory copy, but need to name processes
Optional tag on send and matching rule on receive
User process names local data and entities in process/tag space too
In simplest form, the send/recv match achieves pairwise synch event
Other variants too
Many overheads: copying, buffer management, protection
Message-Passing Abstraction
Pr ocess P Pr ocess Q
Addr ess Y
Addr ess X
Send X, Q, t
Receive Y , P , t Match
Local pr ocess
addr ess space
Local pr ocess
addr ess space
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
40
Early machines: FIFO on each link
Hw close to prog. Model; synchronous ops
Replaced by DMA, enabling non-blocking ops
Buffered by system at destination until recv
Diminishing role of topology
Store&forward routing: topology important
Introduction of pipelined routing made it less so
Cost is in node-network interface
Simplifies programming
Evolution of Message-Passing Machines
000 001
010 011
100
110
101
111
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
41
Made out of essentially complete RS6000 workstations
Network interface integrated in I/O bus (bw limited by I/O bus)
Example of MP arch. : IBM SP-2
Memory bus
MicroChannel bus
I/O
i860 NI
DMA
D
R
A
M
IBM SP-2 node
L
2
$
Power 2
CPU
Memory
controller
4-way
interleav ed
DRAM
General interconnection
network f ormed f r om
8-port switches
NIC
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
42
Example Intel Paragon
Memory bus (64-bit, 50 MHz)
i860
L
1
$
NI
DMA
i860
L
1
$
Driv er
Mem
ctrl
4-way
interleav ed
DRAM
Intel
Paragon
node
8 bits,
175 MHz,
bidirectional
2D grid network
with processing node
attached to ev ery switch
Sandia s Intel Paragon XP/S-based Super computer
Winter 2007 ENGR9861 R. Venkatesan
High-Performance Computer Architecture
43
Evolution and role of software have blurred boundary
Send/recv supported on SAS machines via buffers
Can construct global address space on MP using hashing
Page-based (or finer-grained) shared virtual memory
Hardware organization converging too
Tighter NI integration even for MP (low-latency, high-bandwidth)
At lower level, even hardware SAS passes hardware messages
Even clusters of workstations/SMPs are parallel systems
Emergence of fast system area networks (SAN)
Programming models distinct, but organizations converging
Nodes connected by general network and communication assists
Implementations also converging, at least in high-end machines
Toward Architectural Convergence