Professional Documents
Culture Documents
Pipelined Vector Processing and Scientific Computation: Eine Zeitreise in Die Welt Der Computer. 1
Pipelined Vector Processing and Scientific Computation: Eine Zeitreise in Die Welt Der Computer. 1
Pipelined Vector Processing and Scientific Computation: Eine Zeitreise in Die Welt Der Computer. 1
Computation
John G. Zabolitzky
time ====>
instruction 1 2 3 4 5 6 7
1 fetch decode execute
2 fetch decode execute
3 fetch decode execute
4 fetch decode execute
5 fetch decode execute
6 fetch decode
7 fetch
Pipelined execution on parallel functional units
4 Eine Zeitreise in die Welt der Computer.
5 Eine Zeitreise in die Welt der Computer.
Scalar Code Example
DO i=1,100 a(i)=b(i)*c(i)
load b, inc addesss
load c, inc address
multiply
store a, inc address
decrement count, loop?
5 instructions = cycles (optimum) for one multiply
pipelined multiply: could start one multiply each and every cycle => only 20%
efficient use
expensive multiplier sits idle most of the time
* SIMD (Single Instruction Multiple Data) parallel arithmetic (e.g., ILLIAC IV)
too expensive, inefficient: larger number of lightly used multipliers
* MIMD (Multiple Instruction Multiple Data) true parallel streams, e.g. Cray T3E, IBM
Blue Gene, IBM Cell: may be superimposed on top of ANY CPU architecture
Scientific codes have high percentage in looping over simple data structures
memory-to-memory architecture
therefore long startup times (~n00 cycles)
very slow scalar unit (~2 MFLOPS)
overall disappointing performance
contracted 1967, announced 1972, delivered 1974
total of 4 machines, 2 Lawrence Livermore Lab
Thornton (CDC) and Fernbach (LLL) loose their jobs
Photograph courtesy of
Charles Babbage
Institute, University of
Minnesota, Minneapolis
7 units sold
4
3.5
3 s10_r1
ops/clock
2.5 s100_r1
2 s10_r4
1.5 s100_r4
1 scalar_0.2
0.5
0
0 100 200 300 400 500
vector length
15 Eine Zeitreise in die Welt der Computer.
Vector Performance II
Vector/Scalar Subsections
1.2
1
r5
0.8
performance
r10
0.6 r20
r50
0.4
r100
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
Must have SHORT vector startup => can work with short vectors
Must have FASTEST POSSIBLE scalar unit => can afford scalar sections
irregular data structures ==> need gather, scatter, merge operations (and a few
more)
x(i) = a(index(i)) * b(i)
y(index(i)) = c(i) + d(i)
where (a(i) > b(i)) c(i) = d(i)
Cray-1
1976
Single Processor
80 MFLOPS
1 Mword = 8 Mbyte
Photograph courtesy of
Charles Babbage
Institute, University of
Minnesota,
Minneapolis
Shared
Vi
Large working set:
Main - 8 vector registers, 64
8 scalar registers, 64 bit scalar functional units
words
Memory - 8 scalar registers
64 word
Tjk
- 8 address registers
Buffer memory Si
128 MW
- large instruction buffer
Performance Features:
64 word
1 Gbyte Bjk
- vector processing: one
4 ports/
proc
buffer memory
Ai
operation affects 64
4x 4 x
vector elements, streamed
33 MHz
= 8 instruction buffers, 32 words each
through functional unit
4.2 Gby
/sec Y1 channel - small vector startup
instruction issue 40 Mbyte/sec
time
- chaining between vector
48
shared 21 ops
Eine Zeitreise in die Welt der Computer.
registers
IOS
- large, fast
Cray Research, Inc. cnt’d