Pipelined Vector Processing and Scientific Computation: Eine Zeitreise in Die Welt Der Computer. 1

Pipelined Vector Processing and Scientific
Computation
John G. Zabolitzky
1 Eine Zeitreise in die Welt der Computer.

Applications of High-Performance Computing
Weather prediction, climatic simulation
fluid dynamics simulation (aerodynamics for aerospace, automobile, combustion, ....)
basic science
cosmology
quantum mechanical many-body problems
chemistry
solid-state
quantum fluids
high-energy physics
cryptography
weapons research
energy research
nuclear reactor simulation
fusion research
many many more

Terminal State of Scalar Computing: CDC 7600, 1968
Maximum RISC performance of 1 operation/cycle achieved
No further improvement possible without change of paradigm
36 MHz => 36 MIPS => 5 MFLOPS real
The CDC 7600 (designed by Seymour Cray) was

the most powerful of all computers from 1968 to
1976 when the Cray-1 achieved > 10 times its
performance

Pipelined Scalar Execution
time ====>
instruction 1 2 3 4 5 6 7
1 fetch decode execute
6 fetch decode
7 fetch
Pipelined execution on parallel functional units
Scalar Code Example
DO i=1,100 a(i)=b(i)*c(i)
load b, inc addesss
load c, inc address
multiply
store a, inc address
decrement count, loop?
5 instructions = cycles (optimum) for one multiply
pipelined multiply: could start one multiply each and every cycle => only 20%
efficient use
expensive multiplier sits idle most of the time

Architectural Alternatives
* Pipelined Scalar (RISC) as outlined before
* Pipelined Vector (this presentation further down)
* SIMD (Single Instruction Multiple Data) parallel arithmetic (e.g., ILLIAC IV)
too expensive, inefficient: larger number of lightly used multipliers
* Superscalar = multiple issue in one cycle

all modern single-chip CPUs (Intel to TI); keep all functions busy
* VLIW (Very Long Instruction Word) = Variant of Superscalar
* MIMD (Multiple Instruction Multiple Data) true parallel streams, e.g. Cray T3E, IBM
Blue Gene, IBM Cell: may be superimposed on top of ANY CPU architecture

Vector Computation
Scientific codes have high percentage in looping over simple data structures
DO i=1,100 a(i) = b*c(i) + d(i)
simple logical structure ==>
set up such that one multiply/cycle
one instruction for entire loop
MFLOP rate = cycle rate or multiple thereof
specialized for scientific/engineering tasks

Vector Pipeline c(i)=a(i)*b(i)
fetch a(i++)
multip. 1 multip. 2 multip. 3 multip. 4 store c(i++)
fetch b(i++)
time
i=1 |
2 1 |
3 2 1 |
4 3 2 1 |
5 4 3 2 1 V
6 5 4 3 2 1
7 6 5 4 3 2
8 7 6 5 4 3
Inventor: Henry Ford

Need to Vectorize; some automatic, high quality requires hand-optimization
Naive scalar code for matrix multiply

» s=0.0
» do j=1,n
» s=s+a(i,j)*b(j,k)
Recursive on s => adder pipeline blocked
vector code for matrix multiply
» do i=1,n
» c(i,k) = c(i,k) + a(i,j)*b(j,k)
Independent vector elements, but 1.5x bandwidth
Frequently good idea: exchange inner/outer loop

First Vector Computers
Control Data Corporation (CDC) STAR-100 [STring ARray 100 MFLOPS]
memory-to-memory architecture
therefore long startup times (~n00 cycles)
very slow scalar unit (~2 MFLOPS)
overall disappointing performance
contracted 1967, announced 1972, delivered 1974
total of 4 machines, 2 Lawrence Livermore Lab
Thornton (CDC) and Fernbach (LLL) loose their jobs

CDC STAR-100
Photograph courtesy of
Charles Babbage
Institute, University of
Minnesota, Minneapolis

Texas Instruments ASC
Advanced Scientific Computer, early 1970s
architecturally similar to CDC STAR-100
7 units sold
TI dropped out of mainframe computer manufacturing after this machine

Vector Performance I
MFLOP rate (MFLOPS) as function of vector length n

scalar: ~constant (only some loop overhead, then n * loop time)
vector: (n = length of vector)
# cycles = startup + n / nflop_per_cycle
rate/clock = #ops / #cycles ~ n / (startup + n)
half rate at vectorlength n ~ startup
full rate needs n >> startup => “Long Vector Machine”

Performance vs. Startup, Length
4
3.5
3 s10_r1
ops/clock
2.5 s100_r1
2 s10_r4
1.5 s100_r4
1 scalar_0.2
0.5
0
0 100 200 300 400 500
vector length
Vector Performance II
Vector/Scalar Subsections
ALL codes have some scalar (non-vectorizable) sections
total time = (scalar fraction)/(scalar rate) + (vector fraction)/(vector rate)
example: 10% / 1 MFLOPS + 90% / 100 MFLOPS =

100 / (0.1 * 100 + 0.9 * 1) = 9.2 MFLOPS !!!

Vector Version of Amdahl’s Law
1.2
1
r5
0.8
performance
r10
0.6 r20
r50
0.4
r100
0.2
0
0 0.2 0.4 0.6 0.8 1 1.2
17 scalar fraction Eine Zeitreise in die Welt der Computer.

Vector Computer Design Guide
Must have SHORT vector startup => can work with short vectors
Must have FASTEST POSSIBLE scalar unit => can afford scalar sections
irregular data structures ==> need gather, scatter, merge operations (and a few
more)
x(i) = a(index(i)) * b(i)
y(index(i)) = c(i) + d(i)
where (a(i) > b(i)) c(i) = d(i)

Cray Research, Inc.
Founded by Seymour Cray (father of CDC 6600/7600) in 1972 (STAR-100 known)
first Cray-1 delivered in 1976 to Los Alamos Scientific Laboratory (LASL)
8 vector registers of 64 elements each
Vector load/store instructions
fastest scalar computer of its time
160 MFLOPS peak rate ( 2 ops/cycle @ 80 MHz), few cycles startup

Seymour Cray
Cray-1
1976
Single Processor
80 MFLOPS
1 Mword = 8 Mbyte
Photograph courtesy of
Charles Babbage
Institute, University of
Minnesota,
Minneapolis

Block Diagram Cray YMP-EL, only one of four identical CPUs shown, simplified
8 vector registers, 64 elements, 64 bit 4 vector execution units, 33 MHz
Shared
Vi
Large working set:
Main - 8 vector registers, 64
8 scalar registers, 64 bit scalar functional units
words
Memory - 8 scalar registers
64 word
Tjk
- 8 address registers
Buffer memory Si
128 MW
- large instruction buffer
64 bit 8 address registers, 32 bit address functional units
Performance Features:
64 word
1 Gbyte Bjk
- vector processing: one
4 ports/
proc
buffer memory
Ai
operation affects 64
4x 4 x
vector elements, streamed
33 MHz
= 8 instruction buffers, 32 words each
through functional unit
4.2 Gby
/sec Y1 channel - small vector startup
instruction issue 40 Mbyte/sec
time
- chaining between vector
48
shared 21 ops
Eine Zeitreise in die Welt der Computer.
registers
IOS
- large, fast
Cray Research, Inc. cnt’d
1982 Cray-XMP (Steve Chen improvements, up to 4 processors, shared memory)

1985 Cray-2, 256 Mword memory, 4 processors, immersion cooled
1988 Cray-YMP (last Chen machine)
1991 Cray C90 (up to 16 vector CPUs, shared memory)
1993 Cray T3D (massively parallel Alpha)
one and only Cray-3 delivered to NCAR (Cray Comp Corp)
1994 Cray J90 (up to 32 vector CPUs, shared memory), air cooled
1995 Cray T3E (most successful MPP machine), Cray T90 (parallel vector, immersion cooled)
Cray-4 abandoned (Cray Computer Corporation ch. 11)

1996 acquired by Silicon Graphics
1998 Cray SV1 (parallel vector, air cooled)
1999 acquired by Teradata => Cray, Inc.
2002 Cray X1, parallel vector, immersion spray cooled
2004 Cray X1e, enhanced version of X1
Cray XT3, AMD based 3D Torus massively parallel machine
CDC Cyber 200 Family
- 1980, enhanced version of STAR-100

- reduced startup time, ~ 50 cycles
- fast scalar unit
- rich instruction repertoire
- still memory-to-memory, 400 MFLOPS peak
- Cyber 203, Cyber 205, ETA-10 [10 GFLOPS]
- vector FORTRAN language extensions provided
- terminated in 1989 since unprofitable
- around 40 Cyber 200, 34 ETA-10 sold

Minnesota Supercomputer
Center
Minneapolis, 1986
Cray-2, CDC Cyber 205

NEC Japan
- 1983 SX-1 single processor vector 650 MFLOPS

- 1985 SX-2 single processor vector 1300 MFLOPS
- 1990 SX-3 four processors at ~ 5 GFLOPS each, 4 Gbyte = 0.5 Gword memory
- 1995 SX-4 32 processors at ~ 2 GFLOPS each (CMOS; all previous ECL)
- 1998 SX-5 upto 512 processors 8 GFLOPS each
- 2004 SX-7 upto 2048 processors 8.8 GFLOPS each

IBM - Sony - Toshiba CELL processor
- 8 vector CPUs + GPU on single chip

- 256 kbyte = 32 kword local storage
(very small !!)
- 12 word/cycle internal interconnect
= 386 Gbyte/sec
- 24 Gbyte/sec = 3 Gword/sec main
memory
- 76 Gbyte/sec = 9.5 Gword/sec
communication
- @ 4 GHz clock 256 GFLOPS (32 bit)
peak
- 26
GFLOPS (64 bit) peak
- max 4.5 Gbyte addressable, 512 Mbyte
implemented
- system interconnect ?
- 90 nm SOI, 8 layers Cu interconnect

- 234 M Transistors
- 221 mm² die size
- significant potential in future

revisions
- but: 80W @ 1.1V 4.0 GHz is too much
- 180W @ 1.4V 5.6 GHz is much too
much
- work needed in power reduction
- larger internal memory
- 64 bit arithmetic improved

From: S. Williams et. al., Lawrence

Berkeley Laboratory
- single Cell chip performance
- compared with Cray X1E single vector
processor and several commodity
microprocessors (AMD, Intel)
- already current version shows
impressive speedup, at cost of
significant programming complexity
(explicit storage moves as opposed to
caching)
- slightly enhanced Cell (Cell+)
simulation provides very significant
additional speedup (more efficient DP)
- current version insufficient for
major impact
- future versions may change that,
great potential

Pipelined Vector Processing and Scientific Computation: Eine Zeitreise in Die Welt Der Computer. 1

Uploaded by

Copyright:

Available Formats

You might also like

Pipelined Vector Processing and Scientific Computation: Eine Zeitreise in Die Welt Der Computer. 1

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Pipelined Vector Processing and Scientific Computation: Eine Zeitreise in Die Welt Der Computer. 1

Uploaded by

Copyright:

Available Formats

Pipelined Vector Processing and Scientific

1 Eine Zeitreise in die Welt der Computer.

2 Eine Zeitreise in die Welt der Computer.

The CDC 7600 (designed by Seymour Cray) was

3 Eine Zeitreise in die Welt der Computer.

6 Eine Zeitreise in die Welt der Computer.

* Pipelined Vector (this presentation further down)

* Superscalar = multiple issue in one cycle

* VLIW (Very Long Instruction Word) = Variant of Superscalar

7 Eine Zeitreise in die Welt der Computer.

DO i=1,100 a(i) = b*c(i) + d(i)

simple logical structure ==>

set up such that one multiply/cycle

one instruction for entire loop

MFLOP rate = cycle rate or multiple thereof

specialized for scientific/engineering tasks

8 Eine Zeitreise in die Welt der Computer.

Inventor: Henry Ford

9 Eine Zeitreise in die Welt der Computer.

Naive scalar code for matrix multiply

10 Eine Zeitreise in die Welt der Computer.

Control Data Corporation (CDC) STAR-100 [STring ARray 100 MFLOPS]

11 Eine Zeitreise in die Welt der Computer.

12 Eine Zeitreise in die Welt der Computer.

Advanced Scientific Computer, early 1970s

architecturally similar to CDC STAR-100

TI dropped out of mainframe computer manufacturing after this machine

13 Eine Zeitreise in die Welt der Computer.

MFLOP rate (MFLOPS) as function of vector length n

14 Eine Zeitreise in die Welt der Computer.

ALL codes have some scalar (non-vectorizable) sections

total time = (scalar fraction)/(scalar rate) + (vector fraction)/(vector rate)

example: 10% / 1 MFLOPS + 90% / 100 MFLOPS =

16 Eine Zeitreise in die Welt der Computer.

17 scalar fraction Eine Zeitreise in die Welt der Computer.

18 Eine Zeitreise in die Welt der Computer.

Founded by Seymour Cray (father of CDC 6600/7600) in 1972 (STAR-100 known)

first Cray-1 delivered in 1976 to Los Alamos Scientific Laboratory (LASL)

8 vector registers of 64 elements each

Vector load/store instructions

fastest scalar computer of its time

160 MFLOPS peak rate ( 2 ops/cycle @ 80 MHz), few cycles startup

19 Eine Zeitreise in die Welt der Computer.

20 Eine Zeitreise in die Welt der Computer.

8 vector registers, 64 elements, 64 bit 4 vector execution units, 33 MHz

64 bit 8 address registers, 32 bit address functional units

1982 Cray-XMP (Steve Chen improvements, up to 4 processors, shared memory)

Cray-4 abandoned (Cray Computer Corporation ch. 11)

- 1980, enhanced version of STAR-100

23 Eine Zeitreise in die Welt der Computer.

24 Eine Zeitreise in die Welt der Computer.

- 1983 SX-1 single processor vector 650 MFLOPS

25 Eine Zeitreise in die Welt der Computer.

- 8 vector CPUs + GPU on single chip

- 90 nm SOI, 8 layers Cu interconnect

- significant potential in future

27 Eine Zeitreise in die Welt der Computer.