Professional Documents
Culture Documents
CA 13 VectorProcessors
CA 13 VectorProcessors
Computer Architecture
- Vector Processors
Edgar Gabriel
Spring 2011
Edgar Gabriel
Vector Processors
• Chapter F of the 4th edition (Chapter G of the 3rd edition)
– Available in CD attached to the book
– Anybody having problems to find it should contact me
• Vector processors big in ‘70 and ‘80
• Still used today
– Vector machines: Earth Simulator, NEC SX9, Cray X1
– Graphics cards
– MMX, SSE, SSE2, SSE3 are to some extent ‘vector units’
by
a = b + c; ADDV.D V10, V8, V6
Overhead
• Start-up overhead of a pipeline: how many cycles does
it take to fill the pipeline before the first result is
available?
Unit Start-up
Load/store 12
Multiply 7
Add 6
LV 0 12 11+n
SV 30+3n
COSC 6385 – Computer Architecture 30+3n+12 41+4n
Edgar Gabriel
Pipelining – Metrics (I)
Tc Clocktime, time to finish one segment/sub-operation
m number of stages of the pipeline
n length of the vector
S Startup time in clocks, time after which the first result
is available, S = m
N 1 length of the loop to achieve half of the maximum
2 speed
Assuming a simple loop such as:
for (i=0; i<n; i++) {
a[i] = b[i] + c[i];
}
and N 1 = m −1
2
length of the loop required to achieve half of the theoretical peak
performance of a pipeline is equal to the number of segments
(stages) of the pipeline
COSC 6385 – Computer Architecture
Edgar Gabriel
op op
=α
m −1 Tc
Tc ( + 1)
Nα
m −1
and leads to Nα =
1
−1
α
E.g. for α = 3 you get N 3 = 3(m − 1) ≈ 3m
4 4
Chaining
• Example:
MULV.D V1, V2, V3
ADDV.D V4, V1, V5
• Second instruction has a data dependence on the first
instruction: two convoys required
• Once the element V1(i) is has been calculated, the
second instruction could calculate V4(i)
– no need to wait until all elements of V1 are available
– could work similarly as forwarding in pipelining
– Technique is called chaining
Chaining (III)
• Example: chained and unchained version of the ADDV.D
and MULV.D shown previously for 64 elements
– Start-up latency for the FP MUL vector unit: 7cycles
– Start-up latency for FP ADD vector unit: 6 cycles
– NOTE: different results than in book in fig. G.10
• Unchained version: 7 63 6 63
7 + 63 + 6 + 63 = 139 cycles MULV ADDV
• Chained version:
7 63
7 + 6 + 63 = 76 cycles MULV
6 63
ADDV
A[9] B[9]
A[8] B[8]
A[7] B[7]
A[6] B[6]
A[5] B[5]
A[4] B[4]
A[9] B[9]
A[3] B[3]
A[5] B[5] A[6] B[6] A[7] B[7] A[8] B[8]
A[2] B[2]
A[1] B[1] A[2] B[2] A[3] B[3] A[4] B[4]
A[1] B[1]
C(i)