Download as pdf or txt
Download as pdf or txt
You are on page 1of 15

CKV

Advanced VLSI Architecture


MEL G624

Lecture 26: Data Level Parallelism


CKV
Vector Length Registers

Vector length not known at compile time?

for (i=0; i<n; i=i+1)


C[i] = A[i] + B[i];

n value may not even be known at run time

Use Vector Length Register (VLR)

For n < Maximum Vector Length (MVL), set VLR to n

For n > MVL ? Strip Mining


CKV
Vector Length Registers

“Strip‐mining”

Break loops into pieces that fit into vector registers

Modify

for (i=0; i<n; i=i+1)


C[i] = A[i] + B[i];
CKV
Vector Length Registers
for (i=0; i<n; i=i+1)
C[i] = A[i] + B[i];
low = 0;
VL = (n % MVL); /*find odd-size piece using modulo op % */

for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/

for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/


C[i] = A[i] + B[i] ; /*main operation*/

low = low + VL; /*start of next vector*/


VL = MVL; /*reset the length to maximum vector length*/
}
CKV
Vector Length Registers

Write assembly code MTC1 VLR, R1

low = 0;
VL = (n % MVL); /*find odd-size piece using modulo op % */

for (j = 0; j <= (n/MVL); j=j+1) { /*outer loop*/

for (i = low; i < (low+VL); i=i+1) /*runs for length VL*/


C[i] = A[i] + B[i] ; /*main operation*/

low = low + VL; /*start of next vector*/


VL = MVL; /*reset the length to maximum vector length*/
}
CKV
Vector Length Registers
ANDI R1, N, 63 # N mod 64
MTC1 VLR, R1 # Do remainder
loop: LV V1, RA
LV V2, RB
ADDV.D V3, V1, V2
SV V3, RC
DSLL R2, R1, 3 # Multiply by 8
DADDU RA, RA, R2 # Bump pointer
DADDU RB, RB, R2
DADDU RC, RC, R2
DSUBU N, N, R1 # Subtract elements
LI R1, 64
MTC1 VLR, R1 # Reset full length
BGTZ N, loop # Any more to do?
CKV
Maximum Vector Length (MVL)

Determine the maximum number of elements in a vector
for a given architecture

All blocks but the first are of length MVL
Utilize the full power of the vector processor

Later generations may grow the MVL
No need to change the ISA
CKV
Vector‐Mask Register
for (i = 0; i < 64; i=i+1)
if (X[i] != 0)
X[i] = X[i] – Y[i];
CKV
Vector Mask Registers

A Boolean vector to control the execution of a vector
instruction

VMR is part of the architectural state

For vector processor, it relies on compilers to
manipulate VMR explicitly

For GPU, it gets the same effect using hardware

Both GPU and vector processor spend time on masking
CKV
Vector Mask Registers Example
for (i = 0; i < 64; i=i+1) SNEVS.D V1, F0
if (X[i] != 0)
X[i] = X[i] – Y[i];

LV V1, Rx ;load vector X into V1


LV V2, Ry ;load vector Y
L.D F0, #0 ;load FP zero into F0
SNEVS.D V1, F0 ;sets VM(i) to 1 if V1(i)!=F0
SUBVV.D V1, V1, V2 ;subtract under vector mask
SV Rx,V1 ;store the result in X
CKV
Vector‐Mask Control

Simple Implementation Density-time Implementation


Execute all N operations, turn off scan mask vector and only execute
result write-back according to mask elements with non-zero masks
CKV
Memory Banks

The start‐up time for a load/store vector unit is the time 
to get the first word from memory into a register. 

Penalties for start‐ups on load/
store units are higher than those for arithmetic unit

Memory system must be designed to support high bandwi
dth for vector loads and stores
CKV
Memory Banks
Vector processor use memory banks which allow multiple
independent access rather than memory interleaving
mainly to:
Support multiple loads or stores per clock. Be able to control 
the addresses to the banks independently

Support (multiple) non‐sequential loads or stores

Support multiple processors sharing the same memory syste
m, processor will be generating its own independent stream
of addresses
CKV
Memory Banks
Example: Cray T90

32 processors, each generating 4 loads and 2 stores per
cycle

Processor cycle time is 2.167ns

SRAM cycle time is 15ns

How many memory banks needed?
CKV

Thank You for Attending

You might also like