Download as pdf or txt
Download as pdf or txt
You are on page 1of 31

VECTOR PROCESSING

Architecture Classification

 SISD
 Single Instruction Single Data
 SIMD
 Single Instruction Multiple Data
 MIMD
 Multiple Instruction Multiple Data
 MISD
 Multiple Instruction Single Data

2
Alternative Forms of Machine
Parallelism

 Instruction Level Parallelism (ILP)


 Thread Level Parallelism (TLP)
 vector Data Parallelism (DP)

3
Alternative Forms of Machine
Parallelism

4
Drawbacks of ILP and TLP

 Coherency
 Synchronization
 Large Overhead
 instruction fetch and decode: at some point, its hard to
fetch and decode more instructions per clock cycle
 cache hit rate: some long-running (scientific) programs
have very large data sets accessed with poor locality

5
Alternative: Vector Processors

6
Vector Processing: Introduction
 A vector is an ordered set of elements.
 A vector operand contains an ordered set of n elements, where n
is called the length of the vector.
 Each element in a vector is a scalar quantity, which may be
floating point number, an integer, a logical value, or a character
(byte).
 Example vectors would be 64 or 128 elements in length
 Small vectors are about 4 elements in length

 In vector processing, two successive pairs of elements


are processed each clock period.
7
Vector Processing: Introduction
 Dual vector pipes and dual sets of vector functional units

 As each pair of operations is completed, the results are delivered


to the appropriate elements of the result register.

 The operation continues until the number of elements processed


is equal to the count specified by the vector length register.

For example: C (1:50) = A (1:50) + B (1:50)

8
9
Simple task of adding two groups of 10 numbers
together
Execute this loop 10 times
 read the next instruction and decode it
 fetch this number
 fetch that number
 add them
 put the result here
 end loop

But to a vector processor, this task looks considerably different:

 read instruction and decode it


 fetch these 10 numbers
 fetch those 10 numbers
 add them
 put the results here 10
11
Another Example:

12
Vector instructions are classified into for basic types:

f1: V = V f2: V = S
f3: V * V = V f4: V*S = V

Where V indicates vector operand and S indicates scalar operand.

The operations f1 and f2 are unary operations such as vector square root,
vector sine, vector complement, vector summation and so on.

Operations f3 and f4 are binary operations such as vector add, vector multiply,
vector scalar adds and so on.

13
Vector Instruction Fields

Vector instruction includes the initial addresses of the two source operands, one
destination operand, the length of the vectors and the operation to be performed.

14
Fig. Simplified view of a vector processor with one functional unit for arithmetic
operations

15
What is a Vector Processor?
A Vector processor is a processor that can operate on an
entire vector in one instruction.
 The operand to the instructions are complete vectors
instead of one element.
 Provides high-level operations that work on vectors

 Vector processors reduce the fetch and decode


bandwidth as the number of instructions fetched are
less.
 They also exploit data parallelism in large scientific and
multimedia applications.

16
What is a Vector Processor?
Based on how the operands are fetched, vector processors
can be divided into two categories:

 Memory-memory architecture operands are directly


streamed to the functional units from the memory and
results are written back to memory as the vector operation
proceeds. Example: CDC Star 100

 Vector-register architecture, operands are read into vector


registers from which they are fed to the functional units
and results of operations are written to vector registers.
Example: Cray, Convex, Fujitsu, Hitachi, NEC
17
18
Components of Vector Processor

19
Components of Vector Processor

 Vector Registers
 Fixed length bank holding a single vector
 Has at least 2 read and 1 write ports
 Typically 8-32 vector registers, each holding 64-128 64-
bit elements
 Vector Functional Units
 Fully pipelined, start new operation every clock
 Typically 4-8 FUs: FP add, FP mult, FP reciprocal, integer
add, logical, shift
 Scalar Registers
 Single element for FP scalar or address
 Load Store Units
20
Vector-Register Architecture

21
Memory operations
 Load/store operations move groups of data
between registers and memory

 Three types of addressing


 Unit stride access
 Fastest
 Non-unit (constant) stride access
 Indexed (gather-scatter)
 Vector equivalent of register indirect
 Increases number of programs that vectorize

22
Vector Stride

 Position of the elements we want in


memory may not be sequential
 Consider following code:
Do 10 I=1, 100
Do 10 j =1, 100
A(I,j) = 0.0
Do 10 k =1,100
A(I,j) = A(I,j) + B(I,k)*C(k,j)
10 Continue

23
Vector Processor Properties

 Computation of each result must be independent


of previous results
 Single vector instruction specifies a great deal of
work
 Equivalent to executing an entire loop
 Vector instructions must access memory in a
known access pattern
 Many control hazards can be avoided since the
entire loop is replaced by a vector instruction

24
Advantages of Vector
Processors

 Increase in code density


 Decrease in total number of instructions
 Data is organized in patterns which is
easier for the hardware to compute
 Simple loops are replaced with vector
instructions, hence decrease in overhead
 Scalable

25
Disadvantages of Vector
Processors
 Expansion of the Instruction Set
Architecture (ISA) is needed
 Additional vector functional units and
registers
 Modification of the memory system

26
Example Vector Machines
 Machine Year Clock Regs Elements FUs LSUs
 Cray 1 1976 80 MHz 8 64 6 1
 Cray XMP 1983120 MHz 8 64 8 2 L, 1 S
 Cray YMP 1988166 MHz 8 64 8 2 L, 1 S
 Cray C-90 1991240 MHz 8 128 8 4
 Cray T-90 1996455 MHz 8 128 8 4
 Conv. C-1 1984 10 MHz 8 128 4 1
 Conv. C-4 1994133 MHz 16 128 3 1
 Fuj. VP2001982133 MHz 8-256 32-1024 3 2
 NEC SX/2 1984160 MHz 8+8K 256+var 16 8
 NEC SX/3 1995400 MHz 8+8K 256+var 16 8 27
Vectorization Example 1
DO 100 I = 1, N
A(I) = B(I) + C(I)
100 CONTINUE
Scalar process:
1. B(1) will be fetched from memory
2. C(1) will be fetched from memory
3. A scalar add instruction will operate on B(1) and C(1)
4. A(1) will be stored back to memory
5. Step (1) to (4) will be repeated N times.

28
Vectorization Example 1
DO 100 I = 1, N
A(I) = B(I) + C(I)
100 CONTINUE
Vector process:
1. A vector of values in B(I) will be fetched from memory
2. A vector of values in C(I) will be fetched from memory.
3. A vector add instruction will operate on pairs of B(I) and C(I) values.
4. After a short start-up time, stream of A(I) values will be stored back to
memory, one value every clock cycle.

29
Example (2): Y=aX+Y
Scalar Code: Vector Code:
LD F0, A LD F0, A
ADDI R4,Rx, #512 ; Last addr LV V1, Rx ; Load vecX
Loop: LD F2, 0(Rx) MULTSV V2, F0, V1 ; Vec Mult
MULTD F2, F0, F2 ; A * X[I] LV V3, Ry ; Load vecY
LD F4, 0(Ry) ADDV V4, V2, V3 ; Vec Add
ADDD F4, F2, F4 ; + Y[I] SV Ry, V4 ; Store result
SD 0(Ry), F4
ADDI Rx, Rx, #8 ; Inc index
ADDI Ry, Ry, #8 64 is element size .So we need
SUB R20, R4, Rx no loop now
BNEZ R20, Loop
1+5*64=321 operations

Loop goes 64 times. Vector/Scalar=1.8x

2+9*64=578 operations 30
Applications

 In radar and signal processing for detection of space /


underwater targets.
 In remote sensing for earth resources exploration.
 In computational wind tunnel experiments.
 In 3D stop-action computer assisted tomography.
 Weather forecasting.
 Medical diagnosis.

31

You might also like