Download as pdf or txt
Download as pdf or txt
You are on page 1of 9

ENEE646 Project Report:

SIMD Multiprocessor System Simulator


Ren Mao
School of Electrical and Computer Engineering
University of Maryland
Email: neroam@umd.edu
Abstract
This article describes how to develop a software test-bed to simulate the excution of instructions on a SIMD
multiprocessor system. It gives a design including an instruction format for SIMD multiprocessor and a master
processor that can fetch,decode,and execute the instructions. It also includes a clock to describe and mesuare the
instruction cycles. Then, it gives two example programs on this SIMD system whichcould transpose a 4x4 integer
matrix or multiply two 4x4 floating-point matrices. According to these example programs, it gives a perfomance
model to measure the speed-up,efficiency and utilization of this system. At last, it discusses the main weakness and
strengths of SIMD system and analyzes the possible improvements.
I. P ROJECT OVERVIEW
In this project, I develop a software test-bed to simulate the execution of instructions on a SIMD multiprocessor
system. The code is written in C++ and comiled with g++ in Linux Fedora 17 environment. The usage and its
options are below:
Usage: ./simd

Main Menu: Type


0 to enter a program
1 to display registers
2 to display memory
3 to run examples
4 to exit

The main program has four options as below:


• enter a program: This option allows user to enter a machine code program on SIMD from keyboard or file.
After that,it could execute the program or step into the program and trace the registers and memory units.
Meanwhile, the clock is counted from the execution of the program.
• display registers: This option allows user to display all the registers in the master processor and all the
processing elements.
• display memory: This option allows user to display all the memory units of all the processing elements.
• run examples: This option allows user to generate and execute example programs : to transpose a 4x4 integer
matrix or multiply two 4x4 floating-point matrices on SIMD. Users are allowed to enter the entries of matrices.
The example program code will be generated as ”*.simd” files, and executed on the SIMD system. It will list
the outputs after execution as well as the clock cycles.
According to the specification of this project, this article is organized as below: i) Section II gives the description
of the system architecture, ii) Section III proposes an instruction format for this SIMD system that can fetch,
decode, and execute. iii) Section IV specifies the time assumption for the execution times of instructions to reflect
the amount of time they would consume in bit logic. iv) Section V describes the example program to transpose
a 4x4 matrix with integer entries on this SIMD system and gives the sample outputs. v) section VI describes
the example program to multiply two 4x4 matrices with floating-point entries and lists the sample outpus. vi)
Section VII develops a performance model that measures the speed-up, efficiency and utilization and determines
the performance of the system for the example programs. vii) Section VIII discusses the main weaknesses and
strengths of the SIMD system as well as the possible improvements.
II. A RCHITECTURE D ESCRIPTION
To simulate a SIMD multiprocessor system, I specify it as SIMD(P, M, N ) where the P is a set of 4 processing
elements,M is a set of corressponding memory units, and N is a network with 4 inputs and as many outputs.
Processing elements communicate with one another by establishing paths through N , and each has its own memory
uit(up to 16 units for each). In addition, the system also has a master processor.
The SIMD system contains elements as below:
• Master Processor: The master processor has the instruction registers, address registers, flag registers and 4
accumulator registers. All the registers are 16 bits except PC and MAR which are 12 bits corresponding to
memory location 0 − 4095. It also has a memory with 4096 units, each of the unit is 16 bits, which could be
used as instruction memory and data memory for the master processor.
• Processing Elements: The SIMD here contains 4 processing elements, each of which has a cash of four
16-bit accumulator registers, address registers, flag registers, a 16-bit memory with 16 units, an integer and a
floating-point arithmetic unit that can operate on two operands. Different processing elements have their own
registers and memory units.

TABLE I: Registers Description


Function Registers
Accumulator Registers Regs(0)-Reg(3)
Network Input Register Regs(0)
Network Output Register Regs(1)
Address Index Register Regs(3)
Address Register MAR
Memory Data Register MDR
Address Access Mode Register AMode
Flag Register O,Z,I,F,S,jmp,add,cmp
Instruction Register IR
Instruction Index Register PC

To describe the processor in program, based on the VESP simulator [1], I firstly define a structure of proces-
sors(processing elements) which includes registers and memory units, then use that structure to define a SIMD
system structure as below. The registers are described in Table I.
typedef struct
{
unsigned short MAR,MDR;
short Regs[4],MEMORY[16],O,C,Z,S,add,cmp;
} processor;

typedef struct
{
unsigned short MAR,MDR,PC,IR,clock;
short Regs[4],MEMORY[4096],reset,O,C,Z,S,add,cmp,I,F,AMode,jmp;
processor pe[4];
} architecture; architecture simd;

III. I NSTRUCTION D ESIGN


The instruction repertoire of the SIMD system consists of three types of instructions:
• Computation type(C-type): This type of instructions is to do some calculation for each processing elements,
which is fetched and decoded by the master processor but executed by the selected processing elements.
• Interconnection type(I-type): This type of instructions is to change the interconnection network of processing
elements and write the Network input register from the corresponding output register according to the network
for selected processing elements. This is also fetched and decoded by the master processor but executed by
the processing elements.
• Master type(S-type): This type of instructions is to execute calculation or control function on master processor
such as SET AMODE, which is to set address mode for memory access of processing elements. This is the
only type of instructions executed by the master processor.
Since the memory unit in this SIMD is 16-bits, the instruction length should be aligned to 16-bits. Therefore, I
specify all the instruction format as follow:
A. C-type Instructions

TABLE II: Syntax of C-type Instruction


Op-code: 6-bits Regs.Ids: 2-bits Memory Address: 4-bits Processor mask: 4-bits

The first two bits of op-code represent the type of instructions(C-type = 00), Regs.Ids selects the dest operand
register from 0 to 3, the memory address code could represent the direct memory location or index offset according
to the address mode, and the processor mask specifies the subset of processing elements that take part in the
computations.
Each processing element can execute the C-type instructions either on an integer or a floating-point arithmetic
unit that can operate on two operands. The C-type instructions are specified as Table III.

TABLE III: C-type Instructions Description


Instructions Op-code Function Instructions Op-code Function
ADD 00 0000 Regs = Regs + M emory CMP 00 1000 Regs =∼ Regs
SUB 00 0001 Regs = Regs − M emory SHL 00 1001 Regs = Regs ≪ 1
MPY 00 0010 Regs = Regs × M emory SHR 00 1010 Regs = Regs ≫ 1
DIV 00 0011 Regs = Regs ÷ M emory ROL 00 1011 Regs = Regs ≪ 1(Rotation)
COM 00 0100 Regs − M emory ROR 00 1100 Regs = Regs ≫ 1(Rotation)
AND 00 0101 Regs = Regs&M emory LDA 00 1101 Regs = M emory
IOR 00 0110 Regs = Regs|M emory STO 00 1110 M emory = Regs
XOR 00 0111 Regs = Regs ⊕ M emory

If the processing mask bit is set in the instruction, the corresponding processor will execute the instruction.
Note that if the calculation is based on the floating-point operands, it will tranucate the tailing of results since the
memory is just 16 bit while the floating-point data type for C is 32-bits. This could result some deviation for the
calculation of floating-point operands.
B. I-type Instructions

TABLE IV: Syntax of I-type Instruction


Op-code: 6-bits Reserved code: 6-bits Processor mask: 4-bits

When the first two bits of op-code is 01, it represents the I-type instructions, the last four bits of op-code represents
a unique interconnections function. This instruction will read network output register of dest processing elements.
All the network are specified in the program with an offline algorithm. The I-type instructions are specified as
Table V.
TABLE V: I-type Instructions Description
Instructions Op-code ( Permutation )
0 1 2 3
PER 01 0000
1 2 3 0
( )
0 1 2 3
PEL 01 0001
3 2 1 0
( )
0 1 2 3
SHF 01 0010
0 2 1 3
( )
0 1 2 3
NXC 01 0011
1 0 3 2
( )
0 1 2 3
FXC 01 0100
2 3 0 1

TABLE VI: Syntax of S-type Instruction


Op-code: 6-bits Value: 10-bits Reserved: 4-bits Memory Address: 12-bits

C. S-type Instruction
When the first two bits are 10, it represents S-type Instruction for calculation, which means to execute the
calculation on the master processor. When the first two bits are 11, it represents to execute control instruction such
as jump instruction. And for these control instruction, the operand address could be up to 12-bits so that these
instructions will load the next entry in the instruction memory to execute.
The S-type instructions for calculation is the same as C-type instructions while the control instructions are
specified as Table VII.

TABLE VII: S-type Control Instructions Description


Instructions Op-code Function Instructions Op-code Function
JMP 11 0000 P C = Address SET I 11 0111 I = 1, F = 0
JPOS 11 0001 if S = 1, P C = Address SET F 11 1000 I = 0, F = 1
JNEG 11 0010 if S = 0, P C = Address SET AMODE 11 1001 AM ode = 0 ∼ 4
JZER 11 0011 if Z = 1, P C = Address LAR 11 1010 Regs(3) = P C
RTS 11 0011 P C = Reg(3) HLT 11 1011 reset = 1
SET RX 11 0100 P E.Regs = V alue JSR 11 1100 P C = V alue
SET RY 00 0101 P E.Regs = values

Note that the memory access for processing element has 5 types of address mode:
1) Immediate: This mode is to decode the operand as constant value.
2) Direct: This mode is to decode the operand as the absolute memory location.
3) Indirect: This mode is to decode the operand as an address base respect to the address stored in the address
register.
4) Ry-: This mode is same as Indirect mode but to decrement the address register after access.
5) Ry+: This mode is same as Indirect mode but to increment the address register after access.
IV. T IMING A SSUMPTION
To measure the execution time of programs, I include a clock into the structure of the SIMD system. This clock
will be incremented according to the instrucion fetched and executed to reflect the amount of time they would
consume in bit logic. In this system, I assume that every instruction is fetched, decoded and executed in sequential
time while each processing element could execute the same instruction on multiple data at the same time.
• Fetch Cycle: In fetch cycle, it loads next instruction’s address into MAR register at the first clock cycle, then
fetch the next instruction into IR register at the second clock cycle. Therefore, this fetch cycle consumes 2
clock cycles for each instruction.
• Decode Cycle: In decode cycle, it translates the instruction in the IR register into different instructions, which
will not increment the clock cycles in the simulator.
• Execute Cycle: In execute cycle, it executes different instructions, which will consume different clock cycles.
According to the manual of 80x86 instruction set [2] and TMS320C6742 Floating-Point DSP [3], I set some
assumptions for integer calculation and floating-point calculation cycles such as the Multiplication for integer
is 12 and for floating-point is 20. More detailed is shown in Table VIII and IX.

TABLE VIII: C-type and S-type(calculation) Instruction Execution Cycles


Input Data Type ADD SUB MPY DIV COM AND IOR XOR
Integer 1-2 1-2 12-13 12-13 1-2 1-2 1-2 1-2
Floating-Point 4-5 4-5 20-21 20-21 4-5 1-2 1-2 1-2
Input Data Type CMP SHL SHR ROL ROR LDA STO
Integer 1-2 1 1 1 1 2-3 2-3
Floating-Point 1-2 1 1 1 1 2-3 2-3

In VIII, note that the clock cycles are variant since the memory access mode may different. If the address mode is
not immediate operands, such as indirect access, the memory access will consume another clock cycle besides the
calculation.

TABLE IX: I-type and S-type(control) Instruction Execution Cycles


Instructions PER PEL SHF NXC FXC JMP JPOS JNEG JZER
Clock Cycles 1 1 1 1 1 2 2 2 2
Instructions JSR RTS SET RX SET RY SET I SET F SET AMODE LAR HLT
Clock Cycles 2 1 2 2 1 1 1 1 1

V. I NTEGER M ATRIX T RANSPOSITION


Based on the instructions defined above and the architecture designed, I write a program to load and transpose
a 4x4 matrix with integer entries on this SIMD system. By choosing the matrix transposition example option in
the program menu, user is allowd to enter a 4x4 integer matrix row by row. After entering all the entries, the test
program will generate coresponding ”load transpose.simd” program in machine code. And then user could execute
SIMD program to transpose the matrix or check the program file.
The SIMD program has two parts:
1)Load Matrix
This is to load entries into the memory units of processing elements. It mainly uses the S-type Instructions SET
RX and C-type instructions STO.

Algorithm 1 Load Matrix


Initialization:
Set IntegerRegister = 1; Set AddressM ode = 1;
Iteration:
1: for all j from 0 to 3 do
2: for all i from 0 to 3 do
3: P E(i).RX ← M atrix(i, j)
4: end for
5: P E.M emory(j) ← P E.RX
6: end for

2)Transpose Matrix
This is to load entries from memory units of processing elements and exchange them by interconnection networks
so that get a transposed matrix stored in the same memory location. It mainly uses the S-type Instructions SET
RX and C-type instrutions LDA and STO, I-type instructions PEL, NXC, and FXC.

Algorithm 2 Transpose Matrix


Initialization:
Set AddressM ode = 2;
Iteration:
1: SET RY exchange address
2: LDA Network output register
3: PEL Read Network input register
4: STO RX
5: REPEAT 1 − 4 with different address and network.

Outputs: I test this program with sample matrix and get the following outputs.
Input: 1,2,3,4,
5,6,7,8,
9,10,11,12,
13,14,15,16,

Output:
The number of instructions executed = 47
The number of clock cycles used = 186

The output matrix in the memory is as below:


1,5,9,13,
2,6,10,14,
3,7,11,15,
4,8,12,16,

Therefore, we can see that the output matrix is correct and the clock cycles used are 186.
VI. F LOATING - POINT M ATRIX M ULTIPLICATION
Based on the instructions defined above and the architecture designed, I write a program to load and multiply
two 4x4 matrices with floating-point entries on this SIMD system. By choosing the matrix transposition example
option in the program menu, user is allowd to enter two 4x4 integer matries row by row. After entering all the
entries, the test program will generate coresponding ”load multiplication.simd” program in machine code. And then
user could execute SIMD program and get the outputs.
The SIMD program has two parts:
1)Load Matrix
This is similar to load entries in matrix transpostion. The main difference is: a)the float register should be set
instead of integer register; b) two matrices will be loaded and stored in memory units of processing elements.
2)Multiply Matrix
This is to load entries from memory units of processing elements and exchange them by interconnection networks
and then execute multiplication so that get the output matrix stored in new memory location after several loops. It
mainly uses the S-type Instructions SET RY, C-type instructions LDA, STO, MPY and ADD, I-type instructions
PEL, NXC, and FXC.
To access the memory from different processing elements every clock cycle, I calculate the matrices multiplication
as below:
P0,j = A0,0 × B0,j + A0,1 × B1,j + A0,2 × B2,j + A0,3 × B3,j
P1,j = A1,1 × B1,j + A1,0 × B0,j + A1,3 × B3,j + A1,2 × B2,j
(1)
P2,j = A2,2 × B2,j + A2,3 × B3,j + A2,0 × B0,j + A2,1 × B1,j
P3,j = A3,3 × B3,j + A3,2 × B2,j + A3,1 × B1,j + A3,0 × B0,j
So that for every processing element, it access different processing elements at the same time, which is convinient
for SIMD. The psecode is as below:

Algorithm 3 Multiply Matrix


Initialization:
Set AddressM ode = 2;
Iteration:
1: SET RY exchange address
2: LDA Network output register from Matrix B
3: PEL Read Network input register
4: MPY RX with Memory from Matrix A
5: STO RX in new Memory as Matrix P
6: REPEAT 1 − 5 with different address and networks to get four multiplication items of every entry in the first
column.
7: ADD RX with Memory
8: STO RX in new Memory
9: REPEAT 1 − 8 to get different columns results.

Outputs: I test this program with sample matrix and get the following outputs.
Input:
0.5 0.1 3.2 2.3
4.1 5.2 6.5 7.1
8.2 9.1 10.1 11
-12.1 -13.2 -14.3 15.2

1.2 0.2 3.3 4.4


4.4 3.2 2.2 1.1
-2.5 -3.6 -4.7 -0.8
8.2 7.2 -2.6 5.2

Output:
The number of instructions executed = 175
The number of clock cycles used = 1049

The output matrix in the memory is as below:


11.8125,5.5,-19,11.625,
69,44.75,-24,55,
113.5,73,-28.75,94,
87,115.5,-41.25,22.25,

The objective answer is as below:


11.9,5.46,-19.15,11.71,
69.77,45.18,-24.04,55.48,
114.83,73.6,-28.99,95.21,
87.79,116.26,-41.28,22.72,
Therefore, we can see that the clock cycles used are 1049 and the answer is basically correct with small deviation
due to the tranucation of a floating-point data into 16-bits memory unit.
VII. P ERFORMANCE M ODEL
For this SIMD system, I consider the master processor will not execute the instructions at the same time with the
processing elements. Therefore, the master processor could be regarded as one processor with other three processors.
That means, in this system, there are 4 processors.
1)Speed-up
Based on the processing elements, the C-type instructions could be executed parallelly, which means the execution
time for C-type instructions is the time that could be speeded-up as T0 , and the time for other instuctions inlcuding
fetch cycles is the time that could not be speed-up as T1 .
Tserial Tserial
Speed − up = T0
= (2)
T1 + 4 Tsimd
2)Efficiency
The efficiency is to normalize the Speed-up with the number of processors[4]. Therefore, in this system, I develop
the efficiency as below.
Speed − up Tserial
Ef f iciency = = (3)
4 4 × Tsimd
3)Utilization
The utilization represents whether the processors are fully utilized. In this SIMD system, I combine one processing
element and the master processor as one processor so that when it is in fetch cycle or S-type instruction execution
cycle, only one processor is utilized. Only when the C-type instruction is executed on all the processing elements,
all the processors are fully used. And since the I-type instruction is used to communicate with processors, which
is not needed in serial processor. The workload associated with this type of instruction is redudency work.
W orksimd
U tilization = (4)
4 × Tsimd
Incorporate the model to the code I developed in example programs, I determine the performance of this SIMD
system for those programs and summarize as Table X:

TABLE X: Performance Model Summary


Programs Numberof Instructions Clock Cycles Serial Clocks Workload Speed-up Efficiency Utilization
Integer Matrix Transposition 47 186 450 453 2.42 0.6048 0.6089
Floating-point Matrices Multiplication 175 1049 3098 3110 2.95 0.7383 0.7411

VIII. P ROBLEMS D ISCUSSION


According to the project above, I summarize the main weakness and strengths of the SIMD system as below:
• Weakness: The SIMD has to execute the same instruction on multiple data. Therefore, in order to execute
calculation parallel, it has to move/set data for each processing elements, which will bring low utilization and
efficiency.
• Strengths: The main strength is surely the speed-up since it can execute the calculation on vectors. If the data
is well organized, the speed-up is remarkable.
• Easy Problems: The problems related on multimedia can be easily solved on this architecture, especially some
problems about graphs since they are space homegenous which could be executed with the same instruction
at the same time.
• Hard Problems: The problems related to scalar calculation will severely degrade the speed-up of a SIMD
system. And since the processing elements of SIMD have to exchange data through interconnection network,
for those problems related to the same memory units, the interconnection network is hard to constructed and
hence the problem is difficult for the SIMD system.
Based on this SIMD processor, I am considering that there should be several additional features are good for it
and should be developed in the future.
• Bits Extend: 16-bits instruction unit and memory unit is not enough for a large amount of processing elements.
To extend it into 32-bits or 64-bits is a good feature as well as a good way to extend the processing elements.
• Pipeline: The Instruction is fetched and decoded in serial time, which is a bottleneck to the efficiency and
utilization. To add the instruction pipleline into this project could improve the performance a lot.
• Assembly Language Support: The machine code is hard to remember and use to program, if it could support
the assembly language to write the *.simd program, it should be more powerful.
• Code Generation: The interface to user is directly through high level language, such as C++ here. If user
could enter the program through the interface and get the corresponding *.simd machine code program, it
will be more convinient for system design and simulation. This is what I have partly added into the project,
the matrix transposition and multiplication program could generate the machine code program and output as
*.simd files, which could be stored and tested better.
IX. C ONCLUSION
In this project, I develop a SIMD mutliprocessor system based on 16-bits memory units. I implement the system
as 4 processing elements and design an instruction format for it that can fetch, decode and execute the instructions.
I also include a clock to measure the execution times of instructions to reflect the amount of the time they would
consume in bit logic. Based on this SIMD system, I write example programs to load and transpose a 4x4 integer
matrix or multiply two 4x4 floating-point matrices. The machine code program could be generated as *.simd files
according to the data entered through the interface. I also test the programs and develop the performance model for
this system and achieve the speed-up, efficiency and utilization for example programs. At last, I also discuss the
weakness and strengths of this SIMD system and gives several possible improvement could be done in the future.
R EFERENCES
[1] A.Yavuz Oruc, VESP COMPUTER-3.0X Manual, 2008.
[2] Leo Literak, Intel 80x86 Instruction Set Manual, http://www.penguin.cz/ literakl/intel/intel.html
[3] Texas Instruments, TMS320C6742 Fixed/Floating-Point DSP Manual, 2011
[4] John L.Hennessy and David A.Patterson, Computer Architecture:A Quantitiave Approach, 2011

You might also like