Professional Documents
Culture Documents
Cad For Vlsi 2 Pro Ject - Superscalar Processor Implementation
Cad For Vlsi 2 Pro Ject - Superscalar Processor Implementation
Superscalar Processor
The ALU
The objective in this phase is to implement the ALU for integer operations. For a fast ALU,
the following are required
Fully pipelined Carry Lookahead Adder (CLA) - 32 bit
Fully pipelined Wallace Tree Multiplier (WTM) - 32 bit
Fully pipelined Load - Store Unit (LSU)
Refer to lab class notes and your earlier homework for this. Each processor in your design
must have A1 numbers of CLA, A2 numbers of WTM and A3 numbers of LSU, where A1,
A2 and A3 are parameters. The addition, multiplication and load-store operation may take
several cycles to complete. But the pipelining above ensures that at every cycle a new set
of operands can be pushed into the Arithmetic units for computation.
Note that if there is a structural hazard due to non-availability of functional units, pipeline
may stall and all the instructions that follow the stalled instruction should not be scheduled.
In other words, the issue of instructions is in program order. It is interesting to note that if
the issue is not in program order then, the Tomasulo technique described in the class will
not correctly handle the data hazards. An id for every instruction should also be passed
through the units. This is needed because, if a reservation station R is waiting for a result
from an execution unit E, it should specify that instruction, from several instructions that
may currently be pipelined and executed in E.
3.1
The processor that you have to design is a RISC (Reduced Instruction Set Computer) also
called the Load-Store architecture with the following instruction set.
Instruction set:
The instruction set of the processor includes
3 Arithmetic instructions
ADD R1, R2, R3 ; //R1 = R2 + R3
SUB R1, R2, R3 ; //R1 = R2 - R3
MUL R1, R2, R3 ; //R1 = R2 * R3
All operations are twos complement operations. Exactly one of the source operands
of the arithmetic instruction can be a signed immediate operand of 16 bits stored in
twos complement format.
ADD R1, R0, #5;
makes R1 = 5
2 Data transfer instructions
LD R1, [Reg]; //R1=content of the memory location; address is specified
by Reg.
SD [Reg], R1;//[Reg] = R1
2 Control transfer instructions
JMP L1; //Unconditional jump to location L1
BEQZ (Reg), L1; //Jump to L1 if Reg content is zero
L1 is given as an offset from current Program Counter (PC). This is called PC-relative
addressing.
Halt instruction
HLT
There are basically 5 stages of instruction execution as shown in Figure 1. Also, the instructions are assumed to be of fixed length of 4 bytes each. In a store instruction, the
WB stage is non-existent. In an arithmetic instruction the MEM stage is non-existent. The
processor is pipelined at the instruction level also.
1. Instruction fetch cycle (IF):
IR Mem[PC];
NPC PC + 4;
Operation: Send out the Program Counter (PC) and fetch the instruction from memory into the Instruction Register (IR); increment the PC by 4 to address the next
sequential instruction. The IR is used to hold the instruction that will be needed on
subsequent clock cycles; likewise the register NPC is used to hold the next sequential
PC. The above describes fetching of one instruction at a time. You should fetch P1
number of instructions at any time in the Superscalar architecture. Note that our
desire is to execute more than one instruction in every cycle.
2
Instruction Fetch - IF
Instruction Decoding-ID
Execution or Addr
evaluation - EX
Memory access/branch
completion - MEM
the address used is the one computed during the prior cycle and stored in the
register ALUOutput.
Note: Each processor has two caches - the Instruction cache and the Data cache.
The memory has two ports - a read port for accessing instruction and a read/write
port for accessing data. Conflicts in addressing on these ports, namely same
address loaded on the ports should be resolved. When two or more Load/Store
units try to access the cache, there would be a structural hazard for accessing
the data cache, resulting in stalling of the pipeline inside the Load/Store units.
In your implementation, assume that a Cache-based structural hazard takes one
extra cycle for simultaneous access by two LSUs. In the worst case you may
waste A3 1 cycles due to Cache-based structural hazards. In the case of a
Cache miss, after the Cache miss is detected, assume it takes two clock cycles to
access memory and read/write data.
5. Write-back cycle (WB):
Register-Register ALU instruction:
Regs[rd] ALUOutput;
Load instruction:
Regs[rt] LMD;
Operation: Write the result into the register file, whether it comes from the
memory system (which is in LMD) or from the ALU (which is in ALUOutput);
the register destination field is also in one of two positions (rd or rt) depending
on the effective opcode.
The write back in superscalar is on the Common Data Bus (CDB), which is
communicated back to the reservation stations. The CDB is shared by several
execution units to write back results. The CDB should be designed to handle
C1 units to commit back the result at a time. The CDB has 32 C1 data lines
and does the following function. Note that C 1 A1 + A2 + A3. The Bus arbiter
has a simple circular-token protocol. It has a register which stores an integer
K = A1 + A2 + A3. In a current cycle the Bus arbiter permits the next C1
units from the kth execution unit in a circular fashion that have a request for
CDB to write into CDB.
Note: The Write-back cycle resets the Register status indicator and the memory
status indicator (if applicable).
3.2
The ideal CPI (Cycles per Instruction) of a pipelined processor is 1. So we cannot achieve
better than that without introducing redundancy. This redundancy is in the form of parallel
execution units in the EX stage as shown in Figure 2. This arrangement helps overlapped
and out-of-order execution of instructions on the EX stage in addition to the conventional
pipelining. This arrangement has the potential to achieve a CPI<1.
IF
ID
E
X
1
E
X
2
E
X
3
E
X
N
3.3
Pipelining hazards
Hazards are situations which prevent the next instruction in the instruction stream from
getting executed in its designated clock cycle. Hazards may stall the pipeline. There are
three types of hazards
Structural - If some functional units are duplicated to accommodate overlap in execution and some combination of instructions cannot be run in parallel then structural
hazard results. For e.g., we have only one write port and pipelining requires 2 writes
to be done in that clock cycle.
Data hazards to be explained shortly.
Control hazards arise from pipelining of branches and other instructions that change
the Program Counter (PC). For e.g., In a conditional Jump instruction, till the condition is evaluated the new PC can take either the incremented PC value or the address
accessed in that instruction. To avoid this we either stall the pipeline for 2 cycles or
use branch predictors.
In this project assume no branch predictors are used. Instead, we choose to stall the processor. When a conflict is encountered, all instructions before the stalled instructions need
to continue and all the instructions after the stalled instruction need to be stalled.
3.3.1
1. RAW - Read After Write Consider the instruction sequence given below.
ADD R1, R2, R3
SUB R4, R1, R5
The result of ADD instruction that is written into R1 is required for the SUB instruction to proceed.
2. WAW - Write After Write
LW R1, [addr]
SUB R4, R1, R6
ADD R1, R2, R3
6
The result of the ADD cannot be written to R1 before LW is written into R1 as the
former is needed by SUB. In addition, if the LW goes into a cache miss then ADD
reaches the WB stage before the first instruction. So R1 has the older value at the
end of the sequence.
3. WAR - Write After Read
SD [addr], R4
ADD R4, R3, R2
Actually, the older value of R4 should get stored in [addr], by SD instruction before
the new value of R4 is updated by the ADD instruction.
Issue unit
Mem
status
cache
Reg
File
Register
status
indicator
RS1
RS2
RSN
EX1
EX2
EXN
Operation
Qk Vj Vk Address Busy
3.4
The hardware used to overcome data hazards is shown in Figure 3. There are
K = A1 + A2 + A3 execution units running in parallel giving the data to a common bus
(Common Data bus CDB). Each execution unit has an identification number which is an
integer in the range [1..K-1]. The register file is an array of registers which give the inputs to
the execution units. It has K triples of 5 bit input to specify the register, a read/write input
signal and a 32 bit output port. The memory status cache is an associative memory with
each entry as shown in figure 4 and is implemented to avoid the memory-aliasing problem.
The register status indicator is implemented for handling the RAW and WAR hazards.
Each execution unit is driven by an intermediate block called reservation station. The bits
of reservation station are changed by the issue unit. The register status bits indicate the
following
(0, 0): if the register is not being currently written by any other instruction
(i, j): if the execution unit i is currently evaluating the instruction with id j, where
result is to be written to it.
Whenever an execution unit finishes evaluation, it puts its result and its id on the CDB.
The reservation stations of other execution units are waiting for the result from a particular
execution unit by constantly snooping the CDB. The format of bits in reservation station
is given below
3.5
The CACHE
The cache is used to bridge the gap between the speeds of the fast processor and the slow
main memory. The cache memory is smaller than the main memory and faster than it. It
sits between the processor and the main memory and holds data from a portion of main
memory which is locally referred. The use of cache is motivated by the principle of locality
of reference. There are basically 2 types of cache viz. the fully associative and the direct
mapped cache. We use a cache which is a combination of both, the set associative cache.
Tag Data V
Tag Data V
Tag Data V
Tag Data V
Parameter list:
A1, A2, A3, P1, C1, C2, C3, N
A1 - Number of CLAs in the processor.
A2 - Number of WTMs in the processor.
A3 - Number of LSUs in the processor.
P1 - Number of instructions fetched at a time.
C1 - Number of execution units whose results are to be committed simultaneously.
C2 - Number of cache lines in the set associative cache.
C3 - Number of cache entries held by a cache line in the set associative cache (or) in other
words, number of ways in the set-associative cache.
Once the RTL is developed, the next document would give you the verification plan, which
can enable you to do the Functional Verification of your RTL.
9
Implementation
Your Verilog code must follow synthesis guidelines that are discussed in the class. You
will be required to take the design through the various steps of design flow later. Primary
requirement for those stages is that the code is synthesizable. Further instructions will be
given as you proceed. Remember that this is a group project and partitioning of your design
is an absolute requirement. Use your time judiciously.
Unlike project specifications for other groups, grading scheme for the report is not provided.
I will talk to the groups and decide on the grading policy.
10