CAD for VLSI 2

Pro ject - Superscalar Processor Implementation

Superscalar Processor

Ob jective: The main objective is to implement a superscalar pipelined processor using

Verilog HDL. This project may be divided into three parts:
The Arithmetic Logic Unit
The Pipelined Processor Architecture
Cache design


The objective in this phase is to implement the ALU for integer operations. For a fast ALU,
the following are required
Fully pipelined Carry Lookahead Adder (CLA) - 32 bit
Fully pipelined Wallace Tree Multiplier (WTM) - 32 bit
Fully pipelined Load - Store Unit (LSU)
Refer to lab class notes and your earlier homework for this. Each processor in your design
must have A1 numbers of CLA, A2 numbers of WTM and A3 numbers of LSU, where A1,
A2 and A3 are parameters. The addition, multiplication and load-store operation may take
several cycles to complete. But the pipelining above ensures that at every cycle a new set
of operands can be pushed into the Arithmetic units for computation.
Note that if there is a structural hazard due to non-availability of functional units, pipeline
may stall and all the instructions that follow the stalled instruction should not be scheduled.
In other words, the issue of instructions is in program order. It is interesting to note that if
the issue is not in program order then, the Tomasulo technique described in the class will
not correctly handle the data hazards. An id for every instruction should also be passed
through the units. This is needed because, if a reservation station R is waiting for a result
from an execution unit E, it should specify that instruction, from several instructions that
may currently be pipelined and executed in E.

The Pipelined Processor


The Basic Pipeline

The processor that you have to design is a RISC (Reduced Instruction Set Computer) also
called the Load-Store architecture with the following instruction set.

General purpose registers:

Assume that there are thirty two, 32-bit registers, named R0, ..., R31. R0 always stores
the value 0 to facilitate many calculations involving zero (jump on zero for example).

Instruction set:
The instruction set of the processor includes
3 Arithmetic instructions
ADD R1, R2, R3 ; //R1 = R2 + R3
SUB R1, R2, R3 ; //R1 = R2 - R3
MUL R1, R2, R3 ; //R1 = R2 * R3
All operations are twos complement operations. Exactly one of the source operands
of the arithmetic instruction can be a signed immediate operand of 16 bits stored in
twos complement format.
ADD R1, R0, #5;
makes R1 = 5
2 Data transfer instructions
LD R1, [Reg]; //R1=content of the memory location; address is specified
by Reg.
SD [Reg], R1;//[Reg] = R1
2 Control transfer instructions
JMP L1; //Unconditional jump to location L1
BEQZ (Reg), L1; //Jump to L1 if Reg content is zero
L1 is given as an offset from current Program Counter (PC). This is called PC-relative
Halt instruction
There are basically 5 stages of instruction execution as shown in Figure 1. Also, the instructions are assumed to be of fixed length of 4 bytes each. In a store instruction, the
WB stage is non-existent. In an arithmetic instruction the MEM stage is non-existent. The
processor is pipelined at the instruction level also.
1. Instruction fetch cycle (IF):
IR Mem[PC];
NPC PC + 4;
Operation: Send out the Program Counter (PC) and fetch the instruction from memory into the Instruction Register (IR); increment the PC by 4 to address the next
sequential instruction. The IR is used to hold the instruction that will be needed on
subsequent clock cycles; likewise the register NPC is used to hold the next sequential
PC. The above describes fetching of one instruction at a time. You should fetch P1
number of instructions at any time in the Superscalar architecture. Note that our
desire is to execute more than one instruction in every cycle.

Instruction Fetch - IF

Instruction Decoding-ID

Execution or Addr
evaluation - EX

Memory access/branch
completion - MEM

Write back results - WB

Figure 1: The five stages of instruction execution
2. Instruction Decode/Register fetch cycle (ID):
A Regs [rs];
B Regs [rt];
Imm sign-extended immediate fields of IR;
Operation: Decode the instruction and access the register file to read the registers (rs
and rt are the register specifiers). The outputs of the general purpose registers are
read into two temporary registers (A and B) for use in later clock cycles. The lower
16 bits of the IR are also sign extended and stored into the temporary register Imm,
for use in the next cycle.
Decoding is done in parallel with reading registers, which is possible by ensuring
that these fields are at a fixed location in the instruction format. Assume that the
immediate portion of an instruction is located in an identical place in every instruction,
the sign extended immediate is also calculated during this cycle in case it is needed
in the next cycle.
The above describes, how to decode one instruction. You should parallely decode P1
instructions. In addition, in the superscalar execution, before registers are fetched,
the register status indicators have to be consulted. Also beware of Load and Store
instructions, that reads registers for calculating memory addresses. These register
reads can lead to RAW hazards. This stage is responsible for dynamically scheduling
of P1 instructions at any time into the respective A1, A2 and A3 units. If units are
not available, then stall the pipeline, as a structural hazard is caused. The memory aliasing problem is to be handled using an associative memory as the memory
status indicator. Note that the size of this associative memory will be A3 Number
of pipeline stages in the Load-Store unit. The above will be the maximum
number of memory addresses that could be accessed at a time.

3. Execution/Effective Address cycle (EX):

The ALU operates on the operands prepared in the prior cycle, performing one of the
following four functions depending on the instruction type.
Memory reference: (LD and ST)
ALUOutput R0 + Reg;
Operation: The ALU adds R0 with the contents of Reg fetched in earlier cycle
to form the effective address and places the result into the register ALUOutput.
Consult the memory status indicator for resolving the memory aliasing problem.
Register-Register ALU instruction:(ADD, SUB and MUL)
ALUOutput A op B
Operation: The ALU performs the operation specified by the function code on
the value in register A and on the value in register B. The result is placed in the
temporary register ALUOutput.
Register-Immediate ALU Instruction:(ADD, SUB and MUL)
ALUOutput A op Imm;
Operation: The ALU performs the operation specified by the opcode on the value
in register A and on the operand Imm. The result is placed in the temporary register ALUOutput.
ALUOutput NPC + (Imm << 2);
Cond (A == 0)
Operation : The ALU adds the NPC to the sign-extended immediate value in Imm,
which is shifted left by 2 bits to create a word offset, to compute the address of
the branch target. Register A, which has been read in the prior cycle, is checked
to determine whether the branch is taken. Since we are considering only one
form of branch (BEQZ), the comparison is against 0. Note that BEQZ is actually a
pseudo instruction that translates to a BEQ with R0 as an operand. For simplicity,
this is the only form of branch we consider.
To reduce penalty due to control hazards, the jumps can be treated specially.
Both the unconditional and conditional jumps may be decoded in the IF cycle
itself. Note that unconditional Jumps can be executed at IF cycle and conditional
jumps in ID cycle. This is straight forward to implement. Note that out of the P1
instructions fetched along with a JMP instruction, all the instructions that appear
after the jump instruction should not be scheduled. In case of conditional jump
the pipeline should be stalled for one cycle due to the control hazard.
The load-store architecture enables the effective memory address calculation and
execution cycle to be combined into a single clock cycle, since no instruction
needs to simultaneously calculate a data address, calculate an instruction target
address, and perform an operation on the data.
4. Memory access cycle (MEM):
Memory reference :
LMD Mem [ALUOutput] or
Mem [ALUOutput] B;
Operation: Access memory, if needed. If instruction is a load, data returns
from memory and is placed in the LMD (load memory data) register; if it is a
store, then the data from the B register is written into memory. In either case

the address used is the one computed during the prior cycle and stored in the
register ALUOutput.
Note: Each processor has two caches - the Instruction cache and the Data cache.
The memory has two ports - a read port for accessing instruction and a read/write
port for accessing data. Conflicts in addressing on these ports, namely same
address loaded on the ports should be resolved. When two or more Load/Store
units try to access the cache, there would be a structural hazard for accessing
the data cache, resulting in stalling of the pipeline inside the Load/Store units.
In your implementation, assume that a Cache-based structural hazard takes one
extra cycle for simultaneous access by two LSUs. In the worst case you may
waste A3 1 cycles due to Cache-based structural hazards. In the case of a
Cache miss, after the Cache miss is detected, assume it takes two clock cycles to
access memory and read/write data.
5. Write-back cycle (WB):
Register-Register ALU instruction:
Regs[rd] ALUOutput;
Load instruction:
Regs[rt] LMD;
Operation: Write the result into the register file, whether it comes from the
memory system (which is in LMD) or from the ALU (which is in ALUOutput);
the register destination field is also in one of two positions (rd or rt) depending
on the effective opcode.
The write back in superscalar is on the Common Data Bus (CDB), which is
communicated back to the reservation stations. The CDB is shared by several
execution units to write back results. The CDB should be designed to handle
C1 units to commit back the result at a time. The CDB has 32 C1 data lines
and does the following function. Note that C 1 A1 + A2 + A3. The Bus arbiter
has a simple circular-token protocol. It has a register which stores an integer
K = A1 + A2 + A3. In a current cycle the Bus arbiter permits the next C1
units from the kth execution unit in a circular fashion that have a request for
CDB to write into CDB.
Note: The Write-back cycle resets the Register status indicator and the memory
status indicator (if applicable).


Implementation of the Parallelism

The ideal CPI (Cycles per Instruction) of a pipelined processor is 1. So we cannot achieve
better than that without introducing redundancy. This redundancy is in the form of parallel
execution units in the EX stage as shown in Figure 2. This arrangement helps overlapped
and out-of-order execution of instructions on the EX stage in addition to the conventional
pipelining. This arrangement has the potential to achieve a CPI<1.







Figure 2: Duplication of functional units for parallelism


Pipelining hazards

Hazards are situations which prevent the next instruction in the instruction stream from
getting executed in its designated clock cycle. Hazards may stall the pipeline. There are
three types of hazards
Structural - If some functional units are duplicated to accommodate overlap in execution and some combination of instructions cannot be run in parallel then structural
hazard results. For e.g., we have only one write port and pipelining requires 2 writes
to be done in that clock cycle.
Data hazards to be explained shortly.
Control hazards arise from pipelining of branches and other instructions that change
the Program Counter (PC). For e.g., In a conditional Jump instruction, till the condition is evaluated the new PC can take either the incremented PC value or the address
accessed in that instruction. To avoid this we either stall the pipeline for 2 cycles or
use branch predictors.
In this project assume no branch predictors are used. Instead, we choose to stall the processor. When a conflict is encountered, all instructions before the stalled instructions need
to continue and all the instructions after the stalled instruction need to be stalled.

Data hazard classification

1. RAW - Read After Write Consider the instruction sequence given below.
ADD R1, R2, R3
SUB R4, R1, R5
The result of ADD instruction that is written into R1 is required for the SUB instruction to proceed.
2. WAW - Write After Write
LW R1, [addr]
SUB R4, R1, R6
ADD R1, R2, R3

The result of the ADD cannot be written to R1 before LW is written into R1 as the
former is needed by SUB. In addition, if the LW goes into a cache miss then ADD
reaches the WB stage before the first instruction. So R1 has the older value at the
end of the sequence.
3. WAR - Write After Read
SD [addr], R4
ADD R4, R3, R2
Actually, the older value of R4 should get stored in [addr], by SD instruction before
the new value of R4 is updated by the ADD instruction.

Issue unit








Common Data Bus(CDB)



Qk Vj Vk Address Busy

Figure 3: Hardware for handling the pipelining hazards


Hardware for handling pipelining hazards

The hardware used to overcome data hazards is shown in Figure 3. There are
K = A1 + A2 + A3 execution units running in parallel giving the data to a common bus
(Common Data bus CDB). Each execution unit has an identification number which is an
integer in the range [1..K-1]. The register file is an array of registers which give the inputs to
the execution units. It has K triples of 5 bit input to specify the register, a read/write input
signal and a 32 bit output port. The memory status cache is an associative memory with
each entry as shown in figure 4 and is implemented to avoid the memory-aliasing problem.
The register status indicator is implemented for handling the RAW and WAR hazards.
Each execution unit is driven by an intermediate block called reservation station. The bits
of reservation station are changed by the issue unit. The register status bits indicate the
(0, 0): if the register is not being currently written by any other instruction
(i, j): if the execution unit i is currently evaluating the instruction with id j, where
result is to be written to it.
Whenever an execution unit finishes evaluation, it puts its result and its id on the CDB.
The reservation stations of other execution units are waiting for the result from a particular
execution unit by constantly snooping the CDB. The format of bits in reservation station
is given below

Qj =0 indicates that Vj holds the value of the operand 1.

Qk =0 indicates that Vk holds the value of the operand 2.
Qj =(m, j), where m=0 indicates that 1st operand needs to be taken
from output of instruction with id 0j0 currently executed in the mth unit.
Qk =(m, j), where m=0 indicates that 2st operand needs to be taken from
output of instruction with id 0j0 currently executed in the mth unit.
Busy=0 execution unit is free.
Busy=1 it is waiting for input.
Effective address Unit number accessing it Instruction id

Figure 4: An entry in the associative memory

Using these units the various hazards are handled. There is need for explanation of the
memory status register. It is used to handle the memory aliasing problem. The memory
aliasing problem occurs under the situation given below:
SD [R3+300], R4
LD R2, [R0+100]
A read after write conflict will occur if R3+300 = R0+100. To handle this problem, the
associative mem status register is used. Each entry in the associative memory is shown in
Figure 4.
Whenever a load is done, it finds out whether the associative memory has the address,
and then it does a read from the CDB itself. When the corresponding unit as printed out
by the entry in associative memory completes the specified instruction as specified by the
entry. This is called the Tomasulos scoreboard technique. The architecture shown above
was basically meant for handling the data hazards. For the other two hazards, a separate
kind of architecture is not necessary. Firstly, the structural hazard cannot be avoided. To
handle the control hazard we can do one of the following.
We stall the pipeline until completion of this instruction
We can use branch predictors
In this project, you will stall instructions till the branch condition is evaluated.



The cache is used to bridge the gap between the speeds of the fast processor and the slow
main memory. The cache memory is smaller than the main memory and faster than it. It
sits between the processor and the main memory and holds data from a portion of main
memory which is locally referred. The use of cache is motivated by the principle of locality
of reference. There are basically 2 types of cache viz. the fully associative and the direct
mapped cache. We use a cache which is a combination of both, the set associative cache.

The structure of a cache entry is given below.

Tag Data V

Tag Data V

Tag Data V

Tag Data V

Figure 5: An entry in the set associative cache memory

V: Validity of data.
D: dirty bit; If 1, then it indicates that the data has been written by the processor and
is inconsistent with the data in the memory. If 0, then it has not been modified by the
Caches use two policies for writing to memory:
1. Write through: If a value is to be written, then it is updated in the cache and also
written to the main memory immediately.
2. Write back: The value is written only to the cache and written into memory only if a
location with D=1 and V=1 is to be replaced.
You will design a Cache unit with a write back policy for this project.
The set associative cache has C2 cache lines; each can hold up to C3 cache entries. In other
words we design a C3-way set associative cache with C2 entries. So, up to C3 collisions can
be handled without having to replace a cache entry. Tag is the MSB portion of the address
which is not used in cache address generation and hence used to identify it uniquely. The
LSB log2C2 bits of the main memory address is used for decoding into a particular cache
line. Hence assume C2 to be a power of 2.
The system bus has separate data lines and address lines.
1. Read hit - the cache line holds the value being searched for.
2. Read miss - The cache line does not hold the data, hence need to be accessed from
the memory.
3. Write hit - The cache entry to be written into is already in the cache, so update can
be done in the cache only.
4. Write miss - Then the data already in the cache entry has to be written to main
memory and the new data has to be written to this cache entry.

Parameter list:
A1, A2, A3, P1, C1, C2, C3, N
A1 - Number of CLAs in the processor.
A2 - Number of WTMs in the processor.
A3 - Number of LSUs in the processor.
P1 - Number of instructions fetched at a time.
C1 - Number of execution units whose results are to be committed simultaneously.
C2 - Number of cache lines in the set associative cache.
C3 - Number of cache entries held by a cache line in the set associative cache (or) in other
words, number of ways in the set-associative cache.
Once the RTL is developed, the next document would give you the verification plan, which
can enable you to do the Functional Verification of your RTL.


Your Verilog code must follow synthesis guidelines that are discussed in the class. You
will be required to take the design through the various steps of design flow later. Primary
requirement for those stages is that the code is synthesizable. Further instructions will be
given as you proceed. Remember that this is a group project and partitioning of your design
is an absolute requirement. Use your time judiciously.
Unlike project specifications for other groups, grading scheme for the report is not provided.
I will talk to the groups and decide on the grading policy.


