RP RISC Timing

Non-Pipelined Single Cycle RISC Timing RP_RISC_timing.
docx
Non-Pipelined Single Cycle RISC Timing

Robert Plachno
December 1999
http://www.elqc.com/RobertPlachno/RP_RISC_timing.pdf
Table of Contents
1 Summary ................................................................................................................................. 1
2 CISC Design Overview ............................................................................................................. 2
2.1 Design Constraints............................................................................................................ 2
2.2 Complex Instructions........................................................................................................ 2
3 Pipelined RISC ......................................................................................................................... 4
3.1 Example Pipeline .............................................................................................................. 4
3.2 Hazards ............................................................................................................................. 4
4 RP RISC .................................................................................................................................... 5
4.1 Cycle Time ........................................................................................................................ 5
4.2 Harvard Architecture ........................................................................................................ 5
4.3 Instruction Memory ......................................................................................................... 6
4.4 Register File ...................................................................................................................... 7
4.5 Memory Precharge........................................................................................................... 7
4.6 Address Calculations ........................................................................................................ 9
4.6.1 Operand Address Calculation ................................................................................... 9
4.6.2 Instruction Address Calculation ................................................................................ 9
4.7 Instruction Set .................................................................................................................. 9
4.8 Cross-Assembler ............................................................................................................. 10
4.9 Usage .............................................................................................................................. 11
5 Bibliography .......................................................................................................................... 12
1 Summary
http://www.elqc.com/RobertPlachno/processor.htm Page 1 of 12
Non-Pipelined Single Cycle RISC Timing RP_RISC_timing.docx
The RP RISC had unique timing that obtained a true single cycle operation. This RP RISC
replaced an existing CISC design achieving a dramatic performance increase for an embedded
controller application that was manufactured in high volume (~2.5M/month for ~7 years).
The CISC being replaced had 3 issues that the new RP RISC solved.
1. It was too slow for new real-time features added to the product families.
2. The CISC is the first block to fail as the power supply is lowered causing yield issues for
PC notebook product variants.
3. The Chairman of the Board wanted to eliminate the royalties paid to the CISC licensor.
This document describes the timing of the RP RISC single cycle instruction execution. This was a
pure architecture improvement remaining on the same process technology and using the same
input clock frequency as the original CISC design. For comparison, CISC and Pipelines RISC
architectures are first overviewed prior to describing the RP RISC.
Refer to the related papers including the paper on the single cycle RISC (Plachno, A True Single
Cycle RISC Processor without Pipelining, 2009)and the ALU description document (Plachno, ALU
Transistor Level Description, 1999).
2 CISC Design Overview

2.1 Design Constraints
The CISCs designed in the 1970s were NMOS technology integrated circuits. Power
consumption and die size were a major design concern. The early designs saved transistor
count by having only an 8-bit bus implementing micro-programmed multi-cycle operations to
process each instruction.
The design had to be regular enough to be dynamically clocked. The logic required precharging
and then clocking to process. Later CMOS technology, also known as “idiot logic”, removed this
design constraint. CMOS logic gates do not have the static power dissipation of NMOS gates.
2.2 Complex Instructions

The Intel 8080 was not the CISC used in the original design, but it represents the classical CISC
architecture for a microprocessor. See the table below for the instruction set of the Intel 8080.
This table is from the “Intel 8080 Microcomputer Systems User Manual September 1975” page
4-15.
Note the “Clock Cycles” right-most column in the above table. A fast instruction took 4 cycles
while others took as many as 18 clock cycles.
Each CISC instruction was micro-programmed to execute in multiple clock cycles. The number
of clock cycles per instruction depended on the specific instruction. The opcode is fetched as
the first byte from memory and then decoded to sequence the remaining micro-programmed
operations. Most instructions required the reading of multiple bytes from memory. Each clock
stage’s control is generated from a PLA-like sequencer that is implemented using a known pre-
charge and design access technique avoiding static power dissipation. The NMOS designs were
regular and flexible in their micro-programming.
3 Pipelined RISC
RISCs have a tremendous performance advantage over the CISC architecture. The multiple cycle
operation is standardized into a pipelined structure that finishes an instruction every clock
cycle. Each instruction still has the latency delay of the multi-cycle operation of the pipeline, but
one instruction completes every cycle. Thus, the average multi-cycle CISC instruction evaluation
will decrease to approximately one cycle in a RISC. If a CISC has an average of 7 cycles per
instruction, then this will be reduced to approximately 1 cycle in a RISC resulting in a 7 to 1
performance improvement (7x faster).
The more complex CISC instructions are implemented using multiple instructions in the RISC.
This is where the “reduced instruction set” comes from in the RISC acronym. The CISC should
require less memory to represent the instruction programming than the RISC. The RP RISC had
a 15% (1.15x) disadvantage over the original CISC coding.
3.1 Example Pipeline

An example 5 stage RISC pipeline is shown below.
1 2 3 4 5
Instruction Instruction Execution Memory Write-back
Fetch Decode Access
Each stage is designed so that it is independent and can be operated in parallel to the other
stages. At any time, there are 5 unfinished instructions being processed in the pipeline.
3.2 Hazards
The last instructions inserted into the pipeline may be dependent upon the unexecuted
instructions in front of them. This causes inefficiencies in the pipeline performance. The CISC
did not have this problem since the CISC waited for each instruction to finish before starting a
new instruction.
For example, the program may be summing several values together such as A = A + B + C. The
later instructions require the previous sum to be totaled before it can be used as an input for a
new addition. This may be solved by having the assembler insert a NOP (no-operation) into the
pipeline for the earlier instruction to finish. This adds complexity to the pipelined design that
the CISC did not require.
Another example happens for conditional branching. The hardware can make an intelligent
guess for what direction the code will branch but occasionally the guess will be wrong. The
same code will branch in different directions based on the expression evaluation, so it is always
a guess for the hardware. For loops, it can be assumed that there is a higher probability of
branching back in the loop. However, on occasion the loop will break. This means the hardware
should assume fetching the instruction to jump backwards instead of PC+1 (program counter
plus one or just the next instruction in order). If the expression is calculated to be different
from the instruction fetch assumption, then a “flush” is performed to remove the wrong
instructions presently in the pipeline. The correct instructions then need to be fetched and
loaded into the pipeline.
These and other hazards decrease the efficiency of a pipelined RISC to execute one instruction
per cycle. Optimizing the performance of the hazards adds complexity to the design.
4 RP RISC
The best way to understand the RP RISC timing is to focus on the operation of the memories
interfacing to the processor. The RP RISC is not a fully synchronous design allowing some
processing across clock edges to gain advantage.
4.1 Cycle Time

The clock frequency is limited to the slowest operation in the design. For the RP RISC this is the
instruction fetch. The access time of the instruction memory defines the cycle time. Since the
RP RISC uses the same instruction memory design as the previous CISC, the input clock
frequency remains unchanged.
This is a non-optimized issue of any multi-cycle design. Each cycle in the above CISC or
Pipelined-RISC designs has the same time period for each stage. For example, in the above 5-
stage pipelined design shown above it is likely that the “instruction decode” processes much
faster than an “instruction fetch” wasting the remainder of the allocated time. A RISC is
expected to have a simple “instruction decode” that processes quickly. The delay of each stage
is set by the slowest operation.
4.2 Harvard Architecture

The RP RISC is implemented using a “Harvard Architecture”. There are two separate memories:
one for instructions and another for data. The RP RISC is constructed using independent
hardware for the instruction unit and the execution unit. Since the instruction unit interfaces
with a different memory and buses than the execution unit, these two separate units operate
independently and in parallel.
The instruction unit is tasked with fetching the next instruction to be executed. The execution
unit is tasked with processing the current instruction. These two tasks are independent and are
processed in parallel.
Reviewing the 5-stage pipelined RISC described above it is apparent that implementation is not
a “Harvard Architecture” architecture. The first two cycles are wasted for instruction fetch and
instruction decode. The RP RISC operates in a different manner.
4.3 Instruction Memory

The Instruction Memory is a ROM built out of same 8K x 8 blocks repeated 3 times. The CISC
was organized vertically for a 24K x 8 memory reading to an 8-bit bus. The RISC is organized
horizontally for an 8K x 24 memory reading to a 24-bit bus instruction word.
The CISC may have to perform multiple reads to obtain the full instruction within the micro-
coding. For example, a CISC “add R, R” for register-to-register addition, requires 3 bytes. The
equivalent RISC “add R, R” for register file addition also requires 3 bytes but these are read as a
24-bit instruction word in one clock cycle. The Zylog CISC “add R, R” performs in 10 cycles while
the RISC completes in 1 cycle.
4.4 Register File

The RP RISC is a two-operand processor. For an “ADD” operation:
add R1, R2 means R1 is assigned to R1 + R2
This requires reading two operands from the register file as inputs to the ALU and then writing
back the ALU result to the destination register. In total this has three memory read/write
operations that must occur in the same time frame but in parallel to the instruction fetch.
The size of the register file is smaller than the instruction memory. “Smaller” means less
capacitance that translates into being “faster”. Thus, it is possible to “double pump” the
register file. The register file can be accessed twice in the same time frame as the instruction
memory. This is simply due to the fact that the register file is smaller and faster. The RP RISC
register file is addressed by 7 bits for 27 or 128 bytes. When the MSB bit is high an embedded
register (user-defined address-mapped and not from the register file) is accessed instead.
The register file reads two operands in the first access and writes the result in the second
access. Since two reads are performed in the first access this requires a dual-port RAM. The first
port is read/write. The second port is only for reading. The memory cell design used in the
register file is shown below.
4.5 Memory Precharge

Since the RP RISC hides some functionality in the memory precharge times it will be important
to understand some of the internal operation of memories. Embedded memories used in
integrated circuits require precharge times. Nobody designs memories with static loads on the
bit lines or data lines (BL’s in the above schematic) because the static power dissipation would
be too large. The bit lines are precharged in the first half period and then released (dynamic)
during access.
The address set-up times are actually measured to the inside half period when the precharge
ends. During the precharge time the input addresses can still transition as long as enough time
is provided to set up the word line drivers. This allows some unused time for both the execution
unit and instruction unit to calculate the address values.
To read the register file, both bit lines start dynamically precharged high, but one bit line is
pulled low by the read accessed cell. During the access period, the word lines are actively
driven, the differential bit line data is sensed, and the value is output from the memory.
To write the register file, the data is forced on to the bit lines. Writing is simpler than reading. It
is the forced low bit line that will flip the cross-coupled memory cell. The data set-up time for
writing the register file has good margin. The cell to be written is accessed and the bit lines are
forced to the external data values. The data set-up time is measured to the last edge just prior
to the next precharge.
The read memory access will be valid prior to the last clock edge. There is no need to use a flip-
flop to capture the memory output. Instead, a latch is used to allow the memory output to
ripple through and then latch the valid output data prior to the next clock edge.
The execution unit has more time than one register file precharge period to process. The two
read operand values are allowed to ripple through to the inputs of the execution unit. There are
no flip-flops in the execution unit. The operand latches are closed to save the operand values
from changing when the write precharge starts. The data output result of the ALU is written
back on the final register file access.
The ALU design in the EBOX, execution unit, is described in the paper: alu_description.pdf. The
ALU uses mux-logic for a high-speed design since it is required to process within a little over a
quarter cycle. This design style provides higher performance, lower power, and lower cost.
4.6 Address Calculations

4.6.1 Operand Address Calculation
The RP RISC has several indirect addressing capabilities. The operands can be indirectly
addressed using a page register. Instead of adding the register address from the instruction
word, it is ‘OR’ed. The higher order page register bits are used to page the register file address.
By using an ‘OR’ instead of a full addition (inst_reg_address | page_reg), the operand address
calculation avoids a carry ripple minimizing the calculation time. If the reg address on the
instruction word is zero, then the page reg is simply the indirect address. This concept allows
more timing margin.
4.6.2 Instruction Address Calculation

Similar to the operand address calculation, the next instruction address is calculated without
waiting for a carry ripple.
Normally a processor uses relative addressing for the next instruction address for conditional
branching. Other processors have an offset on the instruction word that is added to the
program counter (PC) register. Instead, the RP RISC only uses absolute addressing. The
instruction word contains the full 16 address bits allowed for the next address on branching
instructions.
This means that the next instruction address is a selection of either the absolute address on the
instruction word or PC+1. This mux is very quick to process. PC+1 is from a synchronous counter
and is already available at the start of the precharge. This concept allows more timing margin.
4.7 Instruction Set

Prior to comparing functional performance, the instruction set of the RP RISC needs to be
described. Shown below are the list of instructions with the instruction decode truth table.
The machine language coding of the 24-bit instruction word is shown on the far left with the
assembly language description and mnemonics (operation) in the next column.
The main control lines input to the datapaths (EBOX, IBOX, etc. for the execution unit,
instruction unit, etc.) are shown on the right. The RP RISC is a semi-custom implementation
with custom datapath blocks controlled by standard cell routed logic. The datapaths are all
grouped together with the register file into one physical block while the standard cell routing is
mixed in with the rest of the chip. For example, LFU3,LFU2,LFU1,LFU0, within the EBOX section
are the Logical Function Unit control lines (programable truth table) in the ALU in the execution
unit. These control lines are the result of the instruction decoding from the far-left byte
(opcode) of the instruction word.
Every instruction is one clock cycle except for the load/store operations that needs two
accesses of the instruction (or external) memory. The load/store requires two cycles since the
instruction fetch accesses the memory for first of those two cycles.
4.8 Cross-Assembler
The advantage of the CISC micro-programming is moved into the RP RISC cross-assembler. The
original CISC code is automatically translated into the RP RISC native instruction set. The
complex instructions of the CISC may translate into one or more RISC instructions.
A good example is the Zilog “djnz” instruction that stands for “Decrement and Jump if Not
Zero”. The RP RISC assembler translates this to two native instructions for add -1 and then the
conditional branch (add #%FF, jr nz).
djnz Instruction
# Cycles # Bytes
CISC 12 if jump taken 2
10 if jump not taken
RP RISC 2 6
This demonstrates that the CISC does have an advantage over the RISC since the CISC source
code will fit in less memory space. However, the RISC has a performance improvement over the
CISC by requiring dramatically less cycles to execute.
The RP RISC was found to use only 15% more instruction memory than the original CISC. This
statistic is helped by the fact that the ROM also stored data look up tables (LUTs).
Moving the CISC design to the RISC version required assembling the existing code into the
native RP RISC instructions. The engineer simulating the design at first did not think the design
was working when he first viewed the new simulation waveforms. Then he realized that the
expected functionality was just condensed to the left of his computer screen due to the
phenomenal processing improvement. There were a few places in the original code where the
software was self-timed requiring a software rethink. In general, the RP RISC redesign was
relatively painless.
4.9 Usage
This single cycle architecture is not an educational exercise. The RP RISC was used for 2
different product lines for 3 different designs per year. These two different product lines had
software coding that was unique and not revised from each other. The RP RISC instruction set
met the criteria required for real product applications.
Other non-timing features of the RP RISC include indirect addressing, interrupts, power down,
multi-bit barrel shifting, user defined condition codes, and direct access of external embedded
registers. The RP RISC could indirectly access the register file, external memory, and subroutine
invokes. Indirect addressing was used by the code for look-up table solutions instead of physical
models.
The RP RISC did have an improved voltage supply margin over the original CISC. Unlike the CISC
design, the analog circuits now fail first instead of the processor with lower supply voltage on
the full mixed-signal chip increasing the margin by over ½ volt. This improved the
manufacturing yields.
The cycle time of the RP RISC had to remain the same as the original CISC. Using a double-
pumped register file was recognizing the size difference of the data to the instruction
memories.
The RP RISC achieved true single cycle instruction execution. The RP RISC architecture avoids
the complexity and inefficiency from hazards of a Pipelined RISC architecture.
5 Bibliography
Plachno, R. (1999, December). Processor Design. Retrieved from Robert S Plachno:
http://www.elqc.com/RobertPlachno/alu_description.pdf
Plachno, R. (2009, March). A True Single Cycle RISC Processor without Pipelining. ESS Design
White Paper – RISC Embedded Controller. Retrieved from
http://www.elqc.com/RobertPlachno/RP_RISC.pdf

RP RISC Timing

Uploaded by

Copyright:

Available Formats

You might also like

RP RISC Timing

Uploaded by

Document Information

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

RP RISC Timing

Uploaded by

Copyright:

Available Formats

Non-Pipelined Single Cycle RISC Timing RP_RISC_timing.

Non-Pipelined Single Cycle RISC Timing

2 CISC Design Overview

2.2 Complex Instructions

3.1 Example Pipeline

4.1 Cycle Time

4.2 Harvard Architecture

4.3 Instruction Memory

4.4 Register File

add R1, R2 means R1 is assigned to R1 + R2

4.5 Memory Precharge

4.6 Address Calculations

4.6.2 Instruction Address Calculation

4.7 Instruction Set

You might also like