Professional Documents
Culture Documents
Reduced Instruction Set Computers: The ACM 28 Number 1
Reduced Instruction Set Computers: The ACM 28 Number 1
Reduced instruction set computers aim for both simplicity in hardware and
synergy between architectures and compilers. Optimizing compilers are used
to compile programming languages down to instructions that are as
unencumbered as microinstructions in a large virtual address space, and to
make the instruction cycle time as fast as possible.
DAVID A. PATTERSON
As circuit technologies reduce the relative cost of proc- called a microinstruction, and the contents are essen-
essing and memory, instruction sets that are too com- tially an interpreter, programmed in microinstructions.
plex become a distinct liability to performance. The The main memories of these computers were magnetic
designers of reduced instruction set computers (RISCs) core memories, the small control memories of which
strive for both simplicity in hardware and synergy be- were usually 10 times faster than core.
tween architecture and compilers, in order to stream- Minicomputer manufacturers tend to follow the lead
line processing as much as possible. Early experience of mainframe manufacturers, especially when the
indicates that RISCs can in fact run much faster than mainframe manufacturer is IBM, and so microprogram-
more conventionally designed machines. ming caught on quickly. The rapid growth of semicon-
ductor memories also speeded this trend. In the early
BACKGROUND 1970s. for example, 8192 bits of read-only memory
The IBM System/360, first introduced in 1964, was the (ROM) took up no more space than 8 bits of register.
real beginning of modern computer architecture. Al- Eventually, minicomputers using core main memory
though computers in the System/360 “family” provided and semiconductor control memory became standard in
a different level of performance for a different price, all the minicomputer industry.
ran identical software. The System/360 originated the With the continuing growth of semiconductor mem-
distinction between computer architecture-the abstract ory, a much richer and more complicated instruction
structure of a computer that a machine-language pro- set could be implemented. The architecture research
grammer needs to know to write programs-and the community argued for richer instruction sets. Let us
hardware implementation of that structure. Before the review some of the arguments they advanced at that
System/BBO, architectural trade-offs were determined time:
by the effect on price and performance of a single im-
plementation: henceforth, architectural trade-offs be- 1. Richer instruction sets would simplify compilers. As
came more esoteric. The consequences of single imple- the story was told, compilers were very hard to
mentations would no longer be sufficient to settle an build, and compilers for machines with registers
argument about instruction set design. were the hardest of all. Compilers for architectures
Microprogiamming was the primary technological with execution models based either on stacks or
innovation behind this marketing concept. Micropro- memory-to-memory operations were much simpler
gramming relied on a small control memory and was an and more reliable.
elegant way of buil.ding the processor control unit for a 2. Richer instruction sets would alleviate the software cri-
large instruction set. Each word of control memory is sis. At a time when software costs were rising as
0 1985 ACM OOOl-0782/85/0100-0008 75~ fast as hardware costs were dropping, it seemed
appropriate to move as much function to the hard- many exotic instruction formats that reduced program
ware as possible. The idea was to create machine size.
instructions that resembled programming language The rapid rise of the integrated circuit, along with
statements, so as to close the “semantic gap” be- arguments from the architecture research community
tween programming languages and machine lan- in the 1970s led to certain design principles that guided
guages. computer architecture:
3. Richer instruction sets would improve architecture qual- The memo y technology used for microprograms was
ity. After IBM differentiated architecture from im- growing rapidly, so large microprograms would add lit-
plementation, the research community looked for tle or nothing to the cost of the machine.
ways to measure the quality of an architecture, as Since microinstructions were much faster than normal
opposed to the speed at which implementations machine instructions, moving software functions to mi-
could run programs. The only architectural metrics crocode made for faster computers and more reliable
then widely recognized were program size, the functions.
number of bits of instructions, and bits of data Since execution speed was proportional to program size,
fetched from memory during program execution architectural techniques that led to smaller programs
(see Figure 1). also led to faster computers.
Registers were old fashioned and made it hard to build
Memory efficiency was such a dominating concern in
compilers; stacks or memory-to-memory architectures
these metrics because main memory-magnetic core
were superior execution models. As one architecture re-
memory-was so slow and expensive. These metrics
searcher put it in 1978, “One’s eyebrows should rise
are partially responsible for the prevailing belief in the
whenever a future architecture is developed with a
1970s that execution speed was proportional to program
register-oriented instruction set.“’
size. It even became fashionable to examine long lists of
instruction execution to see if a pair or triple of old Computers that exemplify these design principles are
instructions could be replaced by a single, more power- the IBM 370/168, the DEC VAX-11/780, the Xerox
ful instruction. The belief that larger programs were ’ Myers, G.). The case against stack-oriented instruction sets. Comput. Archit.
invariably slower programs inspired the invention of News 6. 3 (Aug. 1977). 7-10.
8 4 16 8 16
. .
Load i rB :: B Load i B
I
: :
Load : rC i C Add : C
I .
. .
Add i rA 3 rf3 ! rC Store I A
. I
.
:
Store : rA ! A
c . . I
8 16 16 16
. . :.
Add : B : C A
. .
. . I I
In this example, the three data words are 32 bits each These metrics suggest that a memory-to-memory archi-
and the address field is 16 bits. Metrics were selected by tecture is the “best” architecture, and a register-to-regis-
research architects for deciding which architecture is ter architecture the “worst.” This study led one research
best; they selected the total size of executed instructions architect in 1978 to suggest that future architectures
(I), the total size of executed data (D), and the total mem- should not include registers.
ory traffic-that is, the sum of I and D, which is (M).
FIGURE1. The Statement A c B + C Translated into Assembly Language for Three Execution Models: Register-to-Register,
Memory-to-Register, and Memory-to-Memory
Dorado, and the Intel iAPX-432. Table I shows some of ing microprograms, since microprogramming was the
the characteristics of these machines. most tedious form of machine-language programming.
Although computer architects were reaching a con- Many researchers, including myself, built compilers
sensus on design principles, the implementation world and debuggers for microprogramming. This was a for-
was changing around them: midable assignment, for virtually no inefficiencies
could be tolerated in microcode. These demands led to
Semiconductor memory was replacing core, which
the invention of new programming languages for micro-
meant that main memories would no longer be 10
programming and new compiler techniques.
times slower tha.n control memories.
Unfortunately for me and several other researchers,
Since it was virtually impossible to remove all mis- there were three more impediments that kept WCSs
takes for 400,000 bits of microcode, control store from being very popular. (Although a few machines
ROMs were becoming control store RAMS. offer WCS as an option today, it is unlikely that more
Caches had been invented-studies showed that the than one in a thousand programmers take this option.)
locality of programs meant that small, fast buffers These impediments were
could make substantial improvement in the imple-
mentation speed.of an architecture. As Table I shows, Virtual memory complications. Once computers
caches were included in nearly every machine, made the transition from physical memory to vir-
though control memories were much larger than tual memory, microprogrammers incurred the
cache memories. added difficulty of making sure that any routine
could start over if any memory operand caused a
Compilers were subsetting architectures-simple
virtual memory fault.
compilers found it difficult to generate the complex
Limited address space. The most difficult program-
new functions that were included to help close the
ming situation occurs when a program must be
“semantic gap.” Optimizing compilers subsetted ar-
forced to fit in too small a memory. With control
chitectures because they removed so many of the
memories of 4096 words or less, some unfortunate
unknowns at compiler time that they rarely needed
WCS developers spent more time squeezing space
the powerful instructions at run time.
from the old microcode than they did writing the
new microcode.
WRITABLE CONTROL STORE
Swapping in u multiprocess environment. When each
One symptom of the general dissatisfaction with archi-
program has its own microcode, a multiprocess op-
tectural design principles at this time was the flurry of
erating system has to reload the WCS on each proc-
work in writable control memory, or writable control
ess switch. Reloading time can range from 1,000 to
store (WCS). Researchers observed that microcoded ma-
25,000 memory accesses,depending on the ma-
chines could not run faster than 1 microcycle per in-
chine. This added overhead decreased the perform-
struction, typically averaging 3 or 4 microcycles per
ance benefits gained by going to a WCS in the first
instruction; yet the simple operations in many pro-
place.
grams could be found directly in microinstructions. As
long as machines were too complicated to be imple- These last two difficulties led some researchers to con-
mented by ROMs, why not take advantage of RAMS by clude that future computers would have to have virtual
loading different microprograms for different applica- control memory, which meant that page faults could
tions? occur during microcode execution. The distinction be-
One of the first problems was to provide a program- tween programming and microprogramming was be-
ming environment that could simplify the task of writ- coming less and less clear.
THE ORIGINS OF RISCS length of individual pieces rather than by the total
About this point, several people, including those who length of all pieces. This kind of model gave rise to
had been working on microprogramming tools, began to instruction formats that are simple to decode and to
rethink the architectural design principles of the 1970s. pipeline.
In trying to close the “semantic gap,” these principles 5. Compiler technology should be used to simplify instruc-
had actually introduced a “performance gap.” The at- tions rather than to generate complex instructions.
tempt to bridge this gap with WC% was unsuccessful, RISC compilers try to remove as much work as pos-
although the motivation for WCS-that instructions sible at compile time so that simple instructions can
should be no faster than microinstructions and that be used. For example, RISC compilers try to keep
programmers should write simple operations that map operands in registers so that simple register-to-
directly onto microinstructions-was still valid. Fur- register instructions can be used. Traditional com-
thermore, since caches had allowed “main” memory pilers, on the other hand, try to discover the ideal
accessesat the same speed as control memory accesses, addressing mode and the shortest instruction for-
microprogramming no longer enjoyed a ten-to-one mat to add the operands in memory. In general, the
speed advantage. designers of RISC compilers prefer a register-to-
A new computer design philosophy evolved: Opti- register model of execution so that compilers can
mizing compilers could be used to compile “normal” keep operands that will be reused in registers,
programming languages down to instructions that were rather than repeating a memory access or a calcula-
as unencumbered as microinstructions in a large vir- tion. They therefore use LOADS and STORES to ac-
tual address space, and to make the instruction cycle cess memory so that operands are not implicitly
time as fast as the technology would allow. These ma- discarded after being fetched, as in the memory-to-
chines would have fewer instructions-a reduced set- memory architecture (see Figure 3).
and the remaining instructions would be simple and
would generally execute in one cycle--reduced instruc- COMMON RISC TRAITS
tions-hence the name reduced instruction set computers We can see these principles in action when we look at
(RISCs).RISCs inaugurated a new set of architectural some actual RISC machines: the 801 from IBM Re-
design principles: search, the RISC I and the RISC II from the University
of California at Berkeley, and the MIPS from Stanford
Functions should be kept simple unless there is a very
University.
good reason to do otherwise. A new operation that
All three RISC machines have actually been built
increases cycle time by 10 percent must reduce the
and are working (see Figure 4). The original IBM 801
number of cycles by at least 10 percent to be worth
was built using off-the-shelf MS1 ECL, whereas the
considering. An even greater reduction might be
Berkeley RISCs and the Stanford MIPS were built with
necessary, in fact, if the extra development effort
custom NMOS VLSI. Table II shows the primary char-
and hardware resources of the new function, as
acteristics of these three machines.
they impact the rest of the design, are taken into
Although each project had different constraints and
account.
goals, the machines they eventually created have a
Microinstructions should not be faster than simple in-
great deal in common:
structions. Since cache is built from the same tech-
nology as writable control store, a simple instruc- 1. Operations are register-to-register, with only LOAD and
tion should be executed at the same speed as a STORE accessing memory. Allowing compilers to
microinstruction. reuse operands requires registers. When only LOAD
Microcode is not magic. Moving software into mi- and STORE instructions access memory, the in-
crocode does not make it better, it just makes it struction set, the processor, and the handling of
harder to change. To paraphrase the Turing Ma- page faults in a virtual memory environment are
chine argument, anything that can be done in a micro- greatly simplified. Cycle time is shortened as well.
coded machine can be done in assembly language in a 2. The operations and addressing modes are reduced.
simple machine. The same hardware primitives as- Operations between registers complete in one cycle,
sumed by the microinstructions must be available permitting a simpler, hardwired control for each
in assembly language. The run-time library of a RISC, instead of microcode. Multiple-cycle
RISC has all the characteristics of a function in mi- instructions such as floating-point arithmetic are
crocode, except that it is easier to change. either executed in software or in a special-purpose
Simple decoding and pipelined execution are more im- coprocessor. (Without a coprocessor, RISCs have
portant than program size. Imagine a model in mediocre floating-point performance.) 6nly two
which the total work per instruction is broken into simple addressing modes, indexed and PC-relative,
pieces, and different pieces for each instruction ex- are provided. More complicated addressing modes
ecute in parallel. At the peak rate a new instruction can be synthesized from the simple ones.
is started every cycle (Figure 2). This assembly-line 3. Instruction formats are simple and do not cross word
approach performs at the rate determined by the boundaries. This restriction allows RISCs to re-
. Sequential .
. . .
ii-i i ;
.
i+l ITlY+$$j i
.
. .
Pipelined i+2 ImI
. . . . .
. . . . . . .
.
.
.
.
.
.
. _. .
I -
nn: : ;’
Pipelined execution gives a peak perform-
ante of one instruction every step, so in this
example the peak performance of the pipe-
lined machine is about four times faster
than the sequential version. This figure
shows that the longest piece determines
. . the performance rate of the pipelined ma-
. . . . .
. . . . . chine, so ideally each piece should take the
same amount of time. The five pieces are
the traditional steps of instruction execu-
time - tion: instruction fetch (IF), instruction de-
code (ID), operand fetch (OF), operand exe-
cution (OE), and operand store (OS).
FIGURE2b. Branches and Data Dependencies between Instructions Force Delays, or “Bubbles,” into Pipelines
move instruction decoding time from the critical structions also simplify virtual memory, since they
execution path. As Figure 5 shows, the Berkeley cannot be broken into parts that might wind up on
RISC register operands are always in the same place different pages.
in the 32-bft word, so register access can take place 4. RlSC branches avoid pipeline penalties. A branch in-
simultaneously with opcode decoding. This re- struction in a pipelined computer will normally de-
moves the instruction decoding stage from the pipe- lay the pipeline until the instruction at the branch
lined execution, making it more effective by address is fetched (Figure 2b). Several pipelined
shortening the pipeline (Figure 2b). Single-sized in- machines have elaborate techniques for fetching
4 16
. A-B+C
LOAD : rB 3 6 I
B-A+C
1 LOAD i rC ;.
D-D-B
STORE i rD : D
I
8 16 16 16
. .
ADD : B : C . A
ADD : A : C : B
SUB I B : D : D
.
I = 168b; D = 28813; M = 456b
(Memory-to-Memory)
This new version assumes optimizing compiler technology temporary storage, which means that it must reload oper-
for the sequence A c B + C; B c A + C; D + D - B. ands. The register-to-register architecture now appears to
Note that the memory-to-memory architecture has no be the “best.”
the appropriate instruction after the branch, illustrates once again the folly of architectural metrics.
but these techniques are too complicated for RISCs. Virtually all machines with variable-length instructions
The generic RISC solution, commonly used in mi- use a buffer to supply instructions to the CPU. These
croinstruction sets, is to redefine jumps so that they units blindly fill the prefetch buffer no matter what
do not take effect until after the following instruc- instruction is fetched and thus load the buffer with
tion; this is called the delayed branch. The delayed instructions after a branch, despite the fact that these
branch allows RISCs to always fetch the next in- instructions will eventually be discarded. Since studies
struction during the execution of the current in- show that one in four VAX instructions changes the
struction. The machine-language code is suitably program counter, such variable-length instruction ma-
arranged so that the desired results are obtained. chines really fetch about 20 percent more instruction
Because RISCs are designed to be programmed in words from memory than the architecture metrics
high-level languages, the programmer is not re- would suggest. RISCs, on the other hand, nearly always
quired to consider this issue; the “burden” is car- execute something useful because the instruction is
ried by the programmers of the compiler, the op- fetched after the branch.
timizer, and the debugger. The delayed branch also
removes the branch bubbles normally associated RISC VARIATIONS
with pipelined execution (see Figure 2b). Table III Each RISC machine provides its own particular varia-
illustrates the delayed branch. tions on the common theme. This makes for some in-
teresting differences.
RISC optimizing compilers are able to successfully
rearrange instructions to use the cycle after the delayed Compiler Technology versus Register Windows
branch more than 90 percent of the time. Hennessy has Both IBM and Stanford pushed the state of the art in
found that more than 20 percent of all instructions are compiler technology to maximize the use of registers.
executed in the delay after the branch. Figure 6 shows the graph-coloring algorithm that is the
Hennessy also pointed out how the delayed branch cornerstone of this technology.
The IBM 801, above, was completed in 1979 and had a micron NMOS. The resulting chip is 25 percent smaller
cycle time of 86 n>. It was built from off-the-shelf ECL. than a 68000 and ran on first silicon at 330 ns per in-
The RISC II, on the right, designed by Manolis Katevenis struction. The Stanford m shown on page 16, chip
and Robert Sherburne, was designed in a four-micron was fabricated using the same conservative NMOS archi-
single-level metal custom NMOS
-- VLSI. This 41 ,OOO-tran- tecture and runs at 500 ns per instruction. TaOO-
sistor chip worked the first time at 500 ns per 32-bit transistor chip has about the same chip area as the RISC
instruction. The small control area ass for onlv 10 II. On-chip memory management support plus the two-
percent of the chip and is found in the upper right-hand instruction-per-word format increase the control area of
corner. The RISC II was resealed and fabricated in three- the MIPS.
The Berkeley team did not include compiler experts, and restored on every return. A procedure call auto-
so a hardware solution was implemented to keep oper- matically switches the processor to use a fresh set of
ands in registers. The first step was to have enough registers. Such buffering can only work if programs nat-
registers to keep all the local scalar variables and all urally behave in a way that matches the buffer. Caches
the parameters of the current procedure in registers. work because programs do not normally wander ran-
Attention was directed to these variables because of domly about the address space. Similarly, through ex-
two very opportune properties: Most procedures only perimentation, a locality of procedure nesting was
have a few viriables (approximately a half-dozen), and found; programs rarely execute a long uninterrupted
these are heavily used (responsible for one-half to two- sequence of calls followed by a long uninterrupted se-
thirds of all dynamically executed references to oper- quence of returns. Figure 7 illustrates this nesting be-
ands). Normally, it slows procedure calls when there havior. To further improve performance, the Berkeley
are a great many registers. The solution was to have RISC machines have a unique way of speeding up pro-
many sets, or windows, of registers, so that registers cedure calls. Rather than copy the parameters from one
would not have to be saved on every procedure call window to another on each call, windows are over-
lapped so that some registers are simultaneously part of large programs are run; the MIPS has 16 registers com-
two windows. By putting parameters into the overlap- pared to 32 for the 801, about 35 percent of them being
ping registers, operands are passed automatically. LOAD or STORE instructions. For the Berkeley RISC
What is the “best” way to keep operands in registers? machines, this percentage drops to about 15 percent,
The disadvafitages of register windbws are that ther including the LOADS and STORES used to save and
use more chip area and slow the basic clock cycle. This restore registers when the register-window buffer over-
is due to the capacitive loading of the longer bus, and flows.
although context. switches rarely occur--about 100 to
1000 procedure calls for every switch-they require Delayed Loads and Multiple Memory and Register Ports
that two to three times as many registers be saved, on Since it takes one cycle to calculate the address and
average. The only drawbacks of the optimizing com- one cycle to access memory, the straightforward way to
piler are that it is about half the speed of a simple implement LOADS and STORES is to take two cycles.
compiler and ignores the register-saving penalty of pro- This is what we did in the Berkeley RISC architecture.
cedure calls. Both the 801 and the MIPS mitigate this To reduce the costs of memory accesses,both the 801
call cost by expanding some procedures in-line, al- and the MIPS provide “one-cycle” LOADS by following
though this means these machines cannot provide sepa- the style of the delayed branch. The first step is to have
rate compilation of those procedures. If compiler tech- two ports to memory, one for instructions and one for
nology can reduce the number of LOADS and STORES data, plus a second write port to the registers. Since the
to the extent that register windows can, the optimizing address must still be calculated in the first cycle and
compiler will stand in clear superiority. About 30 per- the operand must still be fetched during the second
cent o! the 801 mstructlons are LOAD or STORE when cycle, the data are not available until the third cycle.
Therefore, the instruction executed following the one- TABLE II. Primary Characteristics of
cycle LOAD must not rely on the operand coming from Three Operational RISC Machines
memory. The 801 and the MIPS solve this problem with IBM 801 RISC I MIPS
a delayed load, which is analogous to the delayed branch
described above. The two compilers are able to put an Year 1980 1982 1983
independent instruction in the extra slot about 90 per- Number of
cent of the time. Since the Berkeley RISC executes instructions 120 39 55
many fewer LOADS, we decided to bypass the extra Control memory
expense of an extra memory port and an extra register size 0 0 0
write port. Once again, depending on goals and imple- Instruction
mentation technology, either approach can be justified. sizes (bits) 32 32 32
Technology ECL MSI NMOS VLSI NMOS VLSI
Pipelines
Execution model reg-reg reg-reg reg-reg
All RISCs use pipelined execution, but the length of the
pipeline and the approach to removing pipeline bubbles None of these machines use microprogramming; all three use
vary. Since the peak pipelined execution rate is deter- 32-bit instructions and follow the register-to-register execution
mined by the longest piece of the pipeline, the trick is model. Note that the number of instructions in each of these
to find a balance between the four parts of a RISC machines is significantly lower than for those in Table I.
RISC I
I SUB :
:
rD
:
:
.
rD :
:
register
operand
:
5 rB I
make instruction decoding more expensive and
thus may not be good predictors of per-form-
ante. Three machines-the RISC I, the VAX,
I 32b memory port 4 and the 432-are compared for the instruction
sequenceA+B+C;A+A+l;D+D-B.
: VAX instructions are byte variable from 16 to
ADD : register B 5 register register
A 456 bits, with an average size of 30 bits. Oper-
(3 operands) i operand ! operand ’ i operand
. and locations are not part of the main opcode
VAX : but are spread throughout the instruction. The
INC ; register SUB : register
432 has bit-variable instructions that range
(1 operand) , operand A (2 operands) i operand B i
. . from 6 to 321 bits. The 432 also has multipart
. opcodes: The first part gives the number of
: register A operands and their location, and the second
:. operand part gives the operation. The 432 has no regis-
ters, so all operands must be kept in memory.
32b memory port The specifier of the operand can appear any-
I
3 operands
i
:
- :
T where in a 32-bit instruction word in the VAX or
the 432. The RISC I instructions are always 32
B c... i. bits long, they always have three operands,
:
! in memory :
. . . . . . . . . . . . ..~..............................~.w.......~
:
.
:A :
. and these operands are always specified in the
same place in the instruction. This allows over-
. . lap of instruction decoding with fetching the op-
. . .C A :D :
erand. This technique has the added benefit of
removing a stage from the execution pipeline.
432
~~
. . . . . . . . . . . . . . ..-..... . . . . . . . . . . . . . . . . . . . ..s......
. .
..
.
TABLE III. A Comparison of the Traditional Branch Instruction with the Delayed Branch Found in RISC Machines
a fair amount of their clock cycle detecting and block- forwarding from lengthening the RISC II clock cycle.
ing interlocks, but the simple pipeline and register
model of RISCs also simplifies interlocking. Hennessy Multiple Instructions per Word
believes the MIPS cycle would be lengthened by 10 The designers of the MIPS tried an interesting variation
percent if hardware interlocking were added. The de- on standard RISC philosophy by packing two instruc-
signers of the RISC II measured the internal forwarding tions into every 32-bit word whenever possible. This
logic and found that careful hardware design prevented improvement could potentially double performance if
A
instr 2
0
0
? Q 0
0 A
C
instr 2 x
0
0
C
x
E
???
E
8
(
FIGURE6a. IBM’s Solution to the Register Allocation Problem FIGURE6b. A Partial Solution to Register Allocation
ret
call
I
nesting
This (graph shows the call-return behavior of programs case is 33 calls and returns (t = 33) inside the buffer. In
that inspired the register window scheme used by the this figure we assume that there are five windows
Berkeley RISC machines. Each call causes the line to (w = 5). We have found that about eight windows hits the
move, down and to the right, and each return causes the knee of the curve and that locality of nesting is found in
line to move up and to the right. The rectangles show programs written in C, Pascal, and Smalltalk.
how long the machine stays within the buffer. The longest
twice as many operations were executed for every 32- coding. He does not plan to use the two-instruction-per-
bit instruction fetch. Since memory-access instructions word technique in his future designs.
--
and jump instructions generally need the full %bit
word, and since data dependencies prevented some HIDDEN RISCS
combinations, most programs were sped up by 10 to 15 If building a new computer around a reduced instruc-
percent Arithmetic routines written in assembly lan- tion set results in a machine with a better price/per-
guage had much higher savings. Hennessy believes the_ formance ratio, what would happen if the same tech-
two-insi ruction-per-word format adds 10 percent to t& niques were used on the RISC subset of a traditional
Mmcqcle time because of the more complicated de- machine? This interesting question has been explored
i+l IF
The memory is kept busy 100 percent of the time, the time. The short pipeline and pipeline data forwarding al-
regist 3r file is reading or writing 100 percent of the time, low the RISC II to avoid pipeline bubbles when data de-
and the execution unit (ALU) is busy 50 percent of the pendencies like those shown in Figure 2b are present.
by RISC advocates and, perhaps unintentionally, by the mizing compiler that used the full 370 instruction set.
designers of more traditional machines. Software and hardware experiments on subsets of the
DEC reported a subsetting experiment on two imple- VAX and IBM 360/370, then, seem to support the RISC
mentations of the VAX architecture in VLSI. The VLSI arguments.
VAX has nine custom VLSI chips and implements the
complete VAX-11 instruction set. DEC found that 20.0 ARCHITECTURAL HERITAGE
percent of the instructions are responsible for 60.0 per- All RISC machines borrowed good ideas from old ma-
cent of the microcode and yet are only 0.2 percent of all chines, and we hereby pay our respects to a long line of
instructions executed. By trapping to software to exe- architectural ancestors. In 1946, before the first digital
cute these instructions, the MicroVAX 32 was able to fit computer was operational, von Neumann wrote
the subset architecture into only one chip, with an op-
The really decisive considerations from the present point of
tional floating-point coprocessor in another chip. As view, in selecting a code [instruction set], are more of a
shown in Table IV, the VLSI VAX, a VLSI implementa- practical nature: the simplicity of the equipment demanded
tion of the full VAX architecture, uses five to ten times by the code, and the clarity of its application to the actually
the resources of the MicroVAX 32 to implement the full important problems together with the speed of its handling
instruction set, yet is only 20 percent faster. of those problems.’
Michael L. Powell of DEC obtained improved per-
For the last 25 years Seymour Cray has been quietly
formance by subsetting the VAX architecture from the
designing register-based computers that rely on LOADS
software side. His experimental DEC Modula-2 com-
and STORES while using pipelined execution. James
piler generates code that is comparable in speed to the
Thornton, one of his colleagues on the CDC-6600, wrote
best compilers for the VAX, using only a subset of the
in 1963
addressing modes and instructions. This gain is in part
due to the fact that optimization reduces the use of The simplicity of all instructions allows quick and simple
complicated modes. Often only a subset of the func- evaluation of status to begin execution. . Adding complica-
tions performed by a single complicated instruction is tion to a special operation, therefore, degrades all the others.
needed. The VAX has an elaborate CALL instruction, and also
generated by most VAX compilers, that saves registers
on procedure entry. By replacing the CALL instruction In my mind, the greatest potential for improvement is with
the internal methods. .at the risk of loss of fringe opera-
with a sequence of simple instructions that do only tions. The work to be done is really engineering work, pure
what is necessary, Powell was able to improve perform- and simple. As a matter of fact, that’s what the results
ance by 20 percent. should be-pure and simple.3
The IBM S/360 and S/370 were also targets of RISC
studies. The 360 model 44 can be considered an ances- John Cocke developed the idea of pushing compiler
tor of the MicroVAX 32, since it implements only a technology with fast, simple instructions for text and
subset of the 360 architecture in hardware. The rest of integer applications. The IBM 801 project, led by
the instructions are implemented by software. The George Radin, began experimenting with these ideas in
360/44 had a significantly better cost/performance ra- 1975 and produced an operational ECL machine in
tio than its nearest neighbors. IBM researchers per- 1979. The results of this project could not be published,
formed a software experiment by retargetting their however, until 1982.
highly optimizing PL/8 compiler away from the 801 to In 1980 we started the RISC project at Berkeley. We
the System/370. The optimizer treated the 370 as a were inspired by an aversion to the complexity of the
register-to-register machine to increase the effective- VAX and Intel 432, the lack of experimental evidence
ness of register allocation. This subset of the 370 ran in architecture research, rumors of the 801, and the
programs 50 percent faster than the previous best opti- desire to build a VLSI machine that minimized design
effort while maximizing the cost/performance factor.
TABLE IV. Two VLSI Implementations of the VAX
We combined our research with course work to build
the RISC I and RISC II machines.
-. In 1981 John Hennessy started the MIPS project,
VLSI Chips (including which tried to extend the state of the art in compiler
floating point) 9 (22%) optimization techniques, explored pipelined tech-
Microcode 45OK niques, and used VLSI to build a fast microcomputer.
Transistors 1250K 1OlK ‘::I 00 Both university projects built working chips using
These two implementations, although not yet in pro&Jots, illus-
much less manpower and time than traditional micro-
trate the difficulty of building the complete VAX architacture in processors.
VLSI. The VLSI VAX is 20 percent faster than thl MirxoVAX 32
but uses five to ten times the resources for the prooessor. Both * Burks. A.W.. Goldstine. H.H.. and van Neumann. 1. Preliminarv discussion of
are implemented from the same three;micron double-level metal the logical design of an electronic computing instr&ent. Rep. to U.S. Army
Ordinance Dept., 1946.
NMOS technology. (The VLSI VAX also has extem?datl? and ‘Thornton, J.E. Considerations in Computer Design-Leading Up to the Control
address caches not counted in this table.) Data 6600. Control Data Chippewa Laboratory, 1963.