Optimizing An ANSI C Interpreter With Superoperators

Optimizing an ANSI C Interpreter with Superoperators
Todd A. Proebsting
University of Arizona
Abstract machine (VM) and then interpret that code.

The extra layer of indirection in an interpreter
This paper introduces superoperators, an opti- presents time/space tradeos. Interpreted code
mization technique for bytecoded interpreters. is usually slower than compiled code, but it can
Superoperators are virtual machine operations be smaller if the virtual machine operations are
automatically synthesized from smaller opera- properly encoded.
tions to avoid costly per-operation overheads. Interpreters are more exible than compilers.
Superoperators decrease executable size and can A compiler writer cannot change the target ma-
double or triple the speed of interpreted pro- chine's instruction set, but an interpreter writer
grams. The paper describes a simple and eec- can customize the virtual machine. For instance,
tive heuristic for inferring powerful superopera- a virtual machine can be augmented with special-
tors from the usage patterns of simple operators. ized operations that will allow the interpreter to
The paper describes the design and implemen- produce smaller or faster code. Similarly, chang-
tation of a hybrid translator/interpreter that em- ing the interpreter implementation to monitor
ploys superoperators. From a specication of the program execution (e.g., for debugging or pro-
superoperators (either automatically inferred or ling information) is usually easy.
manually chosen), the system builds an ecient This paper will describe the design and imple-
implementation of the virtual machine in assem- mentation of hti, a hybrid translator/interpreter
bly language. The system is easily retargetable system for ANSI C that has been targeted to
and currently runs on the MIPS R3000 and the both the MIPS R3000 [KH92], and the SPARC
SPARC. [Sun91]. hti will introduce superoperators, a
novel optimization technique for customizing in-
1 Introduction terpreted code for space and time. Superopera-
tors automatically fold many atomic operations
Compilers typically translate source code into into a more ecient compound operation in a
machine language. Interpreter systems trans- fashion similar to supercombinators in functional
late source into code for an underlying virtual language implementations [FH88]. Without su-

Address: Todd A. Proebsting, Department of Com- peroperators hti executables are only 8-16 times
puter Science, University of Arizona, Tucson, AZ 85721. slower than unoptimized natively compiled code.
Internet: todd@cs.arizona.edu Superoperators can lower this to a factor of 3-9.
Furthermore, hti can generate program-specic
superoperators automatically.
The hybrid translator, hti, compiles C func-
tions into a tiny amount of assembly code for
function prologue and interpreted bytecode in-
structions for function bodies. The bytecodes
represent the operations of the interpreter's vir- The evaluation stack for each translated pro-
tual machine. By mixing assembly code and cedure exists in its own activation record. Lo-
bytecodes, hti maintains all native code calling cal stacks allow programs to behave correctly
conventions; hti object les can be freely mixed in the presence of interprocedural jumps (e.g.,
with compiled object les. longjmp).
The interpreter is implemented in assembly hti produces an assembly le. Most of the le
language for eciency. Both the translator, hti, consists of the bytecode translation of C func-
and the interpreter are quickly retargeted with a tion bodies, and data declarations. hti does,
small machine specication. however, produce a tiny amount of assembly
language for function prologues. Prologue code
2 Translator Output tells the interpreter how big the activation record
should be, where within it to locate the evalua-
hti uses lcc's front end to translate ANSI C tion stack, where to nd the bytecode instruc-
programs into its intermediate representation tions, and ultimately for transferring control to
(IR) [FH91b, FH91a]. lcc's IR consists of ex- the interpreter. A prologue on the R3000 looks
pression trees over a simple 109-operator lan- like the following:
guage. For example, the tree for 2+3 would be main:
ADDI(CNSTI,CNSTI), where ADDI represents in-
li $24, 192 # put activation
teger addition (ADD+I), and the CNSTI's repre- # record size in $24
sent integer constants. The actual values of the li $8, 96 # put location of
CNSTI's are IR node attributes.
# evaluation stack in $8
hti's virtual machine instructions are byte-
la $25, $$11 # put location of
codes (with any necessary immediate values). # bytecode in $25
The interpreter uses an evaluation stack to eval- j _prologue_scalar # jump to interpreter
uate all expressions. In the simplest hti virtual
machines, there is a one-to-one correspondence (_prologue_scalar unloads scalar arguments
between VM bytecodes and lcc IR operators onto the stack | the R3000 calling conventions
(superoperators will change this). Translation is require a few dierent such prologue routines.
a left-to-right postx emission of the bytecodes. Once the arguments are on the stack, the inter-
Any necessary node attributes are emitted im- preter is started.) Prologue code allows natively
mediately after the corresponding bytecode. For compiled procedures to call interpreted proce-
this VM, the translation of 2+3 would be similar dures without modication.
to the following:
.byte 36 # CNSTI 3 Superoperator Optimization
.word 2 # immediate value
.byte 36 # CNSTI
Compiler front ends, including lcc, produce
.word 3 # immediate value
many IR trees that are very similar in struc-
.byte 8 # ADDI
ture. For instance, ADDP(INDIRP(x),CNSTI) is
the most common 3-node IR pattern produced
The interpreter implements operations via a by lcc when it compiles itself. (x is a place-
jump-table indexed by bytecodes. The inter- holder for a subtree.) This pattern computes a
preter reads the rst CNSTI's bytecode (36), and pointer value that is a constant oset from the
jumps to CNSTI's implementation. CNSTI code value pointed to by x (i.e., the l-value of x->b in
reads the attribute value (2) and pushes it on C).
the stack. The interpreter similarly handles the With only simple VM operators, translating
\3." After reading the ADDI bytecode, the inter- ADDP(INDIRP(x),CNSTI) requires emitting three
preter pops two integers o the evaluation stack, bytecodes and the CNSTI's attribute. Interpret-
and pushes their sum. ing those instructions requires
1. Reading the INDIRP bytecode, popping x's ilarly, optimizing for execution time is equally
value o the stack, fetching and pushing the complex.
referenced value,
2. Reading the CNSTI bytecode and attribute,
3.1.1 Inference Heuristic
and pushing the attribute, hti includes a heuristic method for inferring a
good set of superoperators. The heuristic reads
3. Reading the ADDP bytecode, popping the two a le of IR trees, and then decides which ad-
just-pushed values, computing and pushing jacent IR nodes should be merged to form new
their sum. superoperators. Each tree is weighted to guide
the heuristic. When optimizing for space, the
If the pattern ADDP(INDIRP(x),CNSTI) were weight is the number of times each tree is emit-
a single operation that takes a single operand, x, ted by the front end of lcc. When optimizing
the interpreter avoids 2 bytecode reads, 2 pushes, for time, the weight is each tree's (expected) ex-
and 2 pops. This new operator would have one ecution frequency.
attribute | the value of the embedded CNSTI. A simple greedy heuristic creates superopera-
These synthetic operators are called superoperators. The heuristic exams all the input IR trees
tors. to isolate all pairs of adjacent (parent/child)
Superoperators make interpreters faster by nodes. Each pair's weight is the sum of the
eliminating pushes, pops, and bytecode reads. weights of the trees in which it appears. (If the
Furthermore, superoperators decrease code size same pair appears N times in the same tree, that
by eliminating bytecodes. The cost of a super- tree's weight is counted N times.) The pair with
operator is an additional bytecode, and a cor- the greatest cumulative weight becomes the su-
respondingly larger interpreter. Experiments in peroperator formed by merging that pair. This
x8 show that carefully chosen superoperators re- new superoperator then replaces all occurrences
sult in smaller and signicantly faster interpreted of that pair in the input trees. For example, as-
code. sume that the input trees with weights are
3.1 Inferring Superoperators I(A(Z,Y))
A(Y,Y)
10
1
Superoperators can be designed to optimize the The original operators's frequencies of use are
interpreter over a wide range of C programs, or
for a specic program. The lcc IR includes only Y 12
109 distinct operators, thus leaving 147 byte- Z 10
codes for superoperators. Furthermore, if the in- I 10
terpreter is being built for a specic application, A 11
it may be possible to remove many operations The frequencies of the parent/child pairs are
from the VM if they are never generated in the
translation of the source program (e.g., oating I(A(*)) 10
point operations), thereby allowing the creation A(Z,*) 10
of even more superoperators. A(*,Y) 11
The addition of superoperators increases the A(Y,*) 1
size of the interpreter, but this can be oset Therefore, A(*,Y) would become a new superop-
by the corresponding reduction of emitted byte- erator, B. This new unary operator will replace
codes. Specic superoperators may optimize for the occurrences of A(*,Y) in the subject trees.
space or time. Unfortunately, choosing the opti- The resulting trees are
mal set of superoperators for space reduction is
NP-complete | External Macro Data Compres- I(B(Z)) 10
sion (SR22 [GJ79]) reduces to this problem. Sim- B(Y) 1
The new frequencies of parent/child pairs are 4 Translator Design
I(B(*))
B(Z)
10
10
4.1 Bytecode Emitter
B(Y) 1 hti translates lcc's IR into bytecodes and at-
tributes. Bytecodes can represent simple IR op-
Repeating the process, a new superoperator erators, or complex superoperator patterns. The
would be created for either I(B(*)) or B(Z). optimal translation of an IR tree into bytecodes
Ties are broken arbitrarily, so assume that B(Z) is automated via tree pattern matching using
becomes the new leaf operator, C. Note that C is burg [FHP92]. burg takes a cost-augmented set
simply the composition of A(Z,Y). The rewritten of tree patterns, and creates an ecient pattern
trees are matcher that nds the least-cost cover of a sub-
I(C) 10 ject tree. Patterns describe the actions associ-
B(Y) 1 ated with bytecodes. Some sample patterns in
the burg specication, interp.gr, follow:
The frequencies for the bytecodes is now
Y 1
Z 0 stk : ADDP(INDIRP(stk),CNSTI) = 5 (1) ;
I 10 stk : ADDP(stk,stk) = 9 (1) ;
A 0 stk : CNSTI = 36 (1) ;
B 1 stk : INDIRP(stk) = 77 (1) ;
C 10
The nonterminal stk represents a value that re-
It is interesting to note that the B superoper- sides on the stack. The integers after the ='s are
ator is used only once now despite being present the burg rule numbers, and, are also the actual
in 11 trees earlier. Underutilized superoperators bytecodes for each operation. Rule 9, for exam-
inhibit the creation of subsequent superoperators ple, is a VM instruction that pops two values
by using up bytecodes and hiding constituent from the stack, adds them, and pushes the sum
pieces from being incorporated into other su- onto the stack. The (1)'s represent that each
peroperators. Unfortunately, attempting to take pattern has been assigned a cost of 1. The pat-
advantage of this observation by breaking apart tern matcher would choose to use rule 5 (at cost
previously created, but underutilized superoper- 1) over rules 9, 36, and 77 (at cost 3) whenever
ators was complicated and ineective. possible.
Creating the superoperators B and C elimi- The burg specication for a given VM is gen-
nated the last uses of the operators A and Z, re- erated automatically from a list of superoperator
spectively. The heuristic can take advantage of patterns. To change the superoperators of a VM
this by reusing those operators's bytecodes for | and its associated translator and interpreter
new superoperators. The process of synthesizing | one simply adds or deletes patterns from this
superoperators repeats until exhausting all 256 list and then re-builds hti. hti can be built with
bytecodes. The heuristic may, of course, merge inferred, or hand-chosen superoperators.
superoperators together.
The heuristic implementation requires only 4.2 Attribute Emitter
204 lines of Icon [GG90]. The heuristic can be
congured to eliminate obsolete operators (i.e., hti must emit node attributes after appropri-
reuse their bytecodes), or not, as superoperators ate bytecodes. In the previous example, it
are created. Not eliminating obsolete operators is necessary to emit the integer attribute of
allows the resulting translator to process all pro- the CNSTI node immediately after emitting the
grams, even though not specically optimized for bytecodes for rules 5 or 36. This is sim-
them. ple for single operators, but superoperators
may need to emit many attributes. The pat- 5 Interpreter Generation
tern ADDI(MULI(x,CNSTI),CNSTI) requires two
emitted attributes | one for each CNSTI. The interpreter is implemented in assembly lan-
To build hti, a specication associates at- guage. Assembly language enables important
tributes with IR operators. A preprocessor optimizations like keeping the evaluation stack
builds an attribute emitter for each superoper- pointer and interpreter program counter in hard-
ator. The attribute specication for CNSTI is ware registers. Much of the interpreter is auto-
matically generated from a target machine spec-
reg: CNSTI = (1) ication and the list of superoperators. The tar-
"emitsymbol(%P->syms[0]->x.name, 4, 4);" get machine specication maps IR nodes (or pat-
terns of IR nodes) to assembly language. For
The pattern on the rst line indicates that the instance, the mapping for ADDI on the R3000 is
interpreter will compute the value of the CNSTI
into a register at cost 1. The second line indi- reg: ADDI(reg,reg) = (1)
cates that the translator emits a 4-byte value "addu %0r, %1r, %2r\n"
that is 4-byte aligned. The preprocessor ex-
pands %P to point to the CNSTI node relative to This pattern indicates that integer addition
the root of the superoperator in which it exists. (ADDI) can be computed into a register if the
%P->syms[0]->x.name is the emitted value. For
operands are in registers. %0r, %1r, and %2r rep-
the simple operator, stk: CNSTI, the attribute resent the registers for the left-hand side non-
emitter executes the following call after emitting terminal, the left-most right-hand side nonter-
the bytecode minal, and the next right-hand side nonterminal,
respectively.
emitsymbol(p->syms[0]->x.name, 4, 4); The machine specication augments the emit-
ter specication described above | they share
where p points to the CNSTI. the same patterns. Therefore, they can share the
For stk: ADDP(INDIRP(STK), CNSTI), the at- same burg-generated pattern matcher. The pat-
tribute emitter executes tern matcher processes superoperator trees to de-
emitsymbol(p->kids[1]->syms[0]->x.name,
termine how to best translate each into machine
4, 4);
code. Below is a small specication to illustrate
the complete translation for an ADDI operator.
where p->kids[1] points to the CNSTI relative reg: ADDI(reg,reg) = (1)
to the root of the pattern, ADDP. "addu %0r, %1r, %2r\n"
A preprocessor creates a second burg spec-
ication, mach.gr, from the emitter specica- reg: STK = (1)
tion. The emitter specication patterns form the "lw %0r, %P4-4($19)\n"
rules in mach.gr. The mach.gr-generated pat-
tern matcher processes trees that represent the stmt: reg = (0)
VM's superoperators. For every emitter pattern "sw %0r, %U4($19)\n"
that matches in a superoperator tree, the asso-
ciated emitter action must be included in the STK is a terminal symbol representing a value on
translator for that superoperator. This is done the stack. The second rule is a pop from the eval-
automatically from the emitter specication and uation stack into a register. %P4 is a 4-byte pop,
the list of superoperator trees. (Single node VM and $19 is the evaluation stack pointer register.
operators are always treated as degenerate su- The third rule is a push onto the evaluation stack
peroperators.) Automating the process of trans- from a register. %U4 is the 4-byte push.
lating chosen superoperators to a new interpreter To generate the machine code for a sim-
is key to practically exploiting superoperator op- ple ADDI operation, the interpreter-generator re-
timizations. duces the tree ADDI(STK,STK) to the nontermi-
nal stmt using the pattern matcher. The re- and sign-extends a 1-byte value into a 4-byte
sulting code requires two instances of the second register. This corresponds to the IR pattern,
rule, and one each of the rst and third rules: CVCI(INDIRC(x)). The specication for this
complex pattern follows.
lw $8, 0-4($19) # pop left operand
# (reg: STK) reg: CVCI(INDIRC(reg)) = (1)
lw $9, -4-4($19) # pop right operand "lb %0r, 0(%1r)\n"
# (reg: STK) ""
addu $8, $8, $9 # add them
# (reg: ADDI(reg,reg)) The interpreter-generator may use this rule for
sw $8, -8($19) # push the result any superoperators that include
# (stmt: reg) CVCI(INDIRC(x)).
addu $19, -4 # adjust stack
5.1 Additional IR Operator
The interpreter-generator automatically allo-
cates registers $8 and $9 and, generates code to To reduce the size of bytecode attributes, one
adjust the evaluation stack pointer. additional IR operator was added to lcc's orig-
The interpreter-generator selects instructions inal set: ADDRb. lcc's ADDRLP node represents
and allocates temporary registers for each super- the oset of a local variable relative to the frame
operator. In essence, creating an interpreter is pointer. hti emits a 4-byte oset attribute for
traditional code generation | except that it is ADDRLP. ADDRLb is simply an abbreviated ver-
done for a static set of IR trees before any source sion of ADDRLP that requires only a 1-byte o-
code is actually translated. set. Machine-independent back end code does
The emitter and machine specications use the this translation.
same patterns, so only one le is actually main-
tained. The juxtaposition of the emitter code 6 Implementation Details
and machine code makes their relationship ex-
plicit. Below is the complete R3000 specication Building an hti interpreter is a straightfor-
for CNSTI. ward process. The following pieces are needed
to build hti's translator and interpreter:
reg: CNSTI = (1)
"addu $17, 7; A target machine/emitter specication.
srl $17, 2;
sll $17, 2;
lcc back end code to handle data layout,
lw %0r, -4($17)\n"
calling conventions, etc.
"emitsymbol(%P->syms[0]->x.name, 4, 4);" A library of interpreter routines for observ-
Register $17 is the interpreter's program counter ing calling conventions.
(pc). The rst three instructions advance the pc Machine-dependent interpreter-generator
past the 4-byte immediate data and round the routines.
address to a multiple of 4. (Because of assembler
and linker constraints on the R3000, all 4-byte Figure 1 summarizes the sizes of the machine
data must be word aligned.) The lw instruction dependent and independent parts of the system
loads the immediate value into a register. (lcc's front end is excluded).
Machine/emitter specications are not limited The R3000-specic back end code and the
to single-operator patterns. Complex IR tree interpreter library are much bigger than the
patterns may better express the relationship be- SPARC's because of the many irregular argu-
tween target machine instructions and lcc's IR. ment passing conventions observed by C code on
For example, the R3000 lb instruction loads the R3000.
Function Language Sizes (in lines)
Machine Independent R3000 SPARC
Target Specication grammar - 351 354
lcc back end C 434 244 170
interpreter library asm - 130 28
interpreter generator C 204 72 70
Figure 1: Implementation Details
7 System Obstacles ject code for each function consists of a native-

code prologue, with interpreted bytecodes for the
Unfortunately, hti's executables are slower and function body. Object les are linked together
bigger than they ought to be because of limi- with appropriate C libraries (e.g., libc.a) and
tations of system software on both R3000 and the interpreter. The executable may be com-
SPARC systems. The limitations are not intrin- pared to natively compiled code for both size
sic to the architectures or hti; they are just the and speed. The code size includes the function
results of inadequate software. prologues, bytecodes, and one copy of the inter-
Neither machine's assembler supports un- preter. Interpreter comparisons will depend on
aligned initialized data words or halfwords. This available superoperators.
can cause wasted space between a bytecode and Comparisons were made for three programs:
its (aligned) immediate data. Consequently, the
interpreter must execute additional instructions burg: A 5,000-line tree pattern matcher
to round its pc up to a 4-byte multiple be- generator, processing a 136-rule specica-
fore reading immediate 4-byte data. Initial tests tion.
indicate that approximately 17% of the bytes hti: The 13,000-line translator and sub-
emitted by hti are wasted because of alignment ject of this paper, translating a 1117-line C
problems.1 le.
The R3000 assembler restricts the ability to
emit position-relative initialized data. For in- loop: An empty for loop that executes
stance, the following is illegal on the R3000: 10,000,000 times.
L99: On the 33MHz R3000, hti is compared to a
.word 55 production quality lcc compiler. Because lcc's
.word .-L99 SPARC code generator is not available, hti is
Position relative data would allow hti to im- compared to acc, Sun's ANSI C compiler, on
plement pc-relative jumps and branches. Pc- the 33MHz Sun 4/490. Because lcc does little
relative jumps can use 2-byte immediate values global optimization, acc is also run without op-
rather than 4-byte absolute addresses, thus sav- timizations. hti is run both with and without
ing space. enabling the superoperator optimization. Super-
operators are inferred based on a static count of
how many times each tree is emitted from the
8 Experimental Results front end for that benchmark | hti+so repre-
hti compiles C source into object code. Ob- sents these tests. The columns labelled hti rep-
resent the interpreter built with exactly one VM
1
I understand that the latest release of the R3000 as- operator for each IR operator.
sembler and linker supports unaligned initialized data, Figures 2 and 3 summarize the sizes of the code
and that the R3000 has instructions for reading unaligned
data. Unfortunately, I do not have access to these new segments for each benchmark. \code" is the to-
tools. tal of bytecodes, function prologues, and wasted
space. \waste" is the portion wasted due to tion eciency.
alignment restrictions. \interp" is the size of the If space were not a consideration, the in-
interpreter. (The sizes do not include linked sys- terpreter could be implemented in a directly
tem library routines since all executables would threaded fashion to decrease operator decode
use the same routines.) time [Kli81]. The implementation of each VM
The interpreted executables are slightly larger operator is unrelated to the encoding of the oper-
than the corresponding native code. The inter- ators, so changing from the current indirect table
preted executables are large for a few reasons be- lookup to threading would not be dicult.
sides the wasteful alignment restrictions already
mentioned. First, no changes were made to the 9 Limitations and Extensions
lcc's IR except the addition of ADDRb, and lcc
creates wasteful IR nodes. For instance, lcc Almost certainly, each additional superoperator
produces a CVPU node to convert a pointer to contributes decreasing marginal returns. I made
an unsigned integer, yet this a nop on both the no attempt to determine what the time and space
R3000 and SPARC. Removing this node from IR tradeos would be if the number of superopera-
trees would reduce the number of emitted byte- tors were limited to some threshold like 10 or
codes. Additionally, lcc produces IR nodes that 20. I would conjecture that the returns for a
require the same code sequences on most ma- given application diminish very quickly and that
chines, like pointer and integer addition. Dis- 20 superoperators realize the bulk of the poten-
tinguishing these nodes hampers superoperator tial optimization. The valuable superoperators
inference, and superoperators save space. Unfor- for numerically intensive programs probably dif-
tunately, much of the space taken up by executa- fer from those for pointer intensive programs. To
bles is for immediate values, not operator byte- create a single VM capable of executing many
codes. To reduce this space would require either classes of programs eciently, the 147 additional
encoding the sizes of the immediate data in new bytecodes could be partitioned into superopera-
operators (like ADDRb) or tagging the data with tors targeted to dierent representative classes of
size information, which would complicate fetch- applications.
ing the data. This system's eectiveness is limited by lcc's
Fortunately, hti produces extremely fast in- trees. The common C expression, x ? y, cannot
terpreters. Figures 4 and 5 summarize the exe- be expressed as a single tree by lcc. Therefore,
cution times for each benchmark. hti cannot infer superoperators to optimize its
lcc does much better than acc relative to in- evaluation based on the IR trees generated by
terpretation because it does modest global reg- the front end. Of course, any scheme based on
ister allocation, which acc and hti do not do. looking for common tree patterns will be limited
lcc's code is 28.2 times faster than the inter- by the operators in the given intermediate lan-
preted code on loop because of register alloca- guage.
tion. Excluding the biased loop results, inter- hti generates bytecodes as .data assembler
preted code without superoperators is less than directives and function prologues as assembly
16 times slower than native code | sometimes language instructions. Nothing about the tech-
signicantly. Furthermore, superoperators con- niques described above is limited to such an im-
sistently increase the speed of the interpreted plementation. The bytecodes could have been
code by 2-3 times. emitted into a simple array that would be im-
These results can be improved with more en- mediately interpreted, much like in a traditional
gineering and better software. Support for un- interpreter. This would require an additional
aligned data would make all immediate data bytecode to represent the prologue of a function
reads faster. Inferring superoperators based | to mimic the currently executed assembly in-
on prole information rather than static counts structions. To make this work, the system would
would make them have a greater eect on execu- have to resolve references within the bytecode,
R3000 Code Size Summary (in bytes)
Benchmark Translator
lcc hti hti+so
code code interp waste code interp waste
burg 56576 92448 4564 15895 72616 12388 13862
hti 230160 315040 4564 51868 289516 11296 64299
loop 48 52 4564 4 44 600 6
Figure 2: R3000 Benchmark Code Sizes
SPARC Code Size Summary (in bytes)

Benchmark Translator
acc hti hti+so
code code interp waste code interp waste
burg 75248 84720 4080 13560 63992 11840 10862
hti 271736 292568 4080 41423 254808 10512 37507
loop 80 56 4080 2 48 312 4
Figure 3: SPARC Benchmark Code Sizes
R3000 Execution Summary

Benchmark Times (in seconds) Ratios
lcc hti hti+so hti/lcc hti+so/lcc hti/hti+so
burg 1.65 14.04 7.07 8.5 4.3 2.0
hti 2.69 42.83 23.81 15.9 8.8 1.8
loop 1.53 43.12 13.88 28.2 9.1 3.1
Figure 4: R3000 Benchmark Code Speeds
SPARC Execution Summary

Benchmark Times (in seconds) Ratios
acc hti hti+so hti/acc hti+so/acc hti/hti+so
burg 1.78 18.52 8.37 10.4 4.7 2.2
hti 4.39 58.24 28.73 13.3 6.5 2.0
loop 6.62 61.26 20.05 9.3 3.0 3.1
Figure 5: SPARC Benchmark Code Speeds

which would require some additional machine- tribute to the run-time checks. The interpreter
independent eort. (The system linker/loader implements a stack-based machine, and main-
resolves references in the currently generated as- tains calling conventions between native and in-
sembler.) Not emitting function prologues of ma- terpreted code. Unlike hti interpreted functions
chine instructions would make seamless calls behave two entry points: one for being called from
tween interpreted and compiled functions very other interpreted functions, and another for na-
tricky, however. tive calls, with a machine-code prologue.
Similarly, Feuer developed a diagnostic C in-
10 Related Work terpreter, si, for debugging and diagnostic out-
put [Feu85]. si's primary design goals were quick
Many researchers have studied interpreters for translation and exible diagnostics | time and
high-level languages. Some were concerned with space eciency were not reported.
interpretation eciency, and others with the di- Klint compares three ways to encode a pro-
agnostic capabilities of interpretation. gram for interpretation [Kli81]. The methods
Supercombinators optimize combinator-based are \Classical," \Direct Threaded" [Bel73], and
functional-language interpreters in a way sim- \Indirect Threaded." Classical | employed by
ilar to how superoperators optimize hti. Su- hti and Cint | encodes operators as values
percombinators are combinators that encompass such that address of the corresponding inter-
the functionality of many smaller combinators preter code must be looked up in a table. Direct
[FH88]. By combining functionality into a sin- Threaded encodes operations with the addresses
gle combinator, the number of combinators to of the corresponding interpreter code. Indirect
describe an expression is reduced and the num- Threaded encodes operations with pointers to
ber of function applications necessary to evaluate locations that hold the actual code addresses.
an expression is decreased. This is analogous to Klint concludes that the Classical method gives
reducing the number of bytecodes emitted and the greatest compaction because it is possible to
fetched through superoperator optimization. use bytes to encode values (or even to use Hu-
Pittman developed a hybrid interpreter and mann encoding) to save space. However, the
native code system to balance the space/time Classical method requires more time for the table
tradeo between the two techniques [Pit87]. His lookup.
system provided hooks for escaping interpreted
code to execute time-critical code in assembly 11 Discussion
language. Programmers coded directly in both
interpreted operations, or assembly. hti translates ANSI C into tight, ecient code
Davidson and Gresch developed a C inter- that includes a small amount of native code with
preter, Cint, that, like hti, maintained C call- interpreted code. This hybrid approach allows
ing conventions in order to link with native code the object les to maintain all C calling con-
routines [DG87]. Cint was written entirely in C ventions so that they may be freely mixed with
for easy retargetability. Cint's VM is similar to natively compiled object les. The interpreted
hti's | it includes a small stack-based operator object code is approximately the same size as
set. On a set of small benchmarks the interpreted equivalent native code, and runs only 3-16 times
code was 12.4-42.6 times slower than native code slower.
on a VAX-11/780, and 20.9-42.5 times slower on Much of the interpreter's speed comes from
a Sun-3/75. Executable sizes were not compared. being implemented in assembly language. Re-
Kaufer, et. al., developed a diagnostic C inter- targeting the interpreter is simplied using
preter environment, Saber-C, that performs ap- compiler-writing tools like burg and special-
proximately 70 run-time error checks [KLP88]. purpose machine specications. For the MIPS
Saber-C's interpreted code is roughly 200 times R3000 and the SPARC, each machine required
slower than native code, which the authors at- fewer than 800 lines of machine-specic code to
be retargeted. [FHP92] Christopher W. Fraser, Robert R.
Superoperators, which are VM operations that Henry, and Todd A. Proebsting.
represent the aggregate functioning of many con- BURG | fast optimal instruction se-
nected simple operators, make the interpreted lection and tree parsing. SIGPLAN
code both smaller and faster. Tests indicate su- Notices, 27(4):68{76, April 1992.
peroperators can double or triple the speed of in-
terpreted code. Once specied by the interpreter [GG90] Ralph E. Griswold and Madge T. Gris-
developer, new superoperators are automatically wold. The Icon Programming Lan-
incorporated into both the translator and the in- guage. Prentice Hall, 1990.
terpreter. Furthermore, heuristics can automati- [GJ79] M. R. Garey and D. S. Johnson. Com-
cally isolate benecial superoperators from static puters and Intractability: A Guide to
or dynamic feedback information for a specic the Theory of NP-Completeness. W.
program or for an entire suite of programs. H. Freeman and Company, 1979.
12 Acknowledgements [KH92] Gerry Kane and Joe Heinrich. MIPS
RISC Architecture. Prentice Hall, 1992.
Chris Fraser provided useful input on this work.
[Kli81] Paul Klint. Interpretation tech-
niques. Software|Practice and Expe-
References rience, 11(10):963{973, October 1981.
[Bel73] James R. Bell. Threaded code. Com- [KLP88] Stephen Kaufer, Russell Lopez, and Se-
munications of the ACM, 16(6):370{ sha Pratap. Saber-C: An interpreter-
372, June 1973. based programming environment for
[DG87] J. W. Davidson and J. V. Gresch. Cint: the C language. In Proceedings of the
A RISC interpreter for the C pro- 1988 Usenix Summer Conference, San
gramming language. In Proceedings of Francisco, CA, June 1988.
the SIGPLAN '87 Symposium on In- [Pit87] T. Pittman. Two-level hybrid inter-
terpreters and Interpretive Techniques, preter/native code execution for com-
pages 189{198, June 1987. bined space-time program eciency. In
[Feu85] Alan R. Feuer. si | an interpreter Proceedings of the SIGPLAN '87 Sym-
for the C language. In Proceedings of posium on Interpreters and Interpre-
the 1985 Usenix Summer Conference, tive Techniques, pages 150{152, June
Portland, OR, June 1985. 1987.
[FH88] Anthony J. Field and Peter G. Harri- [Sun91] Sun Microsystems, Inc. The SPARC
son. Functional Programming. Addison Architecture Manual (Version 8), 1991.
Wesley, 1988.
[FH91a] Christopher W. Fraser and David R.
Hanson. A code generation interface
for ANSI C. Software|Practice and
Experience, 21(9):963{988, September
1991.
[FH91b] Christopher W. Fraser and David R.
Hanson. A retargetable compiler for
ANSI C. SIGPLAN Notices, 26(10),
October 1991.

Optimizing An ANSI C Interpreter With Superoperators

Uploaded by

Document Information

Original Description:

Original Title

Copyright

Available Formats

Share this document

Share or Embed Document

Sharing Options

Did you find this document useful?

Is this content inappropriate?

Copyright:

Available Formats

Optimizing An ANSI C Interpreter With Superoperators

Uploaded by

Copyright:

Available Formats

Optimizing an ANSI C Interpreter with Superoperators

Abstract machine (VM) and then interpret that code.

Figure 1: Implementation Details

7 System Obstacles ject code for each function consists of a native-

Figure 2: R3000 Benchmark Code Sizes

SPARC Code Size Summary (in bytes)

Figure 3: SPARC Benchmark Code Sizes

R3000 Execution Summary

Figure 4: R3000 Benchmark Code Speeds

SPARC Execution Summary

Figure 5: SPARC Benchmark Code Speeds

You might also like