Professional Documents
Culture Documents
The Python Compiler For Cmu Common Lisp: Robert A. Maclachlan Carnegie Mellon University
The Python Compiler For Cmu Common Lisp: Robert A. Maclachlan Carnegie Mellon University
Robert A. MacLachlan
Carnegie Mellon University
CMU Common Lisp is portable, but does not sacri- three representations for multiple values. The + func-
fice efficiency or functionality for ease of retargeting. tion has 6 transforms, 8 code generators and a type
It currently runs on three processors: MIPS R3000, inference method. This functionality is not without
SPARC, and IBM RT PC. Python has 50,000 lines of a cost; Python is bigger and slower than commercial
235
Eager — Run-time type checks are done at the loca- In: defun type-error
tion of the first type assertion, rather than being (addf x 3)
delayed until the value is actually used. –> +
——
——
Explicit type declarations are neither trusted nor (tie single-float 3)
ignored; declarations simply provide additional asser- Warning: This is not a single-float:
tions which the type checker verifies (with a run-time 3
test if necessary) and then exploits. This eager type
checking allows the compiler to use specialized vari-
able representations in safe code. In addition to detecting static type errors, flow
Because declared types are precisely checked, dec- analysis detects simple dynamic type errors like:
larations become more than an efficiency hack. They
provide a mechanism for introducing moderately
complex consistency assertions into code. Type (defun foo (x arg)
declarations are more powerful than ass ert and (ecase x
check-type because they make it easier for the com- (:regist er-number
piler to propagate and verify assertions. (and (integerp arg) (>= x O)))
.))
Gives thts compiler warntng: Precise, eager type checking lends itself to a new cod-
ing style, v-here:
236
Declarations are specified to be as precise as
possible: (integer 3 7) instead of f ixnum, which give;? these eficiency notes:
(or node null ) instead of no declaration.
In: defun eff-note
Declarations are written during initial coding
(+ixy)
rather than being postponed until tuning, since ==
declarations now enhance debuggability rather
~~~ (+ i X) Y)
than hurting it.
viding an automated efficiency model. In a compiler, the use of a dynamic-binding interpreter than to any
great complexity of the debuggers themselves.
accountability means that efficiency notes:
There has been a trend toward compiling Lisp
are needed whenever an inefficient implementa- programs during development and debugging. The
tion is chosen for non-obvious reasons, first-order reason for this is that in modern imple-
ment ations, compiled code is so much faster than
should convey the seriousness of the situation, the interpreter that interpreting complex programs
has become impractical. Unfortunately, compiled de-
must explain precisely what part of the program
bugging makes interpreter-based debugging tools like
is being referred to, and
*evalhoo~* steppers much less available. It is usu-
must explain why the inefficient choice was made ally not e+en possible to determine a more precise
(and how to enable efficient compilation.) error location than “somewhere in this function.”
One solution is to provide source-level debugging of
compiled t ode. Although the idea has been around
Efficiency Note Example
for a while l18], source level debugging is not yet avail-
Consider this example of “not good enough” declara- able in the popular stock hardware implementations.
tions: To provide useful source-level debugging, CMU Com-
mon Lisp had to overcome these implementation dif-
ficulties:
(defun eff-note (s i x y)
(declare (string s) (fixnum i x y)) ● Excessive space and compile-time penalty for
(arefs (+ i x y))) dumpmg debug information.
237
Excessive run-time performance penalty because #s(my-struct slot 2)
debuggability inhibits important optimizations #p’’foo” )
like register allocation and untagged representa-
tion. ; evaluation of local variables in expressions, and
6] (my-struct-slot (cadr structs))
Poor correctness guarantees. Incorrect informa- 2
tion is worse tha~ none, and even very rare in-
correctness destroys user confidence in the tool. ; display of the error location.
6] source 1
The Python symbolic debugger information is com-
pact, allows debugging of optimized cc)de and pro-
(oddp (#:=* *here*** (my-struct-slot struct)))
vides precise semantics. Source-based debugging is
supported by a bidirectional mapping between loca-
; and you an jump to editing the exact error locatton.
tions in source code and object code. Due to careful
6] edit
encoding, the space increase for source-level debug-
ging is less than 35’%0. Precise variable liveness infor-
mation is recorded, so potentially incorrect values are
not displayed, giving what Zellweger[22] calls truthful
4.2 The Debugger Toolkit
behavior.
The implementation details of stack parsing, debug
info and code representation are encapsulated by an
4.1 Debugger Example
interface exported from the debug-internals pack-
This program demonstrates some simple source-level age. This makes it easy to layer multiple debug-
debugging features: ger user interfaces onto CMU CL. The interface is
somewhat similar to Zurawski[23]. Unlike SmallTalk,
Common l,isp does not define a standard representa-
(defstruct my-struct tion of activation records, etc., so the toolkit must be
(slot nil :type (or integer null))) defined as a language extension.
238
Python increases the efficiency of abstraction in (declaim (inline find-in))
several ways: (defun find-in (next element list &key (key #’identity)
(test #’eql test-p)
Extensive partial evaluation at the Lisp seman-
(test-not nil not-p))
tics level,
(when (and test-p not-p)
Efficient compilation of the full range of Common (error “Both :Test and :Test-Not supplied.”))
values, arrays, numbers, structures, and (do ((current list (funcall next current)))
((null current) nil)
Block compilation and inline expansion (syner- (unless (funcall test-not (funcall key current)
gistic with partial evaluation and untagged rep- element )
resent ations.) (return current)))
(do ((current list (funcall next current)))
Partial Evaluation ((null current) nil)
[whet (funcall test (funcall key current) element)
Partial evaluation is a generalization of constant
(return current)) j))
folding[9]. Instead of only replacing completely con-
stant expressions with the compile-time result of the
When inline expanded and partially evaluated,
expression, partial evaluation transforms arbitrary
this transformation results:
code using information on parts of the program that
are constant. Here are a few sample transformations:
(find-in #’foo-next 3 foo-list
Source transformation systems have demonstrated by making inline functions efficient enough so that
that partial evaluation can make abstraction efficient macros can be reserved for cases where functional ab-
by eliminating unnecessary generality[15, 4], but only straction is insufficient. Such optimization also makes
a few non-experimental compilers (such as 0rbit[9]) it simpler to write macros because unnecessary condi-
have made comprehensive use of this technique. Par- tionals and variable bindings can be introduced into
tial evaluation is especially effective in combination the expam ion without hurting eficiency.
with inline expansion[19, 13].
5.4 Block Compilation
5.2 Partial Evaluation Example
Block compilation assumes that global functions will
Consider this analog of find which searches in a not be red fined, allowing the compiler to see across
239
only effect that block compilation has on debuggabil- ● Efficiency notes provide reliable feedback about
ity is that the entire block must be recompiled when numeric operations that were not compiled effi-
any function in it is redefined. Often block compila- ciently.
tion is first used during tuning, so full-block recom-
In addition to open-coding operations on single and
pilation is rarely necessary during development.
double f losts and f ixnums, Python also implements
Block compilation reduces the basic call overhead
full-word integer arithmetic on signed and unsigned
because the control transfer is a direct jump and ar-
32 bit operands. Word-integer arithmetic is used to
gument syntax checking is safely eliminated. Python
implement bignum operations, resulting in unusually
also does keyword and optional argument parsing at
good bignum performance.
compile-time.
Another advantage of block compilation is that lo-
cally optimized calling conventions can be used. In 7 Implementation
particular, numeric arguments and return values are
passed in non-pointer registers. Python’s structure is broadly characterized by the
In addition to reductions in the call overhead itself, compilation phases and the data structures (or z’nter-
block compilation is important because it extends mediate representations) that they manipulate. Two
type inference and partial evaluation across function major intermediate representations are used:
boundaries. Block compilation allows some of the
The Implicit Continuation Representation (ICR)
benefits of whole-program compilation[3] without ex-
repres mts the lisp-level semantics of the source
cessive compile-time overhead or complete loss of in-
code during the initial phases. ICR is roughly
cremental redefinition.
equivalent to a subset of Common Lisp, but is
Python provides an extension which declares that
represented as a flow-graph rather than a syntax
only certain functions in a block compilation are en-
tree. The main ICR data structures are nodes,
try points that can be called from outside the com-
cent inuat ions and blocks. Partial evaluation
pilation unit. This is a syntactic convenience that
and semantic analysis are done on this represen-
effectively converts the non-entry defuns into an en-
tation. Phases which only manipulate ICR com-
closing labels form. Whichever syntax is used, the
prise the “front en~.
entry point information allows greatly improved op-
timization because the compiler can statically locate The Virtual Machine Representation (VMR)
all calls to block-local functions. represents the implementation of the source code
on a virtual machine. The virtual machine may
vary depending on the the target hardware, but
6 Numeric Efficiency VMR is sufficiently stylized that most of the
phases which manipulate it are portable. All val-
Efficient compilation of Lisp numeric operations has ues are stored in Temporary Names (TNs). All
long been a concern[16, 6]. Even when type inference opera, ions are Virtual Operations (VOPS), and
is sufficient to show that a value is of an appropriate their >perands are either TNs or compile-time
numeric type, efficient implementation is still diffi- constants.
cult because the use of tagged pointers to represent
all objects is incompatible with efficient numeric rep- When compared to the intermediate representations
used by most Lisp compilers, ICR is lower level than
resent at ions.
an abstract syntax tree front-end representation, and
Python offers a combination of S( veral features
VMR is higher level than a macro-assembler back-end
which make efficient numeric code much easier to at-
represent at ion.
tain:
● Block compilation increases the range over which ICR conversion Convert the source into ICR, do-
efficient numeric representations can be used. ing m croexpansion and simple source-to-source
240
transformation. All names are resolved at this Control analysis Linearize the flow graph in a way
time, so later source transformations can’t intro- that r.linimizes the number of branches. The
duce spurious name conflicts. See section 9. block-ievel structure of the flow graph is frozen
at this point (see section 12.3.)
Local call analysis Find calls to local functions
and convert them to local calls to the correct en- Stack analysis Discard stack multiple values that
try point, doing keyword parsing, etc. Recognize are u rused due to a local exit making the val-
once-called functions as lets. Create entry stubs ues re,.eiver unreachable. The phase is only run
for functions that are subject to general function when I’N allocation actually uses the stack val-
ues represent ation.
call.
ICR optimize A grab-bag of all the non-flow ICR Representation selection Look at all references to
optimization. Fold constant functions, prop- each TN to determine which representation has
agate types and eliminate code that computes the lowest cost. Emit appropriate VOPS for co-
unused values. Source transform calls to some ercions to and from that representation. See sec-
known global functions by replacing the function tion 13.
with a computed inline expansion. Merge blocks
Register allocation Use flow analysis to find the
and eliminate if-if constructs. Substitute let
live set at each call site and to build the conflict
variables. See section 10.
graph Find a legal register allocation, attempt-
ing to minimize unnecessary moves. Registers
Constant propagation This phase uses global flow
are saved across calls according to precise life-
analysis to propagate information about lexical
time information, avoiding unnecessary memory
variable values (and their types), eliminating un-
references. See section 14.
necessary type checks and tests.
TN allocation Iterate over all defined functions, ICR is a fl JW graph representation that directly sup-
determining calling conventions and assigning ports flow analysis. The conversion from source into
TNs to local variables. Use type and policy in- ICR throws away large amounts of syntactic infor-
formation to determine which VMR translation mation, eliminating the need to represent most envi-
to use for known functions, and then create TNs ronment manipulation special forms. Control spe-
for expression evaluation temporaries. Print ef- cial forms such as block and go are directly rep-
ficiency notes when a poor translation is chosen. resented by the flow graph, rather than appearing
241
as pseudo-expressions that obstruct the value flow. Some sk ts are shared between all node types (via
This elimination of syntactic and control informa- defstruct inheritance.) This information held in com-
tion removes the need for most “bet a transformation” mon betwt,en all nodes often makes it possible to
optimizations[171. avoid special-casing nodes on the basis of type. The
The set of special forms recognized by ICR conver- common information primarily concerns the order of
sion is exactly that specified in the Common Lisp evaluation and the destinations and properties of re-
standard; all macros are really implemented using sults. This control and value flow is indicated in the
macros. The full Common Lisp lambda is imple- node by references to continuations.
mented with a simple fixed-arg lambda, greatly sim-
plifying later compiler phases. 9.3 Blocks
a transfer of control[l]. This is the Implicit Continu- a block represents only the destination for the result.
entry Marks the start of a dynamic extent that can Use/definition information is available, making
have non-local exits to it. Dynamic state can be substitutions easy.
saved at this point for restoration on re-entry.
All lexical and syntactic information is frozen
exit Marks a potentially non-local exit. This node is in dul ing ICR conversion (alphatization[ 17]), so
interposed between the non-local uses of a con- transformations can be made with no concern for
tinuation and the value receiver so that code to variat Ie name conflicts, scopes of declarations,
do a non-local exit can be inserted if necessary. etc.
242
● During ICR conversion this syntactic scoping in- inference, since inferred types can’t be inverted into
formation is discarded and the environment is intelligible type specifiers for reporting back to the
made implicit, causing further canonicalization: user.
(+3 (unwind-protect 42 (gork))) tates the nodes with a target dependent Virtual Ma-
chine (VM) Representation. VMR is a generalized
As the last example shows, this holds true even tuple representation derived from PQCC[l 1]. There
when the value posit ion is not tail-recursive. are only two major objects in VMR: VOPS (Virtual
Operations) and TNs (Temporary Names). All in-
structions are generated by VOPS and all run-time
11 Type Inference values are stored in TNs.
In order to support type inferences involving complex applied to the value. A primitive type maY have mul-
types, type specifiers are parsed into an internal rep- tiple possible representations (see Representation
resentation. This representation supports subtypep, Selection ).
type simplification, and the union, intersection and VMR cc nversion creates as many TNs as necessary,
difference of arbitrary numeric subrangcs, member and annotating, them only with their primitive type. This
or types. function types are used to represent func- keeps VMR conversion from needing any information
tion argument signatures. Baker[2] describes an ef- about the Llumber or kind of registers, etc. It is the
ficient and relatively simple decision procedure for responsibi ity of Register Allocation to efficiently
subtypep, but it doesn’t support act ountable type map the al ,ocated TNs onto finite hardware resources.
243
12.2 Virtual Operations This phase is in effect a pre-pass to register allo-
cation. The main reason for its separate existence
The PQCC VM technology differs from a conven-
is that representation conversions may be fairly com-
tional virtual machine in that it is not fixed. The set
plex (e.g. themselves requiring register allocation),
of VOPS varies depending on the target hardware.
and thus must be discovered before register alloca-
The default calling mechanisms and a few prim-
tion.
itives are implemented using standard VOPS that
must implemented by each VM. In PQCC!, it was a
rule of thumb that a VOP should translate into about
one instruction. VMR uses a number of VOPS that 14 Register allocation
are much more complex (e.g. function call) in order
to hide implementation details from VMR conversion. Consists of two passes, one which does an initial reg-
ister allocation, a post-pass that inserts spilling code
Calls to primitive functions such as ~ and car are
to satisfy violated register allocation constraints.
translated to VOP equivalents using declarative in-
formation in the particular VM definition; VMR con- The interface to the register allocator is derived
version makes no assumptions about which opera- from the e~tremely general storage description model
tions are primitive or what operand t~ pes are worth developed for the PQCC project [12]. The two major
special-casing. concepts a. “e:
tional is favored by allowing to drop through; the underlying SB, with an associated element size.
right decision gives optimizations such as loop rota- Examples are des cript or-reg (based on the
tion. A poor ordering would result in unnecessary register SB), and double-float-stack (based
unconditional branches. on the number-stack SB, element size 2, offsets
indicating whether the sense of the test is negated. conceivable set of hardware storage resources, the
An unconditional branch VOP is emitted afterward SC/SB mt }del is synergistic with representation se-
if the other path isn’t a drop-through. lection. Representations are SCS. The set of legal
representa~ions for a primitive type is just a list of
SCS. This is above and beyond the advantages noted
13 Representation selection in [6] of having a general-purpose register allocator as
in the TN13ind/Pack model [l O].
Some types of object (such as single-f lost) have
multiple possible representations [6]. Multiple repre-
sentations are useful mainly when there is a particu-
15 Virtual Machine Definition
larly efficient untagged representation. In this case,
the compiler must choose between the normal tagged
The VM description for an architecture has three ma-
pointer representation and an alternate untagged rep-
jor parts:
resentation, Representation selection has two sub-
phases:
● The SB and SC definition for the storage re-
sourws, with their associated primitive types.
● TN representations are selected by examining all
the TN references and choosing t~le representa-
tion with the lowest cost. ● The il struction set definition used by the assem-
bler, z ssembly optimizer and disassembler.
● Representation conversion code is inserted where
the representation of a value is chx;iged. . The definitions of the VOPS.
244
15.1 VOP definition example when the programmer can assume a high level of op-
timization.
VOPS are defined using the def ine-vop macro. Since
The main cost of this additional functionality is
classes of operations (memory references, ALU op-
in compile time: compilations using Python may be
erations, etc. ) often have very similar implement a-
as much as 3x longer than they are for the quick-
tions, def ine-vop provides two mechanisms for shar-
est commercial Common Lisp compilers. However,
ing the implementation of similar VOPS: inheritance
on modern workstations with at least 24 megabytes
and VOP variants. It is common to define an “ab-
of memory incremental recompilation times are still
stract VOP” which is then inherited by multiple real
quite acceptable. A substantial part of the slowdown
VOPS. In this example, single-f loat-op is an ab-
results from poor memory system performance caused
stract VOP which is used to define all dyadic single-
by Python”~ large code size (3.4 megabytes) and large
float operations on the MIPS R3000 (only the + def-
data structures (0.3-3 megabytes.)
inition is shown.)
17 Summary
(define-vop (single-float-op)
(:args (x :SCS (single-reg)) ● Strict type checking and source-level debugging
(y :scs (single-reg))) make development easier.
(:results (r :SCS (single-reg)))
(:variant-vars operation) ● It is both possible and useful to give compile-
(:policy :fast-safe) time warnings about type errors in Common Lisp
(:note “inline float arithmetic”) programs.
(:vop-var vop)
● Optimization can reduce the overhead of type
(:save-p :compute-only)
checking on conventional hardware.
(generator 2
(note-this-location vop :internal-emor) ● The d lfficulty of producing efficient numeric and
(inst float-op operation :single r x y))) array code in Common Lisp is largely due to the
compiler’s opacity to the user, The lack of a
(define-vop (+/single-float single-float-op) simph efficiency model can be compensated for
(:variant ‘+)) by making the compiler accountable to the user
for its optimization decisions.
245
This research was sponsored by The Defense Ad- [11] LEVINtETT, B. W., CATTEL, R. G. G., HOBBS,
vanced Research Projects Agency, Information Sci- S. O., NEWCOMER, J. M., REINER, A. H.,
ence and Technology Office, under the title “Research SCHATZ, B. R., AIND WULF, W. A. An
on Parallel Computing”, ARPA Order No. 7330, is- overview of the production quality compiler-
sued by DARPA/CMO under Contract MDA972-90- compiler project. Computer 13, 8 (Aug. 1980),
C-0035. The views and conclusions contained in this 38-49.
document are those of the authors and should not be
[12] REINER, A. Cost minimization in register
interpreted as representing the official policies, either
assignment for retargetable compilers. Tech.
expressed or implied, of the U.S. Government.
Rep. f ~MU-CS-84-137, Carnegie Mellon Univer-
sity S~’hool of Computer Science, June 1984.
[4] BEER, R. The compile-time type inference and Conference (1977), NASA.
[5] BROOKS, R. A., AND GABRIEL, R. P. A cri- [18] TEI’IWLMAN, W., ET AL. Interlisp Reference
tique of common lisp. In ACM Conference on Manual. Xerox Palo Alto Research Center, 1978.
Ltsp and Functional Programming (1984), pp. 1-
8. [19] WE~IViAN, M. N,, AND ZADECK, K. Constant
propagation with conditional branches. Tech.
[6] BROOKS, R. A., GABRIEL, R. P. AND STEELE Rep. CS-89-36, Brown University, Department
JR., G. L. An optimizing compiler for lexically of Computer Science, 1989.
scoped lisp. In ACM Symposium on Compiler
[20] WIN OGRAD, T. Breaking the complexity bar-
Construction (1982), ACM, pp. 261-275.
rier again. In ACM SIGPLA N–SIGIR Interface
[7] GABRIEL, R. P. Lisp: Good news, bad news, Meetzng (1973), ACM, pp. 13-22.
246