Professional Documents
Culture Documents
Nonfictionsrivastav A 94 Atom
Nonfictionsrivastav A 94 Atom
Nonfictionsrivastav A 94 Atom
Amitabh Srivastava and Alan Eustace Digital Equipment Western Research Laboratory 250 University Ave., Palo Alto, CA 94301 {amitabh, eustace}(!decwrl .pa. dec. com
Abstract
ATOM building (Analysis Tools with OM) is a single framework infrastructure for a wide range of customized program analysis tools. the common present in all codeand time-consuming details in Building a basic block tools; this is the difficult and analysis routines.
Introduction
important for underon new writor Computer architects need such perform
Program analysis tools are extremely standing program behavior. architectures. Software
tools to evaluate how well programs will programs and identify scheduling
ers often use such tools to find out how well their instruction or branch prediction is performing to provide input for profile-driven optimizations. The tirst
technology,
Over the past decade three classes of tools for different machines and applications Epoxie[14] have been developed. counting class consists of basic block basic block is executed. tools like Pixie[9],
analysis routines run in the same address space. Informapassed from the application simple procedure communication analysis routines through calls instead of
and QPT[8] that count the number of times each The second class consists of adtraces.
or files on disk.
care that anatysis routines do not interfere with the programs execution, and precise information simulation ATOM OSF/1. memory pipeline or interpretation. has been implemented It is efficient recording, simulation, on the Alpha AXP under profiling, dynamic and inand has been used to build a diverse counting, instruction evaluating and data cache simulation, branch prediction, about the program is preATOM uses no sented to the analysis routines at all times.
Pixie and QPT also generate address traces and communicate trace data to analysis routines through inter-process communication. communicated but it collects ifying ulators. Tracing and analysis on the WRL Titan [3] via shared memory MPTRACE but required operating [6] is also similar to Pixie by instrumenting
and saves a compressed trace in a file that The third class of tools consists of simsupports multiprocessor simulation simulation assembly language code. g88[2] PROTEUS [4] 88000 selec-
struction scheduling. Permission to cop without fee all or part of this material is granted provided t{ at the copies are not made or distributed for direct commercial advantage, the ACM copyright notice and the title of the publication and its date appear, and notice is given that copying is by permission of the Association of Computing Machinery. To copy otherwise, or to republish, requires a fee and/or specific permission. SIGPLAN 94-6/94 Orlando, Florida USA Q 1994 ACM 0-89791 -662-xJ9410006..$3.5O
Tango Lite[7]
simulates Motorola
techniques.
address the problem of large address traces by allowing simulation. These existing tools have several limitations.
First, most tools are designed to perform a single specific type of instrumentation, tracing. Modifying typically block counting or address these tools to produce more detailed
196
is difficult.
A tool generating
Second, most address tracing tools compute detailed address information. However, As ATOM of compiler works on object modules, it is independent and language systems. tion renders the tool inefficient tire instruction extremely for the user. For example, a branches run
user interested in branch behavior has to sift through the entrace, even though only conditional The instruction need to be examined. into gigabytes. Third, tools based on instruction-level overheads to the processing time. been used to make the simulation system, but simulation many times slower. Fourth, tools such as Tango Lite, which instrument assembly language code, change the application addresses. Instrumenting all libraries have to be available lection mechanisms. communication, programs heap as library routines is inconvenient simulation add large and address traces are and typically
In this paper, we describe the design and implementation of ATOM. We show through a real example how to build tools. Finally, we evaluate the systems performance.
Design of ATOM
is based on the observation all can be accomplished that al-
Several techniques have faster, such as in the Shade The design of ATOM appear vastly different, menting basic block counting store instruction, each conditional though tasks like basic block counting and cache simulation by instruof each a program at a few selected points. tools instrument instrument For example, each load and ATOM al-
the beginning
Finally, most address tracing tools provide trace data colData in form of address traces is communicated to the data analysis routines through inter-process or files on disk, Both are expensive, and the large size of address traces further aggravates this problem. Using a shared buffer reduces this expense but still requires a lot of process switching efficiently and sometimes can be implemented by providing the prinonly with changes to the operating system. in building performance tools. The important systems are listed
lows a procedure call to be inserted before or after any program, procedure, basic block, or instruction. viewed as a linear collection collection instructions. Furthermore, the common the infrastructure ATOM separates the tool-specific needed in all tools. manipulation for object-code part from It provides and a highThe user infrastructure A program is of of procedures, procedures as a
ATOM overcomes these limitations cipal ingredient below. ATOM is a tool-building ranging can be easily built. features that distinguish
level view of the program in object-module defines the tool-specific indicating the points in the application
form.
it from previous
mented, the procedure calls to be made, and the arguments system. A diverse set of tools to cache modeling to be passed. The user also provides code for these procedures in the analysis routines. both the application of print The user specATOM1 Figure 1. In the first step, common the users instrumentation of data is through procedure calls. proThis tool will instrument machinery is combined program with routines to build a custom tool. an application at points routines. application program so
f
share any procedures or data with the application in all codesame library procedure, like print in the final executable, internally works
f,
ATOM provides the common infrastructure instrumenting user simply specifies the tool details. ATOM allows selective instrumentation. ifies the points in the application ments to be passed. The communication Information
program and the analysis routines use the there are two copies one in the application in
program and the other in the analysis routines. in two steps, as shown program to be irMru-
specified by the users instrumentation cation program to build an instrumented executable. The instrumented from application that information
gram to the specified analysis routine with a procedure call instead of through mechanism. Even though the anatysis routines run in the same address space as the application, precise information about communication, files on disk, or a shared buffer with central dispatch
In the second step, this custom tool is applied to the appliexecutable is organized
program is communicated
1ExtemaUy, the user specifies: atomprog in.rt.c ard.c -o Prog.atom to produce dre instrumented program prog.atom.
197
user application
applia2a3ion
analysis output
user instruouenting n
user a~:~is
application output
directly to procedures in the analysis routines through procedure calls. The data is passed as arguments to the handling routine in the requested form, and does not have to go through a central dispatch mechanism. To reduce the communication plication address space. ATOM partitions places the application More importantly, with the information plication program such that they do not interfere to a procedure call, the apthe symbol name space and with each others execution. program and the analysis routines run in the same and analysis routines in the executable
program object module that is to be instrumented, the instrumentation the analysis routines. ify where the application The instrumentation
program is to be instrumented
what procedure calls are to be made. The user provides code for these procedures in the analysis routines. sections show how to write the instrumentation routines for our example tool. The next two and analysis
the analysis routine is always presented (data and text addresses) about the apas if it was executing uninstrumented. Our branch counting tool needs to examine all the conditional branches in the program. We traverse the program a procein the basic block is a conditional The instrumentation the dure at a time, and examine each basic block in the procedure. If the last instruction branch, we instrument ATOM
Instrument
Defhing
Instrumentation
Routines
Section 4 describes how the system guarantees the precise information. ATOM, piler modules. built using OM[l Since OM 1], is independent to work of any comwith different and language system because it operates on objectis designed
the instruction.
architectures,
All
modules
Tools:
In this section we show how to build a simple tool that counts how many times each conditional are written to a file. zoM ~a~ inltiauy implemented on the DECStations running under ULTRIX and was ported to Alpha AXP running under OSF/1. UL~IX, DECStatlon and Alpha AXP are trademarks of Digital Equipment Corporation.
program. guments.
This
to correctly o primitive
is
The AddCal
is used to define
PrintBranch,
taken and how many times it is not taken. The final results
OpenFi.le,
s~e Instmment procedure takes argc and argv as arguments which can be optionally passedfrom the atom command line.
198
verses the program dure, Get First Instrument(int { Proc *p; Block *b; Inst *insq int nbranch = O; AddCaBProto(OpenFile(int)); AddCaBProto(CondBranch(int, AddCalWroto(PrintBranch(int, AddCallProto(CloseFileo); for(p=GetFirstProco; p!=NULL;p=GetNextProc(p)){ for(b=GetFirstBlock(p) ;b!=NULL;b=GetNextBlock(b)){ inst = GetLastInst(b); if(IsInstType(inst, InstTypeCondBr)){ AddCallInst(inst, InstBefore, CondBrattch, nbranch,BrCondValue); AddCallProgram(ProgramAfter, nbranch++; PrintBranch, nbranch, InstPC(inst)); VALUE)); long)); iargc, char **iargv)
GetNext Block
a procedure
at a time.
Block
primitives procedure.
the inner loop traverses all the basic blocks of a we are interested
I nst
We find the last instruction primitive branch using the Is I nst instructions are ignored.
other
With
primitive,
ore
The Inst
argument specifies that the call is to be is executed. The two arguments are the linear number of the vatue specifies
made before the instruction to be passed to CondBranch branch and its condition The AddCallP
(programBef ore)
the application
t e r) the application
) } }
AddCallProgram(ProgramBefore, AddCallProgram(ProgramAfter, } OpenFile, CloseFile); nbranch);
ize analysis routine data and print results at the end, respecA catl to OpenFi before the application program starts executing creates the branch statistics array and opens the output file. We insert calls for each branch to print its count at the end. PC (program counter) and its accumulated branch after the application Finally, the CloseFile
Note that these calls are made only once for each conditional program has finished executing. procedure is executed which closes
the output file. If more than one procedure is to be called at a point, the calls are made in the order in which they were Figure 2: Instrumentation Routines: Branch Counting Tool added by the instrumentation routines.
AIXIM
supports additional
types such
Defining
Analysis
Routines
The anatysis routines contain code and data for all procedures needed to analyze information tion program. in the instrumentation that is passed from the applicaThese include procedures that were specified routines but may contain other proce-
tual argument is not an integer but a register number, and contents of the specified argument or BrCondValue.
BrCondValue
dures that these procedures may call. The analysis routines do not share the code for any procedure with the application program, including Code
PrintBranch,
by load and
library routines.
OpenFi le, CondBranch,
branches and passes zero if the run-time evatttates to a false and a non-zero evaluates to true.
VALUE. CondBranch
for
procedures
were 3.
ATOM atlows the user to traverse the whole program by modeling a program as consisting of a sequence of proceGet First P roc
of branches to atlocate the branch statistics array. opens a file to print results. The CondBranch
re-
turns the first procedure in the program, and Get Next The outermost
for
P roc
loop tra-
199
4
#include <stdio.h> File Wile struct BranchInfo{ long taken; long notTaken; } *bstats; void OpenFile(int n){ BranchInfo));
Implementation
is built using OM[l
of ATOM
1], a link-time code modificaof object files builds a
ATOM
program,
intermediate
applies instrumentarepresenrou-
tation, and finally outputs an executable. the users instrumentation tines with OM using the standard linker to produce a custom tool. This tool is given as input the application the analysis routines. symbolic representations analysis routines. provide of the application representation requested. program and to build interIt uses OMS infrastructure
bstats = (strttctBrrmchInfo *) malloc (n * sizeof(struct file = fopen(btaken.out, fprintf(file, } void CondBranch(int if (taken) bstats[n] taken++; else bstats[n] .notTaken++; n, long taken){ w);
face with the intermediate the information intermediate in [11]. conveniently representation
annotated for procedure call insertions. pass builds the instrumented representation. This pass is
}
void PrintBranch(int n, long pc){ fprintf(file, OX%lX \t %d \t %d\n, PC, bstats[n].taken, } void CloseFileo{ fclose(file); } bstats[n].notTaken);
to organize the data and text sections in a specific is presented to the analysis routines at
order because ATOM has to ensure that precise information about the application all times. In this section, we first describe the extensions to the intermediate representation and the insertion of procedure calls. we describe how Next, we discuss how we minimize ATOM organizes the final executable. the number of registers
Inserting
Procedure
Calls
representation of OM to have a
slot for actions that may be performed before or after the entity is executed. The entity maybe a procedure, basic block,
ittSttWCtiOtt
crements the branch taken or branch not taken counters for the specified branch by examining gument.
Print Branch prints
1 primitives
annotate to the to
the condition
value arthe
by adding
a structure
to be inserted,
arguments
number of times the branch is taken and number of times it is not taken. CloseFi.le closes the output file.
be passed,
Cur-
type of the procedure must already have been added with the
action slot contains as multiple they list of all such actions at a point. to be perThe order will be
Collecting
Program
Statistics
ATOM is given as input the format, The
To find the branch statistics, fully linked application the instrumentation instrumented
formed in which
program
in object-module executable.
are added
so that calls
When this
representation
5An edge ~onnect~two basic blocks and representsthe transfer of control between them.
200
because all insertion is done on OMS intermediate tation and no address tixups are needed. ATOM, does not steal any registers from the application ters that may be modified
here
caller-save registers to save. Saving registers in the application being inserted, is not a good idea if there are more than a few registers to be saved, as it may cause code explosion. create a wrapper routine for each analysis procedure. and makes the call to the analysis routine. routine. Unfortunately,
It allocates space on the stack before the call, saves regisduring the call, restores the saved registers after the call and deallocates the stack space. This enables a number of mechanisms such as signals, setjmp and vfork to work correctly without needing any special attention. The calling conventions are followed in setting up calls to analysis routines. The first six arguments are passed in
wrapper routine saves and restores the necessary registers, The application in calls to program now calls the wrapper routine instead of the analysis this creates an indirection analysis routines. However, this has the advantage that it
registers and the rest are passed on the stack. The number of instructions needed to setup an argument depends on the type of the argument. For example, a 16-bit integer constant can a 32-bit constant in two instructions, and so on. Passing subroutine is within branch range, be built in 1 instruction,
makes no changes to the analysis code so it works well with a debugger like dbx. This is the default mechanism. ATOM provides an additional routines. facility in which the saves and restores of caller-save registers are added to the analysis No wrapper routines are created in this case. The extra space is allocated in the anrttysis routines stack frame. This requires bumping the stack frame and fixing stack references in the analysis routines as needed. This is more work but is more efficient level debugging. optimization as analysis routines are called directly. is available as a higher Since this modifies the analysis routines, it hampers sourceThis mechanism option. the analysis routines. The data flow
a 64-bit program counter in 3 instructions contents of a register takes 1 instruction. To make the call, a pc-relative instruction
otherwise, the value of the procedure is loaded in a register and a js r instruction is used for the procedure call. Therewhen a call is made This register the proceturn address register is always modified
so we always save the return address register. dures address for the js r instruction.
The number of registers that need to be saved and restored is reduced by examining summary information with inall regisof the analysis routines determines all Only these registers need to be the
Reducing
Procedure
Call Overhead
the registers that may be modified when the control reaches a The application terprocedural not follow program may have been compiled conventions 8. Therefore, particular analysis procedure. number of different routines. Moreover, if an analysis routine contains procedure calls and delay the saves of other to other analysis routines, we save only the registers directly used in this analysis routine none of theprocedttrecalls path the program normally takes. registers to procedures that maybe called. We only do this if occur in a loop. Thus we distribute This helps analysis routines involves printing that the cost of saving registers; the overhead now depends on the return if their argument is valid but otherwise raise art error optimization and may contain routines that do in the call to the analysis routines as they have to allow The calling convensaved and restored. We use register renaming to minimize
the catling
need to be saved, The analysis routines, on the other hand, have to follow the calling conventions arbitrary procedures to be linked in.
tions define some registers as callee-save registers that are preserved across procedure calls, and others as caller-save registers that are not preserved across procedure calls. All the caller-save registers need to be saved before the call to the analysis routine and restored on return from the analysis routines. This is necessary to maintain program. the execution state of the application automatically The callee-save registers would
message and touching a lot more registers. For such routines, the common case of a vatid argument has low overhead as few registers are saved. This optimization current implementation. The number of registers that need to be saved may be further reduced by computing live registers in the applilive variable Only cation program. anatysis[l OM can do interprocehral is available in the
Gpixie ~te~s three registers away from the application program for its own use. Pixie maintains three memory locations that have the values of these three registers, and replaces the use of dtese registers by uses of the memory locations. 7~pha[10] ha5 a signed 21-bit pc-relative subroutine branch ~stm~tion. The application may contain hand-crafted assembly language code that often does not follow standard conventions. All)M can handle such programs. $AnalYsis routines are analogous to standard library routines hat have to follow calling conventions so they can be linked wirb progrants.
the live registers need to be saved and restored to preserve the state of the program inlining such as further reduce the overhead of procedure calls at the
201
A
Stack I
textstart ~
a
t
instrumented program text Program Text Addresses Changed \ analysis gp
new datastart
n
Heap Figure 4: Memory layout have not The text addresses. the addresses supplied. program text However,
cost of increasing the code size. These refinements been added to the current system.
same as before. This is shown in Figure 4. addresses of the application program have changed because of the addition of instrumented code. How-
Keeping
Pristine
Behavior
ever, we statically
know the map from the new to original program, the original PC is simply
If an analysis routine asks for the PC of an inThis works well for most of the tools. if the address of a procedure in the application If the
One major goal of ATOM is to avoid perturbing in the application program. Therefore, are put in the space between the application and data segments. Analysis cedures or data with the application data and code for all procedures that they may need. The data sections unchanged. of the application and uninitialized program;
routines do not share any prothey contain routines are not including library program
is taken, its address may exist in a register. is not the original text address.
analysis routine asks for the contents of such a register, the value supplied not implemented We have to return in our current system the ability
data of analysis
Since analysis routines and the application share any procedures, there are two sbrkl the application
routines is put in the space between the application must be located before all uninitialized data by initializing it with zero.
that atlocate space on the same heap. ATOM provides options for tools that must allocate dynamic memory.
ized data of the analysis routines is converted to initialized The start of the stack and heap10 are unchanged,
lo~ the MPha Am
and grows towards low memory, and heap starts at end of uninitialized data and grows towards high memory. 11sbrk routines ~ocate more data spacefor tie program.
202
The first method links the variables of the two sbrks, so both allocate space on the same heap without on each other. stepping This Each starts where the other left off.
programs
execution
The procedure call overhead is dependent on the code in the analysis routines, and the number and type of arguments that are passed. ATOM uses the data flow summary information along with register renaming to save. The contribution instrumented program to find the necessary registers time is also dependent on of procedure call overhead in the
method is useful for tools that are not concerned with the heap addresses being same as in the uninstrumented of the program. also sufficient Such tools include version basic block counting, that require
branch analysis, inline analysis and so on. This method is for tools such as cache modeling This is the default behavior. precise heap addresses but do not allocate dynamic memory in analysis routines. The second method is for tools that allocate dynamic memory and also require heap addresses to be same as in the tminstrumented version of the application application between the application program. To keep the heap addresses as before, the heap is partitioned and the analysis routines. The appli-
execution
the number of times the procedure calls take place. The inline tool instruments only procedure call sites; the total overhead is much less than the cache tool, which do when the control information communication instruments each memory reference. The amount of work the analysis routines reaches them is totally to compute. dependent on Although the the user is trying
overhead is small, we expect it to decrease live register analysis and inlining. Alpha AXP 3000
cation heap starts at the same address but the analysis heap is now made to start at a higher address. The user supplies the offset by which the start of analysis heap is changed. ATOM modifies the sbrk in analysis routines to start at the new address; the two sbrks application are not linked this time. The disad-
All measurements were done on Digital Model 400 with 128 Mb memory.
vantage of this method is that there is no runtime check if the heap grows and enters into the analysis heap.
Status
runs on Alpha AXP with The system Work is in
ATOM is built using OM and currently Fortrart, C++ and two different currently
Performance
how long ATOM takes to instrument a program, and programs execution time compares to 20 SPEC92 progr~s with programs execution time.
C compilers. modules.
To find how well ATOM performs, two measurements are of interest how the instrumented the uninstrumented
progress for adding support for shared libraries. ATOM projects. has been used both in hardware Besides the SPEC92 benchmarks, real applications and software it has successand at a Few
11 tools, The tools are briefly described in Figure 5. The time taken to instrument a program is the sum of the ATOMs processing time and the time taken by the users instrumentation routines. different The time taken by a tool varies as each tool does amounts of processing. For example, the malfoc
Our focus until now has mainly been on functionality. optimization overhead. Currently, reduction
have been added to reduce the procedure call in register saves has been information of We live register analysis system. data flow summary
tool simply asks for the malloc procedure and instruments i~ the processing time is very small. CPU pipeline scheduling The pipe tool does static for each basic block at instrttmen-
tation time and takes more time to instrument an application. The time taken to instrument 20 SPEC92 programs with each tool is also shown in Figure 5. The execution time of the instrumented program is the sum of the execution time of the uninstrumented analysis routines. application
program, the procedure call setup, and the time spent in the This total time represents the time needed and do not include those numbers statistics. The time spent in analysis time required by
12A~M is available to external users. If you would like a copY. Please contact the authors.
by the user to get the final answers. Many systems process the collected data offline as part of data collecting other systems. We compared each instrumented programs execution time
20 3
Tool Description
prediction using 2-bit history table counts
Average Time
5.52 WCS 6.03 WX 6.32 St%X 5.66 S(3(3 7.33 Sees
model direct mapped 8k byte cache computes dynamic instruction call graph based profiling finds potential inpui/output inlining summary tool tool
call sites
121.60 WCS 97.93 Sees 257.48 SCXX 122.53 WCS 120.53 St?CS 135.61 S(?CS
6.08 sees
4.90 sees 12.87 WCS 6.13 WCS 6.03 WCS 6.78 WCS
histogam of dynamic memory pipeline stall tool Instruction profiling tool system call summary tool unalign access tool
Analysis Tool
branch cache dyninst gprof inline io malloc pipe prof Syseall unalign
Instrumentation points
each conditional eaeh basic block each procedure/each basic block each call site before/after write procedure before/after malloc procedure basic block each basic block each procedure/each before/after each basic block branch each memory reference
Number of Arguments 3 1 3 2 1 4 1 2 2 2 3
Figure 6: Execution
time of instrumented
SPEC92 programs
Conclusion
modification details from tool tools to harda high-level view of the program,
Acknowledgements
Great many people have helped us bring ATOM to its current form. Jim Keller, Mike Burrows, Roger Cruz, John Edmondson, Mike McCallig, of bugs and instability Dirk Meyer, Richard Swan and Mike process. Uhler were ottrtirst users and they braved through amine field during the early development Jeremy Dion, Ramsey Haddad, Russel Kao, Greg Lueck and Louis Monier built popular tools with ATOM. Many people, too many to name, gave comments, reported bugs, and provided encouragement. Linda Wilson atl. Roger Cruz, Jeremy Dion, Ramsey PLDI reviewers gave useful Haddad, Russell Kao, Jeff Mogul, Louis Monier, David Wall, and anonymous comments on the earlier drafts of this paper. Our thanks to
cess it. Tools can be built with few pages of code and they compute only what the user asks for.
nication between application traces is no need to record and analysis means
cessed, and final results instrumented programs. of tools ATOM program. It has already to solve will
hardware
problems. platform
continue
to be an effective design.
in software
and architectural
204
References
[1]
l(l),
of Address Calculation
on a 64-
International
Conference on Programming
1, #4, pp 355-364, December 1992. Also available as WRL Research Report 93/4, August 1993. [14] David W. Wall. Systems for late code modification. In Robert Giegerich Code Generation and Susan L. Graham, eds, - Concepts, Tals, Techniques,
Simulator
David
R.
Keppel,
Eric
J.
for EfMulti-
Stanford University,
[11] Amitabh
205