Download as pdf or txt
Download as pdf or txt
You are on page 1of 4

Report on Decompilation

Peter T. Breuer

1992

1 Introduction
This report details efforts at the decompilation of compiled C code and
research into the decompilation process. The research has been funded by
BT under the Visiting Research Fellowship scheme.

1.1 Aims
There are two focii:

• A feasibility study was agreed based upon the GNU ANSI C compiler.
This compiler’s source code is public domain and highly portable,
therefore well structured. Moreover, extensive documentation has
been made available by the compiler’s design team with the intention
of facilitating ports of the code.

– It was hoped that it would be possible to recover the compiler


specification from the documentation and the source code.
– This specification would then be converted into a decompiler us-
ing theory developed in [1] which shows how to treat compiler
specifications as the basis for a decompiler.

• The compiler specification would be rendered into decompiler code by


constructing a decompiler-compiler, which would embody the theory
of [1] in a practical, working utility.

1
2 Results
It proved impossible to determine a compiler specification for GCC, though
much knowledge of the compiler was gained in the attempt. It is the opin-
ion of the present author that it requires both a specification expert and
an expert in machine code design to formulate a specification from the evi-
dence provided by the source code and documentation alone – which is not
surprising, since the compiler design team undoubtedly included the latter,
but not the former.
But a decompiler-compiler has been built, and preliminary testing under
UNIX and MSDOS indicates that it works and builds efficient decompilers.
Time has been too short to permit a full scale testing program, but the
theory appears to be correctly implemented and vindicated by the prototype.
Quite how practical a tool it is from the users point of view has not been
determined, but it is capable of converting object code patterns to source
code outputs, and is no different in the pattern of use from a compiler-
compiler.

3 GCC compiler study


As well as specification experts and machine-code experts, software engi-
neering experts are undoubtedly required to understand the compiler’s code
structure, which is based on a yacc top-level specification, but calls side-
effecting functions to build the parse-tree. An explicitly attributed gram-
mar would have been much more helpful. Full reverse engineering tools
are required to understand the compiler (and an initial attemt at using the
SMART tools was made) but appear to founder on the macro language used
by the back-end.

C source → syntax tree → assembler



mach.def

Figure 1: The GCC compiler – schematic overview.

The compiler is basically three-phase (Fig. 1.


All practical experiments confirmed the same ultimate stumbling block.
The assembler code produced by the unix ADB (debug) utility from GCC-
produced object code does not coincide with the assembler produced by

2
GCC with the -S switch. There is no explanation for this discrepancy to
be found. Assembler from ADB mostly contained assembler instructions
which matched no pattern in the source code or documentation, although
assembler from GCC -S matched up well.
Speculation: the port of GCC to the architecture tried (Sun 4) has not
followed the path set out in the documentation, or ADB uses a very different
assembler notation.
Experiments did establish that the documentation is correct in asserting
that the compiler is largely context independent. Each function is certainly
compiled in stand-alone fashion, and (in non-optimizing mode) internal code
structures are compiled roughly independently too. It seems possible to
recognize the patterns of assembler code produced, even though they cannot
be predicted from the GCC documentation, and tie them to specific source
code structures.
Experiments also established, however, that such condext-dependent as-
pects of the compiler as there are make it is almost impossible to trace vari-
ables through the assembler code, because the compiler will implement them
as registers or constants or memory locations according to its own evolving
high-level model of machine state. If a source-code variable is initialized by
a constant and never reassigned, for example, it will be assembled as a con-
stant. This is clear both from the documentation and the experiments. So
it is necessary to recreate the machine state seen by the compiler’s analysis
in order to understand the denotation of each assembler-level register and
memory location. In the absence of such knowledge, all that can be given is
a range of possibilities. And it is doubtful if the compiler’s analysis can be
recreated from the evidence available in the object code.
Moreover, there are difficulties at the back-end too. For example, a con-
stant will be implemented differently according to whether it is zero or not.
Different instructions are used to move copy and test zero than are used
for different values. That is clear from studying the low-level, machine-
dependent parts of the compiler. But this level is compiled entirely from a
set of machine-characteristics contained in a single file using a macro lan-
guage. Although this design is extremely portable, it is difficult to penetrate.
One wants to see what the machine-characteristics force into the compiler
behaviour, which requires all macros to be expanded out. But the expansion
loses the information contained in the macro name, which might otherwise
have been cross-referenced to the documentation for guidance.

3
References
[1] Breuer, P.T. and Bowen, J.P.. Decompilation is the efficient enumera-
tion of types in Proc. Workshop on Static Analysis, Journées de Travail
WSA, Vol. 92, pages 255–273, 1992. Universite de Rennes/BIGRE 81-
82, IRISA, France.

You might also like