Professional Documents
Culture Documents
Report On Decompilation
Report On Decompilation
Peter T. Breuer
1992
1 Introduction
This report details efforts at the decompilation of compiled C code and
research into the decompilation process. The research has been funded by
BT under the Visiting Research Fellowship scheme.
1.1 Aims
There are two focii:
• A feasibility study was agreed based upon the GNU ANSI C compiler.
This compiler’s source code is public domain and highly portable,
therefore well structured. Moreover, extensive documentation has
been made available by the compiler’s design team with the intention
of facilitating ports of the code.
1
2 Results
It proved impossible to determine a compiler specification for GCC, though
much knowledge of the compiler was gained in the attempt. It is the opin-
ion of the present author that it requires both a specification expert and
an expert in machine code design to formulate a specification from the evi-
dence provided by the source code and documentation alone – which is not
surprising, since the compiler design team undoubtedly included the latter,
but not the former.
But a decompiler-compiler has been built, and preliminary testing under
UNIX and MSDOS indicates that it works and builds efficient decompilers.
Time has been too short to permit a full scale testing program, but the
theory appears to be correctly implemented and vindicated by the prototype.
Quite how practical a tool it is from the users point of view has not been
determined, but it is capable of converting object code patterns to source
code outputs, and is no different in the pattern of use from a compiler-
compiler.
2
GCC with the -S switch. There is no explanation for this discrepancy to
be found. Assembler from ADB mostly contained assembler instructions
which matched no pattern in the source code or documentation, although
assembler from GCC -S matched up well.
Speculation: the port of GCC to the architecture tried (Sun 4) has not
followed the path set out in the documentation, or ADB uses a very different
assembler notation.
Experiments did establish that the documentation is correct in asserting
that the compiler is largely context independent. Each function is certainly
compiled in stand-alone fashion, and (in non-optimizing mode) internal code
structures are compiled roughly independently too. It seems possible to
recognize the patterns of assembler code produced, even though they cannot
be predicted from the GCC documentation, and tie them to specific source
code structures.
Experiments also established, however, that such condext-dependent as-
pects of the compiler as there are make it is almost impossible to trace vari-
ables through the assembler code, because the compiler will implement them
as registers or constants or memory locations according to its own evolving
high-level model of machine state. If a source-code variable is initialized by
a constant and never reassigned, for example, it will be assembled as a con-
stant. This is clear both from the documentation and the experiments. So
it is necessary to recreate the machine state seen by the compiler’s analysis
in order to understand the denotation of each assembler-level register and
memory location. In the absence of such knowledge, all that can be given is
a range of possibilities. And it is doubtful if the compiler’s analysis can be
recreated from the evidence available in the object code.
Moreover, there are difficulties at the back-end too. For example, a con-
stant will be implemented differently according to whether it is zero or not.
Different instructions are used to move copy and test zero than are used
for different values. That is clear from studying the low-level, machine-
dependent parts of the compiler. But this level is compiled entirely from a
set of machine-characteristics contained in a single file using a macro lan-
guage. Although this design is extremely portable, it is difficult to penetrate.
One wants to see what the machine-characteristics force into the compiler
behaviour, which requires all macros to be expanded out. But the expansion
loses the information contained in the macro name, which might otherwise
have been cross-referenced to the documentation for guidance.
3
References
[1] Breuer, P.T. and Bowen, J.P.. Decompilation is the efficient enumera-
tion of types in Proc. Workshop on Static Analysis, Journées de Travail
WSA, Vol. 92, pages 255–273, 1992. Universite de Rennes/BIGRE 81-
82, IRISA, France.