Professional Documents
Culture Documents
Breaking The x86 ISA
Breaking The x86 ISA
Breaking The x86 ISA
July 27, 2017 maintaining backwards compatibility, the processor has kept
Abstract— A processor is not a trusted black box for running even those instructions and modes that are no longer used
code; on the contrary, modern x86 chips are packed full of secret today. The result of these continuous modifications and
instructions and hardware bugs. In this paper, we demonstrate evolutions is a processor that is a complex labyrinth of new
how page fault analysis and some creative processor fuzzing can
be used to exhaustively search the x86 instruction set and
and ancient technologies. Within this shifting maze, there are
uncover the secrets buried in a chipset. The approach has instructions and features that have been largely forgotten and
revealed critical x86 hardware glitches, previously unknown lost over time.
machine instructions, ubiquitous software bugs, and flaws in
enterprise hypervisors.
I. OVERVIEW
II. HISTORY
x86 is one of the longest continuously evolving instruction
set architectures (ISAs) in history. With a design that began in Figure 3. Blanks in the x86 opcode maps indicate a possible hidden
instruction.
early 1976 as the 8086, the ISA has undergone continuous
revisions and updates, while still maintaining backwards Whereas the techniques for finding bugs, secrets, backdoors
compatibility and support for the original specification. In the in software are well studied and established, similar
40 years since, the architecture has evolved with a multitude techniques for hardware are non-existent. This is troubling, in
of new operating modes (figure 1), each adding an entirely that it is the processor that enforces the security of the system,
new layer to the already complex design. Along with new and is ultimately the system’s most trusted component. It
modes came instruction set extensions, adding entirely new seems necessary to stop treating a processor as a trusted black
classes of instructions from a range of vendors (figure 2). In box for running software, and instead develop systematic tools
2
and approaches for auditing processors, in much the same way incremented. For example, in the case of the 15 byte zero
that we can audit software. This is the motivation behind our buffer, the instruction will be observed to be two bytes long;
research – an approach to discovering the secrets and flaws thus, the second byte is incremented, so that the buffer is now
built into the processors we blindly trust. {0x00, 0x01, 0x00, 0x00, 0x00, …}. The process is then
repeated with the new instruction. If this incrementation
III. PRIOR WORK results in a change in the observed instruction length or
Prior work on x86 fuzzing focuses on the correct exception generated, the resulting instruction is incremented
functioning of emulators and hypervisors. These techniques from its new end. When the end of an instruction has been
take the approach of random instruction generation, or incremented 256 times (exhausting all possibilities for the last
generation based on pre-existing knowledge of the x86 byte of that instruction), the increment process moves to the
instruction format. Random instruction generation produces previous byte in the instruction (figure 4).
poor instruction coverage, and cannot find arbitrarily complex
000000000000000000000000000000
instructions, such as those with long combinations of prefixes 000100000000000000000000000000
and opcodes. Generation based on knowledge of the x86 000200000000000000000000000000
instruction set can produce better instruction coverage, but 000300000000000000000000000000
000400000000000000000000000000
fails to find undocumented and mis-documented instructions, 000401000000000000000000000000
and is still unable to find arbitrarily complex instructions. In 000402000000000000000000000000
addition to these limitations, no known approach focuses on 000403000000000000000000000000
000404000000000000000000000000
the processor hardware itself. Our proposed technique is the 000405000000000000000000000000
first x86 fuzzing work targeting the actual processor, and uses 000405000000010000000000000000
an effective search approach requiring no prior knowledge of 000405000000020000000000000000
000405000000030000000000000000
the x86 instruction format. 000405000000040000000000000000
Our goal is to find a way to programmatically exhaustively This technique allows effectively exploring the meaningful
search the x86 instruction set, in order to find hidden or search space of the x86 ISA. The less significant portions of
undocumented instructions, as well as instruction-level flaws an instruction (such as immediate values and displacements)
like the Pentium f00f bug. To do this, we would generate a are quickly skipped in the search, since they do not change the
potential x86 instruction, execute it, and observe its results. instruction length or exceptions. This allows the fuzzer
The challenge with this is in the complexity of the x86 process to focus on only meaningful parts of the instruction,
instruction set: x86 instructions can be between 1 and 15 bytes such as prefixes, opcodes, and operand selection bytes. This
long (figure 3). approach reduces the instruction search space from a worst-
case 1.3x1036 instructions, down to a very manageable
inc eax
40 100,000,000 instructions (as observed in lab scans, described
lock add qword cs:[eax+4*eax+07e06df23h], 0efcdab89h in section IV). We implement the instruction generation and
2e 67 f0 48 818480 23df067e 89abcdef execution logic in a process called the injector.
Figure 3. A 1 byte vs. a 15 byte x86 instruction.
Resolving Instruction Lengths
With instructions up to 15 bytes long, the worst-case search However, the instruction tunneling approach only works if
space for the x86 ISA is 1.3x1036 instructions – a simple there is a reliable way to determine the length of an arbitrary
iterative search is infeasible, and randomly selecting possible (potentially undocumented) x86 instruction. Since the
instructions will only cover a tiny fraction of the potential instruction may be undocumented, disassembling the
search space. The search space can be reduced by only instruction is not an option. An alternate naïve approach to
generating instructions that follow the formats described in determining instruction length is to set the x86 trap flag,
x86 reference manuals, but this approach will fail to find execute the instruction, and observe the difference between the
undocumented instructions, and will miss hardware errors that original and new instruction pointers. However, this approach
are the result of invalid instructions. To effectively reduce the fails on instructions that throw faults – since a faulting
instruction search space, we propose a search algorithm based instruction does not execute, there is no change in the
on observing changes in instruction lengths. instruction pointer when the instruction is stepped with the
Searching the Instruction Set trap flag. We wish to find all potentially undocumented or
The instruction search process, which we call tunneling, flawed instructions – including those normally restricted to
runs as follows. A 15 byte buffer is generated as a potential kernel, hypervisor, or system management code – so exploring
starting instruction; for example, for searching the complete even faulting instructions is critical to the approach. For
instruction space, we use a buffer of 15 0 bytes as the starting example, an instruction such as “inc eax” can execute in ring 3
candidate. The instruction is executed, and its length (in bytes) and below; an instruction such as “mov eax, cr0” can execute
is observed. The byte at the end of the instruction is then in ring 0 and below; and an instruction such as “rsm” can
execute only in ring -2 (System Management Mode). For
3
effective results, the injector should be able to identify exception thrown, the injector can differentiate between
instructions in more privileged rings, even if it cannot actually instructions that don’t exist, versus those that exist but are
execute those instructions. restricted to more privileged rings. Thus, even from ring 3,
To effectively determine the length of even faulting the injector can effectively explore the instruction space of
instructions, we introduce a 'page fault analysis' technique, ring 0, the hypervisor, and system management mode.
wherein instructions are incrementally moved across page Persistence
boundaries to induce page faults. A candidate instruction is
generated (a 15 byte value, generated by the incrementation The tunneling algorithm combined with fault analysis to
process described earlier), and placed in memory so that the resolve instruction lengths brings us close to an effective x86
first byte of the instruction is on the last byte of an executable instruction fuzzing approach, but other problems arise.
page, and the rest of the instruction lies in a non-executable Foremost, the injector process is fuzzing the very processor it
page. The instruction is then executed. If a page fault occurs is running on. As such, it is important to avoid accidentally
during the instruction fetch, the processor triggers the #PF corrupting the system or process state. As a basic protection
exception, and the address of the page boundary is reported in against this, we restrict the injector to ring 3 – this avoids the
the CR2 register. This indicates to the injector process that possibility of catastrophic system failures, except in the case
part of the instruction lies in the non-executable page; any of serious hardware bugs. Although the injector is limited to
other result indicates that the entire instruction was fetched ring 3, it is still able to resolve the existence of instructions in
from memory. If the injector determines that the instruction more privileged rings through the page fault analysis.
does not yet reside entirely in executable memory, the While the operating system should not crash from the
instruction is moved back a byte, so that the first two bytes are injector’s ring 3 fuzzing, it is still possible for the injector to
on an executable page, and the rest are on the non-executable corrupt itself with one of the generated instructions.
page. The process is repeated until no #PF exception occurs, The first situation to guard against is faulting instructions.
or until a #PF exception is received with an address other than For this, we hook every exception that a generated instruction
the page boundary. At this point, the number of bytes lying in might trigger (in Linux, sigsegv, sigill, sigfpe, sigbus, sigtrap).
the executable page indicate the length of the instruction The injector’s exception handler receives these signals,
(figure 5). restores the system registers to a ‘known good’ state, and
resumes the fuzzing process where it left off.
We should also guard against state corruption; specifically,
0f 6a 60 6a 79 6d c6 02 … the process state is corrupted if a generated instruction writes
0f 6a 60 6a 79 6d c6 02 … into the injector’s address space. This is overcome by
initializing all registers to 0 and mapping the NULL pointer
0f 6a 60 6a 79 6d c6 02 …
into the injector process’s memory. This ensures that
0f 6a 60 6a 79 6d c6 02 … computed memory addresses such as [eax + 4 * ecx] resolve to
Figure 5. A candidate instruction is moved across a page boundary to 0, rather than an address within the process’s normal memory
determine its length. The first page is executable, while the second page is space. Mapping the page at address 0 into memory allows
non-executable. When the instruction does not throw a #PF exception with more detailed fault analysis for some types of instructions.
CR2 set to the page boundary address, the entire instruction must lie within
For example, without address 0 mapped, “mov eax, [ecx + 8 *
the executable page. In the example, the check is complete when the
executable page contains 0f 6a 60 6a (punpckhdq mm4,[rax+0x6a]). edx]” will generate a #GP exception, as will “mov cr0, eax”.
Since both instructions generate the same exceptions, the
Once the non-#PF/CR2 combination is observed, the
injector cannot determine that one is privileged and one is not.
instruction fetch is known to be complete. However, it is still
By mapping 0 into the process’s address space, the
not clear whether the fetched instruction exists or not. For this,
unprivileged instruction can successfully execute, allowing the
the injector observes any exceptions thrown by the instruction.
injector to differentiate it from the privileged instruction.
Non-existing instructions will generate a #UD (undefined
However, this mapping is not strictly necessary for the search.
opcode) exception, existing instructions will either
With the registers initialized to zero prior to instruction
successfully execute or throw a different exception.
execution, memory accesses with a displacement may still
Interestingly, the approach allows resolving the length even
cause a process state corruption; for example, “inc
of non-existing instructions. For example, 9a13065b8000d7 is
[0x0804a10c]” may hit the .data segment of a 32 bit process,
an illegal instruction, but its length is known to be 7 bytes,
regardless of the register initialization values. However, as the
because this is when the processor stops decoding the
tunneling approach for instruction searching only manipulates
instruction. This provides some small insight into the
a single byte of the instruction at a time, it will explore “inc
pipelining architecture and the format of potential future
[0x0000000c]”, “inc [0x0000a100]”, “inc [0x00040000]”, and
instructions.
“inc [0x08000000]”, but will never search “inc
This approach allows the injector to detect even privileged
[0x0804a10c]”. In practice, this prevents the tunneling
instructions: whereas a non-existing instruction will throw a
process from ever corrupting its own state.
#UD exception, a privileged instruction will throw a #GP
We also provide an alternative fuzzing strategy via random
exception if the executing process does not have the necessary
instruction generation. In this approach, it is possible for the
permissions for the instruction. By observing the type of
4
injector process to become corrupted, but we have observed These techniques form our "sandsifter" x86 fuzzing tool,
that in practice, this is still extremely rare – a 32 bit process which we release as open source. The tool calculates and
with 1 KB of writable critical program data has only a one in executes each candidate instruction, and compares its
four million chance of being corrupted by an arbitrary memory observed length and fault behavior to the expected values
access, and even then only for instructions that allow a 4 byte provided by a disassembler and architecture documentation.
displacement in the memory calculation. Any deviations from the expected behavior are logged for
Despite these protections, the process state may still be analysis.
corrupted by some specific instructions and be unable to
recover its original state. To solve this, we are forced to V. RESULTS
blacklist a small subset of the instruction space. Specifically, We ran the processor fuzzer against following x86
we disallow execution of segment register loads (lds, les, lfs, processors: Intel Core i7-4650U, Intel Quark SoC X1000,
lgs, lss, mov seg) and system call instructions (int 0x80, int Intel Pentium, AMD Geode NX1500, AMD C-50, VIA Nano
0xe, sysenter, syscall), which could corrupt the process state to U3500, VIA C7-M, Transmeta TM5700, and another,
the point that the exception handlers cannot recover it. currently unspecified x86 processor. The tool discovered
The last challenge in maintaining coherent execution state is undocumented instructions in all major processors, shared
resuming execution after an instruction is tested, and dealing bugs in nearly every major assembler and disassembler, flaws
with generated branch instructions. Both issues are solved by in enterprise hypervisors, and critical x86 hardware errata.
setting the x86 trap flag immediately prior to instruction
Hidden Instructions
execution,. The trap flag allows one instruction to execute, and
then throws a single step exception. By catching the single In this section, we use the notation xx to denote an arbitrary
step exception, the injector can catch execution after the byte, {aa-bb} to indicate any byte between aa and bb, and,
instruction runs, and detect that an instruction successfully following Intel notation, /n to denote n as the reg field of the
executed. This allows regaining control after both errant jump instruction’s modr/m byte.
instructions and non-branching instructions. On an Intel Core i7-4650U processor running in 64 bit
Finding Anomalies mode, the following undocumented instructions were found.
0f0dxx /non-1: this is currently documented as prefetchw for
With the tunneling algorithm and page fault analysis, we are /1; other reg fields are not documented, but still execute.
now able to effectively explore the x86 instruction set, 0f18xx: until the -061 (December 2016) version of the
reducing 1036 conceivable 15 byte combinations down to reference manuals, about half of these instructions were
approximately 100,000,000 candidate instructions (as undocumented, but would still run (the tested processor was
observed during the tests described in RESULTS). However, a released in 2012); they're now documented as reserved nops
means of identifying the unusual or interesting instructions is (presumably in place of a future instruction). 0f{1a-1f}xx:
still necessary. For this, we wrap the injector with a sifter similar to 0f18xx, this doesn't appear until the -061 references,
process. The sifter is responsible for recording anomalous but executed at least back to Ivy Bridge. 0fae{e9-ef, f1-f7, f9-
results from one or more injectors. To do this, the sifter uses ff}: these seem to have existed for a long time, but were
an existing disassembler to predict the length of an injected undocumented until the -051 references (June 2014) (only the
instruction. It then compares the observed length of the r/m field = 0 were documented prior to this). dbe0, dbe1:
instruction with the expected length of the instruction. A these execute but do not appear in the opcode maps. df{c0-
difference in length generally indicates a software bug (the c7}: these execute but do not appear in the opcode maps. f1:
disassembler and processor disagree on the instruction). On this executes but does not appear in the opcode maps; there is
the other hand, if the sifter sees that an instruction exists, but a note in SDM vol. 3 that it and d6 will not produce a #UD
the disassembler does not recognize the instruction, it (interestingly, d6 does produce a #UD, at least in Ivy Bridge).
generally indicates an undocumented instruction on the {c0-c1, d0-d1, d2-d3}{30-37, 70-77, b0-b7, f0-f7}: these
processor (since, presumably, the disassembler is written execute, but are not in the opcode maps; we believe they are
based off of processor documentation). The tool has also SAL aliases. f6 /1, f7 /1: these execute, but aren't in the
produced the hypervisor and hardware bugs described in opcode maps; we suspect they are aliases for the /0 version.
RESULTS; with the exception of a critical hardware bug, these On an AMD Geode NX1500, the following undocumented
have been due to incorrect exception generation by the instructions were found. 0f0f{40-7f}{80-ff}{xx}: these are in
hardware or hypervisor. This tends to cause incorrect the AMD 3DNow! Instruction range, but are not documented
instruction length resolution in the injector, which the sifter for a range of xx that still execute. dbe0, dbe1: these execute
then flags as a software bug. At this point, manual analysis is but do not appear in the AMD opcode maps. df{c0-c7}: these
necessary to correctly classify the bugs as software, hardware, execute but do not appear in the AMD opcode maps.
or hypervisor. VIA does not release programming manuals detailing the
For our research, we used the Capstone disassembler due to processor specifications the way that Intel and AMD do.
its ease of Python integration. However, code exists for Classifying an instruction as undocumented or not on VIA is
swapping Capstone with objdump or ndisasm. done by comparing the instruction against Intel and AMD
The Sandsifter Framework documentation, and any available VIA documentation on
5
VI. CONCLUSION
Although we treat our processors as trusted black boxes,
they are riddled with the same flaws and secrets we find in
software. Through guided instruction fuzzing based on a
depth-first search and page fault analysis, the sandsifter toolset
is able to exhaustively enumerate and test all reasonably
distinct instructions in the x86 ISA. The process has revealed
hidden instructions, software bugs, hypervisor flaws, and
critical processor failures. With the release of the sandsifter
tool [1], the reader is encouraged to audit their own processors
for defects and hidden instructions. This work provides an
important first step towards introspecting x86 chips, and
validating the processors we all blindly trust.
REFERENCES
[1] https://github.com/xoreaxeaxeax/sandsifter