Translation of The Book Windows APT Warfare - Sudo Null IT News

You might also like

Download as pdf or txt
Download as pdf or txt
You are on page 1of 11

Hi all. I decided to share the news, maybe someone will be interested.

I finished
translating the book Windows APT Warfare. In its own way, a very interesting
read, for those who work with assembler, malware and information security. I'll
leave the first part here. The rest of the book can be picked up on damage, now
- xss.is. Maybe someone can layout it in PDF, I will be very grateful. Thanks to
all.

From Source to Binaries: C Program Path

In this chapter, we will learn the basics of how compilers package EXE binaries
from C code and methods for executing system processes. These basic
concepts will help you understand how Windows compiles C into programs and
links them between system components. You will also understand the program
structure and workflow that malware analysis should follow.

In this chapter we will cover the following main topics:

• The simplest program for Windows in C


• C Compiler – Assembly Code Generation
• Assembler - converting assembly code into machine code
• Compiling Code
• Windows Linker - Packaging binary data into Portable Executable (PE)
format
• Running compiled PE executables as dynamic processes

The simplest Windows program in C

Any software is designed with a specific functionality in mind. This functionality


may include tasks such as reading external input data, processing it according
to the engineer's expectations, or performing a specific function or task. All of
these actions require interaction with the underlying operating system (OS). A
program must call system functions to interact with the underlying OS. It is
almost impossible to develop a meaningful program that does not use any
system calls.

In addition, on Windows, when compiling a C program, a programmer needs to


specify the subsystem (you can read more about this at docs.microsoft.com/en-
us/cpp/build/reference/subsystem-specify-subsystem ); windows and console
are probably the two most common.

Let's look at a simple example of a C program for Windows:

Here is the most simplified C program for Windows. Its purpose is to call the
USER32!MessageBox() function at the entry point of the main() function to open
a popup window with an informational header and welcome content.

C Compiler - Assembly Code Generation

What is interesting to understand in the previous section is the reason why the
compiler understands this C code. Firstly, the main job of the compiler is to
convert the C code into assembly code according to the C/C++ calling
convention, as shown in Fig. 1.1:

For convenience and practicality, the following examples will be presented with
x86 instructions. However, the methods and principles described in this book
are common to all Windows systems, and the compiler examples are based on
the GNU Compiler Collection (GCC) for Windows (MinGW).

Because various system functions (and even third-party modules) have


expected memory access to the assembly code memory layer, there are
several basic Application Binary Interface (ABI) calling conventions for ease of
management. Interested readers may refer to Microsoft's docs on argument
passing and naming conventions (https://docs.microsoft.com/en-
us/cpp/cpp/argument-passing-and-naming-conventions).

These calling conventions mainly deal with several issues:

• The position at which parameters are placed in order (for example, on


the stack, in a register such as ECX, or mixed to speed things up)
• Amount of memory occupied by parameters if parameters need to be
saved
• Used memory that will be freed by the caller or calledee

When the compiler generates assembly code, it recognizes the system's calling
conventions, arranges the parameters in memory according to its preferences,
and then calls the function's memory address using the call
instruction. Therefore, when the thread jumps to the system instruction, it can
correctly obtain the function parameter at the expected memory address.

Let's take Fig. 1.1: We know that the USER32!MessageBoxA function prefers
WINAPI calling conventions. In this calling convention, the contents of the
parameter are pushed onto the stack from right to left, and the memory freed for
this calling convention selects the called object. So, after pushing 4 parameters
onto the stack, occupying 16 bytes on the stack (sizeof(uint32_t) x 4), the code
will be executed in USER32!MessageBoxA. After executing the function
request, return to the next line of the Call MessageBoxA instruction with ret
0x10 and free 16 bytes of memory from the stack (i.e. xor eax, eax).

The book focuses only on how the compiler generates single-chip instructions
and encapsulates the program into an executable file. It does not include
important parts of advanced compiler theory such as semantic tree generation
and compiler optimization. These parts are reserved for readers to study for
further study.

In this section, we learned about C/C++ calling convention, how parameters are
placed in memory in order, and how memory is freed when a program
terminates.

Assembler - converting assembly code into machine code

At this point you may notice that something is wrong. The processor chips we
use every day are not capable of executing text-based assembly code, but are
instead converted into machine code for the appropriate set of instructions to
perform the corresponding memory operations. Thus, during the compilation
process, the previously mentioned assembly code is converted by the
assembler into machine code that the chip can understand.

In Fig. Figure 1.2 shows the dynamic memory allocation of a 32-bit PE:

Since the chip cannot directly parse strings such as HELLO WORLD or INFO,
data (such as global variables, static strings, global arrays, etc.) is first stored in
a separate structure called section. Each partition is created with an offset
address where it should be placed. If the code later needs to retrieve resources
identified during these compilation periods, the corresponding data can be
retrieved at the appropriate offset addresses. Here's an example:

• The above information string can be expressed as \x69\x6E\x66\x6F\x00


in ASCII code (total 5 bytes with a trailing zero). The binary data of this
string can be stored at the beginning of the .rdata section. Similarly, the
greeting line can be stored next to the previous line at the address of the
.rdata section at offset +5.
• Actually the above MessageBoxA API call is not understood by the
chip. Therefore, the compiler will create an import address table
structure, which is the .idata section, to store the address of the system
function that the current program wants to call. When needed by the
program, the corresponding function address can be retrieved from this
table, allowing the thread to jump to the function address and continue
executing the system function.
• Generally speaking, the compiler usually stores the content of the code
in the .text section.
• Each individual running process does not have only one PE
module. Either *.EXE or *.DLL mounted in process memory are
packaged in PE format.
• In practice, each module loaded into memory must be assigned an
image base address to store all of the module's contents. In the case of a
32-bit *.EXE, the base address of the image is usually 0x400000.
• The absolute address of each piece of data in dynamic memory will be
the base address of the image of this module + the partition offset + the
data offset on the partition. As an example, let's take the base address of
the image as 0x400000. If we want to receive an information string, the
expected content will be located at address 0x402000 (0x400000 +
0x2000 + 0x00). Similarly, HELLO will be stored at address 0x402005,
and the MessageBoxA pointer will be stored at address 0x403018.

There is no guarantee that in practice the compiler will generate .text, .rdata,
and .idata sections or that they will be used for these functions. Most compilers
follow the previously mentioned memory allocation principles. Visual Studio
compilers, for example, do not create executable programs with .idata sections
to store function pointer tables, but rather with .rdata sections that are readable
and writable.

This is only a rough understanding of the properties of block and absolute


addressing in dynamic memory; there is no need to get hung up on
understanding the content, attributes and how to fill them out correctly in
practice. The following chapters will explain in detail the meaning of each
structure and how to design one yourself.

In this section, we learned about the conversion of machine code operations


during program execution, as well as the various sections and offsets of data
stored in memory that can be accessed later during the compilation process.
Compiling Code

As mentioned earlier, if the code contains strings or text functions that the chip
does not understand, the compiler must first convert them to absolute
addresses that the chip can understand and then store them in separate
sections. It is also necessary to translate the text script into native code or
machine code that the chip can recognize. How does this work in practice?

In the case of Windows x86, instructions executed in assembly language are


translated according to the x86 instruction set. Text instructions are translated
and encoded into machine code that
the chip can understand. Interested readers can search x86 Instruction Set on
Google to find the complete instruction table, or even code it by hand without
relying on a compiler.

Once the compiler completes the above block packing, the next step is to
extract and encode the text instructions from the script, one by one, according
to the x86 instruction set, and write them into the .text section, which is used to
store the machine code.

As shown in Fig. 1.3, the dotted box is the text-type assembly code resulting
from compiling the C/C++ code:

You can see that the first instruction is push 0, which pushes 1 byte of data onto
the stack (stored as 4 bytes), and 6A 00 is used to present this instruction. The
push instruction 0x402005 pushes 4 bytes onto the stack at a time, so push 68
50 20 40 00 is used to achieve the longer push instruction. call ds:[0x403018] is
a 4-byte address and a long machine code call, FF 15 18 30 40 00, is used to
represent this instruction.
Although in Fig. Figure 1.3 shows the memory allocation of the dynamic file
msgbox.exe; the file created by the compiler is not yet an executable PE
file. Rather, it is a file called Common Object File Format (COFF), or an object
file as some people call it, which is a wrapper file specifically designed to record
the various sections produced by the compiler. The following figure shows a
COFF file obtained by compiling and assembling the source code using the gcc
-c command and viewing its structure using the famous PEview tool.

As shown in Fig. 1.4, at the beginning of the COFF file there is an


IMAGE_FILE_HEADER structure to record the number of included sections:

At the end of this structure there is an entire IMAGE_SECTION_HEADER array


to record the current location and size of the contents of each section in the
file. At the end of this array, the main content of each section is closely
related. In practice, the first section will usually be the content of the .text
section.

In the next step, the linker is responsible for adding an extra fragment of the
COFF file to the application loader, which will become our general EXE
program.

In the case of systems with an x86 chip, it is common to swap the pointer and
digit for a bit in memory during encoding. This practice is called little-endian, as
opposed to a string or array, which must be arranged from least significant to
most significant address. The layout of the multi-byte data depends on the chip
architecture. Interested readers may refer to the article How to Write Endian-
Independent Code in C (https://developer.ibm.com/articles/au-endianc/).

In this section, we learned about COFF, which is used to write the contents into
memory of various sections written by the compiler.
Windows Linker - Packing binary data into PE format

In the previous section, we assumed some memory allocation during program


compilation. For example, the default EXE module image base must have an
address of 0x400000 in order for the executable content to be hosted. The .text
section should be located at 0x401000 above the base image. As we said, the
.idata section is used to store the import address table, so the question
becomes, who or what is responsible for populating the import address table?

The answer is that every OS has an application loader that is designed to


perform all of these tasks correctly when creating a process from a static
program. However, there is a lot of information that will only be known at
compile time and not by the system developer, for example:

• Does the program need to enable Address Space Layout Randomization


(ASLR) or Data Execution Prevention (DEP)?
• Where is the main(int, char) function in the .text section written by the
developer?
• How much of the shared memory is used by the execution unit during the
dynamic phase?

Therefore, Microsoft introduced the PE format, which is essentially an extension


of the COFF file, with an additional optional header structure to record the
information needed by the Windows program loader to adjust the process. The
following chapters will focus on playing with various PE format structures so that
you can write an executable file by hand on your own.

All you need to know now is that the PE executable has some key features:

• Code: Typically stored as machine code in the .text section.


• Import table: to allow the loader to fill in the function addresses and allow
the program to get them correctly.
• Optional header: This structure allows the loader to read and know how
to fix the current dynamic module.

Here's an example in Figure 1.5:


msgbox.exe is a minimalistic Windows program consisting of only three
sections: .text, .rdata and .idata. After dynamic execution, the system
application loader sequentially extracts the contents of three sections and writes
them each at offsets 0x1000, 0x2000 and 0x3000 relative to the current PE
module (msgbox.exe).

In this section, we learned that the application loader is responsible for patching
and populating the program content to create a static program file in the
process.

Running static PE files as dynamic processes

At this point, you have a general understanding of how a minimal program is


generated, compiled, and packaged by the compiler into an executable file in
the static phase. So the next question is: what does the OS do to run a static
program?

In Fig. Figure 1.6 shows the structure of the process of converting an


executable program from a static to a dynamic process on a Windows system:
Note that this is different from the hatching process of the latest version of
Windows. For the sake of explanation, we will ignore the processes of privilege
escalation, the patch mechanism, and kernel generation, and will only talk about
how a static program is correctly parsed and executed.

On Windows systems, all processes must be started by a parent process by


interrupting a system function to jump to the kernel level. For example, the
parent process is currently trying to run the cmd command. exe /c whoami,
which is an attempt to turn the static cmd.exe file into a dynamic process and
assign its parameters to /c whoami.

So what happens in the whole process? As shown in Figure 1.6, these are the
steps:

1. The parent process makes a request to the kernel using CreateProcess,


indicating the creation of a new process (child process).

2. The kernel will then create a new process container and populate the
container with executable code with file associations. The kernel will create a
thread to assign to this child process, which is usually called the main thread or
GUI thread. At the same time, the kernel also organizes a memory block in the
Userland heap to store two building blocks: a process environment block (PEB)
to record current process environment information, and a thread environment
block (TEB) to record thread environment information. Details of these two
structures will be fully presented in Chapter 2, “Process Memory—File Mapping,
PE Parser, tinyLinker, and Hollowing,” and Chapter 3, “Dynamic API Call—
Thread, Process, and Environment Information.”

3. The NtDLL export function, RtlUserThreadStart, is the primary routing


function for all threads and is responsible for the necessary initialization of each
new thread, such as creating structured exception handling (SEH). The first
thread of each process, that is, the main thread, will execute
NtDLL!LdrInitializeThunk at the user level and enter the
NtDLL!LdrpInitializeProcess function after the first execution. This is an
executable software loader responsible for the necessary correction of the PE
module loaded into memory.

4. After the execution loader completes the fix, it returns to the current
execution entry (AddressOfEntryPoint), which is the main function of the
developer.

From a code perspective, a thread can be thought of as the person responsible


for executing code, and a process can be thought of as a container for loading
code.

The kernel level is responsible for file mapping, which is the process of placing
program content based on a preferred address at compile time. For example, if
the image base address is 0x400000 and the .text offset is 0x1000, then the file
mapping process is essentially just requesting a block of memory at address
0x400000 on the heap and writing the actual contents of the .text to address
0x401000.

In fact, the loader function (NtDLL!LdrpInitializeProcess) does not directly call


AddressOfEntryPoint after execution; instead, the tasks patched by the loader
and the entry point are treated as two separate threads (in practice, two thread
contexts will be opened). NtDLL!NtContinue will be called after the fix and will
transfer the task to the entry to continue execution as the thread's task
schedule.

The execution entry point is written to


NtHeaders→OptionalHeader. AddressOfEntryPoint of the PE structure, but is
not a direct equivalent to the main developer function. This is for your
understanding only. Generally speaking, AddressOfEntryPoint points to the
CRTStartup (C++ Runtime Startup) function, which is responsible for a number
of necessary C/C++ initialization preparations (such as converting arguments to
developer-friendly inputs) before calling the main developer function.

In this section, we learned how EXE files are transformed from static to
dynamically running processes in a Windows system. With a process and
thread and the necessary initialization steps, the program is ready to run.

Results

In this chapter, we have explained how the OS converts C code into assembly
code using a compiler and into executable programs using a linker.

The next chapter will build on this framework and take you through hands-on
experience with the entire flowchart in several C/C++ labs. In the following
chapters, you'll learn the intricacies of PE format design by creating a compact
program loader and writing an executable program yourself.

You might also like