Tuesday, March 20, 2018

GCC output, disassembled

The CPU can only run "machine code", i.e. a sequence of bytes encoded in a manner such that the CPU's transistor-based circuitry can operate off of it. Machine code is usually undecipherable to the untrained (and often the trained) eye, as to humans it looks like, at most, a jumble of hexadecimal characters. Indeed, much of the detail contained within machine code can only be appreciated at the binary level. That is why assembly was created.

Assembly assigns names to the instructions and registers in machine code. These names are reduced to their hexadecimal equivalents by the assembler. Here is some assembly code:
mov eax, 123
xor eax, 456
ret
Here is the same code in machine language:
B8 7B 00 00 00 35 C8 01 00 00 C3
As you can see, no one (or at least no sane person) writes code in machine language. Unfortunately, it's also not much easier to get things done in assembly, the reason being that assembly is a low-level language. Low level languages run directly on, or are directly converted into something that runs on the CPU. Because of this, it is very detailed. Here is an analogy:

Imagine you are painting a masterpiece, pixel by pixel. In assembly, you would have to manually place each pixel. In a medium level language, however, you could say, "Oh, place a line here!", and a line would appear, with all of its pixels in order. The disadvantage with this is that you can't control how each nitty-gritty pixel is placed. In a high level language, you would overlook the lines and say, "I want a rose here, and a tree here", and so on and so forth. The problem here is similar; you can't control how the rose (or your object of choice) is drawn. The difference between them is the level of detail.

Knowing this, today we'll be analyzing the output of a C compiler. C code cannot be directly run on the CPU. It must go through the compiler, which converts the input into machine language. Depending on your compiler, it may do this directly, or it may convert the code to assembly and offload the rest of the work onto an assembler. GCC (the GNU C Compiler) belongs to the latter category.

With that being said, let's dive right in. Here is one of the simplest possible C programs:
int main() {
        return 5;
}
After compilation and disassembly, here is what it looks like:
00000000 <main>:
   0:   55                      push   ebp
   1:   89 e5                   mov    ebp,esp
   3:   b8 05 00 00 00          mov    eax,0x5
   8:   5d                      pop    ebp
   9:   c3                      ret
This may seem confusing, but let's analyze it line by line.

Let's examine this snippet:
push ebp
mov ebp, esp
This evidently pushes EBP to the stack and sets EBP to ESP. What does that mean? It stores the value of EBP on the stack, and sets up a new stack frame whose bottom (indicated by the base pointer, EBP) is at the start of the current stack.

"What? I don't get it." you may be saying. Allow me to explain.

On the x86 platform, when you push something to the stack, the data you are pushing is moved to the location pointed to by ESP, the Stack Pointer. Then, ESP is decremented. Initially, ESP is equal to EBP. However, as we push things, ESP starts decreasing away from EBP. Likewise, when we pop something, the data at ESP is moved to our register, and ESP is incremented.

The above assembly code puts EBP onto the stack, storing it (presumably so we can restore it later). Then, it sets up a new stack, whose bottom is at the start of the current stack. This way, the C program has a clean stack to use.

This stub is very common; in fact, those two instructions are present at the beginning of every function in C. It is known as the function prologue.

The line is as follows:
mov eax, 0x5
This line is quite simple. In C, when you return something, you put it in EAX. By default, in assembly, functions do not return values; they just go back to the spot where they were called. Thus, the developers of Unix System V (one of the first and at the time most used versions of Unix) created a convention, the System V Application Binary Interface. Among other things, this specified that function return values should be stored in EAX. Today, most C compilers follow this convention.

We hinted at what the last lines might do earlier. Here they are:
pop ebp
ret
These two instructions restore EBP (which we stored on the stack earlier) and returns. These two lines correspond to the first 2, and are thus known as the function epilogue.

That is all for this post. I will be posting more disassemblies soon.