diff options
author | Nathan Ringo <nathan@remexre.com> | 2023-10-05 02:27:31 -0500 |
---|---|---|
committer | Nathan Ringo <nathan@remexre.com> | 2023-10-05 02:27:31 -0500 |
commit | c33dc81d26eb5b77faf0108502ec7588a13ebad6 (patch) | |
tree | 097f63838ed76849fdae525018f554fe4777a4b5 /Compiler in a Day/05-codegen.page | |
parent | e49b67af7125cc7c5a5ec6f321057bee1036ba05 (diff) |
Adds Compiler in a Day series. This is a direct import and will need cleanup.
Diffstat (limited to 'Compiler in a Day/05-codegen.page')
-rw-r--r-- | Compiler in a Day/05-codegen.page | 136 |
1 files changed, 136 insertions, 0 deletions
diff --git a/Compiler in a Day/05-codegen.page b/Compiler in a Day/05-codegen.page new file mode 100644 index 0000000..2d21d75 --- /dev/null +++ b/Compiler in a Day/05-codegen.page @@ -0,0 +1,136 @@ ++++ +title = "Code Generation — Compiler In A Day" +prev = "04-abi.md" +next = "05-codegen.md" ++++ + +# Code Generation + +We're ready to start producing machine code! + +To reiterate, we're _not_ aiming to produce efficient machine code for now. +We're just aiming to get the simplest thing that works. + +Before we actually dive into generating code, there's a bit more to decide. +The important things to decide are: + +- How different registers are allowed to be used. +- The way we pass arguments to and receive arguments from functions. +- How the various data types in our language are represented. + +These questions (and more related ones we won't touch on) are called the *application binary interface*, or ABI. + +Usually, a platform has an ABI already defined and documented for the C programming language, and this serves as the "lowest common denominator" for other languages' ABIs. + +Even though you can make your compiler follow whatever ABI you want, it simplifies calling into other languages (and letting them call your code) if it roughly follows the C ABI. + +Although we aren't providing features for users to interoperate with any other code they want, we still need to call into our runtime. +Since our runtime will be written in C, we want to make this as painless as possible. + +We're producing code to run on Linux on x86_64 CPUs, so the most relevant C ABI document is the [AMD64 Architecture Processor Supplement of the System V ABI]. + +We can copy its choices about how registers can be used, and how arguments get passed to and return values get returned from functions: + +| Registers | Saved by | Notes | +| -------------------------- | -------- | --------------------- | +| rdi, rsi, rdx, rcx, r8, r9 | Caller | Function arguments | +| rax | Caller | Function return value | +| r10, r11 | Caller | | +| rbp | Callee | Frame pointer | +| rsp | Callee | Stack pointer | +| rbx, r12, r13, r14, r15 | Callee | | + +When a register is caller-saved, that means that the functions we generate are allowed to freely modify them. +However, so is any function we call, so we can't rely on values in them staying unchanged. + +Conversely, a callee-saved register must be restored to its original value before we return, but we can rely on functions we call doing the same. + +Next, we need to figure out how we're going to represent values. +We have three types of values: integers, strings, and arrays. + +Since we're implementing a dynamically typed language, any variable needs to be able to store values of any of these types. + +We also want values to fit in a single (64-bit) register, for simplicity and performance. + +Finally, we want values that contain memory addresses to be represented by those addresses, so that the garbage collector we're using can recognize them as pointers. + +The easiest thing to do is to simply represent all three as pointers to structures in the heap that have information about the type stored in them. + +For example, the array `{42, "foo"}` might be represented as the pointer `0x08001230`, pointing to data like: + +``` +╭────────────╮ +│ 0x08001230 │ HEAP +╰────────────╯ + │ │ ... │ + ╰─> ├────────────┤ + type tag │ ARRAY │ + data pointer │ 0x08001250 │───╮ + length │ 2 │ │ + capacity │ 4 │ │ + ├────────────┤ <─╯ + │ 0x08001270 │─────╮ + │ 0x08001280 │───╮ │ + │ 0 │ │ │ + │ 0 │ │ │ + ├────────────┤ <───╯ + type tag │ INTEGER │ │ + value │ 42 │ │ + ├────────────┤ <─╯ + type tag │ STRING │ + data pointer │ 0x080012a0 │───╮ + length │ 2 │ │ + capacity │ 8 │ │ + ├────────────┤ <─╯ + │ "foo", ... │ + ├────────────┤ + │ ... │ +``` + +We know that `0x08001230` points to an array, since when we dereference it, we get the `ARRAY` constant. + +It doesn't really matter what values the `INTEGER`, `ARRAY`, and `STRING` constants have. +For simplicity, let's make them: + +| Constant | Value | +| --------- | ----- | +| `INTEGER` | 0 | +| `ARRAY` | 1 | +| `STRING` | 2 | + +So, putting it all together, let's imagine the equivalent of this simple function in C: + +``` +fun f(arr, x) { + return arr[x] + x; +} +``` + +It might look something like: + +```c +enum type_tag { INTEGER, ARRAY, STRING }; +struct array { struct value** data_ptr; size_t len, cap; }; +struct string { char* data_ptr; size_t len, cap; }; +struct value { + enum type_tag tag; + union { + int64_t integer; + struct array array; + struct string string; + }; +}; + +struct value* f(struct value* arr, struct value* x) { + assert(arr->tag == ARRAY); + assert( x->tag == INTEGER); + assert(x->integer < arr->array.len); + struct value* arr_x = arr->array.data_ptr[x->integer]; + + assert(arr_x->tag == INTEGER); + assert( x->tag == INTEGER); + return make_int(arr_x->integer + x->integer); +} +``` + +[AMD64 Architecture Processor Supplement of the System V ABI]: https://gitlab.com/x86-psABIs/x86-64-ABI |