1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
|
+++
title = "Code Generation — Compiler In A Day"
prev = "04-abi.md"
next = "05-codegen.md"
+++
# Code Generation
We're ready to start producing machine code!
To reiterate, we're _not_ aiming to produce efficient machine code for now.
We're just aiming to get the simplest thing that works.
Before we actually dive into generating code, there's a bit more to decide.
The important things to decide are:
- How different registers are allowed to be used.
- The way we pass arguments to and receive arguments from functions.
- How the various data types in our language are represented.
These questions (and more related ones we won't touch on) are called the *application binary interface*, or ABI.
Usually, a platform has an ABI already defined and documented for the C programming language, and this serves as the "lowest common denominator" for other languages' ABIs.
Even though you can make your compiler follow whatever ABI you want, it simplifies calling into other languages (and letting them call your code) if it roughly follows the C ABI.
Although we aren't providing features for users to interoperate with any other code they want, we still need to call into our runtime.
Since our runtime will be written in C, we want to make this as painless as possible.
We're producing code to run on Linux on x86_64 CPUs, so the most relevant C ABI document is the [AMD64 Architecture Processor Supplement of the System V ABI].
We can copy its choices about how registers can be used, and how arguments get passed to and return values get returned from functions:
| Registers | Saved by | Notes |
| -------------------------- | -------- | --------------------- |
| rdi, rsi, rdx, rcx, r8, r9 | Caller | Function arguments |
| rax | Caller | Function return value |
| r10, r11 | Caller | |
| rbp | Callee | Frame pointer |
| rsp | Callee | Stack pointer |
| rbx, r12, r13, r14, r15 | Callee | |
When a register is caller-saved, that means that the functions we generate are allowed to freely modify them.
However, so is any function we call, so we can't rely on values in them staying unchanged.
Conversely, a callee-saved register must be restored to its original value before we return, but we can rely on functions we call doing the same.
Next, we need to figure out how we're going to represent values.
We have three types of values: integers, strings, and arrays.
Since we're implementing a dynamically typed language, any variable needs to be able to store values of any of these types.
We also want values to fit in a single (64-bit) register, for simplicity and performance.
Finally, we want values that contain memory addresses to be represented by those addresses, so that the garbage collector we're using can recognize them as pointers.
The easiest thing to do is to simply represent all three as pointers to structures in the heap that have information about the type stored in them.
For example, the array `{42, "foo"}` might be represented as the pointer `0x08001230`, pointing to data like:
```
╭────────────╮
│ 0x08001230 │ HEAP
╰────────────╯
│ │ ... │
╰─> ├────────────┤
type tag │ ARRAY │
data pointer │ 0x08001250 │───╮
length │ 2 │ │
capacity │ 4 │ │
├────────────┤ <─╯
│ 0x08001270 │─────╮
│ 0x08001280 │───╮ │
│ 0 │ │ │
│ 0 │ │ │
├────────────┤ <───╯
type tag │ INTEGER │ │
value │ 42 │ │
├────────────┤ <─╯
type tag │ STRING │
data pointer │ 0x080012a0 │───╮
length │ 2 │ │
capacity │ 8 │ │
├────────────┤ <─╯
│ "foo", ... │
├────────────┤
│ ... │
```
We know that `0x08001230` points to an array, since when we dereference it, we get the `ARRAY` constant.
It doesn't really matter what values the `INTEGER`, `ARRAY`, and `STRING` constants have.
For simplicity, let's make them:
| Constant | Value |
| --------- | ----- |
| `INTEGER` | 0 |
| `ARRAY` | 1 |
| `STRING` | 2 |
So, putting it all together, let's imagine the equivalent of this simple function in C:
```
fun f(arr, x) {
return arr[x] + x;
}
```
It might look something like:
```c
enum type_tag { INTEGER, ARRAY, STRING };
struct array { struct value** data_ptr; size_t len, cap; };
struct string { char* data_ptr; size_t len, cap; };
struct value {
enum type_tag tag;
union {
int64_t integer;
struct array array;
struct string string;
};
};
struct value* f(struct value* arr, struct value* x) {
assert(arr->tag == ARRAY);
assert( x->tag == INTEGER);
assert(x->integer < arr->array.len);
struct value* arr_x = arr->array.data_ptr[x->integer];
assert(arr_x->tag == INTEGER);
assert( x->tag == INTEGER);
return make_int(arr_x->integer + x->integer);
}
```
[AMD64 Architecture Processor Supplement of the System V ABI]: https://gitlab.com/x86-psABIs/x86-64-ABI
|