diff options
-rw-r--r-- | Compiler in a Day.page | 535 |
1 files changed, 534 insertions, 1 deletions
diff --git a/Compiler in a Day.page b/Compiler in a Day.page index 2a5c60b..a732d4d 100644 --- a/Compiler in a Day.page +++ b/Compiler in a Day.page @@ -21,4 +21,537 @@ They are, in the order they get run: TODO: pictures! -Before we can start looking at these steps, however, we should look at the language we'll be compiling.
\ No newline at end of file +Before we can start looking at these steps, however, we should look at the language we'll be compiling. + +# The Ciad Language + +Before we start looking at the compiler, we should see and understand the language we’ll be compiling. This language is named Ciad (pronounced like “key-add”), which is an acronym for the name of this article. + +TODO: This is [here](https://plwiki.org/Compiler%20in%20a%20Day/01-ciadlang), but I really don't like it and want to rewrite it. + +Unlike many (most?) other compiler tutorials, we _won’t_ be starting with parsing. Instead, we’ll start with how we represent the program after it’s been parsed. + +# The Abstract Syntax Tree + +Most programming languages' syntax can be described with a tree structure. +An expression like `m.a * m.d - m.b * m.c` could be represented as: + +``` + ╭────────── - ──────────╮ + ╭──── * ────╮ ╭──── * ────╮ + ╭─ . ─╮ ╭─ . ─╮ ╭─ . ─╮ ╭─ . ─╮ +var field var field var field var field + │ │ │ │ │ │ │ │ +"m" "a" "m" "d" "m" "b" "m" "c" +``` + +Such a tree is called an *abstract syntax tree*, or AST. + +TODO: Move the CST material to the "what to do better" section? + +In our compiler, the parser directly produces the AST. +In some compilers, there's an intermediate tree called the *concrete syntax tree*, or CST. + +The CST is a tree that exactly corresponds to the input program, with every token present as a child. + +``` + ╭────────────── binop ──────────────╮ + ╭───── binop ─────╮ │ ╭───── binop ─────╮ + ╭─ dot ─╮ │ ╭─ dot ─╮ │ ╭─ dot ─╮ │ ╭─ dot ─╮ +var │ field │ var │ field │ var │ field │ var │ field + │ │ │ │ │ │ │ │ │ │ │ │ │ │ │ +"m" "." "a" "*" "m" "." "d" "-" "m" "." "b" "*" "m" "." "c" +``` + +This tree is called the *concrete syntax tree*, or CST. +It gets simplified to produce the AST. + +TODO: A better example, one where there's actual meaningful simplification? + +Typically, this is done so that type-checking and other error checks can be performed with as much information about the source code present as possible. + +In a simple compiler like ours, though, we don't lose much quality of error reporting checking for errors on the AST instead. + +Now, it's finally time to see some of the code in our compiler! + +To start with, we'll just define the structure of the tree as Ruby classes. + +Recall from [the syntactic grammar](./01-ciadlang.md#syntactic-grammar) that a program is a sequence of declarations. + +```ruby +class Program + def initialize(decls) + @decls = decls + end +end +``` + +Declarations either declare a function or a variable. +Both have a name, but a function also has a list of arguments and a block of statements that comprise its body. + +```ruby +class DeclFun + def initialize(name, args, body) + @name, @args, @body = name, args, body + end +end + +class DeclVar + def initialize(name) + @name = name + end +end +``` + +We'll also define a helper class for arguments. +For now, this will just wrap the name of an argument, but later, it'll come in handy. + +```ruby +class Arg + def initialize(name) + @name = name + end +end +``` + +There are four kinds of statements: assignment statements, if statements, return statements, and while statements. +These follow what's in the grammar. + +Remember, though, that expressions and variable declarations can also show up as statements! + +In a statically typed language, we would probably define `StmtExpr` and `StmtDeclVar` classes to handle them. +However, we don't need to in Ruby, since it's dynamically typed! + +TODO: okay, I feel like I ought to justify this better... +But the justification involves the visitors we're going to introduce later. + +```ruby +class StmtAssign + def initialize(lhs, rhs) + @lhs, @rhs = lhs, rhs + end +end + +class StmtIf + attr_reader :then_stmts + attr_reader :else_stmts + def initialize(cond, then_stmts, else_stmts) + @cond = cond + @then_stmts = then_stmts + @else_stmts = else_stmts + end +end + +class StmtReturn + def initialize(expr) + @expr = expr + end +end + +class StmtWhile + attr_reader :body_stmts + def initialize(cond, body_stmts) + @cond, @body_stmts = cond, body_stmts + end +end +``` + +Finally, we have the classes for expressions. +There are 10 of these, that are exactly like what we've seen so far. + +```ruby +class ExprAppend # lhs ++ rhs + def initialize(lhs, rhs) + @lhs, @rhs = lhs, rhs + end +end + +class ExprArrayIndex # array[index] + def initialize(array, index) + @array, @index = array, index + end +end + +class ExprArrayLit # { exprs... } + def initialize(exprs) + @exprs = exprs + end +end + +class ExprCall # func(args...) + def initialize(func, args) + @func, @args = func, args + end +end + +class ExprEquals # lhs == rhs + def initialize(lhs, rhs) + @lhs, @rhs = lhs, rhs + end +end + +class ExprIntLit + def initialize(value) + @value = value + end +end + +class ExprMinus # lhs - rhs + def initialize(lhs, rhs) + @lhs, @rhs = lhs, rhs + end +end + +class ExprPlus # lhs + rhs + def initialize(lhs, rhs) + @lhs, @rhs = lhs, rhs + end +end + +class ExprStringLit + def initialize(value) + @value = value + end +end + +class ExprVariable + def initialize(name) + @name = name + end +end +``` + +Now that we have all these classes, we can build ASTs corresponding to programs. + +Let's look at a simple program and determine what its AST should look like. + +```ciad +var x; + +fun main(args) { + x = length(args); + print({x, args}); +} +``` + +Because our AST is so simple, it corresponds pretty directly to the program. + +TODO: Is there more to say here? + +```ruby +example = Program.new([ + # var x; + DeclVar.new("x"), + # fun main(args) { ... } + DeclFun.new("main", [Arg.new("args")], [ + # x = length(args); + StmtAssign.new( + ExprVariable.new("x"), + ExprCall.new("length", [ExprVariable.new("args")])), + # print({x, args}); + ExprCall.new("print", [ + ExprArrayLit.new([ + ExprVariable.new("x"), + ExprVariable.new("args"), + ]) + ]), + ]), +]) +``` + +Now that we have our AST, we'll need to process it. + +# Frame Layout + +When programs execute, their local variables' values are typically stored into a data structure called the *call stack*, or commonly just *the stack*. + +This is a stack data structure whose entries are called *stack frames*. +These contain all the information that a function might need to save while calling another one. + +For example, consider the following program. + +```ciad +fun f(x) { + var y; + var z; + + y = x + 3; + z = g(x - 1); + return y - z; +} + +fun g() { + if x == 0 { + return 42; + } else { + return f(x - 1); + } +} + +fun main(args) { + return f(2); +} +``` + +When `f` calls `g`, we need to store the value of `y` somewhere, so that it can be used again after `g` returns. +We also need to store what function `f` should return to once it finishes, and we store that on the stack as well. + +In general, most languages structure their stacks as something like: + +``` +╭───────────────────────────────╮ <-- "bottom of the stack" +│ address main should return to │ (highest address +├───────────────────────────────┤ on most current +│ saved frame pointer │ CPU architectures) +├───────────────────────────────┤ +│ │ +│ saved local variables of main │ +│ │ +┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥ +│ address f should return to │ +├───────────────────────────────┤ +│ saved frame pointer │ +├───────────────────────────────┤ +│ │ +│ saved local variables of f │ +│ │ +┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥ +│ address g should return to │ +├───────────────────────────────┤ +│ saved frame pointer │ frame pointer +├───────────────────────────────┤ <--------- when running g +│ │ +│ local variables of g │ stack pointer +│ │ when running g, aka +╰───────────────────────────────╯ <----- "top of the stack" +``` + +The stack then "grows down" as functions are called. +When a function returns, it removes its frame from the stack, shrinking it. + +The stack is kept track of with the *stack pointer*, which points to the top of the stack, i.e. the address at the _end_ of the current stack frame. + +In some programs, there's also a *frame pointer*. +This points to the address at the start of local variables of the current stack frame. + +With a smarter compiler than we're building in this series, the frame pointer isn't necessary. +However, it can still be useful for a debugger and some kinds of runtime introspection. + +In our compiler, we'll need to have a pass that computes how much storage the local variables of each function will take up in the stack frame. + +Once we know this, we can compute the address of any local variable as an offset from the frame pointer. + +Alright, let's get started with the code! + +The first thing we'll need to do is to extend `Arg` and `DeclVar` with the ability to store a slot index. +This the offset of the variable from the frame pointer. + +```ruby +class Arg + # ... + + attr_accessor :slot +end + +class DeclVar + # ... + + attr_accessor :slot +end +``` + +Then, we can add the method to perform the pass to the `Program` class. + +Computing frame layouts doesn't require any sort of information that's shared between functions, so we can just tell each function declaration to compute it independently. + +```ruby +class Program + # ... + + def compute_frame_layouts + @decls.each do |decl| + case decl + in DeclFun + decl.compute_frame_layout + in DeclVar + # Ignore it; variables don't have any kind of frame + # to be laid out. + end + end + end +end +``` + +Once we're processing the function, we start with storing a counter for the number of local variables we have in the declaration. + +We define a helper function (`assign_slot_to`) to be called on every `Arg` and `DeclVar` in the function as well. + +```ruby +class DeclFun + # ... + + def compute_frame_layout + @locals_count = 0 + def assign_slot_to(var) + var.slot = @locals_count + @locals_count += 1 + end + + @args.each { |arg| assign_slot_to arg } + + def traverse_stmt(stmt) + case stmt + in DeclVar + assign_slot_to stmt + + in StmtIf + stmt.then_stmts.each { |stmt| traverse_stmt stmt } + stmt.else_stmts.each { |stmt| traverse_stmt stmt } + in StmtWhile + stmt.body_stmts.each { |stmt| traverse_stmt stmt } + + else + # Do nothing; nothing else needs to get a slot nor + # contains anything that needs to get a slot. + end + end + + @body.each { |stmt| traverse_stmt stmt } + end +end +``` + +Once this has finished running, every `DeclStmt` will have an `@locals_count`, and every `Arg` and non-global `DeclVar` will have an `@slot`. + +Before we can start generating code, though, we should look at some details of exactly what code we'll be generating. + +# The ABI + +We're ready to start making the final decisions about the machine code we're going to produce. + +To reiterate, we're _not_ aiming to produce efficient machine code for now. +We're just aiming to get the simplest thing that works. + +The important things to decide are: + +- How different registers are allowed to be used. +- The way we pass arguments to and receive arguments from functions. +- How the various data types in our language are represented. + +These questions (and more related ones we won't touch on) are called the *application binary interface*, or ABI. + +Usually, a platform has an ABI already defined and documented for the C programming language. +This C ABI serves as the "lowest common denominator" for other languages' ABIs. + +Even though you can make your compiler follow whatever ABI you want, it simplifies calling into other languages (and letting them call your code) if it roughly follows the C ABI. + +Although we aren't providing features for users to interoperate with any other code they want, we still need to call into our runtime. +Since our runtime will be written in C, we want to make this as painless as possible. + +We're producing code to run on Linux on x86_64 CPUs, so the most relevant C ABI document is the [AMD64 Architecture Processor Supplement of the System V ABI]. + +We can copy its choices about how registers can be used, and how arguments get passed to and return values get returned from functions: + +| Registers | Saved by | Notes | +| -------------------------- | -------- | --------------------- | +| rdi, rsi, rdx, rcx, r8, r9 | Caller | Function arguments | +| rax | Caller | Function return value | +| r10, r11 | Caller | | +| rbp | Callee | Frame pointer | +| rsp | Callee | Stack pointer | +| rbx, r12, r13, r14, r15 | Callee | | + +When a register is caller-saved, that means that the functions we generate are allowed to freely modify them. +However, so is any function we call, so we can't rely on values in them staying unchanged. + +Conversely, a callee-saved register must be restored to its original value before we return, but we can rely on functions we call doing the same. + +Next, we need to figure out how we're going to represent values. +We have three types of values: integers, strings, and arrays. + +Since we're implementing a dynamically typed language, any variable needs to be able to store values of any of these types. + +We also want values to fit in a single (64-bit) register, for simplicity and performance. + +Finally, we want values that contain memory addresses to be represented by those addresses, so that the garbage collector we're using can recognize them as pointers. + +The easiest thing to do is to simply represent all three as pointers to structures in the heap that have information about the type stored in them. + +For example, the array `{42, "foo"}` might be represented as the pointer `0x08001230`, pointing to data like: + +``` +╭────────────╮ +│ 0x08001230 │ HEAP +╰────────────╯ + │ │ ... │ + ╰─> ├────────────┤ + type tag │ ARRAY │ + data pointer │ 0x08001250 │───╮ + length │ 2 │ │ + capacity │ 4 │ │ + ├────────────┤ <─╯ + │ 0x08001270 │─────╮ + │ 0x08001280 │───╮ │ + │ 0 │ │ │ + │ 0 │ │ │ + ├────────────┤ <───╯ + type tag │ INTEGER │ │ + value │ 42 │ │ + ├────────────┤ <─╯ + type tag │ STRING │ + data pointer │ 0x080012a0 │───╮ + length │ 2 │ │ + capacity │ 8 │ │ + ├────────────┤ <─╯ + │ "foo", ... │ + ├────────────┤ + │ ... │ +``` + +We know that `0x08001230` points to an array, since when we dereference it, we get the `ARRAY` constant. + +It doesn't really matter what values the `INTEGER`, `ARRAY`, and `STRING` constants have. +For simplicity, let's make them: + +| Constant | Value | +| --------- | ----- | +| `INTEGER` | 0 | +| `ARRAY` | 1 | +| `STRING` | 2 | + +So, putting it all together, let's imagine the equivalent of this simple function in C: + +``` +fun f(arr, x) { + return arr[x] + x; +} +``` + +It might look something like: + +```c +enum type_tag { INTEGER, ARRAY, STRING }; +struct array { struct value** data_ptr; size_t len, cap; }; +struct string { char* data_ptr; size_t len, cap; }; +struct value { + enum type_tag tag; + union { + int64_t integer; + struct array array; + struct string string; + }; +}; + +struct value* f(struct value* arr, struct value* x) { + assert(arr->tag == ARRAY); + assert( x->tag == INTEGER); + assert(x->integer < arr->array.len); + struct value* arr_x = arr->array.data_ptr[x->integer]; + + assert(arr_x->tag == INTEGER); + assert( x->tag == INTEGER); + return make_int(arr_x->integer + x->integer); +} +``` + +[AMD64 Architecture Processor Supplement of the System V ABI]: https://gitlab.com/x86-psABIs/x86-64-ABI |