summaryrefslogtreecommitdiff
diff options
context:
space:
mode:
-rw-r--r--Compiler in a Day.page535
1 files changed, 534 insertions, 1 deletions
diff --git a/Compiler in a Day.page b/Compiler in a Day.page
index 2a5c60b..a732d4d 100644
--- a/Compiler in a Day.page
+++ b/Compiler in a Day.page
@@ -21,4 +21,537 @@ They are, in the order they get run:
TODO: pictures!
-Before we can start looking at these steps, however, we should look at the language we'll be compiling. \ No newline at end of file
+Before we can start looking at these steps, however, we should look at the language we'll be compiling.
+
+# The Ciad Language
+
+Before we start looking at the compiler, we should see and understand the language we’ll be compiling. This language is named Ciad (pronounced like “key-add”), which is an acronym for the name of this article.
+
+TODO: This is [here](https://plwiki.org/Compiler%20in%20a%20Day/01-ciadlang), but I really don't like it and want to rewrite it.
+
+Unlike many (most?) other compiler tutorials, we _won’t_ be starting with parsing. Instead, we’ll start with how we represent the program after it’s been parsed.
+
+# The Abstract Syntax Tree
+
+Most programming languages' syntax can be described with a tree structure.
+An expression like `m.a * m.d - m.b * m.c` could be represented as:
+
+```
+ ╭────────── - ──────────╮
+ ╭──── * ────╮ ╭──── * ────╮
+ ╭─ . ─╮ ╭─ . ─╮ ╭─ . ─╮ ╭─ . ─╮
+var field var field var field var field
+ │ │ │ │ │ │ │ │
+"m" "a" "m" "d" "m" "b" "m" "c"
+```
+
+Such a tree is called an *abstract syntax tree*, or AST.
+
+TODO: Move the CST material to the "what to do better" section?
+
+In our compiler, the parser directly produces the AST.
+In some compilers, there's an intermediate tree called the *concrete syntax tree*, or CST.
+
+The CST is a tree that exactly corresponds to the input program, with every token present as a child.
+
+```
+ ╭────────────── binop ──────────────╮
+ ╭───── binop ─────╮ │ ╭───── binop ─────╮
+ ╭─ dot ─╮ │ ╭─ dot ─╮ │ ╭─ dot ─╮ │ ╭─ dot ─╮
+var │ field │ var │ field │ var │ field │ var │ field
+ │ │ │ │ │ │ │ │ │ │ │ │ │ │ │
+"m" "." "a" "*" "m" "." "d" "-" "m" "." "b" "*" "m" "." "c"
+```
+
+This tree is called the *concrete syntax tree*, or CST.
+It gets simplified to produce the AST.
+
+TODO: A better example, one where there's actual meaningful simplification?
+
+Typically, this is done so that type-checking and other error checks can be performed with as much information about the source code present as possible.
+
+In a simple compiler like ours, though, we don't lose much quality of error reporting checking for errors on the AST instead.
+
+Now, it's finally time to see some of the code in our compiler!
+
+To start with, we'll just define the structure of the tree as Ruby classes.
+
+Recall from [the syntactic grammar](./01-ciadlang.md#syntactic-grammar) that a program is a sequence of declarations.
+
+```ruby
+class Program
+ def initialize(decls)
+ @decls = decls
+ end
+end
+```
+
+Declarations either declare a function or a variable.
+Both have a name, but a function also has a list of arguments and a block of statements that comprise its body.
+
+```ruby
+class DeclFun
+ def initialize(name, args, body)
+ @name, @args, @body = name, args, body
+ end
+end
+
+class DeclVar
+ def initialize(name)
+ @name = name
+ end
+end
+```
+
+We'll also define a helper class for arguments.
+For now, this will just wrap the name of an argument, but later, it'll come in handy.
+
+```ruby
+class Arg
+ def initialize(name)
+ @name = name
+ end
+end
+```
+
+There are four kinds of statements: assignment statements, if statements, return statements, and while statements.
+These follow what's in the grammar.
+
+Remember, though, that expressions and variable declarations can also show up as statements!
+
+In a statically typed language, we would probably define `StmtExpr` and `StmtDeclVar` classes to handle them.
+However, we don't need to in Ruby, since it's dynamically typed!
+
+TODO: okay, I feel like I ought to justify this better...
+But the justification involves the visitors we're going to introduce later.
+
+```ruby
+class StmtAssign
+ def initialize(lhs, rhs)
+ @lhs, @rhs = lhs, rhs
+ end
+end
+
+class StmtIf
+ attr_reader :then_stmts
+ attr_reader :else_stmts
+ def initialize(cond, then_stmts, else_stmts)
+ @cond = cond
+ @then_stmts = then_stmts
+ @else_stmts = else_stmts
+ end
+end
+
+class StmtReturn
+ def initialize(expr)
+ @expr = expr
+ end
+end
+
+class StmtWhile
+ attr_reader :body_stmts
+ def initialize(cond, body_stmts)
+ @cond, @body_stmts = cond, body_stmts
+ end
+end
+```
+
+Finally, we have the classes for expressions.
+There are 10 of these, that are exactly like what we've seen so far.
+
+```ruby
+class ExprAppend # lhs ++ rhs
+ def initialize(lhs, rhs)
+ @lhs, @rhs = lhs, rhs
+ end
+end
+
+class ExprArrayIndex # array[index]
+ def initialize(array, index)
+ @array, @index = array, index
+ end
+end
+
+class ExprArrayLit # { exprs... }
+ def initialize(exprs)
+ @exprs = exprs
+ end
+end
+
+class ExprCall # func(args...)
+ def initialize(func, args)
+ @func, @args = func, args
+ end
+end
+
+class ExprEquals # lhs == rhs
+ def initialize(lhs, rhs)
+ @lhs, @rhs = lhs, rhs
+ end
+end
+
+class ExprIntLit
+ def initialize(value)
+ @value = value
+ end
+end
+
+class ExprMinus # lhs - rhs
+ def initialize(lhs, rhs)
+ @lhs, @rhs = lhs, rhs
+ end
+end
+
+class ExprPlus # lhs + rhs
+ def initialize(lhs, rhs)
+ @lhs, @rhs = lhs, rhs
+ end
+end
+
+class ExprStringLit
+ def initialize(value)
+ @value = value
+ end
+end
+
+class ExprVariable
+ def initialize(name)
+ @name = name
+ end
+end
+```
+
+Now that we have all these classes, we can build ASTs corresponding to programs.
+
+Let's look at a simple program and determine what its AST should look like.
+
+```ciad
+var x;
+
+fun main(args) {
+ x = length(args);
+ print({x, args});
+}
+```
+
+Because our AST is so simple, it corresponds pretty directly to the program.
+
+TODO: Is there more to say here?
+
+```ruby
+example = Program.new([
+ # var x;
+ DeclVar.new("x"),
+ # fun main(args) { ... }
+ DeclFun.new("main", [Arg.new("args")], [
+ # x = length(args);
+ StmtAssign.new(
+ ExprVariable.new("x"),
+ ExprCall.new("length", [ExprVariable.new("args")])),
+ # print({x, args});
+ ExprCall.new("print", [
+ ExprArrayLit.new([
+ ExprVariable.new("x"),
+ ExprVariable.new("args"),
+ ])
+ ]),
+ ]),
+])
+```
+
+Now that we have our AST, we'll need to process it.
+
+# Frame Layout
+
+When programs execute, their local variables' values are typically stored into a data structure called the *call stack*, or commonly just *the stack*.
+
+This is a stack data structure whose entries are called *stack frames*.
+These contain all the information that a function might need to save while calling another one.
+
+For example, consider the following program.
+
+```ciad
+fun f(x) {
+ var y;
+ var z;
+
+ y = x + 3;
+ z = g(x - 1);
+ return y - z;
+}
+
+fun g() {
+ if x == 0 {
+ return 42;
+ } else {
+ return f(x - 1);
+ }
+}
+
+fun main(args) {
+ return f(2);
+}
+```
+
+When `f` calls `g`, we need to store the value of `y` somewhere, so that it can be used again after `g` returns.
+We also need to store what function `f` should return to once it finishes, and we store that on the stack as well.
+
+In general, most languages structure their stacks as something like:
+
+```
+╭───────────────────────────────╮ <-- "bottom of the stack"
+│ address main should return to │ (highest address
+├───────────────────────────────┤ on most current
+│ saved frame pointer │ CPU architectures)
+├───────────────────────────────┤
+│ │
+│ saved local variables of main │
+│ │
+┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
+│ address f should return to │
+├───────────────────────────────┤
+│ saved frame pointer │
+├───────────────────────────────┤
+│ │
+│ saved local variables of f │
+│ │
+┝━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━━┥
+│ address g should return to │
+├───────────────────────────────┤
+│ saved frame pointer │ frame pointer
+├───────────────────────────────┤ <--------- when running g
+│ │
+│ local variables of g │ stack pointer
+│ │ when running g, aka
+╰───────────────────────────────╯ <----- "top of the stack"
+```
+
+The stack then "grows down" as functions are called.
+When a function returns, it removes its frame from the stack, shrinking it.
+
+The stack is kept track of with the *stack pointer*, which points to the top of the stack, i.e. the address at the _end_ of the current stack frame.
+
+In some programs, there's also a *frame pointer*.
+This points to the address at the start of local variables of the current stack frame.
+
+With a smarter compiler than we're building in this series, the frame pointer isn't necessary.
+However, it can still be useful for a debugger and some kinds of runtime introspection.
+
+In our compiler, we'll need to have a pass that computes how much storage the local variables of each function will take up in the stack frame.
+
+Once we know this, we can compute the address of any local variable as an offset from the frame pointer.
+
+Alright, let's get started with the code!
+
+The first thing we'll need to do is to extend `Arg` and `DeclVar` with the ability to store a slot index.
+This the offset of the variable from the frame pointer.
+
+```ruby
+class Arg
+ # ...
+
+ attr_accessor :slot
+end
+
+class DeclVar
+ # ...
+
+ attr_accessor :slot
+end
+```
+
+Then, we can add the method to perform the pass to the `Program` class.
+
+Computing frame layouts doesn't require any sort of information that's shared between functions, so we can just tell each function declaration to compute it independently.
+
+```ruby
+class Program
+ # ...
+
+ def compute_frame_layouts
+ @decls.each do |decl|
+ case decl
+ in DeclFun
+ decl.compute_frame_layout
+ in DeclVar
+ # Ignore it; variables don't have any kind of frame
+ # to be laid out.
+ end
+ end
+ end
+end
+```
+
+Once we're processing the function, we start with storing a counter for the number of local variables we have in the declaration.
+
+We define a helper function (`assign_slot_to`) to be called on every `Arg` and `DeclVar` in the function as well.
+
+```ruby
+class DeclFun
+ # ...
+
+ def compute_frame_layout
+ @locals_count = 0
+ def assign_slot_to(var)
+ var.slot = @locals_count
+ @locals_count += 1
+ end
+
+ @args.each { |arg| assign_slot_to arg }
+
+ def traverse_stmt(stmt)
+ case stmt
+ in DeclVar
+ assign_slot_to stmt
+
+ in StmtIf
+ stmt.then_stmts.each { |stmt| traverse_stmt stmt }
+ stmt.else_stmts.each { |stmt| traverse_stmt stmt }
+ in StmtWhile
+ stmt.body_stmts.each { |stmt| traverse_stmt stmt }
+
+ else
+ # Do nothing; nothing else needs to get a slot nor
+ # contains anything that needs to get a slot.
+ end
+ end
+
+ @body.each { |stmt| traverse_stmt stmt }
+ end
+end
+```
+
+Once this has finished running, every `DeclStmt` will have an `@locals_count`, and every `Arg` and non-global `DeclVar` will have an `@slot`.
+
+Before we can start generating code, though, we should look at some details of exactly what code we'll be generating.
+
+# The ABI
+
+We're ready to start making the final decisions about the machine code we're going to produce.
+
+To reiterate, we're _not_ aiming to produce efficient machine code for now.
+We're just aiming to get the simplest thing that works.
+
+The important things to decide are:
+
+- How different registers are allowed to be used.
+- The way we pass arguments to and receive arguments from functions.
+- How the various data types in our language are represented.
+
+These questions (and more related ones we won't touch on) are called the *application binary interface*, or ABI.
+
+Usually, a platform has an ABI already defined and documented for the C programming language.
+This C ABI serves as the "lowest common denominator" for other languages' ABIs.
+
+Even though you can make your compiler follow whatever ABI you want, it simplifies calling into other languages (and letting them call your code) if it roughly follows the C ABI.
+
+Although we aren't providing features for users to interoperate with any other code they want, we still need to call into our runtime.
+Since our runtime will be written in C, we want to make this as painless as possible.
+
+We're producing code to run on Linux on x86_64 CPUs, so the most relevant C ABI document is the [AMD64 Architecture Processor Supplement of the System V ABI].
+
+We can copy its choices about how registers can be used, and how arguments get passed to and return values get returned from functions:
+
+| Registers | Saved by | Notes |
+| -------------------------- | -------- | --------------------- |
+| rdi, rsi, rdx, rcx, r8, r9 | Caller | Function arguments |
+| rax | Caller | Function return value |
+| r10, r11 | Caller | |
+| rbp | Callee | Frame pointer |
+| rsp | Callee | Stack pointer |
+| rbx, r12, r13, r14, r15 | Callee | |
+
+When a register is caller-saved, that means that the functions we generate are allowed to freely modify them.
+However, so is any function we call, so we can't rely on values in them staying unchanged.
+
+Conversely, a callee-saved register must be restored to its original value before we return, but we can rely on functions we call doing the same.
+
+Next, we need to figure out how we're going to represent values.
+We have three types of values: integers, strings, and arrays.
+
+Since we're implementing a dynamically typed language, any variable needs to be able to store values of any of these types.
+
+We also want values to fit in a single (64-bit) register, for simplicity and performance.
+
+Finally, we want values that contain memory addresses to be represented by those addresses, so that the garbage collector we're using can recognize them as pointers.
+
+The easiest thing to do is to simply represent all three as pointers to structures in the heap that have information about the type stored in them.
+
+For example, the array `{42, "foo"}` might be represented as the pointer `0x08001230`, pointing to data like:
+
+```
+╭────────────╮
+│ 0x08001230 │ HEAP
+╰────────────╯
+ │ │ ... │
+ ╰─> ├────────────┤
+ type tag │ ARRAY │
+ data pointer │ 0x08001250 │───╮
+ length │ 2 │ │
+ capacity │ 4 │ │
+ ├────────────┤ <─╯
+ │ 0x08001270 │─────╮
+ │ 0x08001280 │───╮ │
+ │ 0 │ │ │
+ │ 0 │ │ │
+ ├────────────┤ <───╯
+ type tag │ INTEGER │ │
+ value │ 42 │ │
+ ├────────────┤ <─╯
+ type tag │ STRING │
+ data pointer │ 0x080012a0 │───╮
+ length │ 2 │ │
+ capacity │ 8 │ │
+ ├────────────┤ <─╯
+ │ "foo", ... │
+ ├────────────┤
+ │ ... │
+```
+
+We know that `0x08001230` points to an array, since when we dereference it, we get the `ARRAY` constant.
+
+It doesn't really matter what values the `INTEGER`, `ARRAY`, and `STRING` constants have.
+For simplicity, let's make them:
+
+| Constant | Value |
+| --------- | ----- |
+| `INTEGER` | 0 |
+| `ARRAY` | 1 |
+| `STRING` | 2 |
+
+So, putting it all together, let's imagine the equivalent of this simple function in C:
+
+```
+fun f(arr, x) {
+ return arr[x] + x;
+}
+```
+
+It might look something like:
+
+```c
+enum type_tag { INTEGER, ARRAY, STRING };
+struct array { struct value** data_ptr; size_t len, cap; };
+struct string { char* data_ptr; size_t len, cap; };
+struct value {
+ enum type_tag tag;
+ union {
+ int64_t integer;
+ struct array array;
+ struct string string;
+ };
+};
+
+struct value* f(struct value* arr, struct value* x) {
+ assert(arr->tag == ARRAY);
+ assert( x->tag == INTEGER);
+ assert(x->integer < arr->array.len);
+ struct value* arr_x = arr->array.data_ptr[x->integer];
+
+ assert(arr_x->tag == INTEGER);
+ assert( x->tag == INTEGER);
+ return make_int(arr_x->integer + x->integer);
+}
+```
+
+[AMD64 Architecture Processor Supplement of the System V ABI]: https://gitlab.com/x86-psABIs/x86-64-ABI