Compiler in a Day/01-ciadlang.page


1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
219
220
221
222
223
224
225
226
227
228
229
230
231
232
233
234
235
236
237
238
239
240
241
242
243
244
245
246
247
248
249
250
251
252
253
254
255
256
257
258
259
260
261
262
263
264
265
266
267
268
269
270
271
272
273
274
275
276
277
278
279
280
281
282
283
284
285
286
287
288
289

+++
title = "The Ciad language — Compiler In A Day"
prev = "00-overview.md"
next = "02-ast.md"
+++

# The Ciad language

Before we start looking at the compiler, we should see and understand the language we'll be compiling.
This language is named Ciad (pronounced like "key-add"), which is an acronym for the name of this series.

This article will be a bit dry; it's mostly describing what is, rather than *why* it is.
Some design decisions will be expounded on later, when their benefits are reaped.

Other design decisions take more time to explain; some of these will get their own articles.

One choice about this article that's worth discussing up front — the description of the language is done both in natural language, and in an appropriate formalism.

The main benefit of the latter is precision — with a good formal description, we can avoid needing to argue about what the compiler "should" do, or a program "should" mean; instead, we can consult the formalism.

A secondary benefit (that we won't exploit in this series) is the wide variety of tools and techniques that can give us "nice things" from formal descriptions of a programming language.

In theory, a full formal description of the syntax and semantics of a programming language can be used to "mechanically" generate a complete working compiler for the language.

In practice, almost every language has at least one "weird thing" that breaks the ability to do this.
Still, sticking to having a good formal description makes it easier to take advantage of programming language theory, even if we do have to do some "fixing up" to handle weird things in our language.

## Syntax

The formalism we'll be using to describe Ciad's syntax is [EBNF], a notation used for describing [context-free grammars].

TODO: Should I say more about CFGs?

TODO: I should definitely mention tokens.

The syntax of a programming language is typically described in two parts, with the lexical syntax described separately from the rest.

This is because it's typically simpler to discuss things like comments and whitespace once, and then state a rule like "comment and whitespace tokens are ignored during parsing," rather than noting that a comment or whitespace token could appear at any point in the actual grammar.

This also fits with the separation between lexing and parsing; typically, a lexer can use simpler algorithms than a parser.

### Lexical Grammar

There are eight categories of tokens in Ciad's lexical syntax:

- Identifiers, like `foo` and `main`. Identifiers are used for the names of variables and functions.

- Keywords, like `if` and `while`. We recognize these separately from identifiers so that users can't do things like create a function named `while`, which may be confusing to other people and hard to parse.

- Integer and string literals, like `34` and `"Hello, world!"`. We need to recognize these in order for them to be present in the syntax!

- Punctuation, like `=` and `++`. These get recognized in a manner similar to that of keywords.

- Comments and whitespace, like `// TODO: speed up this loop` and line feeds. As mentioned above, recognizing them in this separate lexing step helps simplify the syntactic grammar, since they can be ignored here.

- End of file. We have a special token to indicate that the end of the file has been reached. This isn't "really" part of the syntax, but it'll help us when we get around to implementing it.

```ebnf
token = identifier | keyword | integer literal
      | string literal | punctuation | comment
      | whitespace | ? end of file ?;
```

Identifiers are a letter or underscore, followed by zero or more letters, underscores, or digits, except for keywords.

```ebnf
identifier = identifier or keyword - keyword;
identifier or keyword = identifier start character,
                        {identifier body character};
identifier start character = letter | "_";
identifier body character = letter | "_" | digit;

letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H"
       | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P"
       | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X"
       | "Y" | "Z"
       | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h"
       | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p"
       | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x"
       | "y" | "z";
digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7"
      | "8" | "9"; 
```

The words `else`, `fun`, `if`, `return`, `var`, and `while` are all keywords.

```ebnf
keyword = "else" | "fun" | "if" | "return" | "var"
        | "while";
```

An integer literal is one or more digits optionally preceded by a `-` character.

```ebnf
integer literal = ["-"], digit, {digit};
```

A string literal is zero or more string chunks surrounded by two `"` characters.
A string chunk can either be escaped or unescaped.
An escaped string chunk starts with a `\` character, and is then followed by one of `f`, `n`, `r`, `t`, `\`, `"`, or `x` and two hexadecimal digits.
An unescaped string chunk is one or more [ASCII printable characters] other than `\` and `"`.

```ebnf
string literal = '"', {string chunk}, '"';
string chunk = escaped string chunk
             | unescaped string chunk;

escaped string chunk = "\", escape sequence;
escape sequence = "f" | "n" | "r" | "t" | "\" | '"'
                | "x", hex digit, hex digit;
hex digit = "A" | "B" | "C" | "D" | "E" | "F"
          | "a" | "b" | "c" | "d" | "e" | "f" | digit;

unescaped string chunk = unescaped character,
                         {unescaped character};
unescaped character = printable character - ('"' | "\");

printable character = digit | letter  | " " | "!" | '"'
                    | "#" | "$" | "%" | "&" | "'" | "("
                    | ")" | "*" | "+" | "," | "-" | "."
                    | "/" | ":" | ";" | "<" | "=" | ">"
                    | "?" | "@" | "[" | "\" | "]" | "^"
                    | "_" | "`" | "{" | "|" | "}" | "~";
```

The following sequences of characters are punctuation: `(`, `)`, `[`, `]`, `{`, `}`, `,`, `=`, `==`, `-`, `+`, `++`, `;`.

Multi-character sequences are preferred to two single-character ones; this rule is often called "[maximal munch]" (and applies everywhere else in the lexical grammar, too).

```ebnf
punctuation = "(" | ")" | "[" | "]" | "{" | "}" | ","
            | "=" | "-" | "+" | ";" | "==" | "++";
```

Comments are `//` followed by zero or more ASCII printable characters or horizontal tabs and terminated by a carriage return, line feed, or the end of the file.

```ebnf
comment = "//", comment character, comment end of line;
comment character = printable character
                  | ? horizontal tab character ?;
comment end of line = ? carriage return character ?
                    | ? line feed character ?
                    | ? end of file ?;
```

Whitespace tokens are one or more carriage returns, form feeds, horizontal tabs, line feeds, and space characters.

```ebnf
whitespace = whitespace character,
             {whitespace character};
whitespace character = ? carriage return character ?
                     | ? form feed character ?
                     | ? horizontal tab character ?
                     | ? line feed character ?
                     | " ";
```

### Syntactic Grammar

The syntactic grammar of Ciad is specified using the nonterminals and terminals from the lexical grammar.

Recall that any comment or whitespace tokens that are encountered are ignored, so the syntactic grammar should be read as if there were a `{comment | whitespace}` before any other nonterminal or terminal from the lexical grammar.

A Ciad program is a series of declarations, which may be variable declarations or function declarations.

```ebnf
program = {declaration}, ? end of file ?;
declaration = variable declaration | function declaration;
```

#### Declaration Syntax

A variable declaration is the `var` keyword, followed by an identifier (the variable name) and a semicolon.  

```ebnf
variable declaration = "var", identifier, ";";
```

A function declaration is the `fun` keyword, followed by an identifier (the function name), a comma-separated (and optionally comma-terminated) list of identifiers surrounded by parentheses (the function arguments), and a block (the function body).

A block is zero or more statements surrounded by curly braces.

```ebnf
function declaration = "fun", identifier,
                       "(", [function arguments], ")",
                       block;
function arguments = identifier, {",", identifier}, [","];
block = "{", {statement}, "}";
```

#### Statement Syntax

There are five kinds of statements: assignment statements, expression statements, if statements, return statements, and while statements.
Additionally, variable declarations can be used as statements.

```ebnf
statement = assignment statement | expression statement
          | if statement | return statement
          | while statement | variable declaration;
```

An assignment statement is an expression followed by an equals sign, another expression, and a semicolon.

An expression statement is an expression followed by a semicolon.

A return statement is the `return` keyword, followed by an expression and a semicolon.

```ebnf
assignment statement = expression, "=", expression, ";";
expression statement = expression, ";";
return statement = "return", expression, ";";
```

A while statement is the `while` keyword followed by an expression, then zero or more statements surrounded by curly braces (the body).

```ebnf
while statement = "while", expression, block;
```

An if statement is the `if` keyword followed by an expression and a block, and optionally by the `else` keyword and another block.

```ebnf
if statement = "if", expression, block, ["else", block];
```

#### Expression Syntax

Unambiguously describing expressions is somewhat tricky.

For instance, consider the expression `1 - 2 - 3`.
Is this the same expression as `(1 - 2) - 3`, or as `1 - (2 - 3)`?

The ordinary rules of order of operations tell us that it's the former, but we need to specify how they work for operations that don't exist in ordinary math.
For example, should the expression `x ++ y[3]` be the same as `(x ++ y)[3]` or the same as `x ++ (y[3])`?

TODO: Should I explain precedence and associativity?
Probably, since approximately every programming language spec uses them... or I could leave them until an after-the-code-works chapter.

To help keep our description unambiguous, we describe two subcategories of expressions, atomic expressions and suffixed expressions.

Atomic expressions are those expressions that don't contain operators that aren't enclosed within delimiters.
This includes integer and string literals, variables, parenthesized expressions, and array literals.

Array literals are comma-separated (and optionally comma-terminated) lists of expressions surrounded by curly braces.

```ebnf
atomic expression = integer literal | string literal
                  | identifier
                  | "(", expression, ")"
                  | "{", [expression list], "}";
expression list = expression, {",", expression}, [","];
```

Suffixed expressions are atomic expressions followed by zero or more suffix operators.

These are the call operator (a parenthesized list of comma-separated and optionally comma-terminated expressions) and the index operator (an expression surrounded by square brackets).

```ebnf
suffixed expression = atomic expression,
                      {expression suffix};
expression suffix = call operator | index operator;
call operator = "(", [expression list], ")";
index operator = "[", expression, "]";
```

Finally, an expression is a sequence of one or more suffixed expressions separated by infix operators, which are `+`, `-`, `++`, and `==`.

These operators have equal precedence and associate to the left; that is, `1 - 2 == 3 + 4` is the same as `(((1 - 2) == 3) - 4)`.
While this isn't what most people are used to, it's easier to implement!

```ebnf
expression = suffixed expression
           | expression, infix operator,
             suffixed expression;
```

## Semantics

TODO: lol, which formalism...

---

Unlike many (most?) other compiler tutorials, we *won't* be starting with parsing.
Instead, we'll start with how we represent the program after it's been parsed.

[ASCII printable characters]: https://en.wikipedia.org/wiki/ASCII#Printable_characters
[context-free grammars]: https://en.wikipedia.org/wiki/Context-free_grammar
[EBNF]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
[maximal munch]: https://en.wikipedia.org/wiki/Maximal_munch