summaryrefslogtreecommitdiff
path: root/Compiler in a Day/01-ciadlang.page
diff options
context:
space:
mode:
authorNathan Ringo <nathan@remexre.com>2023-10-05 02:27:31 -0500
committerNathan Ringo <nathan@remexre.com>2023-10-05 02:27:31 -0500
commitc33dc81d26eb5b77faf0108502ec7588a13ebad6 (patch)
tree097f63838ed76849fdae525018f554fe4777a4b5 /Compiler in a Day/01-ciadlang.page
parente49b67af7125cc7c5a5ec6f321057bee1036ba05 (diff)
Adds Compiler in a Day series. This is a direct import and will need cleanup.
Diffstat (limited to 'Compiler in a Day/01-ciadlang.page')
-rw-r--r--Compiler in a Day/01-ciadlang.page289
1 files changed, 289 insertions, 0 deletions
diff --git a/Compiler in a Day/01-ciadlang.page b/Compiler in a Day/01-ciadlang.page
new file mode 100644
index 0000000..7abba8b
--- /dev/null
+++ b/Compiler in a Day/01-ciadlang.page
@@ -0,0 +1,289 @@
++++
+title = "The Ciad language — Compiler In A Day"
+prev = "00-overview.md"
+next = "02-ast.md"
++++
+
+# The Ciad language
+
+Before we start looking at the compiler, we should see and understand the language we'll be compiling.
+This language is named Ciad (pronounced like "key-add"), which is an acronym for the name of this series.
+
+This article will be a bit dry; it's mostly describing what is, rather than *why* it is.
+Some design decisions will be expounded on later, when their benefits are reaped.
+
+Other design decisions take more time to explain; some of these will get their own articles.
+
+One choice about this article that's worth discussing up front — the description of the language is done both in natural language, and in an appropriate formalism.
+
+The main benefit of the latter is precision — with a good formal description, we can avoid needing to argue about what the compiler "should" do, or a program "should" mean; instead, we can consult the formalism.
+
+A secondary benefit (that we won't exploit in this series) is the wide variety of tools and techniques that can give us "nice things" from formal descriptions of a programming language.
+
+In theory, a full formal description of the syntax and semantics of a programming language can be used to "mechanically" generate a complete working compiler for the language.
+
+In practice, almost every language has at least one "weird thing" that breaks the ability to do this.
+Still, sticking to having a good formal description makes it easier to take advantage of programming language theory, even if we do have to do some "fixing up" to handle weird things in our language.
+
+## Syntax
+
+The formalism we'll be using to describe Ciad's syntax is [EBNF], a notation used for describing [context-free grammars].
+
+TODO: Should I say more about CFGs?
+
+TODO: I should definitely mention tokens.
+
+The syntax of a programming language is typically described in two parts, with the lexical syntax described separately from the rest.
+
+This is because it's typically simpler to discuss things like comments and whitespace once, and then state a rule like "comment and whitespace tokens are ignored during parsing," rather than noting that a comment or whitespace token could appear at any point in the actual grammar.
+
+This also fits with the separation between lexing and parsing; typically, a lexer can use simpler algorithms than a parser.
+
+### Lexical Grammar
+
+There are eight categories of tokens in Ciad's lexical syntax:
+
+- Identifiers, like `foo` and `main`. Identifiers are used for the names of variables and functions.
+
+- Keywords, like `if` and `while`. We recognize these separately from identifiers so that users can't do things like create a function named `while`, which may be confusing to other people and hard to parse.
+
+- Integer and string literals, like `34` and `"Hello, world!"`. We need to recognize these in order for them to be present in the syntax!
+
+- Punctuation, like `=` and `++`. These get recognized in a manner similar to that of keywords.
+
+- Comments and whitespace, like `// TODO: speed up this loop` and line feeds. As mentioned above, recognizing them in this separate lexing step helps simplify the syntactic grammar, since they can be ignored here.
+
+- End of file. We have a special token to indicate that the end of the file has been reached. This isn't "really" part of the syntax, but it'll help us when we get around to implementing it.
+
+```ebnf
+token = identifier | keyword | integer literal
+ | string literal | punctuation | comment
+ | whitespace | ? end of file ?;
+```
+
+Identifiers are a letter or underscore, followed by zero or more letters, underscores, or digits, except for keywords.
+
+```ebnf
+identifier = identifier or keyword - keyword;
+identifier or keyword = identifier start character,
+ {identifier body character};
+identifier start character = letter | "_";
+identifier body character = letter | "_" | digit;
+
+letter = "A" | "B" | "C" | "D" | "E" | "F" | "G" | "H"
+ | "I" | "J" | "K" | "L" | "M" | "N" | "O" | "P"
+ | "Q" | "R" | "S" | "T" | "U" | "V" | "W" | "X"
+ | "Y" | "Z"
+ | "a" | "b" | "c" | "d" | "e" | "f" | "g" | "h"
+ | "i" | "j" | "k" | "l" | "m" | "n" | "o" | "p"
+ | "q" | "r" | "s" | "t" | "u" | "v" | "w" | "x"
+ | "y" | "z";
+digit = "0" | "1" | "2" | "3" | "4" | "5" | "6" | "7"
+ | "8" | "9";
+```
+
+The words `else`, `fun`, `if`, `return`, `var`, and `while` are all keywords.
+
+```ebnf
+keyword = "else" | "fun" | "if" | "return" | "var"
+ | "while";
+```
+
+An integer literal is one or more digits optionally preceded by a `-` character.
+
+```ebnf
+integer literal = ["-"], digit, {digit};
+```
+
+A string literal is zero or more string chunks surrounded by two `"` characters.
+A string chunk can either be escaped or unescaped.
+An escaped string chunk starts with a `\` character, and is then followed by one of `f`, `n`, `r`, `t`, `\`, `"`, or `x` and two hexadecimal digits.
+An unescaped string chunk is one or more [ASCII printable characters] other than `\` and `"`.
+
+```ebnf
+string literal = '"', {string chunk}, '"';
+string chunk = escaped string chunk
+ | unescaped string chunk;
+
+escaped string chunk = "\", escape sequence;
+escape sequence = "f" | "n" | "r" | "t" | "\" | '"'
+ | "x", hex digit, hex digit;
+hex digit = "A" | "B" | "C" | "D" | "E" | "F"
+ | "a" | "b" | "c" | "d" | "e" | "f" | digit;
+
+unescaped string chunk = unescaped character,
+ {unescaped character};
+unescaped character = printable character - ('"' | "\");
+
+printable character = digit | letter | " " | "!" | '"'
+ | "#" | "$" | "%" | "&" | "'" | "("
+ | ")" | "*" | "+" | "," | "-" | "."
+ | "/" | ":" | ";" | "<" | "=" | ">"
+ | "?" | "@" | "[" | "\" | "]" | "^"
+ | "_" | "`" | "{" | "|" | "}" | "~";
+```
+
+The following sequences of characters are punctuation: `(`, `)`, `[`, `]`, `{`, `}`, `,`, `=`, `==`, `-`, `+`, `++`, `;`.
+
+Multi-character sequences are preferred to two single-character ones; this rule is often called "[maximal munch]" (and applies everywhere else in the lexical grammar, too).
+
+```ebnf
+punctuation = "(" | ")" | "[" | "]" | "{" | "}" | ","
+ | "=" | "-" | "+" | ";" | "==" | "++";
+```
+
+Comments are `//` followed by zero or more ASCII printable characters or horizontal tabs and terminated by a carriage return, line feed, or the end of the file.
+
+```ebnf
+comment = "//", comment character, comment end of line;
+comment character = printable character
+ | ? horizontal tab character ?;
+comment end of line = ? carriage return character ?
+ | ? line feed character ?
+ | ? end of file ?;
+```
+
+Whitespace tokens are one or more carriage returns, form feeds, horizontal tabs, line feeds, and space characters.
+
+```ebnf
+whitespace = whitespace character,
+ {whitespace character};
+whitespace character = ? carriage return character ?
+ | ? form feed character ?
+ | ? horizontal tab character ?
+ | ? line feed character ?
+ | " ";
+```
+
+### Syntactic Grammar
+
+The syntactic grammar of Ciad is specified using the nonterminals and terminals from the lexical grammar.
+
+Recall that any comment or whitespace tokens that are encountered are ignored, so the syntactic grammar should be read as if there were a `{comment | whitespace}` before any other nonterminal or terminal from the lexical grammar.
+
+A Ciad program is a series of declarations, which may be variable declarations or function declarations.
+
+```ebnf
+program = {declaration}, ? end of file ?;
+declaration = variable declaration | function declaration;
+```
+
+#### Declaration Syntax
+
+A variable declaration is the `var` keyword, followed by an identifier (the variable name) and a semicolon.
+
+```ebnf
+variable declaration = "var", identifier, ";";
+```
+
+A function declaration is the `fun` keyword, followed by an identifier (the function name), a comma-separated (and optionally comma-terminated) list of identifiers surrounded by parentheses (the function arguments), and a block (the function body).
+
+A block is zero or more statements surrounded by curly braces.
+
+```ebnf
+function declaration = "fun", identifier,
+ "(", [function arguments], ")",
+ block;
+function arguments = identifier, {",", identifier}, [","];
+block = "{", {statement}, "}";
+```
+
+#### Statement Syntax
+
+There are five kinds of statements: assignment statements, expression statements, if statements, return statements, and while statements.
+Additionally, variable declarations can be used as statements.
+
+```ebnf
+statement = assignment statement | expression statement
+ | if statement | return statement
+ | while statement | variable declaration;
+```
+
+An assignment statement is an expression followed by an equals sign, another expression, and a semicolon.
+
+An expression statement is an expression followed by a semicolon.
+
+A return statement is the `return` keyword, followed by an expression and a semicolon.
+
+```ebnf
+assignment statement = expression, "=", expression, ";";
+expression statement = expression, ";";
+return statement = "return", expression, ";";
+```
+
+A while statement is the `while` keyword followed by an expression, then zero or more statements surrounded by curly braces (the body).
+
+```ebnf
+while statement = "while", expression, block;
+```
+
+An if statement is the `if` keyword followed by an expression and a block, and optionally by the `else` keyword and another block.
+
+```ebnf
+if statement = "if", expression, block, ["else", block];
+```
+
+#### Expression Syntax
+
+Unambiguously describing expressions is somewhat tricky.
+
+For instance, consider the expression `1 - 2 - 3`.
+Is this the same expression as `(1 - 2) - 3`, or as `1 - (2 - 3)`?
+
+The ordinary rules of order of operations tell us that it's the former, but we need to specify how they work for operations that don't exist in ordinary math.
+For example, should the expression `x ++ y[3]` be the same as `(x ++ y)[3]` or the same as `x ++ (y[3])`?
+
+TODO: Should I explain precedence and associativity?
+Probably, since approximately every programming language spec uses them... or I could leave them until an after-the-code-works chapter.
+
+To help keep our description unambiguous, we describe two subcategories of expressions, atomic expressions and suffixed expressions.
+
+Atomic expressions are those expressions that don't contain operators that aren't enclosed within delimiters.
+This includes integer and string literals, variables, parenthesized expressions, and array literals.
+
+Array literals are comma-separated (and optionally comma-terminated) lists of expressions surrounded by curly braces.
+
+```ebnf
+atomic expression = integer literal | string literal
+ | identifier
+ | "(", expression, ")"
+ | "{", [expression list], "}";
+expression list = expression, {",", expression}, [","];
+```
+
+Suffixed expressions are atomic expressions followed by zero or more suffix operators.
+
+These are the call operator (a parenthesized list of comma-separated and optionally comma-terminated expressions) and the index operator (an expression surrounded by square brackets).
+
+```ebnf
+suffixed expression = atomic expression,
+ {expression suffix};
+expression suffix = call operator | index operator;
+call operator = "(", [expression list], ")";
+index operator = "[", expression, "]";
+```
+
+Finally, an expression is a sequence of one or more suffixed expressions separated by infix operators, which are `+`, `-`, `++`, and `==`.
+
+These operators have equal precedence and associate to the left; that is, `1 - 2 == 3 + 4` is the same as `(((1 - 2) == 3) - 4)`.
+While this isn't what most people are used to, it's easier to implement!
+
+```ebnf
+expression = suffixed expression
+ | expression, infix operator,
+ suffixed expression;
+```
+
+## Semantics
+
+TODO: lol, which formalism...
+
+---
+
+Unlike many (most?) other compiler tutorials, we *won't* be starting with parsing.
+Instead, we'll start with how we represent the program after it's been parsed.
+
+[ASCII printable characters]: https://en.wikipedia.org/wiki/ASCII#Printable_characters
+[context-free grammars]: https://en.wikipedia.org/wiki/Context-free_grammar
+[EBNF]: https://en.wikipedia.org/wiki/Extended_Backus%E2%80%93Naur_form
+[maximal munch]: https://en.wikipedia.org/wiki/Maximal_munch