╭──────────────────────────────────────────────────────────────────────────────╮
│ imb3: A programming language that combines the best parts of Common Lisp and │
│ Forth to make a very high-level language with easy metaprogramming.          │
╞══════════════════════════════════════════════════════════════════════════════╡
│ The "guts" of the language resemble Common Lisp (e.g. a CLOS-like object     │
│ system, a condition system, a high-performance garbage collector), while the │
│ surface syntax is all Forth. There is also a type checker for a type system  │
│ that allows dynamically-typed code to exist while also allowing fairly       │
│ complex properties to be checked at compile-time if desired. The cherry on   │
│ top is an efficient native compiler with the REPL interface users of both    │
│ Common Lisp and Forth have come to expect.                                   │
├──────────────────────────────────────────────────────────────────────────────┤
│ This initial implementation is designed to be recompilable for a variety of  │
│ platforms, including the Nintendo 3DS.                                       │
╰──────────────────────────────────────────────────────────────────────────────╯

╭──────────────────────────────────────────────────────────────────────────────╮
│ Garbage Collector                                                            │
╞══════════════════════════════════════════════════════════════════════════════╡
│ Currently, the garbage collector is a generational garbage collector with    │
│ three generations -- the “nursery,” the “young-heap,” and the “old-heap.”    │
│ Each one uses a different garbage collection algorithm.                      │
│                                                                              │
│ The nursery is the region almost all new objects are allocated into. There   │
│ are two exceptions:                                                          │
│                                                                              │
│   1. Objects that are too large to fit in the nursery.                       │
│   2. Objects with a finalizer. Note that finalizers are not user-exposed, to │
│      avoid issues with object resurrection -- they are currently only used   │
│      to allow allocating executable code, using a custom allocator with a    │
│      malloc/free -style interface.                                           │
│                                                                              │
│ These objects are instead allocated into the young-heap if they could fit    │
│ there, and the old-heap if not.                                              │
│                                                                              │
│ The nursery’s size is chosen when the process starts, and is equal to the    │
│ size of the largest L1d cache. When the nursery fills up, Cheney’s algorithm │
│ is run to copy any live objects into the young-heap. Because Cheney’s        │
│ algorithm does not traverse dead objects, objects with finalizers must be    │
│ allocated in (and, more importantly, freed from) another region.             │
│                                                                              │
│ As the data in the nursery grows up, the addresses  ┌──────────────┐ 0x…8000 │
│ of slots that have had pointers written into them   │remembered set│         │
│ are written to the top of its region, and the end   ├──────────────┤ 0x…7000 │
│ of the allocatable region grows down. (This is the  │     free     │         │
│ only write barrier the collector needs. Notably,    │     space    │         │
│ this avoids branches other than “are we out of      ├──────────────┤ 0x…5000 │
│ space for the remembered set (in which case we need │              │         │
│ to perform a nursery GC).”)                         │              │         │
│                                                     │    objects   │         │
│ When a nursery GC occurs, its remembered set is     │              │         │
│ checked for references into the nursery. These are  │              │         │
│ used as roots. References to other regions are      └──────────────┘ 0x…0000 │
│ copied into the remembered set of the young-heap,      The Nursery Region’s  │
│ and non-references are discarded. This discarding            Layout          │
│ behavior removes another branch -- whenever a value                          │
│ not statically known to be an integer is written,                            │
│ we can unconditionally add the place it was written to into the remembered   │
│ set.                                                                         │
│                                                                              │
│ The young-heap is a region that is larger than the nursery, but small enough │
│ that collections should still be fast. It is sized to be a fixed multiple of │
│ the amount of L3 cache in the system (currently, 2). It is collected with a  │
│ mark-compact algorithm based on the Lisp 2 algorithm, which makes finalizers │
│ essentially free. To ensure that there is always room to complete a nursery  │
│ GC, we reserve a portion of the young-heap equal in size to the nursery, and │
│ collect the young-heap once there is not enough free memory to allocate      │
│ without cutting into that reservation.                                       │
│                                    ╭──────╥────────────────────────────────╮ │
│ Objects in the young-heap remain   │ TODO ║ Is there a selection algorithm │ │
│ there for a tunable number of GC   ╞══════╝ we could use to make a dynamic │ │
│ cycles (currently defaulting to 3) │ choice for how long objects remain in │ │
│ before being tenured -- that is,   │ the young-heap before being tenured?  │ │
│ moved to the old-heap.             ╰───────────────────────────────────────╯ │
│                                                                              │
│ The young-heap has its own remembered set, whose entries are copied in from  │
│ the nursery’s remembered set. To ensure that these entries are up to date,   │
│ a young-heap GC can only occur when the nursery is empty. After a young-heap │
│ GC, we do not need to copy the remembered set into the old-heap, because all │
│ remaining references will stay internal to the old-heap.                     │
│                                                                              │
│ Finally, the old-heap stores objects that have outlived their time in the    │
│ other regions. It is organized into several blocks, each of which belongs to │
│ a size class. New blocks are allocated and freed as needed, so the old-heap  │
│ dynamically grows and shrinks. Old-heap collections occur when the old-heap  │
│ is using twice as much memory as it was the last time a collection completed │
│ (capped by a memory limit, by default the size of RAM on the entire system). │
│                                                                              │
│ The old-heap is collected using mark-sweep with bitmap marking, including a  │
│ per-block bit which allows entire blocks to be swept at once if they contain │
│ only dead objects without finalizers. (A separate bit is maintained when     │
│ objects are allocated and when sweeping, to track whether a block contains   │
│ objects with finalizers or not.)                                             │
│                                                                              │
│ As the young-heap requires the nursery to be empty before it is collected,   │
│ so too does the old-heap require both the nursery and young-heap to be empty │
│ before it can be collected. Because collecting the old-heap tends to be so   │
│ slow relative to the nursery and young-heap, this does not add significant   │
│ overhead.                                                                    │
│                                                                              │
│ If 5 consecutive old-heap collections recovers less than 2% of heap memory   │
│ needed while the heap is at its maximum size, the process exits with an      │
│ error. This behavior and these constants are taken from OpenJDK, and are an  │
│ attempt to avoid GC thrashing.                                               │
├──────────────────────────────────────────────────────────────────────────────┤
│ Thread-based parallelism is not yet implemented. However, parallelizing this │
│ collector design is relatively straightforward. Each thread should be pinned │
│ to a hardware core. Threads can then get a per-thread nursery sized to their │
│ core’s L1d cache, rather than a system-wide maximum.                         │
│                                                                              │
│ The young-heap and old-heap are each protected with a mutex, which prevents  │
│ multiple calls to allocate memory from interfering with one another.         │
│ Unfortunately, this also means that nursery collections cannot occur         │
│ concurrently. It is an open question whether this is an issue in practice as │
│ mutators and nursery collections execute, but it becomes an issue during     │
│ young-heap collections.                                                      │
│                                                                              │
│ To mitigate this issue, starting a young-heap collection does not signal the │
│ other threads immediately. Instead, each thread is responsible for notifying │
│ the next thread just before it completes collection, unless another thread   │
│ is already waiting to collect. This maximizes the amount of time threads     │
│ spend mutating instead of waiting to perform a nursery collection.           │
│                                                                              │
│ Once all the nursery collections have completed in serial, the marking phase │
│ can be performed in parallel with work-stealing. Of the Lisp 2 algorithm’s   │
│ passes, only the “update object fields to refer to the new location” pass is │
│ obviously parallelizable. Finalizers are also called in this pass, to allow  │
│ them to be executed in parallel as well. (Recall that finalizers are always  │
│ provided by the runtime, so they can be trusted to be thread-safe.)          │
│                                                                              │
│ Finally, the old-heap can easily be both marked and swept in parallel. Each  │
│ block is conceptually owned by a thread, and that thread is responsible for  │
│ sweeping it.                                                                 │
├──────────────────────────────────────────────────────────────────────────────┤
│ There are some experiments it’d be nice to run with this collector:          │
│                                                                              │
│   1. Right now, the write barrier design is intended to avoid branches as    │
│      much as possible (without relying on page faults). Instead, extra       │
│      nursery collections will be performed due to the remembered set being   │
│      larger than it needs to be.                                             │
│                                                                              │
│      i. Are the branches really that expensive? In particular, does it end   │
│         up being a net win to add a branch on:                               │
│                                                                              │
│         a. Whether the written value is a fixnum at runtime or not.          │
│                                                                              │
│         b. Whether the slot being written to is in a non-nursery object.     │
│                                                                              │
│         c. Whether the value being written is a pointer to a non-old-heap    │
│            object.                                                           │
│                                                                              │
│         d. Whether the value being written is a pointer to an object in a    │
│            strictly younger region than the slot it’s being written to.      │
│                                                                              │
│      ii. Is the branch on whether the nursery is full really that cheap?     │
│          Does it end up being a net win to use a guard page to store the     │
│          nursery separately from the remembered set, and use guard pages to  │
│          fault when each one is full?                                        │
│                                                                              │
│   2. The constants listed above were chosen by a combination of “that sounds │
│      about right,” “Go does that,” and “OpenJDK does that.” It’s plausible   │
│      that instead, good defaults could be learned with a genetic algorithm   │
│      that tunes the constants in a much more general formula, e.g.           │
│                                                                              │
│        nursery size = k₁ + k₂ × L1d size + k₃ × L2 size + k₄ × L3 size + …   │
│                                                                              │
│      It would be useful to try learning good defaults on a wide variety of   │
│      machines and workloads, and seeing if natural constants “fall out.”     │
│      (Note that this means running a single learning process whose loss      │
│      function depends on the performance of the formula across all the       │
│      machines.)                                                              │
│                                                                              │
│      This could also be used alongside any “boolean-valued” experiments, to  │
│      learn whether they’re useful and how they affect the constants.         │
│                                                                              │
│      Finally, this could plausibly even be a useful tool to ship to users    │
│      of the compiler, so they can make tune GC constants for release builds  │
│      of their own software.                                                  │
│                                                                              │
│   3. The parallel variant pins threads to CPUs because many modern machines  │
│      have heterogenous core clusters, so different CPUs will have different  │
│      L1d sizes. However, this is only likely to have reasonable performance  │
│      with a work-stealing green threading runtime in userspace, and even in  │
│      that case may have worse mutator performance than not pinning threads.  │
│                                                                              │
│      Does pinning and having optimally-cache-sized nurseries really improve  │
│      performance that much versus uniformly sizing the nurseries and not     │
│      pinning?                                                                │
│                                                                              │
│   4. Parallelizing the “update object fields to refer to the new location”   │
│      pass of the young-heap collection requires the “assign forwarding       │
│      pointers to objects” pass to estimate which parts of the young-heap are │
│      assigned to which thread, since they must be aligned to the boundaries  │
│      between objects.                                                        │
│                                                                              │
│      In the above design, the passes are serialized, such that only after    │
│      all threads have been assigned to parts of the region do any of them    │
│      start updating object fields. Instead, we could imagine giving threads  │
│      non-uniformly-sized parts of the region, and having them start as soon  │
│      as they are assigned.                                                   │
│                                                                              │
│      Threads could also start the “shift all objects down” pass as soon as   │
│      all threads before them have finished the “update object fields” pass.  │
│                                                                              │
│      i. Does interleaving the first two passes improve performance?          │
│                                                                              │
│      ii. Should threads be given (approximately) equally-sized portions of   │
│          the region, or should they follow some curve, so threads that start │
│          earlier are given more work to do?                                  │
│                                                                              │
│          This curve’s parameters should probably be learned as above.        │
│                                                                              │
│      iii. Does interleaving the last two passes improve performance? This    │
│           means shifting objects down while some (later) objects’ fields are │
│           still being updated.                                               │
│                                                                              │
│      iv. Does interleaving all three passes improve performance?             │
╰──────────────────────────────────────────────────────────────────────────────╯

╭──────────────────────────────────────────────────────────────────────────────╮
│ The Meta-Object Protocol (i.e., the scary parts of the object system)        │
╞══════════════════════════════════════════════════════════════════════════════╡
│ The object system supports a meta-object protocol inspired by the original   │
│ MOP as implemented in most Common Lisp implementations, but with a few parts │
│ cleaned up and simplified.                                                   │
│                                                                              │
│ The pre-defined classes are either BUILTIN-CLASSes or STANDARD-CLASSes. The  │
│ main difference between the two is that only STANDARD-CLASSes have a slots   │
│ array. STANDARD-CLASS may be subclassed, allowing user-defined metaclasses   │
│ to be defined. BUILTIN-CLASS may not be subclassed.                          │
│                                                                              │
│ The class hierarchy for pre-defined classes is as follows:                   │
│                                                                              │
│                                                           t                  │
│                                       ┌───────────────────┴────────────────┐ │
│                                builtin-object                              │ │
│       ┌────┬───┬─────────┬──────┬─────┼───────────────┐                    │ │
│    integer │ symbol   sequence  │ character       hashtable                │ │
│    ┌──┴──┐ └──┐│      ┌──┴──┐   │             ┌───────┴───────┐            │ │
│ bignum fixnum ││    list vector └─────┐  hashtable-eq hashtable-equal      │ │
│   ┌───────┬───┘└──┐┌─┴──┐   └┬───…    │                                    │ │
│ method package   null cons string  function                                │ │
│            ┌─────────────────┬────────┴───────┬─────────────────┐          │ │
│     builtin-function compiled-function generic-function threaded-function  │ │
│                                                                            │ │
│                                      ┌─────────────────────────────────────┘ │
│                                standard-object                               │
│              ┌─────────────────────┬─┴────────────────────┐                  │
│            class        direct-slot-definition effective-slot-definition     │
│       ┌──────┴──────┐                                                        │
│ builtin-class standard-class                                                 │
╰──────────────────────────────────────────────────────────────────────────────╯