╭──────────────────────────────────────────────────────────────────────────────╮ │ imb3: A programming language that combines the best parts of Common Lisp and │ │ Forth to make a very high-level language with easy metaprogramming. │ ╞══════════════════════════════════════════════════════════════════════════════╡ │ The "guts" of the language resemble Common Lisp (e.g. a CLOS-like object │ │ system, a condition system, a high-performance garbage collector), while the │ │ surface syntax is all Forth. There is also a type checker for a type system │ │ that allows dynamically-typed code to exist while also allowing fairly │ │ complex properties to be checked at compile-time if desired. The cherry on │ │ top is an efficient native compiler with the REPL interface users of both │ │ Common Lisp and Forth have come to expect. │ ├──────────────────────────────────────────────────────────────────────────────┤ │ This initial implementation is designed to be recompilable for a variety of │ │ platforms, including the Nintendo 3DS. │ ╰──────────────────────────────────────────────────────────────────────────────╯ ╭──────────────────────────────────────────────────────────────────────────────╮ │ Garbage Collector │ ╞══════════════════════════════════════════════════════════════════════════════╡ │ Currently, the garbage collector is a generational garbage collector with │ │ three generations -- the “nursery,” the “young-heap,” and the “old-heap.” │ │ Each one uses a different garbage collection algorithm. │ │ │ │ The nursery is the region almost all new objects are allocated into. There │ │ are two exceptions: │ │ │ │ 1. Objects that are too large to fit in the nursery. │ │ 2. Objects with a finalizer. Note that finalizers are not user-exposed, to │ │ avoid issues with object resurrection -- they are currently only used │ │ to allow allocating executable code, using a custom allocator with a │ │ malloc/free -style interface. │ │ │ │ These objects are instead allocated into the young-heap if they could fit │ │ there, and the old-heap if not. │ │ │ │ The nursery’s size is chosen when the process starts, and is equal to the │ │ size of the largest L1d cache. When the nursery fills up, Cheney’s algorithm │ │ is run to copy any live objects into the young-heap. Because Cheney’s │ │ algorithm does not traverse dead objects, objects with finalizers must be │ │ allocated in (and, more importantly, freed from) another region. │ │ │ │ As the data in the nursery grows up, the addresses ┌──────────────┐ 0x…8000 │ │ of slots that have had pointers written into them │remembered set│ │ │ are written to the top of its region, and the end ├──────────────┤ 0x…7000 │ │ of the allocatable region grows down. (This is the │ free │ │ │ only write barrier the collector needs. Notably, │ space │ │ │ this avoids branches other than “are we out of ├──────────────┤ 0x…5000 │ │ space for the remembered set (in which case we need │ │ │ │ to perform a nursery GC).”) │ │ │ │ │ objects │ │ │ When a nursery GC occurs, its remembered set is │ │ │ │ checked for references into the nursery. These are │ │ │ │ used as roots. References to other regions are └──────────────┘ 0x…0000 │ │ copied into the remembered set of the young-heap, The Nursery Region’s │ │ and non-references are discarded. This discarding Layout │ │ behavior removes another branch -- whenever a value │ │ not statically known to be an integer is written, │ │ we can unconditionally add the place it was written to into the remembered │ │ set. │ │ │ │ The young-heap is a region that is larger than the nursery, but small enough │ │ that collections should still be fast. It is sized to be a fixed multiple of │ │ the amount of L3 cache in the system (currently, 2). It is collected with a │ │ mark-compact algorithm based on the Lisp 2 algorithm, which makes finalizers │ │ essentially free. To ensure that there is always room to complete a nursery │ │ GC, we reserve a portion of the young-heap equal in size to the nursery, and │ │ collect the young-heap once there is not enough free memory to allocate │ │ without cutting into that reservation. │ │ ╭──────╥────────────────────────────────╮ │ │ Objects in the young-heap remain │ TODO ║ Is there a selection algorithm │ │ │ there for a tunable number of GC ╞══════╝ we could use to make a dynamic │ │ │ cycles (currently defaulting to 3) │ choice for how long objects remain in │ │ │ before being tenured -- that is, │ the young-heap before being tenured? │ │ │ moved to the old-heap. ╰───────────────────────────────────────╯ │ │ │ │ The young-heap has its own remembered set, whose entries are copied in from │ │ the nursery’s remembered set. To ensure that these entries are up to date, │ │ a young-heap GC can only occur when the nursery is empty. After a young-heap │ │ GC, we do not need to copy the remembered set into the old-heap, because all │ │ remaining references will stay internal to the old-heap. │ │ │ │ Finally, the old-heap stores objects that have outlived their time in the │ │ other regions. It is organized into several blocks, each of which belongs to │ │ a size class. New blocks are allocated and freed as needed, so the old-heap │ │ dynamically grows and shrinks. Old-heap collections occur when the old-heap │ │ is using twice as much memory as it was the last time a collection completed │ │ (capped by a memory limit, by default the size of RAM on the entire system). │ │ │ │ The old-heap is collected using mark-sweep with bitmap marking, including a │ │ per-block bit which allows entire blocks to be swept at once if they contain │ │ only dead objects without finalizers. (A separate bit is maintained when │ │ objects are allocated and when sweeping, to track whether a block contains │ │ objects with finalizers or not.) │ │ │ │ As the young-heap requires the nursery to be empty before it is collected, │ │ so too does the old-heap require both the nursery and young-heap to be empty │ │ before it can be collected. Because collecting the old-heap tends to be so │ │ slow relative to the nursery and young-heap, this does not add significant │ │ overhead. │ │ │ │ If 5 consecutive old-heap collections recovers less than 2% of heap memory │ │ needed while the heap is at its maximum size, the process exits with an │ │ error. This behavior and these constants are taken from OpenJDK, and are an │ │ attempt to avoid GC thrashing. │ ├──────────────────────────────────────────────────────────────────────────────┤ │ Thread-based parallelism is not yet implemented. However, parallelizing this │ │ collector design is relatively straightforward. Each thread should be pinned │ │ to a hardware core. Threads can then get a per-thread nursery sized to their │ │ core’s L1d cache, rather than a system-wide maximum. │ │ │ │ The young-heap and old-heap are each protected with a mutex, which prevents │ │ multiple calls to allocate memory from interfering with one another. │ │ Unfortunately, this also means that nursery collections cannot occur │ │ concurrently. It is an open question whether this is an issue in practice as │ │ mutators and nursery collections execute, but it becomes an issue during │ │ young-heap collections. │ │ │ │ To mitigate this issue, starting a young-heap collection does not signal the │ │ other threads immediately. Instead, each thread is responsible for notifying │ │ the next thread just before it completes collection, unless another thread │ │ is already waiting to collect. This maximizes the amount of time threads │ │ spend mutating instead of waiting to perform a nursery collection. │ │ │ │ Once all the nursery collections have completed in serial, the marking phase │ │ can be performed in parallel with work-stealing. Of the Lisp 2 algorithm’s │ │ passes, only the “update object fields to refer to the new location” pass is │ │ obviously parallelizable. Finalizers are also called in this pass, to allow │ │ them to be executed in parallel as well. (Recall that finalizers are always │ │ provided by the runtime, so they can be trusted to be thread-safe.) │ │ │ │ Finally, the old-heap can easily be both marked and swept in parallel. Each │ │ block is conceptually owned by a thread, and that thread is responsible for │ │ sweeping it. │ ├──────────────────────────────────────────────────────────────────────────────┤ │ There are some experiments it’d be nice to run with this collector: │ │ │ │ 1. Right now, the write barrier design is intended to avoid branches as │ │ much as possible (without relying on page faults). Instead, extra │ │ nursery collections will be performed due to the remembered set being │ │ larger than it needs to be. │ │ │ │ i. Are the branches really that expensive? In particular, does it end │ │ up being a net win to add a branch on: │ │ │ │ a. Whether the written value is a fixnum at runtime or not. │ │ │ │ b. Whether the slot being written to is in a non-nursery object. │ │ │ │ c. Whether the value being written is a pointer to a non-old-heap │ │ object. │ │ │ │ d. Whether the value being written is a pointer to an object in a │ │ strictly younger region than the slot it’s being written to. │ │ │ │ ii. Is the branch on whether the nursery is full really that cheap? │ │ Does it end up being a net win to use a guard page to store the │ │ nursery separately from the remembered set, and use guard pages to │ │ fault when each one is full? │ │ │ │ 2. The constants listed above were chosen by a combination of “that sounds │ │ about right,” “Go does that,” and “OpenJDK does that.” It’s plausible │ │ that instead, good defaults could be learned with a genetic algorithm │ │ that tunes the constants in a much more general formula, e.g. │ │ │ │ nursery size = k₁ + k₂ × L1d size + k₃ × L2 size + k₄ × L3 size + … │ │ │ │ It would be useful to try learning good defaults on a wide variety of │ │ machines and workloads, and seeing if natural constants “fall out.” │ │ (Note that this means running a single learning process whose loss │ │ function depends on the performance of the formula across all the │ │ machines.) │ │ │ │ This could also be used alongside any “boolean-valued” experiments, to │ │ learn whether they’re useful and how they affect the constants. │ │ │ │ Finally, this could plausibly even be a useful tool to ship to users │ │ of the compiler, so they can make tune GC constants for release builds │ │ of their own software. │ │ │ │ 3. The parallel variant pins threads to CPUs because many modern machines │ │ have heterogenous core clusters, so different CPUs will have different │ │ L1d sizes. However, this is only likely to have reasonable performance │ │ with a work-stealing green threading runtime in userspace, and even in │ │ that case may have worse mutator performance than not pinning threads. │ │ │ │ Does pinning and having optimally-cache-sized nurseries really improve │ │ performance that much versus uniformly sizing the nurseries and not │ │ pinning? │ │ │ │ 4. Parallelizing the “update object fields to refer to the new location” │ │ pass of the young-heap collection requires the “assign forwarding │ │ pointers to objects” pass to estimate which parts of the young-heap are │ │ assigned to which thread, since they must be aligned to the boundaries │ │ between objects. │ │ │ │ In the above design, the passes are serialized, such that only after │ │ all threads have been assigned to parts of the region do any of them │ │ start updating object fields. Instead, we could imagine giving threads │ │ non-uniformly-sized parts of the region, and having them start as soon │ │ as they are assigned. │ │ │ │ Threads could also start the “shift all objects down” pass as soon as │ │ all threads before them have finished the “update object fields” pass. │ │ │ │ i. Does interleaving the first two passes improve performance? │ │ │ │ ii. Should threads be given (approximately) equally-sized portions of │ │ the region, or should they follow some curve, so threads that start │ │ earlier are given more work to do? │ │ │ │ This curve’s parameters should probably be learned as above. │ │ │ │ iii. Does interleaving the last two passes improve performance? This │ │ means shifting objects down while some (later) objects’ fields are │ │ still being updated. │ │ │ │ iv. Does interleaving all three passes improve performance? │ ╰──────────────────────────────────────────────────────────────────────────────╯ ╭──────────────────────────────────────────────────────────────────────────────╮ │ The Meta-Object Protocol (i.e., the scary parts of the object system) │ ╞══════════════════════════════════════════════════════════════════════════════╡ │ The object system supports a meta-object protocol inspired by the original │ │ MOP as implemented in most Common Lisp implementations, but with a few parts │ │ cleaned up and simplified. │ │ │ │ The pre-defined classes are either BUILTIN-CLASSes or STANDARD-CLASSes. The │ │ main difference between the two is that only STANDARD-CLASSes have a slots │ │ array. STANDARD-CLASS may be subclassed, allowing user-defined metaclasses │ │ to be defined. BUILTIN-CLASS may not be subclassed. │ │ │ │ The class hierarchy for pre-defined classes is as follows: │ │ │ │ t │ │ ┌───────────────────┴────────────────┐ │ │ builtin-object │ │ │ ┌────┬───┬─────────┬──────┬─────┼───────────────┐ │ │ │ integer │ symbol sequence │ character hashtable │ │ │ ┌──┴──┐ └──┐│ ┌──┴──┐ │ ┌───────┴───────┐ │ │ │ bignum fixnum ││ list vector └─────┐ hashtable-eq hashtable-equal │ │ │ ┌───────┬───┘└──┐┌─┴──┐ └┬───… │ │ │ │ method package null cons string function │ │ │ ┌─────────────────┬────────┴───────┬─────────────────┐ │ │ │ builtin-function compiled-function generic-function threaded-function │ │ │ │ │ │ ┌─────────────────────────────────────┘ │ │ standard-object │ │ ┌─────────────────────┬─┴────────────────────┐ │ │ class direct-slot-definition effective-slot-definition │ │ ┌──────┴──────┐ │ │ builtin-class standard-class │ ╰──────────────────────────────────────────────────────────────────────────────╯