summaryrefslogtreecommitdiff
path: root/README.txt
diff options
context:
space:
mode:
Diffstat (limited to 'README.txt')
-rw-r--r--README.txt218
1 files changed, 218 insertions, 0 deletions
diff --git a/README.txt b/README.txt
new file mode 100644
index 0000000..0221931
--- /dev/null
+++ b/README.txt
@@ -0,0 +1,218 @@
+╭──────────────────────────────────────────────────────────────────────────────╮
+│ imb3: A programming language that combines the best parts of Common Lisp and │
+│ Forth to make a very high-level language with easy metaprogramming. │
+╞══════════════════════════════════════════════════════════════════════════════╡
+│ The "guts" of the language resemble Common Lisp (e.g. a CLOS-like object │
+│ system, a condition system, a high-performance garbage collector), while the │
+│ surface syntax is all Forth. There is also a type checker for a type system │
+│ that allows dynamically-typed code to exist while also allowing fairly │
+│ complex properties to be checked at compile-time if desired. The cherry on │
+│ top is an efficient native compiler with the REPL interface users of both │
+│ Common Lisp and Forth have come to expect. │
+├──────────────────────────────────────────────────────────────────────────────┤
+│ This initial implementation is designed to be recompilable for a variety of │
+│ platforms, including the Nintendo 3DS. │
+╰──────────────────────────────────────────────────────────────────────────────╯
+
+╭──────────────────────────────────────────────────────────────────────────────╮
+│ Garbage Collector │
+╞══════════════════════════════════════════════════════════════════════════════╡
+│ Currently, the garbage collector is a generational garbage collector with │
+│ three generations -- the “nursery,” the “young-heap,” and the “old-heap.” │
+│ Each one uses a different garbage collection algorithm. │
+│ │
+│ The nursery is the region almost all new objects are allocated into. There │
+│ are two exceptions: │
+│ │
+│ 1. Objects that are too large to fit in the nursery. │
+│ 2. Objects with a finalizer. Note that finalizers are not user-exposed, to │
+│ avoid issues with object resurrection -- they are currently only used │
+│ to allow allocating executable code, using a custom allocator with a │
+│ malloc/free -style interface. │
+│ │
+│ These objects are instead allocated into the young-heap if they could fit │
+│ there, and the old-heap if not. │
+│ │
+│ The nursery’s size is chosen when the process starts, and is equal to the │
+│ size of the largest L1d cache. When the nursery fills up, Cheney’s algorithm │
+│ is run to copy any live objects into the young-heap. Because Cheney’s │
+│ algorithm does not traverse dead objects, objects with finalizers must be │
+│ allocated in (and, more importantly, freed from) another region. │
+│ │
+│ As the data in the nursery grows up, the addresses ┌──────────────┐ 0x…8000 │
+│ of slots that have had pointers written into them │remembered set│ │
+│ are written to the top of its region, and the end ├──────────────┤ 0x…7000 │
+│ of the allocatable region grows down. (This is the │ free │ │
+│ only write barrier the collector needs. Notably, │ space │ │
+│ this avoids branches other than “are we out of ├──────────────┤ 0x…5000 │
+│ space for the remembered set (in which case we need │ │ │
+│ to perform a nursery GC).”) │ │ │
+│ │ objects │ │
+│ When a nursery GC occurs, its remembered set is │ │ │
+│ checked for references into the nursery. These are │ │ │
+│ used as roots. References to other regions are └──────────────┘ 0x…0000 │
+│ copied into the remembered set of the young-heap, The Nursery Region’s │
+│ and non-references are discarded. This discarding Layout │
+│ behavior removes another branch -- whenever a value │
+│ not statically known to be an integer is written, │
+│ we can unconditionally add the place it was written to into the remembered │
+│ set. │
+│ │
+│ The young-heap is a region that is larger than the nursery, but small enough │
+│ that collections should still be fast. It is sized to be a fixed multiple of │
+│ the amount of L3 cache in the system (currently, 2). It is collected with a │
+│ mark-compact algorithm based on the Lisp 2 algorithm, which makes finalizers │
+│ essentially free. To ensure that there is always room to complete a nursery │
+│ GC, we reserve a portion of the young-heap equal in size to the nursery, and │
+│ collect the young-heap once there is not enough free memory to allocate │
+│ without cutting into that reservation. │
+│ ╭──────╥────────────────────────────────╮ │
+│ Objects in the young-heap remain │ TODO ║ Is there a selection algorithm │ │
+│ there for a tunable number of GC ╞══════╝ we could use to make a dynamic │ │
+│ cycles (currently defaulting to 3) │ choice for how long objects remain in │ │
+│ before being tenured -- that is, │ the young-heap before being tenured? │ │
+│ moved to the old-heap. ╰───────────────────────────────────────╯ │
+│ │
+│ The young-heap has its own remembered set, whose entries are copied in from │
+│ the nursery’s remembered set. To ensure that these entries are up to date, │
+│ a young-heap GC can only occur when the nursery is empty. After a young-heap │
+│ GC, we do not need to copy the remembered set into the old-heap, because all │
+│ remaining references will stay internal to the old-heap. │
+│ │
+│ Finally, the old-heap stores objects that have outlived their time in the │
+│ other regions. It is organized into several blocks, each of which belongs to │
+│ a size class. New blocks are allocated and freed as needed, so the old-heap │
+│ dynamically grows and shrinks. Old-heap collections occur when the old-heap │
+│ is using twice as much memory as it was the last time a collection completed │
+│ (capped by a memory limit, by default the size of RAM on the entire system). │
+│ │
+│ The old-heap is collected using mark-sweep with bitmap marking, including a │
+│ per-block bit which allows entire blocks to be swept at once if they contain │
+│ only dead objects without finalizers. (A separate bit is maintained when │
+│ objects are allocated and when sweeping, to track whether a block contains │
+│ objects with finalizers or not.) │
+│ │
+│ As the young-heap requires the nursery to be empty before it is collected, │
+│ so too does the old-heap require both the nursery and young-heap to be empty │
+│ before it can be collected. Because collecting the old-heap tends to be so │
+│ slow relative to the nursery and young-heap, this does not add significant │
+│ overhead. │
+│ │
+│ If 5 consecutive old-heap collections recovers less than 2% of heap memory │
+│ needed while the heap is at its maximum size, the process exits with an │
+│ error. This behavior and these constants are taken from OpenJDK, and are an │
+│ attempt to avoid GC thrashing. │
+├──────────────────────────────────────────────────────────────────────────────┤
+│ Thread-based parallelism is not yet implemented. However, parallelizing this │
+│ collector design is relatively straightforward. Each thread should be pinned │
+│ to a hardware core. Threads can then get a per-thread nursery sized to their │
+│ core’s L1d cache, rather than a system-wide maximum. │
+│ │
+│ The young-heap and old-heap are each protected with a mutex, which prevents │
+│ multiple calls to allocate memory from interfering with one another. │
+│ Unfortunately, this also means that nursery collections cannot occur │
+│ concurrently. It is an open question whether this is an issue in practice as │
+│ mutators and nursery collections execute, but it becomes an issue during │
+│ young-heap collections. │
+│ │
+│ To mitigate this issue, starting a young-heap collection does not signal the │
+│ other threads immediately. Instead, each thread is responsible for notifying │
+│ the next thread just before it completes collection, unless another thread │
+│ is already waiting to collect. This maximizes the amount of time threads │
+│ spend mutating instead of waiting to perform a nursery collection. │
+│ │
+│ Once all the nursery collections have completed in serial, the marking phase │
+│ can be performed in parallel with work-stealing. Of the Lisp 2 algorithm’s │
+│ passes, only the “update object fields to refer to the new location” pass is │
+│ obviously parallelizable. Finalizers are also called in this pass, to allow │
+│ them to be executed in parallel as well. (Recall that finalizers are always │
+│ provided by the runtime, so they can be trusted to be thread-safe.) │
+│ │
+│ Finally, the old-heap can easily be both marked and swept in parallel. Each │
+│ block is conceptually owned by a thread, and that thread is responsible for │
+│ sweeping it. │
+├──────────────────────────────────────────────────────────────────────────────┤
+│ There are some experiments it’d be nice to run with this collector: │
+│ │
+│ 1. Right now, the write barrier design is intended to avoid branches as │
+│ much as possible (without relying on page faults). Instead, extra │
+│ nursery collections will be performed due to the remembered set being │
+│ larger than it needs to be. │
+│ │
+│ i. Are the branches really that expensive? In particular, does it end │
+│ up being a net win to add a branch on: │
+│ │
+│ a. Whether the written value is a fixnum at runtime or not. │
+│ │
+│ b. Whether the slot being written to is in a non-nursery object. │
+│ │
+│ c. Whether the value being written is a pointer to a non-old-heap │
+│ object. │
+│ │
+│ d. Whether the value being written is a pointer to an object in a │
+│ strictly younger region than the slot it’s being written to. │
+│ │
+│ ii. Is the branch on whether the nursery is full really that cheap? │
+│ Does it end up being a net win to use a guard page to store the │
+│ nursery separately from the remembered set, and use guard pages to │
+│ fault when each one is full? │
+│ │
+│ 2. The constants listed above were chosen by a combination of “that sounds │
+│ about right,” “Go does that,” and “OpenJDK does that.” It’s plausible │
+│ that instead, good defaults could be learned with a genetic algorithm │
+│ that tunes the constants in a much more general formula, e.g. │
+│ │
+│ nursery size = k₁ + k₂ × L1d size + k₃ × L2 size + k₄ × L3 size + … │
+│ │
+│ It would be useful to try learning good defaults on a wide variety of │
+│ machines and workloads, and seeing if natural constants “fall out.” │
+│ (Note that this means running a single learning process whose loss │
+│ function depends on the performance of the formula across all the │
+│ machines.) │
+│ │
+│ This could also be used alongside any “boolean-valued” experiments, to │
+│ learn whether they’re useful and how they affect the constants. │
+│ │
+│ Finally, this could plausibly even be a useful tool to ship to users │
+│ of the compiler, so they can make tune GC constants for release builds │
+│ of their own software. │
+│ │
+│ 3. The parallel variant pins threads to CPUs because many modern machines │
+│ have heterogenous core clusters, so different CPUs will have different │
+│ L1d sizes. However, this is only likely to have reasonable performance │
+│ with a work-stealing green threading runtime in userspace, and even in │
+│ that case may have worse mutator performance than not pinning threads. │
+│ │
+│ Does pinning and having optimally-cache-sized nurseries really improve │
+│ performance that much versus uniformly sizing the nurseries and not │
+│ pinning? │
+│ │
+│ 4. Parallelizing the “update object fields to refer to the new location” │
+│ pass of the young-heap collection requires the “assign forwarding │
+│ pointers to objects” pass to estimate which parts of the young-heap are │
+│ assigned to which thread, since they must be aligned to the boundaries │
+│ between objects. │
+│ │
+│ In the above design, the passes are serialized, such that only after │
+│ all threads have been assigned to parts of the region do any of them │
+│ start updating object fields. Instead, we could imagine giving threads │
+│ non-uniformly-sized parts of the region, and having them start as soon │
+│ as they are assigned. │
+│ │
+│ Threads could also start the “shift all objects down” pass as soon as │
+│ all threads before them have finished the “update object fields” pass. │
+│ │
+│ i. Does interleaving the first two passes improve performance? │
+│ │
+│ ii. Should threads be given (approximately) equally-sized portions of │
+│ the region, or should they follow some curve, so threads that start │
+│ earlier are given more work to do? │
+│ │
+│ This curve’s parameters should probably be learned as above. │
+│ │
+│ iii. Does interleaving the last two passes improve performance? This │
+│ means shifting objects down while some (later) objects’ fields are │
+│ still being updated. │
+│ │
+│ iv. Does interleaving all three passes improve performance? │
+╰──────────────────────────────────────────────────────────────────────────────╯