1
2
3
4
5
6
7
8
9
10
11
12
13
14
15
16
17
18
19
20
21
22
23
24
25
26
27
28
29
30
31
32
33
34
35
36
37
38
39
40
41
42
43
44
45
46
47
48
49
50
51
52
53
54
55
56
57
58
59
60
61
62
63
64
65
66
67
68
69
70
71
72
73
74
75
76
77
78
79
80
81
82
83
84
85
86
87
88
89
90
91
92
93
94
95
96
97
98
99
100
101
102
103
104
105
106
107
108
109
110
111
112
113
114
115
116
117
118
119
120
121
122
123
124
125
126
127
128
129
130
131
132
133
134
135
136
137
138
139
140
141
142
143
144
145
146
147
148
149
150
151
152
153
154
155
156
157
158
159
160
161
162
163
164
165
166
167
168
169
170
171
172
173
174
175
176
177
178
179
180
181
182
183
184
185
186
187
188
189
190
191
192
193
194
195
196
197
198
199
200
201
202
203
204
205
206
207
208
209
210
211
212
213
214
215
216
217
218
|
╭──────────────────────────────────────────────────────────────────────────────╮
│ imb3: A programming language that combines the best parts of Common Lisp and │
│ Forth to make a very high-level language with easy metaprogramming. │
╞══════════════════════════════════════════════════════════════════════════════╡
│ The "guts" of the language resemble Common Lisp (e.g. a CLOS-like object │
│ system, a condition system, a high-performance garbage collector), while the │
│ surface syntax is all Forth. There is also a type checker for a type system │
│ that allows dynamically-typed code to exist while also allowing fairly │
│ complex properties to be checked at compile-time if desired. The cherry on │
│ top is an efficient native compiler with the REPL interface users of both │
│ Common Lisp and Forth have come to expect. │
├──────────────────────────────────────────────────────────────────────────────┤
│ This initial implementation is designed to be recompilable for a variety of │
│ platforms, including the Nintendo 3DS. │
╰──────────────────────────────────────────────────────────────────────────────╯
╭──────────────────────────────────────────────────────────────────────────────╮
│ Garbage Collector │
╞══════════════════════════════════════════════════════════════════════════════╡
│ Currently, the garbage collector is a generational garbage collector with │
│ three generations -- the “nursery,” the “young-heap,” and the “old-heap.” │
│ Each one uses a different garbage collection algorithm. │
│ │
│ The nursery is the region almost all new objects are allocated into. There │
│ are two exceptions: │
│ │
│ 1. Objects that are too large to fit in the nursery. │
│ 2. Objects with a finalizer. Note that finalizers are not user-exposed, to │
│ avoid issues with object resurrection -- they are currently only used │
│ to allow allocating executable code, using a custom allocator with a │
│ malloc/free -style interface. │
│ │
│ These objects are instead allocated into the young-heap if they could fit │
│ there, and the old-heap if not. │
│ │
│ The nursery’s size is chosen when the process starts, and is equal to the │
│ size of the largest L1d cache. When the nursery fills up, Cheney’s algorithm │
│ is run to copy any live objects into the young-heap. Because Cheney’s │
│ algorithm does not traverse dead objects, objects with finalizers must be │
│ allocated in (and, more importantly, freed from) another region. │
│ │
│ As the data in the nursery grows up, the addresses ┌──────────────┐ 0x…8000 │
│ of slots that have had pointers written into them │remembered set│ │
│ are written to the top of its region, and the end ├──────────────┤ 0x…7000 │
│ of the allocatable region grows down. (This is the │ free │ │
│ only write barrier the collector needs. Notably, │ space │ │
│ this avoids branches other than “are we out of ├──────────────┤ 0x…5000 │
│ space for the remembered set (in which case we need │ │ │
│ to perform a nursery GC).”) │ │ │
│ │ objects │ │
│ When a nursery GC occurs, its remembered set is │ │ │
│ checked for references into the nursery. These are │ │ │
│ used as roots. References to other regions are └──────────────┘ 0x…0000 │
│ copied into the remembered set of the young-heap, The Nursery Region’s │
│ and non-references are discarded. This discarding Layout │
│ behavior removes another branch -- whenever a value │
│ not statically known to be an integer is written, │
│ we can unconditionally add the place it was written to into the remembered │
│ set. │
│ │
│ The young-heap is a region that is larger than the nursery, but small enough │
│ that collections should still be fast. It is sized to be a fixed multiple of │
│ the amount of L3 cache in the system (currently, 2). It is collected with a │
│ mark-compact algorithm based on the Lisp 2 algorithm, which makes finalizers │
│ essentially free. To ensure that there is always room to complete a nursery │
│ GC, we reserve a portion of the young-heap equal in size to the nursery, and │
│ collect the young-heap once there is not enough free memory to allocate │
│ without cutting into that reservation. │
│ ╭──────╥────────────────────────────────╮ │
│ Objects in the young-heap remain │ TODO ║ Is there a selection algorithm │ │
│ there for a tunable number of GC ╞══════╝ we could use to make a dynamic │ │
│ cycles (currently defaulting to 3) │ choice for how long objects remain in │ │
│ before being tenured -- that is, │ the young-heap before being tenured? │ │
│ moved to the old-heap. ╰───────────────────────────────────────╯ │
│ │
│ The young-heap has its own remembered set, whose entries are copied in from │
│ the nursery’s remembered set. To ensure that these entries are up to date, │
│ a young-heap GC can only occur when the nursery is empty. After a young-heap │
│ GC, we do not need to copy the remembered set into the old-heap, because all │
│ remaining references will stay internal to the old-heap. │
│ │
│ Finally, the old-heap stores objects that have outlived their time in the │
│ other regions. It is organized into several blocks, each of which belongs to │
│ a size class. New blocks are allocated and freed as needed, so the old-heap │
│ dynamically grows and shrinks. Old-heap collections occur when the old-heap │
│ is using twice as much memory as it was the last time a collection completed │
│ (capped by a memory limit, by default the size of RAM on the entire system). │
│ │
│ The old-heap is collected using mark-sweep with bitmap marking, including a │
│ per-block bit which allows entire blocks to be swept at once if they contain │
│ only dead objects without finalizers. (A separate bit is maintained when │
│ objects are allocated and when sweeping, to track whether a block contains │
│ objects with finalizers or not.) │
│ │
│ As the young-heap requires the nursery to be empty before it is collected, │
│ so too does the old-heap require both the nursery and young-heap to be empty │
│ before it can be collected. Because collecting the old-heap tends to be so │
│ slow relative to the nursery and young-heap, this does not add significant │
│ overhead. │
│ │
│ If 5 consecutive old-heap collections recovers less than 2% of heap memory │
│ needed while the heap is at its maximum size, the process exits with an │
│ error. This behavior and these constants are taken from OpenJDK, and are an │
│ attempt to avoid GC thrashing. │
├──────────────────────────────────────────────────────────────────────────────┤
│ Thread-based parallelism is not yet implemented. However, parallelizing this │
│ collector design is relatively straightforward. Each thread should be pinned │
│ to a hardware core. Threads can then get a per-thread nursery sized to their │
│ core’s L1d cache, rather than a system-wide maximum. │
│ │
│ The young-heap and old-heap are each protected with a mutex, which prevents │
│ multiple calls to allocate memory from interfering with one another. │
│ Unfortunately, this also means that nursery collections cannot occur │
│ concurrently. It is an open question whether this is an issue in practice as │
│ mutators and nursery collections execute, but it becomes an issue during │
│ young-heap collections. │
│ │
│ To mitigate this issue, starting a young-heap collection does not signal the │
│ other threads immediately. Instead, each thread is responsible for notifying │
│ the next thread just before it completes collection, unless another thread │
│ is already waiting to collect. This maximizes the amount of time threads │
│ spend mutating instead of waiting to perform a nursery collection. │
│ │
│ Once all the nursery collections have completed in serial, the marking phase │
│ can be performed in parallel with work-stealing. Of the Lisp 2 algorithm’s │
│ passes, only the “update object fields to refer to the new location” pass is │
│ obviously parallelizable. Finalizers are also called in this pass, to allow │
│ them to be executed in parallel as well. (Recall that finalizers are always │
│ provided by the runtime, so they can be trusted to be thread-safe.) │
│ │
│ Finally, the old-heap can easily be both marked and swept in parallel. Each │
│ block is conceptually owned by a thread, and that thread is responsible for │
│ sweeping it. │
├──────────────────────────────────────────────────────────────────────────────┤
│ There are some experiments it’d be nice to run with this collector: │
│ │
│ 1. Right now, the write barrier design is intended to avoid branches as │
│ much as possible (without relying on page faults). Instead, extra │
│ nursery collections will be performed due to the remembered set being │
│ larger than it needs to be. │
│ │
│ i. Are the branches really that expensive? In particular, does it end │
│ up being a net win to add a branch on: │
│ │
│ a. Whether the written value is a fixnum at runtime or not. │
│ │
│ b. Whether the slot being written to is in a non-nursery object. │
│ │
│ c. Whether the value being written is a pointer to a non-old-heap │
│ object. │
│ │
│ d. Whether the value being written is a pointer to an object in a │
│ strictly younger region than the slot it’s being written to. │
│ │
│ ii. Is the branch on whether the nursery is full really that cheap? │
│ Does it end up being a net win to use a guard page to store the │
│ nursery separately from the remembered set, and use guard pages to │
│ fault when each one is full? │
│ │
│ 2. The constants listed above were chosen by a combination of “that sounds │
│ about right,” “Go does that,” and “OpenJDK does that.” It’s plausible │
│ that instead, good defaults could be learned with a genetic algorithm │
│ that tunes the constants in a much more general formula, e.g. │
│ │
│ nursery size = k₁ + k₂ × L1d size + k₃ × L2 size + k₄ × L3 size + … │
│ │
│ It would be useful to try learning good defaults on a wide variety of │
│ machines and workloads, and seeing if natural constants “fall out.” │
│ (Note that this means running a single learning process whose loss │
│ function depends on the performance of the formula across all the │
│ machines.) │
│ │
│ This could also be used alongside any “boolean-valued” experiments, to │
│ learn whether they’re useful and how they affect the constants. │
│ │
│ Finally, this could plausibly even be a useful tool to ship to users │
│ of the compiler, so they can make tune GC constants for release builds │
│ of their own software. │
│ │
│ 3. The parallel variant pins threads to CPUs because many modern machines │
│ have heterogenous core clusters, so different CPUs will have different │
│ L1d sizes. However, this is only likely to have reasonable performance │
│ with a work-stealing green threading runtime in userspace, and even in │
│ that case may have worse mutator performance than not pinning threads. │
│ │
│ Does pinning and having optimally-cache-sized nurseries really improve │
│ performance that much versus uniformly sizing the nurseries and not │
│ pinning? │
│ │
│ 4. Parallelizing the “update object fields to refer to the new location” │
│ pass of the young-heap collection requires the “assign forwarding │
│ pointers to objects” pass to estimate which parts of the young-heap are │
│ assigned to which thread, since they must be aligned to the boundaries │
│ between objects. │
│ │
│ In the above design, the passes are serialized, such that only after │
│ all threads have been assigned to parts of the region do any of them │
│ start updating object fields. Instead, we could imagine giving threads │
│ non-uniformly-sized parts of the region, and having them start as soon │
│ as they are assigned. │
│ │
│ Threads could also start the “shift all objects down” pass as soon as │
│ all threads before them have finished the “update object fields” pass. │
│ │
│ i. Does interleaving the first two passes improve performance? │
│ │
│ ii. Should threads be given (approximately) equally-sized portions of │
│ the region, or should they follow some curve, so threads that start │
│ earlier are given more work to do? │
│ │
│ This curve’s parameters should probably be learned as above. │
│ │
│ iii. Does interleaving the last two passes improve performance? This │
│ means shifting objects down while some (later) objects’ fields are │
│ still being updated. │
│ │
│ iv. Does interleaving all three passes improve performance? │
╰──────────────────────────────────────────────────────────────────────────────╯
|