The MaPLe compiler for efficient and scalable parallel functional programming
In this release, we implement automatic parallelism management [1], which helps address the granularity control problem. In short, the impact of this update is as follows:
ForkJoin.par
is reduced by multiple orders of magnitude. Because of the reduced cost, we can now use ForkJoin.par
much more liberally to write fine-grained parallel code.This makes it possible to write very fine-grained parallel code with good performance in practice.
In many other languages, if you write parallel code that is too fine-grained, you risk a slowdown of as much as 10-100x, or worse. This slowdown is due to the overhead of creating and managing parallel tasks, and this overhead can outweigh the benefits of parallelism if you are not careful.
In contrast, MPL-v0.5 ensures that: (1) the overhead of creating and managing parallel tasks remains low, and also (2) scalability is preserved. This is accomplished with a provably efficient scheduling algorithm (described in [1]), which decides when and where to create tasks while the program is executing. The algorithm is provably efficient in the sense that it preserves both work and span up to small constant factors.
Note that, with MPL-v0.5, tuning grain sizes can still improve performance, but the performance gap between tuned and untuned codes is much smaller: in our paper [1], we measure a gap of less than 2x on average across a range of benchmarks, with a worst case of only approximately 5-6x. In other words, with MPL-v0.5, the performance penalty for ignoring granularity control is only a small constant factor overhead. This is much better than the 10-100x penalty that you would previously risk, if you ignored granularity control.
For benchmarks with no granularity control, we have measured that MPL-v0.5 is significantly faster than MPL-v0.4: up to 50x in some cases. For code that already has manual granularity control, the performance of v0.5 should generally be similar to v0.4. (Any changes here are likely due to differences in whole-program compiler optimizations.)
The changes in this release were implemented primarily by @shwestrick and @MatthewFluet. The design of the automatic parallelism management system is due to @shwestrick, @MatthewFluet, @mikerainey, and @umutacar, and is described in more detail in our paper [1]. Thank you to @typerSniper for help integrating with the entanglement management system! We also made a few bugfixes along the way. See "Summary of PRs and commits" below for more details.
[1] Automatic Parallelism Management. Sam Westrick, Matthew Fluet, Mike Rainey, and Umut A. Acar. POPL 2024.
As an example, here is a performance comparison of an N-queens (wikipedia link) benchmark, with and without granularity control, between MPL versions 0.4 and 0.5. (Full code available here. These experiments were run with N=13.)
The classic N-Queens algorithm is essentially an irregular parallel tree traversal, e.g., as illustrated here. It's not immediately obvious how best to do manual granularity control for this algorithm. One idea would be to switch to a sequential tree traversal after getting a few levels deep in the tree. As long as N is big enough, this should work well, because (a) the tree has large fan-out, exposing lots of parallelism at the top of the tree, and (b) the top of the tree is mostly regular... it only becomes irregular as we get deeper into the search. This granularity control strategy is implemented here, where we switch to sequential at depth 3.
Results for this benchmark are shown below. A few observations:
+-----------------------------------+-----------------------------------+-------------------------------+
| | | OVERHEAD OF NO GRAN CONTROL |
| MANUAL GRAN CONTROL | NO GRAN CONTROL | (ratio: no_gran/manual_gran) |
+-----------------------------------+-----------------------------------+-------------------------------+
| ratio | ratio | |
procs | v0.4 v0.5 (v0.4 / v0.5) | v0.4 v0.5 (v0.4 / v0.5) | v0.4 v0.5 |
+-------+-----------------------------------+-----------------------------------+-------------------------------+
| 1 | 1.64s 1.56s 1.05x | 13.8s 3.75s 3.7x | 8.4x 2.4x |
| 10 | 0.178s 0.169s 1.05x | 11.6s 0.452s 26x | 65x 2.7x |
| 20 | 0.097s 0.097s 1.0x | 11.8s 0.240s 46x | 121x 2.5x |
| 30 | 0.067s 0.067s 1.0x | 2.96s 0.163s 18x | 44x 2.4x |
| 40 | 0.051s 0.050s 1.02x | 3.82s 0.125s 31x | 75x 2.5x |
| 50 | 0.041s 0.040s 1.03x | 0.888s 0.099s 9.0x | 22x 2.5x |
| 60 | 0.035s 0.034s 1.03x | 1.55s 0.083s 19x | 44x 2.4x |
| 72 | 0.031s 0.029s 1.07x | 0.997s 0.071s 14x | 32x 2.4x |
+-------+-----------------------------------+-----------------------------------+-------------------------------+
In MPL-v0.5, there are now three new @mpl ... --
runtime args. These control the automatic parallelism management system, which is based on a heartbeat scheduling strategy. Please see the paper [1] for more details!
heartbeat-us <N>
, which controls how many microseconds should pass between each heartbeat. (Default 500.)heartbeat-tokens <C>
, which controls how many tokens each active thread should receive on each heartbeat. (Default 30.)heartbeat-relayer-threshold <T>
, which controls the heartbeat delivery strategy in the run-time system. (Default 16.)
P
worker threads (@mpl procs P --
), if P > T
, then MPL will designate one worker thread as the heartbeat relayer. The relayer's only job is to deliver heartbeats at a regular rate to all other workers. When P <= T
, the run-time system instead uses a regular SIGALRM
signal from the OS to deliver heartbeats.Tokens are spent to perform promotions, where each promotion creates one task. The default settings allow for creating one task approximately every 500/30 microseconds on average. (Default settings are N=500us, C=30.) Increasing C results in more parallelism but potentially higher overhead. Increasing N results in less parallelism and potentially less overhead.
The default settings should work well across a range of hardware and modern OSes.
-trace true -trace-runtime true
, and then can be used to output an execution trace which is compatible with Perfetto. See mltrace/NOTES.md
for more info.Full Changelog: https://github.com/MPLLang/mpl/compare/v0.4...v0.5
Another year, another major release. In addition to multiple bugfixes and improvements (see below), this release features a milestone:
MPL now has full support for in-place updates.
Previously, MPL was only capable of running disentangled programs, and included a dynamic detector (similar to a data-race detector) that would exit the program with an error when entanglement occurred. With this update, MPL now automatically manages entanglement on-the-fly, allowing the program to continue safely and efficiently. This work was presented recently at PLDI 2023 in our paper Efficient Parallel Functional Programming with Effects [1].
Real.{toString,fmt,...}
: make gdtoa
thread-safeInt.{fmt,toString}
, Word.{fmt,toString}
, Real.{split,toManExp}
, etc.: update One
structure in basis library to be thread-safeReal32
type for primitive CASsplitTypes{1,2}
passes for now (these will need to be updated to support primitive CAS)The changes in this release are implemented primarily by Jatin Arora (@typerSniper) and Sam Westrick (@shwestrick). The design of the entanglement manager [1] is by Jatin Arora, Sam Westrick, and Umut Acar (@umutacar). As always, thank you to @MatthewFluet for his constant support of the project!
[1] Efficient Parallel Functional Programming with Effects. Jatin Arora, Sam Westrick, and Umut A. Acar. PLDI 2023.
Over the past year and a half, we have made a number of improvements to MPL. In addition to various bug-fixes and performance improvements, there are three major changes: entanglement detection, CGC-chaining, and a new block allocator.
In v0.3, the MPL runtime system now dynamically monitors reads and writes to check for entanglement. If entanglement is detected, then MPL terminates the program with an error message.
Entanglement detection ensures that MPL is safe for all programs: if it type-checks, then it is safe to run. Specifically, when you run a program, MPL v0.3 guarantees that either (a) the program runs to completion in a disentangled manner, or (b) the program terminates with an entanglement error. The entanglement detector is precise: no false alarms!
With entanglement detection, you can now safely use in-place updates to improve efficiency. You can even use in-place updates in a non-deterministic manner, which is essential for good performance in some cases. In particular, many well-known parallel programming techniques improve efficiency by interleaving atomic in-place updates and accesses to shared memory. These techniques are inherently non-deterministic, but desirable for performance optimization. A key advantage of disentanglement (in comparison to purely functional programming, where in-place updates are disallowed) is that it allows for in-place updates to be utilized in this manner.
For example, with v0.3, you can safely use atomic read-modify-write operations to efficiently traverse a graph and reduce contention, or use deterministic reservations to ensure determinism via non-deterministic speculation, or implement lock-free data structures to allow threads to communicate on-the-fly. We have multiple benchmarks using these techniques. Our breadth-first search benchmark interleaves atomic CAS operations to visit vertices, and our low-diameter decomposition uses a similar technique to compute the boundary between adjacent clusters. Betweenness centrality aggregates sums of weights via priority updates. In Delaunay triangulation, triangles atomically reserve access to their neighbors to resolve update conflicts.
Using in-place updates for efficiency, our benchmarks are able to compete in performance with low-level code written in C/C++. For example, our breadth-first search and Delaunay triangulation benchmarks are only ~2x slower than state-of-the-art PBBS implementations, while using a similar amount of memory.
It is important to note that, because MPL allows for non-determinism, entanglement detection is execution-dependent. If an execution of a program exhibits entanglement, then MPL v0.3 will detect the entanglement and terminate. But on the exact same program—even on the same input—it is possible that a different execution might not exhibit entanglement, in which case MPL will allow the execution to complete. This is possible because the outcome of a race condition might determine whether or not entanglement occurs during execution. Race conditions are not necessarily bugs, however. All of the non-deterministic programming techniques we described above have race conditions, but these race conditions are "benign" in the sense of being intentional, useful for efficiency, and disentangled (with some care).
In terms of time and space overhead, don't worry: entanglement detection has nearly zero overhead. See the table, below. The space overhead is essentially zero across the board (one exception: 20% for tinykaboom on 1 processor, but the overall memory usage for this benchmark is low.) The largest time overhead is 7%, and a majority of benchmarks have 2% overhead or less, with a geomean across of approximately 1% across all benchmarks.
When there is overhead, it is predictable and easy to optimize away. The only operations affected are mutable dereference operations, e.g. Array.sub
and Ref.!
, so it is easy to avoid overhead from entanglement detection by removing unnecessary mutation. Purely functional code has no overhead.
Overhead of Entanglement Detection
TIME SPACE
--------------------------------------------
Benchmark P=1 P=72 P=1 P=72
============================================
bfs-tree +3.0% +2.1% +0.0% -0.1%
centrality -1.2% -1.4% -0.0% -0.0%
dedup-strings +3.2% +7.0% -0.0% +4.8%
delaunay +5.3% +3.7% -0.1% +2.6%
dense-matmul +0.4% +0.0% -2.1% +0.2%
game-of-life -4.4% +0.0% -0.0% +0.1%
grep -1.9% +0.0% -0.0% +0.1%
low-d-decomp +2.6% -3.4% -0.0% -10.5%
msort-strings +2.6% -3.3% +0.1% +1.5%
nearest-nbrs +0.0% +2.2% +0.0% -0.1%
nqueens +0.6% +3.6% -0.3% +0.6%
palindrome +4.3% +3.2% +2.5% +0.2%
primes +1.9% -2.4% -0.3% -0.0%
quickhull +5.6% +6.8% +0.0% +2.9%
range-query +6.9% -0.8% -0.0% +3.0%
raytracer -5.7% +0.0% +1.2% +2.1%
reverb -2.2% -2.4% +0.0% -0.1%
seam-carve +0.0% +2.4% -0.1% -0.2%
skyline +3.6% +0.4% +0.1% -1.2%
suffix-array +3.6% +1.8% -0.1% -5.1%
tinykaboom +0.4% +0.0% +20.0% +0.4%
tokens -5.1% -2.3% +0.0% -0.0%
triangle-count -3.6% -0.5% +0.0% -2.2%
These benchmarks are available in our benchmark suite. To reproduce these experiments, see https://github.com/MPLLang/entanglement-detect/tree/icfp22-artifact
This change improves the CGC collector by placing collections off the critical path.
Some background: MPL's collector performs collections along spines in the task tree. Each spine collection is divided into two pieces, and the two pieces are collected with two different algorithms.
This approach requires almost no synchronization. Furthermore, GC is both parallel and concurrent: we get parallelism across spines, and we allow for GC and mutators to run concurrently. To perform a single spine collection, we only have to pause one mutator, to facilitate LGC. In this way, MPL has no stop-the-world pauses. This is the case for both v0.2 as well as v0.3.
The spines of the program can be unbalanced in their memory sizes. In particular, the spine that owns the root heap is often significantly larger than the other spines (which are short-lived). In v0.2, because spine collections are scheduled on the critical path, a single large collection might cause a noticeable decrease in throughput.
In v0.3, MPL spawns CGC tasks and schedules these exactly like normal tasks. Therefore, within a spine, multiple CGCs might be executed on different processors, if the scheduler deems it profitable. To keep CGC'ed heaps separate from the main spine heaps, we extended each heap with a "chain" of secondary heaps that are undergoing CGC. The chain can grow as multiple CGCs are spawned on different portions on the same heap (as the program forks and joins concurrently with ongoing CGCs). When a CGC task completes, the corresponding heap is merged back into the spine. CGC tasks are spawned at forks in the program, and are prioritized for shallow heaps (which are generally larger). LGC is still performed along the critical path, but each LGC is fast and only needs to pause a single mutator.
With this approach, in v0.3, the collector is better parallelized and has less impact on the critical path of the computation.
In #135, we developed a new block-allocation mechanism based in part on the Hoard algorithm. This is the central allocation mechanism in MPL: all major allocations performed by the rest of the system are backed by blocks.
The results are quite impressive, to say the least. Due to reduced mmap
pressure and better reuse of memory, we get as much as 30% improvement in time and 80% improvement in max residency. See #135 for more details.
Most of the implementation work done by @shwestrick and @typerSniper. @larry98 helped develop the order maintenance component of the entanglement detector, and @ckoparkar ported a new example from Haskell. Big thanks to @umutacar for helping guide this project, and to @MatthewFluet for his constant support (and keeping us in-sync with MLton!)
The entanglement detection algorithm is designed by Sam Westrick, Jatin Arora, and Umut Acar. Details will be available in an upcoming publication. Entanglement detection is supported by the the DePa algorithm for SP-order maintenance which was designed and implemented by Sam Westrick, Larry Wang, and Umut Acar [5]. The MPL memory management system is based on the theory of disentanglement [1,2,3,4].
[1] Hierarchical Memory Management for Parallel Programs. Ram Raghunathan, Stefan K. Muller, Umut A. Acar, and Guy Blelloch. ICFP 2016. [2] Hierarchical Memory Management for Mutable State. Adrien Guatto, Sam Westrick, Ram Raghunathan, Umut Acar, and Matthew Fluet. PPoPP 2018. [3] Disentanglement in Nested-Parallel Programs. Sam Westrick, Rohan Yadav, Matthew Fluet, and Umut A. Acar. POPL 2020. [4] Provably Space-Efficient Parallel Functional Programming. Jatin Arora, Sam Westrick, and Umut A. Acar. POPL 2021. [5] DePa: Simple, Provably Efficient, and Practical Order Maintenance for Task Parallelism. Sam Westrick, Larry Wang, and Umut A. Acar. arXiv 2022.
It's been about a year since we released MPL-v0.1, and now we're ready for round 2! Here's what's changed.
Concurrent, Parallel GC is here! (#129) This is the biggest change, and has been in the works for over a year. Awesome work Jatin @typerSniper! Many of the details are published in our recent POPL'21 paper: Provably Space-Efficient Parallel Functional Programming.
General performance improvements
Upstream MLton merge (thanks @MatthewFluet!)
New features
mmap
/munmap
. The existing basis library I/O functions were essentially entirely sequential. By mmap
ing a file, we can then load it into a heap array in parallel.New examples
General maintenance
Bug fixes
SimplifyTypes
SSA optimization pass.