perf: 60% faster cold builds via type dedup, solver memoization, and a larger GC nursery#96
Open
CharlonTank wants to merge 4 commits intolamdera:lamdera-nextfrom
Open
perf: 60% faster cold builds via type dedup, solver memoization, and a larger GC nursery#96CharlonTank wants to merge 4 commits intolamdera:lamdera-nextfrom
CharlonTank wants to merge 4 commits intolamdera:lamdera-nextfrom
Conversation
Each Can.Type subtree is interned into a Shape whose children are already Word32 pool IDs. Hashing/comparing a Shape is O(small) regardless of subtree size, so the Map.lookup that gates dedup is no longer expensive. The walk threads InternState through the interface; top-level structures (Annotation/Union/Alias/Binop) produce closed-over Put actions that reference children by ID, so serialization needs no second lookup phase. Wire format identical to the previous Map-based PR.
When a function signature like FA -> Action references large monomorphic type aliases (e.g. FrontendModel with 100+ transitive fields), srcTypeToVar walks the entire expanded type and creates fresh UnionFind variables on every single call site. On a real project, this resulted in 45M+ srcTypeToVar calls for only ~4000 unique Can.Type subtrees (11000x redundancy). This commit adds a per-run cache keyed on (rank, Can.Type) that returns the previously-built Variable when the same monomorphic type is encountered again. The cache is gated on flexVars being empty to ensure correctness in the presence of polymorphism: when type variables are in scope, sharing would incorrectly conflate distinct instantiations. On a real Lamdera project (391 modules, including dense Effect.Test code): - Cold build: 120s -> 98s (-18%, median over 3 runs) - typecheck UsersFlows: 125s -> 67s (-46%) on the bottleneck module
Profiling on a large project (391 modules, ~30 GB live working set during type checking) showed the compiler spent 86% of wall-clock in GC: each ~1 GB/s of allocation triggered a young collection every ~130 ms, and each collection scanned a huge live set. The nursery was the bottleneck that dwarfed every code-level optimization we tried. Increasing the allocation area to 1 GB reduces collection frequency by ~8x. On the same project, cold full builds drop from 130 s to 51 s (-61%, median of 3 runs). Peak RSS rises by ~1 GB on top of the existing ~19 GB working set, which is comfortably below typical developer-machine RAM budgets. Smaller projects keep their existing behavior since they never fill the nursery; the change is effectively no-op for them.
8 tasks
Code-review pass on the perf changes:
* Introduced `Ext.Common.envFlag :: String -> Bool` to factor the
cached env-var presence check that was duplicated inline in
File.hs (LDEBUG_FILE_TIMING) and Interface.hs (LDEBUG_DEDUP_TIMING).
* Dropped `serializeNanos` (declared, read by `getDedupTimings`, never
written). `getDedupTimings` simplified to `IO Double`.
* Dropped `_unusedIndex` workaround and the `Data.Index` import that
motivated it; nothing in the new code touches `Index.ZeroBased`.
* Merged `putInterfaceDedupRaw` and `putInterfaceDedupTimed` into a
single `putInterfaceDedup`. Timing branches off into a small
`recordPoolTime` helper that forces and clocks the pool, removing
~25 lines of duplicated serialization tail.
* Renamed `putKeyedPuts` to `putMapPuts` for symmetry with the existing
`getMapWith` reader.
* Replaced two `if X > 0 then atomicPutStrLn ... else return ()`
blocks in Reporting.hs with `when`.
Tried also factoring the four `intern{Types,RecordFields,AliasArgs}`
helpers through `Data.List.mapAccumL`. Measurements showed a +17%
CPU regression on the cold-build workload (the generic tuple shape
mapAccumL builds adds GC pressure on the pool-building hot path), so
the direct recursive helpers are kept and a comment now documents the
why. Re-benchmarked on the same project: minimum CPU time matches the
pre-cleanup binary (38.5 s vs 39.4 s, within noise), no wall-clock
regression.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
TL;DR
Three independent changes that compose. On a project with 391 modules and large model record aliases (Lamdera-style
FrontendModel/BackendModelpatterns withEffect.Testsetups), cold builds drop from 130 s to 51 s (-61% median, best case 48 s), and total.elmisize drops from 419 MB to 14 MB (30x)..elmiserializationElm.Interface.elmi, time neutralsrcTypeToVarin the type solver (monomorphic case)Type.Solve-A128m→-A1gelm.cabal(1 line).elmiBenchmarks
Methodology: cleared
elm-stuff, thenlamdera make tests/Tests.elm. Three cold builds per setup, median reported. Same machine (Apple M2 Max, 32 GB RAM), back-to-back, no other heavy load. Variance was ±10-15 s between runs because the working set is large enough to make the test sensitive to system noise — treat any single sub-5% delta as noise..elmitotallamdera-next)-A128m-A1g-A128m-A1g-A1g-A128m-A1g⭐-A1gInteresting findings worth noting:
-A1gis slightly slower than baseline+-A1g. Once GC is no longer the bottleneck, the dedup pool's interning overhead (~23 s) costs more than the I/O savings from a 30x smaller.elmi. Shape becomes a time win only when paired with memo, which clears the type-checking bottleneck so interning becomes proportionally cheap.-A1gis the biggest "effort vs reward" win: a one-character cabal change worth ~50 s on this kind of project.Why these changes
1. Shape-based type dedup (
Elm.Interface)When a project shares a large type alias (a backend record with 50+ fields, used as the type of every test-runner setup, every persisted-state shape, every effect-test program type, etc.),
Can.Typeexpands the alias structurally at every reference.lamdera-nextwrites that subtree verbatim into each.elmifile. With ~10 references and ~50 modules using it, the same subtree gets serialized hundreds of times. Result: a single.elmican balloon to hundreds of MB.This PR introduces a bottom-up interning pool. Each
Can.Typesubtree becomes a smallShapevalue whose children are alreadyWord32pool IDs (not recursiveCan.Types). Hashing/comparing aShapeis O(small) regardless of subtree size, so theMap.lookupthat gates dedup is never expensive. Top-level annotations/unions/aliases producePutactions closed over the IDs, so serialization needs no second lookup phase.The wire format is identical to the previous Map-based dedup attempt. A magic byte (
0x00sentinel) at the start of theInterfacepayload distinguishes the new format from the old one for backward-compatible reads. Old.elmifiles keep working until they're regenerated.This was attempted in PR #94 with a
Map-based intern pool that hashed fullCan.Typesubtrees per lookup — measurements then showed it actually slowed builds by ~16% for the same reason: every lookup walked the entire subtree it was looking up. The Shape approach short-circuits this by making the lookup key already contain only IDs of children.2. Memoize
srcTypeToVarin the solver (Type.Solve)Profiling on the same project showed the type solver's
srcTypeToVarwas called ~45M times for ~4k unique types. Every call materializes a freshUnionFindVariablefor each subtree, and recursion repeats this work for every reference to the same type. There's no caching because in the general case the result depends on the surroundingflexVarsmap (free type variables in the local scope).But: when
flexVarsis empty (the surrounding context is monomorphic — which holds for all top-level type annotations and any expression whose type is fully determined), the result ofsrcTypeToVar tdepends only ontitself. We can safely memoize on(rank, t)in this case.The patch threads an
IORef-basedSolveCachethrough the solver. The cache is consulted only whenMap.null flexVarsholds (a constant-time check). For polymorphic positions the original code path is unchanged, so this is opt-in safe.3. RTS nursery bump
-A128m→-A1g(elm.cabal)Profiling with
+RTS -srevealed the dominant cost of a cold build is garbage collection, not actual compilation work:With
-N12 -qg, every GC stops all 12 worker threads. With a 128 MB nursery and ~1 GB/s of allocation, a young-gen collection fires every ~130 ms, and each one scans the live working set (which can grow to ~30 GB during type checking on this kind of project).Quadrupling the nursery to
-A1greduces collection frequency by ~8x. Wall-clock GC drops from 81 s to a few seconds, and the productivity ratio goes from ~14% to ~50% of elapsed time. The cost is ~3 GB of additional virtual address space pre-reserved (most of which is never actually committed to physical RAM), and ~1 GB more peak resident memory.Why wasn't this set already? Historical context: when Elm 0.19 was released, the GHC default was
-A1m, dev machines commonly had 8-16 GB, and-A1gwas per-thread (so-N12 -A1greserves 12 GB of address space at startup, which would have been a deal-breaker). Lamdera's existing-A128mwas already aggressive. With modern dev hardware (16-64 GB common) and the existence of larger projects, the trade-off has shifted.Trade-offs
Variableallocations. The peak virtual footprint goes up by ~3 GB (extra nursery reservation), but on macOS and Linux that's address space, not committed RAM. Users with <16 GB of RAM can't compile this size of project either way..elmiformat is detected via a magic byte and old.elmifiles are still read correctly. The reverse (new compiler reading old.elmi, old compiler reading new.elmi) is the standard.elmiinvalidation path: any compiler upgrade typically requires regeneratingelm-stuff/anyway.-A1gis effectively a no-op on small projects (the nursery is never filled before a build completes). Memo and Shape add tiny overhead but are dwarfed by build time on small projects to the point of being unmeasurable. The changes target large projects without regressing small ones.Files changed
compiler/src/Elm/Interface.hs— Shape dedup implementation (~600 lines added)compiler/src/Type/Solve.hs—srcTypeToVarmemoizationelm.cabal— RTS opts bump (1 line)Relation to PR #94
This PR supersedes #94 (currently in draft). #94 used a
Map-based intern pool that hashed fullCan.Typesubtrees per lookup, which proved ~16% slower than baseline despite delivering the same.elmisize reduction. This PR is the working version — same.elmiformat, faster intern path, plus the orthogonal solver and RTS wins.