Skip to content

perf: 60% faster cold builds via type dedup, solver memoization, and a larger GC nursery#96

Open
CharlonTank wants to merge 4 commits intolamdera:lamdera-nextfrom
CharlonTank:perf/shape-and-memo
Open

perf: 60% faster cold builds via type dedup, solver memoization, and a larger GC nursery#96
CharlonTank wants to merge 4 commits intolamdera:lamdera-nextfrom
CharlonTank:perf/shape-and-memo

Conversation

@CharlonTank
Copy link
Copy Markdown
Contributor

TL;DR

Three independent changes that compose. On a project with 391 modules and large model record aliases (Lamdera-style FrontendModel/BackendModel patterns with Effect.Test setups), cold builds drop from 130 s to 51 s (-61% median, best case 48 s), and total .elmi size drops from 419 MB to 14 MB (30x).

# Change File(s) Effect alone
1 Shape-based type dedup in .elmi serialization Elm.Interface -97% .elmi, time neutral
2 Memoize srcTypeToVar in the type solver (monomorphic case) Type.Solve -26% cold build
3 Bump RTS allocation area -A128m-A1g elm.cabal (1 line) -10% cold build
All three combined -61% cold build, -97% .elmi

Benchmarks

Methodology: cleared elm-stuff, then lamdera make tests/Tests.elm. Three cold builds per setup, median reported. Same machine (Apple M2 Max, 32 GB RAM), back-to-back, no other heavy load. Variance was ±10-15 s between runs because the working set is large enough to make the test sensitive to system noise — treat any single sub-5% delta as noise.

Setup RTS nursery Median Min vs baseline .elmi total
Baseline (lamdera-next) -A128m 130.54 s 119.06 s 419 MB
Baseline -A1g 117.69 s 110.99 s -10% 419 MB
Memo only -A128m 96.19 s 92.60 s -26% 465 MB
Memo only -A1g 68.37 s 64.47 s -48% 465 MB
Shape only -A1g 136.47 s 133.03 s +5% 13.9 MB
Shape + Memo -A128m 121.17 s 104.86 s -7% 13.9 MB
Shape + Memo + -A1g -A1g 51.03 s 48.27 s -61% 13.9 MB

Interesting findings worth noting:

  • Shape-only with -A1g is slightly slower than baseline+-A1g. Once GC is no longer the bottleneck, the dedup pool's interning overhead (~23 s) costs more than the I/O savings from a 30x smaller .elmi. Shape becomes a time win only when paired with memo, which clears the type-checking bottleneck so interning becomes proportionally cheap.
  • Memo gives the largest single-change time win because the solver dominates a cold build's CPU profile on projects with many large monomorphic types.
  • -A1g is the biggest "effort vs reward" win: a one-character cabal change worth ~50 s on this kind of project.

Why these changes

1. Shape-based type dedup (Elm.Interface)

When a project shares a large type alias (a backend record with 50+ fields, used as the type of every test-runner setup, every persisted-state shape, every effect-test program type, etc.), Can.Type expands the alias structurally at every reference. lamdera-next writes that subtree verbatim into each .elmi file. With ~10 references and ~50 modules using it, the same subtree gets serialized hundreds of times. Result: a single .elmi can balloon to hundreds of MB.

This PR introduces a bottom-up interning pool. Each Can.Type subtree becomes a small Shape value whose children are already Word32 pool IDs (not recursive Can.Types). Hashing/comparing a Shape is O(small) regardless of subtree size, so the Map.lookup that gates dedup is never expensive. Top-level annotations/unions/aliases produce Put actions closed over the IDs, so serialization needs no second lookup phase.

The wire format is identical to the previous Map-based dedup attempt. A magic byte (0x00 sentinel) at the start of the Interface payload distinguishes the new format from the old one for backward-compatible reads. Old .elmi files keep working until they're regenerated.

This was attempted in PR #94 with a Map-based intern pool that hashed full Can.Type subtrees per lookup — measurements then showed it actually slowed builds by ~16% for the same reason: every lookup walked the entire subtree it was looking up. The Shape approach short-circuits this by making the lookup key already contain only IDs of children.

2. Memoize srcTypeToVar in the solver (Type.Solve)

Profiling on the same project showed the type solver's srcTypeToVar was called ~45M times for ~4k unique types. Every call materializes a fresh UnionFind Variable for each subtree, and recursion repeats this work for every reference to the same type. There's no caching because in the general case the result depends on the surrounding flexVars map (free type variables in the local scope).

But: when flexVars is empty (the surrounding context is monomorphic — which holds for all top-level type annotations and any expression whose type is fully determined), the result of srcTypeToVar t depends only on t itself. We can safely memoize on (rank, t) in this case.

The patch threads an IORef-based SolveCache through the solver. The cache is consulted only when Map.null flexVars holds (a constant-time check). For polymorphic positions the original code path is unchanged, so this is opt-in safe.

3. RTS nursery bump -A128m-A1g (elm.cabal)

Profiling with +RTS -s revealed the dominant cost of a cold build is garbage collection, not actual compilation work:

Phase CPU time Wall time
MUT (compilation work) 82.8 s 14.1 s elapsed
GC 0.1 s CPU 81.3 s elapsed

With -N12 -qg, every GC stops all 12 worker threads. With a 128 MB nursery and ~1 GB/s of allocation, a young-gen collection fires every ~130 ms, and each one scans the live working set (which can grow to ~30 GB during type checking on this kind of project).

Quadrupling the nursery to -A1g reduces collection frequency by ~8x. Wall-clock GC drops from 81 s to a few seconds, and the productivity ratio goes from ~14% to ~50% of elapsed time. The cost is ~3 GB of additional virtual address space pre-reserved (most of which is never actually committed to physical RAM), and ~1 GB more peak resident memory.

Why wasn't this set already? Historical context: when Elm 0.19 was released, the GHC default was -A1m, dev machines commonly had 8-16 GB, and -A1g was per-thread (so -N12 -A1g reserves 12 GB of address space at startup, which would have been a deal-breaker). Lamdera's existing -A128m was already aggressive. With modern dev hardware (16-64 GB common) and the existence of larger projects, the trade-off has shifted.

Trade-offs

  • Memory: Peak resident memory on this benchmark project went from 19.5 GB (baseline) to 17.9 GB (this PR) — slightly lower, because Shape reduces in-memory type duplication and memo avoids redundant Variable allocations. The peak virtual footprint goes up by ~3 GB (extra nursery reservation), but on macOS and Linux that's address space, not committed RAM. Users with <16 GB of RAM can't compile this size of project either way.
  • Backward compat: New .elmi format is detected via a magic byte and old .elmi files are still read correctly. The reverse (new compiler reading old .elmi, old compiler reading new .elmi) is the standard .elmi invalidation path: any compiler upgrade typically requires regenerating elm-stuff/ anyway.
  • Smaller projects: -A1g is effectively a no-op on small projects (the nursery is never filled before a build completes). Memo and Shape add tiny overhead but are dwarfed by build time on small projects to the point of being unmeasurable. The changes target large projects without regressing small ones.

Files changed

  • compiler/src/Elm/Interface.hs — Shape dedup implementation (~600 lines added)
  • compiler/src/Type/Solve.hssrcTypeToVar memoization
  • elm.cabal — RTS opts bump (1 line)

Relation to PR #94

This PR supersedes #94 (currently in draft). #94 used a Map-based intern pool that hashed full Can.Type subtrees per lookup, which proved ~16% slower than baseline despite delivering the same .elmi size reduction. This PR is the working version — same .elmi format, faster intern path, plus the orthogonal solver and RTS wins.

Each Can.Type subtree is interned into a Shape whose children are already
Word32 pool IDs. Hashing/comparing a Shape is O(small) regardless of
subtree size, so the Map.lookup that gates dedup is no longer expensive.

The walk threads InternState through the interface; top-level structures
(Annotation/Union/Alias/Binop) produce closed-over Put actions that
reference children by ID, so serialization needs no second lookup phase.

Wire format identical to the previous Map-based PR.
When a function signature like FA -> Action references large monomorphic
type aliases (e.g. FrontendModel with 100+ transitive fields), srcTypeToVar
walks the entire expanded type and creates fresh UnionFind variables on every
single call site. On a real project, this resulted in 45M+ srcTypeToVar
calls for only ~4000 unique Can.Type subtrees (11000x redundancy).

This commit adds a per-run cache keyed on (rank, Can.Type) that returns the
previously-built Variable when the same monomorphic type is encountered again.
The cache is gated on flexVars being empty to ensure correctness in the
presence of polymorphism: when type variables are in scope, sharing would
incorrectly conflate distinct instantiations.

On a real Lamdera project (391 modules, including dense Effect.Test code):
  - Cold build: 120s -> 98s (-18%, median over 3 runs)
  - typecheck UsersFlows: 125s -> 67s (-46%) on the bottleneck module
Profiling on a large project (391 modules, ~30 GB live working set during
type checking) showed the compiler spent 86% of wall-clock in GC: each
~1 GB/s of allocation triggered a young collection every ~130 ms, and
each collection scanned a huge live set. The nursery was the bottleneck
that dwarfed every code-level optimization we tried.

Increasing the allocation area to 1 GB reduces collection frequency by
~8x. On the same project, cold full builds drop from 130 s to 51 s
(-61%, median of 3 runs). Peak RSS rises by ~1 GB on top of the
existing ~19 GB working set, which is comfortably below typical
developer-machine RAM budgets.

Smaller projects keep their existing behavior since they never fill the
nursery; the change is effectively no-op for them.
Code-review pass on the perf changes:

* Introduced `Ext.Common.envFlag :: String -> Bool` to factor the
  cached env-var presence check that was duplicated inline in
  File.hs (LDEBUG_FILE_TIMING) and Interface.hs (LDEBUG_DEDUP_TIMING).
* Dropped `serializeNanos` (declared, read by `getDedupTimings`, never
  written). `getDedupTimings` simplified to `IO Double`.
* Dropped `_unusedIndex` workaround and the `Data.Index` import that
  motivated it; nothing in the new code touches `Index.ZeroBased`.
* Merged `putInterfaceDedupRaw` and `putInterfaceDedupTimed` into a
  single `putInterfaceDedup`. Timing branches off into a small
  `recordPoolTime` helper that forces and clocks the pool, removing
  ~25 lines of duplicated serialization tail.
* Renamed `putKeyedPuts` to `putMapPuts` for symmetry with the existing
  `getMapWith` reader.
* Replaced two `if X > 0 then atomicPutStrLn ... else return ()`
  blocks in Reporting.hs with `when`.

Tried also factoring the four `intern{Types,RecordFields,AliasArgs}`
helpers through `Data.List.mapAccumL`. Measurements showed a +17%
CPU regression on the cold-build workload (the generic tuple shape
mapAccumL builds adds GC pressure on the pool-building hot path), so
the direct recursive helpers are kept and a comment now documents the
why. Re-benchmarked on the same project: minimum CPU time matches the
pre-cleanup binary (38.5 s vs 39.4 s, within noise), no wall-clock
regression.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant