perf: 60% faster cold builds via type dedup, solver memoization, and a larger GC nursery by CharlonTank · Pull Request #96 · lamdera/compiler

CharlonTank · 2026-04-17T04:55:34Z

TL;DR

Three independent changes that compose. On a project with 391 modules and large model record aliases (Lamdera-style FrontendModel/BackendModel patterns with Effect.Test setups), cold builds drop from 130 s to 51 s (-61% median, best case 48 s), and total .elmi size drops from 419 MB to 14 MB (30x).

#	Change	File(s)	Effect alone
1	Shape-based type dedup in `.elmi` serialization	`Elm.Interface`	-97% `.elmi`, time neutral
2	Memoize `srcTypeToVar` in the type solver (monomorphic case)	`Type.Solve`	-26% cold build
3	Bump RTS allocation area `-A128m` → `-A1g`	`elm.cabal` (1 line)	-10% cold build
All three combined			-61% cold build, -97% `.elmi`

Benchmarks

Methodology: cleared elm-stuff, then lamdera make tests/Tests.elm. Three cold builds per setup, median reported. Same machine (Apple M2 Max, 32 GB RAM), back-to-back, no other heavy load. Variance was ±10-15 s between runs because the working set is large enough to make the test sensitive to system noise — treat any single sub-5% delta as noise.

Setup	RTS nursery	Median	Min	vs baseline	`.elmi` total
Baseline (`lamdera-next`)	`-A128m`	130.54 s	119.06 s	—	419 MB
Baseline	`-A1g`	117.69 s	110.99 s	-10%	419 MB
Memo only	`-A128m`	96.19 s	92.60 s	-26%	465 MB
Memo only	`-A1g`	68.37 s	64.47 s	-48%	465 MB
Shape only	`-A1g`	136.47 s	133.03 s	+5%	13.9 MB
Shape + Memo	`-A128m`	121.17 s	104.86 s	-7%	13.9 MB
Shape + Memo + `-A1g` ⭐	`-A1g`	51.03 s	48.27 s	-61%	13.9 MB

Interesting findings worth noting:

Shape-only with -A1g is slightly slower than baseline+-A1g. Once GC is no longer the bottleneck, the dedup pool's interning overhead (~23 s) costs more than the I/O savings from a 30x smaller .elmi. Shape becomes a time win only when paired with memo, which clears the type-checking bottleneck so interning becomes proportionally cheap.
Memo gives the largest single-change time win because the solver dominates a cold build's CPU profile on projects with many large monomorphic types.
-A1g is the biggest "effort vs reward" win: a one-character cabal change worth ~50 s on this kind of project.

Why these changes

1. Shape-based type dedup (`Elm.Interface`)

When a project shares a large type alias (a backend record with 50+ fields, used as the type of every test-runner setup, every persisted-state shape, every effect-test program type, etc.), Can.Type expands the alias structurally at every reference. lamdera-next writes that subtree verbatim into each .elmi file. With ~10 references and ~50 modules using it, the same subtree gets serialized hundreds of times. Result: a single .elmi can balloon to hundreds of MB.

This PR introduces a bottom-up interning pool. Each Can.Type subtree becomes a small Shape value whose children are already Word32 pool IDs (not recursive Can.Types). Hashing/comparing a Shape is O(small) regardless of subtree size, so the Map.lookup that gates dedup is never expensive. Top-level annotations/unions/aliases produce Put actions closed over the IDs, so serialization needs no second lookup phase.

The wire format is identical to the previous Map-based dedup attempt. A magic byte (0x00 sentinel) at the start of the Interface payload distinguishes the new format from the old one for backward-compatible reads. Old .elmi files keep working until they're regenerated.

This was attempted in PR #94 with a Map-based intern pool that hashed full Can.Type subtrees per lookup — measurements then showed it actually slowed builds by ~16% for the same reason: every lookup walked the entire subtree it was looking up. The Shape approach short-circuits this by making the lookup key already contain only IDs of children.

2. Memoize `srcTypeToVar` in the solver (`Type.Solve`)

Profiling on the same project showed the type solver's srcTypeToVar was called ~45M times for ~4k unique types. Every call materializes a fresh UnionFind Variable for each subtree, and recursion repeats this work for every reference to the same type. There's no caching because in the general case the result depends on the surrounding flexVars map (free type variables in the local scope).

But: when flexVars is empty (the surrounding context is monomorphic — which holds for all top-level type annotations and any expression whose type is fully determined), the result of srcTypeToVar t depends only on t itself. We can safely memoize on (rank, t) in this case.

The patch threads an IORef-based SolveCache through the solver. The cache is consulted only when Map.null flexVars holds (a constant-time check). For polymorphic positions the original code path is unchanged, so this is opt-in safe.

3. RTS nursery bump `-A128m` → `-A1g` (`elm.cabal`)

Profiling with +RTS -s revealed the dominant cost of a cold build is garbage collection, not actual compilation work:

Phase	CPU time	Wall time
MUT (compilation work)	82.8 s	14.1 s elapsed
GC	0.1 s CPU	81.3 s elapsed

With -N12 -qg, every GC stops all 12 worker threads. With a 128 MB nursery and ~1 GB/s of allocation, a young-gen collection fires every ~130 ms, and each one scans the live working set (which can grow to ~30 GB during type checking on this kind of project).

Quadrupling the nursery to -A1g reduces collection frequency by ~8x. Wall-clock GC drops from 81 s to a few seconds, and the productivity ratio goes from ~14% to ~50% of elapsed time. The cost is ~3 GB of additional virtual address space pre-reserved (most of which is never actually committed to physical RAM), and ~1 GB more peak resident memory.

Why wasn't this set already? Historical context: when Elm 0.19 was released, the GHC default was -A1m, dev machines commonly had 8-16 GB, and -A1g was per-thread (so -N12 -A1g reserves 12 GB of address space at startup, which would have been a deal-breaker). Lamdera's existing -A128m was already aggressive. With modern dev hardware (16-64 GB common) and the existence of larger projects, the trade-off has shifted.

Trade-offs

Memory: Peak resident memory on this benchmark project went from 19.5 GB (baseline) to 17.9 GB (this PR) — slightly lower, because Shape reduces in-memory type duplication and memo avoids redundant Variable allocations. The peak virtual footprint goes up by ~3 GB (extra nursery reservation), but on macOS and Linux that's address space, not committed RAM. Users with <16 GB of RAM can't compile this size of project either way.
Backward compat: New .elmi format is detected via a magic byte and old .elmi files are still read correctly. The reverse (new compiler reading old .elmi, old compiler reading new .elmi) is the standard .elmi invalidation path: any compiler upgrade typically requires regenerating elm-stuff/ anyway.
Smaller projects: -A1g is effectively a no-op on small projects (the nursery is never filled before a build completes). Memo and Shape add tiny overhead but are dwarfed by build time on small projects to the point of being unmeasurable. The changes target large projects without regressing small ones.

Files changed

compiler/src/Elm/Interface.hs — Shape dedup implementation (~600 lines added)
compiler/src/Type/Solve.hs — srcTypeToVar memoization
elm.cabal — RTS opts bump (1 line)

Relation to PR #94

This PR supersedes #94 (currently in draft). #94 used a Map-based intern pool that hashed full Can.Type subtrees per lookup, which proved ~16% slower than baseline despite delivering the same .elmi size reduction. This PR is the working version — same .elmi format, faster intern path, plus the orthogonal solver and RTS wins.

Each Can.Type subtree is interned into a Shape whose children are already Word32 pool IDs. Hashing/comparing a Shape is O(small) regardless of subtree size, so the Map.lookup that gates dedup is no longer expensive. The walk threads InternState through the interface; top-level structures (Annotation/Union/Alias/Binop) produce closed-over Put actions that reference children by ID, so serialization needs no second lookup phase. Wire format identical to the previous Map-based PR.

When a function signature like FA -> Action references large monomorphic type aliases (e.g. FrontendModel with 100+ transitive fields), srcTypeToVar walks the entire expanded type and creates fresh UnionFind variables on every single call site. On a real project, this resulted in 45M+ srcTypeToVar calls for only ~4000 unique Can.Type subtrees (11000x redundancy). This commit adds a per-run cache keyed on (rank, Can.Type) that returns the previously-built Variable when the same monomorphic type is encountered again. The cache is gated on flexVars being empty to ensure correctness in the presence of polymorphism: when type variables are in scope, sharing would incorrectly conflate distinct instantiations. On a real Lamdera project (391 modules, including dense Effect.Test code): - Cold build: 120s -> 98s (-18%, median over 3 runs) - typecheck UsersFlows: 125s -> 67s (-46%) on the bottleneck module

Profiling on a large project (391 modules, ~30 GB live working set during type checking) showed the compiler spent 86% of wall-clock in GC: each ~1 GB/s of allocation triggered a young collection every ~130 ms, and each collection scanned a huge live set. The nursery was the bottleneck that dwarfed every code-level optimization we tried. Increasing the allocation area to 1 GB reduces collection frequency by ~8x. On the same project, cold full builds drop from 130 s to 51 s (-61%, median of 3 runs). Peak RSS rises by ~1 GB on top of the existing ~19 GB working set, which is comfortably below typical developer-machine RAM budgets. Smaller projects keep their existing behavior since they never fill the nursery; the change is effectively no-op for them.

Code-review pass on the perf changes: * Introduced `Ext.Common.envFlag :: String -> Bool` to factor the cached env-var presence check that was duplicated inline in File.hs (LDEBUG_FILE_TIMING) and Interface.hs (LDEBUG_DEDUP_TIMING). * Dropped `serializeNanos` (declared, read by `getDedupTimings`, never written). `getDedupTimings` simplified to `IO Double`. * Dropped `_unusedIndex` workaround and the `Data.Index` import that motivated it; nothing in the new code touches `Index.ZeroBased`. * Merged `putInterfaceDedupRaw` and `putInterfaceDedupTimed` into a single `putInterfaceDedup`. Timing branches off into a small `recordPoolTime` helper that forces and clocks the pool, removing ~25 lines of duplicated serialization tail. * Renamed `putKeyedPuts` to `putMapPuts` for symmetry with the existing `getMapWith` reader. * Replaced two `if X > 0 then atomicPutStrLn ... else return ()` blocks in Reporting.hs with `when`. Tried also factoring the four `intern{Types,RecordFields,AliasArgs}` helpers through `Data.List.mapAccumL`. Measurements showed a +17% CPU regression on the cold-build workload (the generic tuple shape mapAccumL builds adds GC pressure on the pool-building hot path), so the direct recursive helpers are kept and a comment now documents the why. Re-benchmarked on the same project: minimum CPU time matches the pre-cleanup binary (38.5 s vs 39.4 s, within noise), no wall-clock regression.

CharlonTank added 3 commits April 17, 2026 08:42

CharlonTank mentioned this pull request Apr 17, 2026

Deduplicate types in .elmi binary serialization #94

Closed

8 tasks

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

perf: 60% faster cold builds via type dedup, solver memoization, and a larger GC nursery#96

perf: 60% faster cold builds via type dedup, solver memoization, and a larger GC nursery#96
CharlonTank wants to merge 4 commits intolamdera:lamdera-nextfrom
CharlonTank:perf/shape-and-memo

CharlonTank commented Apr 17, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

CharlonTank commented Apr 17, 2026

TL;DR

Benchmarks

Why these changes

1. Shape-based type dedup (Elm.Interface)

2. Memoize srcTypeToVar in the solver (Type.Solve)

3. RTS nursery bump -A128m → -A1g (elm.cabal)

Trade-offs

Files changed

Relation to PR #94

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

1. Shape-based type dedup (`Elm.Interface`)

2. Memoize `srcTypeToVar` in the solver (`Type.Solve`)

3. RTS nursery bump `-A128m` → `-A1g` (`elm.cabal`)