Bump sparkey to 3.7.0, add heap budget support for side inputs by spkrka · Pull Request #5901 · spotify/scio

spkrka · 2026-03-24T19:29:28Z

Summary

Upgrade sparkey from 3.5.1 to 3.7.0 (adds heap-backed reader, builder API)
Add SparkeyReadConfig with heapBudgetBytes parameter controlling how much sparkey data is loaded into JVM heap (byte[]) vs memory-mapped files
Shards are sorted largest-first and greedily allocated to heap until budget is exhausted; remaining shards use mmap (current default behavior)
Default heapBudgetBytes=0 preserves existing all-mmap behavior

Usage

val config = SparkeyReadConfig(heapBudgetBytes = 100L * 1024 * 1024 * 1024)
sc.sparkeySideInput(path, config)

Motivation

Large sparkey side inputs using mmap can cause major GC and page fault overhead on Dataflow workers. Loading sparkey data into JVM heap instead eliminates page faults and allows the JVM to manage memory directly. The heap budget lets users control the tradeoff per side input.

Test plan

Existing sparkey tests pass
Added tests for heap budget allocation (sharded and unsharded)
MiMa binary compatibility check passes
scalafmt passes

🤖 Generated with Claude Code

codecov · 2026-03-25T08:12:27Z

Codecov Report

❌ Patch coverage is 84.81013% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.67%. Comparing base (d59fbba) to head (727ae72).

Files with missing lines	Patch %	Lines
...scala/com/spotify/scio/extra/sparkey/package.scala	71.42%	10 Missing ⚠️
...la/com/spotify/scio/extra/sparkey/SparkeyUri.scala	97.61%	1 Missing ⚠️
...extra/sparkey/instances/ShardedSparkeyReader.scala	0.00%	1 Missing ⚠️

Additional details and impacted files

@@            Coverage Diff             @@
##             main    #5901      +/-   ##
==========================================
+ Coverage   61.54%   61.67%   +0.12%     
==========================================
  Files         317      318       +1     
  Lines       11653    11715      +62     
  Branches      822      844      +22     
==========================================
+ Hits         7172     7225      +53     
- Misses       4481     4490       +9

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:

❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

kellen · 2026-03-25T15:24:02Z

As written, this api only helps when loading from an externally-produced sparkey, but we also have ones that can be produced in the same pipeline as they are read e.g. via the largeHash* methods

spkrka · 2026-03-25T16:55:39Z

Good point, SparkeySideInput is not the only sparkey flow in the codebase - we also have LargeMapSideInput and LargeSetSideInput. I don't mind updating the PR to add support for those two if needed - this was the minimal to try to investigate/unblock a specific issue. I think in any case I don't really want anything merged until we have explored it a bit more.

Sorry about converting to PR form - draft form was better, I was hoping it would trigger some type of snapshot build I could use! I'll switch back.

On whether locally-produced sparkeys benefit from prefetching: I think it depends on the data flow. If the sparkey files go through GCS (write → download on each worker), the page cache will be cold and load() helps. If there's a path where a worker produces and consumes locally, it might already be warm. Either way, a load() call is cheap when pages are already resident, and since this happens on a background thread the runtime would not be affected.

- Upgrade sparkey from 3.5.1 to 3.7.0 (adds heap-backed reader, builder API) - Add SparkeyReadConfig with heapBudgetBytes parameter controlling how much sparkey data is loaded into JVM heap (byte[]) vs memory-mapped files - Shards are sorted largest-first and greedily allocated to heap until budget is exhausted; remaining shards use mmap (current default behavior) - Default heapBudgetBytes=0 preserves existing all-mmap behavior Usage: val config = SparkeyReadConfig(heapBudgetBytes = 100L * 1024 * 1024 * 1024) sc.sparkeySideInput(path, config) Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

SideInput.get() is called per DoFn clone (one per vCPU). With mmap this was fine (OS deduplicates pages), but with heap-backed readers each clone allocated its own byte[] copies, causing OOM on high-vCPU machines. Use a static ConcurrentHashMap[path, CompletableFuture[SparkeyReader]] so all DoFn instances on the same worker share a single reader. The future is created eagerly via supplyAsync so computeIfAbsent returns quickly without holding the bucket lock during slow loading. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>

spkrka · 2026-03-27T14:05:47Z

Closing, made a cleaner one here: #5903

spkrka commented Mar 24, 2026

View reviewed changes

Comment thread scio-extra/src/main/scala/com/spotify/scio/extra/sparkey/package.scala Outdated

spkrka force-pushed the krka/sparkey-load branch 2 times, most recently from 1fc0253 to f3cd63f Compare March 25, 2026 11:19

spkrka changed the title ~~Bump sparkey to 3.6.0, add LoadMode for page cache prefetch~~ Bump sparkey to 3.6.1, add LoadMode for page cache prefetch Mar 25, 2026

spkrka force-pushed the krka/sparkey-load branch 2 times, most recently from 244d4fd to 57dba3f Compare March 25, 2026 13:26

spkrka marked this pull request as ready for review March 25, 2026 14:19

spkrka marked this pull request as draft March 25, 2026 16:55

spkrka force-pushed the krka/sparkey-load branch from b0f141e to 8faf9b3 Compare March 26, 2026 09:55

spkrka changed the title ~~Bump sparkey to 3.6.1, add LoadMode for page cache prefetch~~ Bump sparkey to 3.7.0, add heap budget support for side inputs Mar 26, 2026

spkrka force-pushed the krka/sparkey-load branch from 8faf9b3 to fbe4642 Compare March 26, 2026 11:51

spkrka closed this Mar 27, 2026

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Bump sparkey to 3.7.0, add heap budget support for side inputs#5901

Bump sparkey to 3.7.0, add heap budget support for side inputs#5901
spkrka wants to merge 2 commits intospotify:mainfrom
spkrka:krka/sparkey-load

spkrka commented Mar 24, 2026 •

edited

Loading

Uh oh!

Uh oh!

codecov bot commented Mar 25, 2026 •

edited

Loading

Uh oh!

kellen commented Mar 25, 2026

Uh oh!

spkrka commented Mar 25, 2026

Uh oh!

spkrka commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

spkrka commented Mar 24, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Usage

Motivation

Test plan

Uh oh!

Uh oh!

codecov bot commented Mar 25, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Codecov Report

Uh oh!

kellen commented Mar 25, 2026

Uh oh!

spkrka commented Mar 25, 2026

Uh oh!

spkrka commented Mar 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

spkrka commented Mar 24, 2026 •

edited

Loading

codecov bot commented Mar 25, 2026 •

edited

Loading