Skip to content

Bump sparkey to 3.7.0, add heap budget support for side inputs#5901

Closed
spkrka wants to merge 2 commits intospotify:mainfrom
spkrka:krka/sparkey-load
Closed

Bump sparkey to 3.7.0, add heap budget support for side inputs#5901
spkrka wants to merge 2 commits intospotify:mainfrom
spkrka:krka/sparkey-load

Conversation

@spkrka
Copy link
Copy Markdown
Member

@spkrka spkrka commented Mar 24, 2026

Summary

  • Upgrade sparkey from 3.5.1 to 3.7.0 (adds heap-backed reader, builder API)
  • Add SparkeyReadConfig with heapBudgetBytes parameter controlling how much sparkey data is loaded into JVM heap (byte[]) vs memory-mapped files
  • Shards are sorted largest-first and greedily allocated to heap until budget is exhausted; remaining shards use mmap (current default behavior)
  • Default heapBudgetBytes=0 preserves existing all-mmap behavior

Usage

val config = SparkeyReadConfig(heapBudgetBytes = 100L * 1024 * 1024 * 1024)
sc.sparkeySideInput(path, config)

Motivation

Large sparkey side inputs using mmap can cause major GC and page fault overhead on Dataflow workers. Loading sparkey data into JVM heap instead eliminates page faults and allows the JVM to manage memory directly. The heap budget lets users control the tradeoff per side input.

Test plan

  • Existing sparkey tests pass
  • Added tests for heap budget allocation (sharded and unsharded)
  • MiMa binary compatibility check passes
  • scalafmt passes

🤖 Generated with Claude Code

Comment thread scio-extra/src/main/scala/com/spotify/scio/extra/sparkey/package.scala Outdated
@codecov
Copy link
Copy Markdown

codecov bot commented Mar 25, 2026

Codecov Report

❌ Patch coverage is 84.81013% with 12 lines in your changes missing coverage. Please review.
✅ Project coverage is 61.67%. Comparing base (d59fbba) to head (727ae72).

Files with missing lines Patch % Lines
...scala/com/spotify/scio/extra/sparkey/package.scala 71.42% 10 Missing ⚠️
...la/com/spotify/scio/extra/sparkey/SparkeyUri.scala 97.61% 1 Missing ⚠️
...extra/sparkey/instances/ShardedSparkeyReader.scala 0.00% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5901      +/-   ##
==========================================
+ Coverage   61.54%   61.67%   +0.12%     
==========================================
  Files         317      318       +1     
  Lines       11653    11715      +62     
  Branches      822      844      +22     
==========================================
+ Hits         7172     7225      +53     
- Misses       4481     4490       +9     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

@spkrka spkrka force-pushed the krka/sparkey-load branch 2 times, most recently from 1fc0253 to f3cd63f Compare March 25, 2026 11:19
@spkrka spkrka changed the title Bump sparkey to 3.6.0, add LoadMode for page cache prefetch Bump sparkey to 3.6.1, add LoadMode for page cache prefetch Mar 25, 2026
@spkrka spkrka force-pushed the krka/sparkey-load branch 2 times, most recently from 244d4fd to 57dba3f Compare March 25, 2026 13:26
@spkrka spkrka marked this pull request as ready for review March 25, 2026 14:19
@kellen
Copy link
Copy Markdown
Contributor

kellen commented Mar 25, 2026

As written, this api only helps when loading from an externally-produced sparkey, but we also have ones that can be produced in the same pipeline as they are read e.g. via the largeHash* methods

@spkrka
Copy link
Copy Markdown
Member Author

spkrka commented Mar 25, 2026

Good point, SparkeySideInput is not the only sparkey flow in the codebase - we also have LargeMapSideInput and LargeSetSideInput. I don't mind updating the PR to add support for those two if needed - this was the minimal to try to investigate/unblock a specific issue. I think in any case I don't really want anything merged until we have explored it a bit more.

Sorry about converting to PR form - draft form was better, I was hoping it would trigger some type of snapshot build I could use! I'll switch back.

On whether locally-produced sparkeys benefit from prefetching: I think it depends on the data flow. If the sparkey files go through GCS (write → download on each worker), the page cache will be cold and load() helps. If there's a path where a worker produces and consumes locally, it might already be warm. Either way, a load() call is cheap when pages are already resident, and since this happens on a background thread the runtime would not be affected.

@spkrka spkrka marked this pull request as draft March 25, 2026 16:55
@spkrka spkrka force-pushed the krka/sparkey-load branch from b0f141e to 8faf9b3 Compare March 26, 2026 09:55
@spkrka spkrka changed the title Bump sparkey to 3.6.1, add LoadMode for page cache prefetch Bump sparkey to 3.7.0, add heap budget support for side inputs Mar 26, 2026
- Upgrade sparkey from 3.5.1 to 3.7.0 (adds heap-backed reader, builder API)
- Add SparkeyReadConfig with heapBudgetBytes parameter controlling how much
  sparkey data is loaded into JVM heap (byte[]) vs memory-mapped files
- Shards are sorted largest-first and greedily allocated to heap until budget
  is exhausted; remaining shards use mmap (current default behavior)
- Default heapBudgetBytes=0 preserves existing all-mmap behavior

Usage:
  val config = SparkeyReadConfig(heapBudgetBytes = 100L * 1024 * 1024 * 1024)
  sc.sparkeySideInput(path, config)

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@spkrka spkrka force-pushed the krka/sparkey-load branch from 8faf9b3 to fbe4642 Compare March 26, 2026 11:51
SideInput.get() is called per DoFn clone (one per vCPU). With mmap this
was fine (OS deduplicates pages), but with heap-backed readers each clone
allocated its own byte[] copies, causing OOM on high-vCPU machines.

Use a static ConcurrentHashMap[path, CompletableFuture[SparkeyReader]]
so all DoFn instances on the same worker share a single reader. The
future is created eagerly via supplyAsync so computeIfAbsent returns
quickly without holding the bucket lock during slow loading.

Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
@spkrka
Copy link
Copy Markdown
Member Author

spkrka commented Mar 27, 2026

Closing, made a cleaner one here: #5903

@spkrka spkrka closed this Mar 27, 2026
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants