feat(sparkey): Deduplicate SparkeyReader across DoFn clones#5918
feat(sparkey): Deduplicate SparkeyReader across DoFn clones#5918spkrka wants to merge 1 commit intospotify:mainfrom
Conversation
| .computeIfAbsent( | ||
| uri.path, | ||
| _ => | ||
| CompletableFuture.supplyAsync { () => |
There was a problem hiding this comment.
Note: using a future here to support loading multiple readers in parallel if they are expensive to create - currently that's not true, but if we add loading/locking/load to heap later it might be.
There was a problem hiding this comment.
🤔 But is the CompletableFuture any relevant if its call is always followed by .join()? I wonder removing CompletableFuture might be simpler.. If keeping, then consider leaving a comment for its motivation.
There was a problem hiding this comment.
Hm, perhaps you're right and I reasoned incorrectly about this - my hope was that multiple threads loading the same sideinputs would still allow for loading multiple things in parallel, but if the file ordering is stable, that doesn't actually help. Will think more about this, and potentially simplify
Codecov Report✅ All modified and coverable lines are covered by tests. Additional details and impacted files@@ Coverage Diff @@
## main #5918 +/- ##
==========================================
+ Coverage 61.71% 61.72% +0.01%
==========================================
Files 317 317
Lines 11654 11660 +6
Branches 815 827 +12
==========================================
+ Hits 7192 7197 +5
- Misses 4462 4463 +1 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
Beam creates one DoFn clone per vCPU thread (e.g. 80 on n4-standard-80). Previously, each clone independently called uri.getReader() which downloads files from GCS and opens new SparkeyReader instances — duplicating work and wasting file descriptors and mmap regions. This adds a static ConcurrentHashMap<String, CompletableFuture<SparkeyReader>> cache so the first thread loads the reader and all others wait on the same future. The reader is reused for the lifetime of the JVM, which is safe for Dataflow batch (one pipeline per JVM). The cache is used by SparkeySideInput, LargeMapSideInput, and LargeSetSideInput. No API changes, no new dependencies. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
efdd017 to
2ce6af5
Compare
| val reader = uri.getReader(rfu) | ||
| checkMemory(reader) | ||
| reader |
There was a problem hiding this comment.
| val reader = uri.getReader(rfu) | |
| checkMemory(reader) | |
| reader | |
| checkMemory(uri.getReader(rfu)) |
Can this be simplified?
| .computeIfAbsent( | ||
| uri.path, | ||
| _ => | ||
| CompletableFuture.supplyAsync { () => |
There was a problem hiding this comment.
🤔 But is the CompletableFuture any relevant if its call is always followed by .join()? I wonder removing CompletableFuture might be simpler.. If keeping, then consider leaving a comment for its motivation.
Summary
Beam creates one DoFn clone per vCPU thread (e.g. 80 on an n4-standard-80 worker). Previously, each clone independently called
uri.getReader(), which downloads files from GCS and opens newSparkeyReaderinstances — duplicating file I/O, file descriptors, and mmap regions.This adds a
CompletableFuture-based reader cache so the first thread loads the reader and all others wait on the same future. The reader is reused for the JVM lifetime, which is safe for Dataflow batch (one pipeline per JVM).Changes
SparkeySideInput,LargeMapSideInput, andLargeSetSideInputnow go throughSparkeySideInput.getOrCreateReader()instead of callinguri.getReader()directlyConcurrentHashMap[String, CompletableFuture[SparkeyReader]]keyed by URI pathcheckMemoryis preserved but called once per reader (inside the cache loader) instead of on everyget()callWhy this is safe
SparkeyReaderfromSparkey.open()returns aPooledSparkeyReaderwhich is already thread-safecheckMemorywhich also uses a static referenceWhat this does NOT include
This is intentionally minimal. The heap-backed reader fallback and memory-aware shard placement from #5903 / #5904 are separate concerns and will be proposed separately (likely as opt-in features).
Test plan
🤖 Generated with Claude Code