You signed in with another tab or window. Reload to refresh your session.You signed out in another tab or window. Reload to refresh your session.You switched accounts on another tab or window. Reload to refresh your session.Dismiss alert
feat(sparkey): Deduplicate SparkeyReader instances across DoFn clones
Beam creates one DoFn clone per vCPU thread (e.g. 80 on n4-standard-80).
Previously, each clone independently called uri.getReader() which downloads
files from GCS and opens new SparkeyReader instances — duplicating work
and wasting file descriptors and mmap regions.
This adds a static ConcurrentHashMap<String, CompletableFuture<SparkeyReader>>
cache so the first thread loads the reader and all others wait on the same
future. The reader is reused for the lifetime of the JVM, which is safe for
Dataflow batch (one pipeline per JVM).
The cache is used by SparkeySideInput, LargeMapSideInput, and
LargeSetSideInput. No API changes, no new dependencies.
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
0 commit comments