beta(0.4.0): 45 new functions + PMTiles DataSource#33
Draft
mjohns-databricks wants to merge 443 commits into
Draft
beta(0.4.0): 45 new functions + PMTiles DataSource#33mjohns-databricks wants to merge 443 commits into
mjohns-databricks wants to merge 443 commits into
Conversation
added 30 commits
June 9, 2026 08:21
The heavy (JVM) bench could not run against a UC Volume corpus: the JVM
can't do file I/O on the /Volumes object-storage FUSE mount (java.io and
java.nio open() return EPERM; Hadoop FileSystem returns UC_VOLUMES_NOT_
SUPPORTED) and GDAL can't random-access it. Only Spark is UC-aware.
- BenchIO.readBytes reads corpus.json + synth/bytes tiles via
spark.read.format("binaryFile") (the same connector the row-tile reader
uses); row tiles already read via binaryFile.
- HeavyRunner opens GDAL datasets by staging the Volume tile to a local
temp first (withVolumeDataset), since GDAL needs random access.
- JsonlAppender fsync is best-effort (FUSE can reject fsync); the heavy
shard is written to local disk, Python sinks it to Delta.
- Heavy spark-path rows recorded srid=0; record the row tiles' real srid
so they match the lightweight rows in compare_cells (was compared=0).
Validated on cluster: heavy 104/104 ok; --modes both -> compared=83.
The epilogue comparison (--modes both) failed on a cluster: spec. registered_rst and compare.pyrx_implemented resolved their inputs by walking up to repo paths (docs/tests-function-info/registered_functions.txt and pyrx/functions.py), which don't exist above the wheel in site-packages. - registered_rst falls back to importlib.resources when the repo path is absent; ship registered_functions.txt inside the bench package (package-data) so the fallback has a file to read. - pyrx_implemented falls back to the imported pyrx functions module's own __file__ when the repo path is absent. - check-binding-parity.py guards the packaged copy against drift from the canonical docs/tests-function-info/registered_functions.txt. Validated on cluster: --modes both wrote summary.md + compared=83 (SUCCESS).
Make a cluster run observable in real time, let light + heavy of one benchmark coexist in bench_results, and render results in the run UI. - Heavy streams live: HeavyBenchMain runs in a thread while Python tails its per-row-fsync'd local shard and feeds each new row to the same Delta sink. Lightweight already streams per function. - Table grain (run_id, api): --truncate-results is a scoped DELETE (run_id + the tier(s) this invocation writes) so the paired tier survives; --truncate-all empties the whole table (only this run remains). - Host poll filters the live count by (run_id, api); line-buffer stdout so the URL + progress stream live; 5s cadence; print queryable FQN + run URL right after submit. - Run UI is self-documenting: as each tier finishes, display its raw bench_results rows then its summary (light, the new heavyweight.summary.md, and the heavy-vs-light comparison when both ran). Validated: heavy streamed live; cluster-both-* held both tiers; truncate modes + api-filtered counts behaved; --modes both -> compared=83 SUCCESS.
GDAL's FillNodata writes its X/Y index scratch as a temp GTiff via CPLGenerateTempFilename. On some executors (Databricks clusters) that failed with "Could not create Y index work file. Check driver capabilities.", erroring the function. Setting TEMP_FILE_DRIVER=MEM keeps the scratch in memory (matching the /vsimem output) so it no longer depends on a writable temp dir. Co-authored-by: Isaac
Evolves the cluster heavy-vs-light raster benchmark for a clean, observable 1000-tile run and correct spark-path metrics. runner.py: - --explain-only mode: build each spark-path fn's DataFrame and print its physical plan (tee to a Volume dir for harvest) without timing. Used to diagnose the *_agg staging. - Aggregator parallelism fix: the groupBy's mandatory hash exchange supersedes (elides) any pre-groupBy repartition, so partition_size was a no-op for *_agg. Drive the agg stage width via spark.sql.shuffle.partitions = _parts instead; drop the dead repartition. (ArrowAggregatePython is non-partial -- documented.) - spark-path defaults to 1 warm-up + 1 measured iter (the N-tile sweep is the averaging); per_tile_avg_ms = iter_median_ms / num_tiles; binaryFile executor read + UDF (no driver LocalRelation crash); tile_array fns pinned to 1 tile/partition (OOM fix). results/compare.py: iter_-prefixed timing columns delineated from per_tile_avg_ms; rendered Markdown summaries with hoisted constants, tile-scale line, and corpus-pool warnings. cluster.py / launcher: one cell per (tier x mode) section for live, preserved output; function-granular --resume + --no-fix-errors; run_event_num as a monotonic progress counter; --override-partition-size; AQE off so per-fn partition counts are respected; 6h timeout. Scala bench (BenchRow/HeavyRunner/HeavyBenchMain): wall-clock fields + Volume binaryFile reads so the heavy tier streams to the same sink. Co-authored-by: Isaac
HeavyRunner coalesce(1)'d both the spark-path tile DataFrame and the aggregator's scaled groupBy, citing a local[N] GDAL driver-registration race. But the coalesce was unconditional, so on the cluster it serialized all 1000 tiles / aggregations onto ONE task -- the heavy spark-path grind that never completed. The race is already handled: GDALManager.init is synchronized + idempotent (registers once per JVM), which is why the product runs heavy rst_* functions on multi-core executors. New GdalParallelSafetyTest confirms concurrent rst_fromcontent + column eval (and a parallel groupBy().agg) in local[4] do not crash. So: repartition the spark-path tiles for parallelism (oversubscribe slots ~4x, capped at row count) and skip the limit(n) funnel when n==maxRows; and run the aggregator as a bounded parallel fan-out -- hash-partition the keys to AggKeysPerTask(=2) groups/task + broadcast the small group + drop the coalesce, so large-output aggregators (merge's union mosaic) hold few big outputs per task. shuffle.partitions set per-N so the groupBy exchange elides. Validated locally (5/5 heavy suites). Co-authored-by: Isaac
combineavg_tiles stacked all N group tiles as one float64 (N,bands,h,w) array plus a full-shape valid mask plus np.where copies -- ~4.3x the raw group bytes. A 16-tile group of 1024x1024x4 float32 peaked at ~1.17 GB, which OOMs a Spark Python worker. Rewrite to accumulate the running sum + count ONE tile at a time (open, add, close) so peak memory is ~O(one tile + 2 accumulators), independent of N: measured 1174 -> 172 MB (6.8x less) and ~2x faster, with byte-identical output (verified across f32 / NoData / multiband / int16 / big-tile; core_agg + fingerprint tests pass). Also fix _open_all: a corrupt tile mid-group threw before the caller's try/finally was entered, leaking every dataset/MemoryFile opened so far. Close them on partial-open failure before re-raising. Co-authored-by: Isaac
The lightweight *_agg perf path did group_df.crossJoin(broadcast(keys)), which broadcast the KEYS and iterated the (tiny, ~2-tile) group -- so the entire xN replication landed in group_size busy tasks while the rest of the slots idled (the 38/40-2-running straggler). Flip it: hash-partition the N keys into the fan-out and broadcast the small group, so replication runs parts-wide and the groupBy hash exchange elides (one stage). Bound the fan-out to AggKeysPerTask(=2) groups/task so large-output aggregators (merge's union mosaic) don't hold many big outputs per task and OOM the worker -- the earlier ~6/task fan-out lost rst_merge_agg to ExecutorLost. Also: the regular spark-path per-fn limit(n) injected a SinglePartition GlobalLimit funnel before the repartition; df_all is already capped at the source, so use it directly when n==maxRows (no funnel). And set per_tile_avg_ms on aggregator rows (only the regular rows had it -> 0s). Co-authored-by: Isaac
--redo-functions <csv>: force re-run an explicit fn list for the selected (run_id, api, mode) by purging their rows first, independent of --set so a resume run can also re-measure a named subset (e.g. the aggregators) while the rest completes. Targeted, unlike --resume (only MISSING) / --truncate (broad). Mutually exclusive with the truncates. Resume mode-scoping: run_light/run_heavy now scope the to-do to fns that actually HAVE the mode -- the spark-path-only *_agg aggregators no longer show as 7 'missing' (and dispatch for zero rows) on every pure-core resume. Collapse Cmd 3 (the big setup cell) by default so the run view leads with the per-section result cells, not the wall of setup code. Co-authored-by: Isaac
The combineavg_tiles streaming rewrite and the _open_all partial-open leak fix lacked dedicated coverage. Add: a NoData-aware mean over 5 tiles (exercises the one-tile-at-a-time sum+count accumulation past the 2-tile cases), a no-NoData-declared fast-path case, and a corrupt-tile-mid-group case that asserts _open_all raises cleanly rather than hanging/leaking. Co-authored-by: Isaac
Bench timings are now multi-second (a 1000-tile iteration is tens of seconds), so milliseconds read awkwardly (iter_median_ms=70785 vs 70.8 s). Convert the wall-clock metrics to SECONDS: rename avg_wall_clock_ms and all iter_*_ms to *_s (values /1000). Keep per_tile_avg_ms (sub-ms..ms scale stays readable in ms) and ADD per_tile_avg_s immediately to its left so per-tile is available in both units. Heavy _remap_iter divides the Scala jsonl ms keys by 1000 (Scala still emits ms). Summaries + compare render at second scale; per-tile shown in both s and ms. ResultRow 34->35 columns; the bench_results table was recreated (ms->s + per_tile_avg_s). Also: _show_md now renders the summary.md Volume path above the body, so the run view links to the artifact (e.g. .../bench-out/<run_id>/summary.md) for the final heavy-vs-light comparison and each per-tier section. Co-authored-by: Isaac
Parallelizing the heavy aggregator path exposed a product concurrency gap:
VectorRasterBridge.buildEmptyRaster does a raw gdal.GetDriverByName("MEM")
that races to null under concurrent tasks (NPE), and rst_gridfrompoints_agg
even sigabrt'd a worker. The 5 tile aggregators are concurrency-safe (they
only touch rst_fromcontent via GDALManager.init), but the geometry
aggregators (gridfrompoints/dtmfromgeoms/rasterize) go through the
not-thread-safe bridge. Serialize geometry-kind aggregators to one task
(parts=1) until the product bridge is made thread-safe; tile aggregators
keep the bounded parallel fan-out.
Co-authored-by: Isaac
The heavy jsonl ms->s conversion was a closure inside the run notebook
string (not unit-testable). Extract it to cluster.remap_heavy_iter_to_seconds
(the notebook imports cluster as _cl and now calls it) so the cluster path
and host tests share one implementation, and add unit tests covering the
rename+/1000 of every timing key, the per_tile_avg_{s,ms} derivation, and
the zero-rows guard (a /1000 slip would silently corrupt all heavy metrics).
Co-authored-by: Isaac
The bench harness (python/.../gbx/bench/) is internal dev/CI tooling, not user-facing API, so its functions + ResultRow schema fields will never have docs/ entries -- yet docs-match-code kept flagging them (ORDER, the second- scale columns, the remap helper), blocking pushes on false positives. Scope the check's Python symbol scan to exclude the bench subpackage. test- completeness stays ON for bench (it surfaces real coverage gaps), and the critical checks (binding-parity, secrets, etc.) are unaffected. Co-authored-by: Isaac
Extract the *_agg post-shuffle task-count decision from runAggregate into a testable HeavyRunner.aggParts(kind, n), and unit-test it: tile aggregators parallelize at ~2 keys/task (ceil(n/2), capped at n) while geometry aggregators serialize to 1 (VectorRasterBridge not thread-safe). Locks the geometry-serialization regression fix so a future edit can't silently re-parallelize the GDAL-unsafe path. Co-authored-by: Isaac
Per review, the bench_results table reads better with the headline timing metrics adjacent to the function/mode columns rather than buried after the size dims. Move avg_wall_clock_s + per_tile_avg_s + per_tile_avg_ms to immediately after mode in ORDER (tile_px/bands/... follow). The existing table was reordered in place to match. Co-authored-by: Isaac
The agg warm-up iter re-ran the FULL n-key scaled groupBy, doubling cost for slow aggregators. Pass time_iters a warmup_fn that aggregates a minimal scaled (parts keys, ~1 per partition) so the warm-up is cheap; the measured iter still runs the full n. Matches the regular spark-path's _warm_df pattern. Co-authored-by: Isaac
timeIters gains a warmBody param: warm-up iters run the cheap stand-in, the measured iters run the full job. runSparkPath warms over dfAll.limit(slots) .repartition(slots) (~1 tile/partition); runAggregate warms over a parts-key scaled (geometry aggs -> 1 key). Stops the full-N warm-up from doubling the cost of slow heavy aggregators. Unit-tested (warm vs measured run counts). Co-authored-by: Isaac
The GDAL Java bindings hold process-global OGR registry state; the new-in-0.4.0 vector/geometry functions registered OGR ad-hoc per task (raw ogr.RegisterAll() in execute() bodies), so concurrent Spark tasks in one executor JVM raced the registry -> null GetDriverByName (NPE) or a native SIGSEGV killing the executor. A local[8] stress test calling raw RegisterAll() concurrently hard-crashes the JVM in libgdal, confirming the race. Add GDALManager.initOgr() -- synchronized + idempotent (registers OGR once per JVM under the same lock as init's gdal.AllRegister) -- and route all 9 sites through it: MvtWriter, VectorRasterBridge, RST_Polygonize, RST_Contour, RST_GridFromPoints, GDALRasterize, OGR_SchemaInference, OGR_DataSource. OgrThreadSafetyTest (8 threads x 50 concurrent initOgr) is the regression guard; CLAUDE.md records the requirement. These functions are new in 0.4.0, so this is concurrency hardening of unreleased code, not a fix of released behavior. Co-authored-by: Isaac
HeavyRunner.aggParts no longer pins geometry aggregators to one task: now that OGR/GDAL registration is thread-safe via the synchronized GDALManager guards (initOgr, committed in 0bc1c9f), gridfrompoints/dtmfromgeoms/rasterize_agg use the same bounded keys/task fan-out as tile aggregators. Validated on cluster 0519 (1000-tile spark-path): rst_dtmfromgeoms_agg ran parallel OK with no sigabrt and dropped 1324s -> 74.8s (~18x); all 5 affected fns ok both tiers. cluster.py _show_md now renders summaries via markdown -> displayHTML (the Databricks job-run UI does not render IPython.display.Markdown -- that only works in Jupyter), with IPython/print fallbacks; the notebook %pip installs markdown. push_jar_to_volume.py (gbx:data:push-jar) now stages BOTH jars from one build: the product fat jar -> GBX_ARTIFACT_VOLUME (init-script source) and the bench tests.jar -> the bundle volroot (bench cluster-library source), removing the manual fs-cp step for the tests.jar. The .sh prefers the .venv-pyrx interpreter (has databricks-sdk). Co-authored-by: Isaac
The final heavy-vs-light compare summary never rendered inline in the run UI (only its summary.md path surfaced in the exit JSON), while the per-section lightweight/heavyweight summaries did render. Cause: the compare _show_md() displayHTML render shared a cell with dbutils.notebook.exit(); in a job run exit() ends the notebook immediately and the run UI keeps only the exit value, dropping that cell's display output. The per-section summaries render because each is in its own cell that completes normally. Split the exit into a trailing _EXIT cell so the epilogue cell (which builds result + renders the compare) completes and commits its HTML output before the exit cell runs (result persists across cells via shared notebook globals). Verified on the deployed notebook for run 877343400972203: cell 5 renders the compare, cell 6 is exit-only. Co-authored-by: Isaac
--redo-functions only PURGED the named fns' rows; the executed scope came solely from --set/--functions. So --redo-functions X with the default --set core (X not in core) DELETED X's rows and re-ran nothing -- the run still reported SUCCESS while silently losing data. Hit 2026-06-10 redoing the geometry aggregators: their spark-path rows were purged, none replaced. Union REDO_FUNCTIONS into the run scope in the preamble -- into fnspecs (the lightweight scope and the heavy _mode_names gate) AND FUNCTIONS (the csv run_heavy splits) -- so a redo fn outside the selected set is actually re-run, matching the documented 'force re-run, independent of --set' intent. select(functions=) ignores the tier, so out-of-set redo fns resolve. Validated against the registry: out-of-core redo fns now land in fnspecs + FUNCTIONS + the spark-path mode gate. Co-authored-by: Isaac
The desired bench iteration defaults: spark-path stays 1 warm-up + 1 measured (the N-tile sweep is the averaging); pure-core moves from 2/5 to 1/3. One warm-up is enough to absorb the cold cost (e.g. numba JIT compiles on the first call), and 3 measured gives a stable median of a fast single-tile op without overpaying. Both remain overridable via --warmup/--measured/--spark-*. Co-authored-by: Isaac
Document the one-time per-process numba JIT compilation paid on the first viewshed call (several seconds, amortized over subsequent calls; steady-state scales with tile size). From the light-outlier perf investigation: xrspatial's viewshed uses a good O(n log n) sweep but its @ngjit kernels lack cache=True, so the cold-start can't be cleanly eliminated -- it amortizes at scale and is absorbed by benchmark warmup. Not a bug; documented rather than patched. Co-authored-by: Isaac
ReTile.getTile minted a /vsimem path and shelled out to a full gdal.Translate -srcwin per output tile (UUID, command build+reparse, empty-recheck, FlushCache, failing Files.size) -- ~4x per-tile overhead shared by rst_tooverlappingtiles, rst_retile, and rst_maketiles. Replace it with WindowedExtract.extract: a direct windowed ReadRaster + Create into a /vsimem GTiff. The downstream serializer re-encodes the tile, so compression/byte-layout parity is moot; the fast path copies every pixel/attribute that is observable -- window-shifted geotransform (full formula incl. rotation terms), projection, dataset+band metadata, per-band NoData (incl. NaN), color interpretation, color table, scale, offset, unit type. Correctness over speed: the fast path runs only when simpleEnough(ds) holds (uniform per-band dtype; no real mask bands; no GCPs; no RPC/GEOLOCATION). Anything else FALLS BACK to the unchanged GDALTranslate.executeTranslate with the same -srcwin command and the source driver's extension, so e.g. a mixed-dtype VRT round-trips faithfully. getTile keeps the exact RasterAccessors.isEmpty discard + gdal.Unlink cleanup, preserving tile count. GTiff/MEM drivers are obtained via new synchronized GDALManager accessors (gtiffDriver/memDriver), never a raw per-task GetDriverByName, so concurrent retile cannot race driver registration. WindowedExtractTest covers pixel parity, window-shifted (and rotated) geotransform, NoData incl. NaN and unset, color table, scale/offset/ unit/metadata, color interpretation, projection, the three fallback triggers (mixed dtype, per-dataset mask, GCPs), partial windows, all-NoData discard, and metadata-map shape parity. Co-authored-by: Isaac
The Scala heavyweight shard (BenchRow) serializes ms-suffixed timing field names (median_ms, min_ms, p90_ms, total_wall_clock_ms, avg_wall_clock_ms) while the canonical ResultRow is second-scale (iter_*_s). read_jsonl threw TypeError on the heavy shard, crashing the local heavyweight-vs-lightweight compare. Map ms->s (÷1000) + drop unknown keys on read so both shards load. Mirrors cluster.py's _remap_heavy_iter_to_seconds for the local path. Co-authored-by: Isaac
Serverless / Spark Connect forbids spark.conf.set() and JVM-bridge access (_jvm/_jsc/sparkContext/.rdd). Serverless is a target environment for the lightweight pyrx tier, so the pyrx product must never do those -- only register UDFs + build Column expressions. Static source guard fails (with file:line) if a forbidden call is added to pyrx, before it ships and breaks on Serverless. Audit 2026-06-11: pyrx is currently clean; the only config-setting is the bench harness (repo tooling, not the product), and the only _jvm use is vectorx (the heavyweight JVM-backed tier, not the Serverless-light target). Co-authored-by: Isaac
A spark-path iteration silently capped at the pool size (tiles = pool.tiles[:max_rows] / pool.tiles.take(maxRows)), so a 100-tile pool against --row-counts 1000 reported rows=1000 while only processing 100 -- an invalid, under-filled measurement that looked complete. This produced a bad 1000-scale 1024px data point that had to be discarded. Refuse the run instead of under-filling, with the same message at all three entry points: - push_and_run_bench_on_cluster.py: launcher pre-flight reads the corpus.json row_pool size and returns non-zero before submitting the cluster job (the immediate, no-rebuild guard). - runner.py (light): ValueError before the slice. - HeavyRunner.scala (heavy): require() before take(). Co-authored-by: Isaac
Regenerate the benchmark page from two fresh 1024² runs: - Pure-core (local, 100 fns): per-tile timing in ms; consistency tally 26 exact / 47 within-tol / 23 timing-only / 4 divergent. The 4 divergences (convolve, derivedband, resample, contour) are all NoData/edge handling -- narrated honestly, interior values match. - Spark-path at scale (cluster, 83 fns, 1000 tiles): a new section showing that pure-core wins don't all survive at scale -- terrain and the H3/QuadBin grids fall to parity once tile bytes cross the Python boundary (the serialization tax). 57/83 light >= heavy, 26 heavy faster; the 7 aggregators all within tolerance. Both tables mark the speedup<1 crossover with a divider row, and both option tables (cluster + local) state defaults and the supported sets vs continuous ranges; clarify pure-core = 1 tile (ignores --row-counts) and the warmup-vs-measured iteration model. Also: retile family now fast after the windowed-read fix (13-23x in pure-core); document the harness repartition strategy and the hard Serverless caveat (pyrx sets no Spark config); note the bench refuses to run when the row pool can't fill --row-counts. Co-authored-by: Isaac
The ms->s field converter added in 4d671cc (so local heavy-vs-light compare stops crashing on the Scala BenchRow's ms-suffixed timing fields) had no direct test. Add five: all-five-fields conversion, None->0.0, setdefault not clobbering an existing seconds value, unknown-key drop (forward-compat), and an end-to-end read_jsonl over a heavyweight ms-style shard. Closes the QC test-completeness gap. Co-authored-by: Isaac
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
GeoBrix v0.4.0 — 45 new functions + 1 new DataSource across 12 implementation waves merged between 2026-05-27 and 2026-05-28. This PR opens as DRAFT for review while the wave-by-wave commit history is fresh; mark ready when the diff has been audited.
What's new
See
docs/docs/beta-release-notes.mdx§ What's new in v0.4.0 for the full per-feature changelog (12 bullets). High-level groupings:VectorX (new expression surface) — first expression-level functions in VectorX:
gbx_st_asmvt— UDAF aggregating features into Mapbox Vector Tile (MVT) protobufgbx_st_asmvt_pyramid— generator: feature → many(z, x, y, mvt_bytes)rowsGridX (new quadbin subpackage):
gridx/quadbin/(pointascell,aswkb,centroid,resolution,polyfill,kring,tessellate,cellunion,distance) — CARTO quadbin v0 spec, 64-bit Long cell IDs aligned with the web-mercator XYZ tile grid.RasterX (29 new functions):
gbx_rst_rasterize,gbx_rst_polygonizeto_webmercator,tilexyz,xyzpyramidslope,aspect,hillshade,tri,tpi,roughness,color_reliefevi,savi,ndwi,nbr, plus genericindexdispatcherresample/_to_size/_to_res,gridfrompoints+_aggfillnodata,sample,setsrid,histogram,threshold,buildoverviews,bandcog_convert,proximity,contour,viewshedPMTiles (new top-level package):
gbx_pmtiles_aggUDAF — returns BINARY PMTile v3 blob.write.format("pmtiles")DataSource — streams larger pyramids to file via partitioned commit protocolTest plan
build main) green onbeta/0.4.0(latest run).maven-keys.listunchanged.cursor/rules/user-facing-docs-voice.mdcrule)META-INF/services/...DataSourceRegisterregistrationsThis pull request and its description were written by Isaac.