Skip to content

feat: parse kerchunk inline refs into inlined ChunkManifest entries#979

Merged
TomNicholas merged 5 commits intozarr-developers:mainfrom
TomNicholas:kerchunk-inlined-chunks
Apr 24, 2026
Merged

feat: parse kerchunk inline refs into inlined ChunkManifest entries#979
TomNicholas merged 5 commits intozarr-developers:mainfrom
TomNicholas:kerchunk-inlined-chunks

Conversation

@TomNicholas
Copy link
Copy Markdown
Member

@TomNicholas TomNicholas commented Apr 24, 2026

Summary

Teaches the kerchunk parsers (both JSON and Parquet) to decode inline references into ChunkManifest._inlined instead of raising NotImplementedError. Both the raw-string and base64:-prefixed forms of kerchunk inline data are supported, and the single translator change covers both parsers since they share the same pipeline.

Fixes the read half of #489 — round-tripping a kerchunk-with-inlined-data file through open_virtual_dataset(..., filetype="kerchunk") now works. The write half (emitting inlined chunks from the kerchunk/icechunk writers) is deliberately left for a follow-up PR, so this PR mentions but does not close #489.

Only possible thanks to #938.

Test plan

  • JSON parser: parametrized test over both inline encodings (base64-prefixed and raw string) — hand-crafted refs dict with one inlined + one virtual chunk, assertions on exact manifest._inlined bytes and manifest.dict().
  • Parquet parser: same refs round-tripped through kerchunk.df.refs_to_dataframe, same assertions.
  • End-to-end: ManifestStore.get for both an inlined chunk key and a virtual chunk key returns the expected bytes.

TomNicholas and others added 4 commits April 24, 2026 11:55
Replaces the two tests that locked in NotImplementedError for inlined
refs with in-memory tests that hand-craft a refs dict with one inlined
chunk and one virtual chunk, and assert the exact bytes, offsets, and
paths that should appear in the resulting ChunkManifest. Covers both
kerchunk inline encodings (base64-prefixed and raw string) for the JSON
parser, plus a parquet round-trip.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Kerchunk represents inlined chunk data as either a raw string (interpreted
as bytes) or a base64-encoded payload prefixed with 'base64:'. Previously
the translator raised NotImplementedError for either; it now decodes both
forms into a ChunkEntry with a 'data' field so the bytes flow through to
ChunkManifest._inlined. Works for both the KerchunkJSONParser and
KerchunkParquetParser since both share this translator.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Constructs a refs dict with one base64-encoded inlined chunk plus one
virtual chunk pointing at a file in a MemoryStore, parses it through
KerchunkJSONParser, then awaits ManifestStore.get for each chunk key
and asserts the bytes match. The inlined chunk is served directly from
ChunkManifest._inlined while the virtual chunk is fetched via the
object store registry.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 24, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.92%. Comparing base (3194b09) to head (caa6e94).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@           Coverage Diff           @@
##             main     #979   +/-   ##
=======================================
  Coverage   89.91%   89.92%           
=======================================
  Files          33       33           
  Lines        2053     2054    +1     
=======================================
+ Hits         1846     1847    +1     
  Misses        207      207           
Files with missing lines Coverage Δ
virtualizarr/parsers/kerchunk/translator.py 81.25% <100.00%> (+0.16%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TomNicholas TomNicholas added Kerchunk Relating to the kerchunk library / specification itself parsers labels Apr 24, 2026
@TomNicholas TomNicholas merged commit ad4a593 into zarr-developers:main Apr 24, 2026
19 checks passed
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Kerchunk Relating to the kerchunk library / specification itself parsers

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Error reading inlined reference data when trying to roundtrip virtual dataset

1 participant