Skip to content

feat: generalize ChunkManifest to hold inline chunks#938

Merged
TomNicholas merged 26 commits intozarr-developers:mainfrom
maxrjones:store-native-chunks
Apr 24, 2026
Merged

feat: generalize ChunkManifest to hold inline chunks#938
TomNicholas merged 26 commits intozarr-developers:mainfrom
maxrjones:store-native-chunks

Conversation

@maxrjones
Copy link
Copy Markdown
Member

@maxrjones maxrjones commented Mar 20, 2026

This implements the design proposed in #851 (comment) for supporting native chunks in ChunkManifest, which allows loading Kerchunk JSONs with in-lined references or eventually Icechunk stores.

I prefer this design over #794, but still don't really like the amount of if native forks needed. Still, this is such a long-overdue feature that it seems worth releasing soon even if it's not the perfect design. We can also improve the internals later on. Let's discuss at the dev meeting tomorrow.

  • Adds _native: dict[tuple[int, ...], bytes] to ChunkManifest for storing in-memory chunk data
  • Extends ChunkEntry with optional data: NotRequired[bytes] field
  • Native chunks propagate through concat/stack/broadcast, pickle, and ManifestStore.get()
  • Kerchunk parser decodes base64 inlined refs into native chunks
  • Writers serialize native chunks: base64 for kerchunk, store.set() for icechunk
  • Docs page + tests including a kerchunk→icechunk→xarray roundtrip

Acceptance criteria:

@codecov
Copy link
Copy Markdown

codecov Bot commented Mar 20, 2026

Codecov Report

✅ All modified and coverable lines are covered by tests.
✅ Project coverage is 89.91%. Comparing base (e82ac27) to head (de587a9).
⚠️ Report is 1 commits behind head on main.

Additional details and impacted files
@@            Coverage Diff             @@
##             main     #938      +/-   ##
==========================================
+ Coverage   89.35%   89.91%   +0.56%     
==========================================
  Files          33       33              
  Lines        2038     2053      +15     
==========================================
+ Hits         1821     1846      +25     
+ Misses        217      207      -10     
Files with missing lines Coverage Δ
virtualizarr/manifests/array_api.py 98.18% <100.00%> (+0.56%) ⬆️
virtualizarr/manifests/manifest.py 92.56% <100.00%> (+6.38%) ⬆️
virtualizarr/manifests/store.py 89.72% <100.00%> (+0.44%) ⬆️
🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Comment thread docs/inlined_references.md Outdated
Copy link
Copy Markdown
Member

@TomNicholas TomNicholas left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This is awesome, thanks @maxrjones .

My main concern about this PR is that it changes private internals and public behaviour in one go. The more neat way to do it would be to leave any changes to the Kerchunk and Icechunk readers/writers to follow-up PRs.

Comment thread virtualizarr/manifests/manifest.py Outdated
Comment thread virtualizarr/manifests/manifest.py Outdated
_paths: np.ndarray[Any, np.dtypes.StringDType]
_offsets: np.ndarray[Any, np.dtype[np.uint64]]
_lengths: np.ndarray[Any, np.dtype[np.uint64]]
_native: dict[tuple[int, ...], bytes]
Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

How is a scalar native array handled? An empty tuple? Worth having a test for that case.

Copy link
Copy Markdown
Member Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Comment thread virtualizarr/manifests/manifest.py
Comment thread virtualizarr/manifests/manifest.py Outdated
@TomNicholas
Copy link
Copy Markdown
Member

TomNicholas commented Mar 20, 2026

I'm also not totally clear on the intended relationship between this feature and the parsers. Do parsers choose if a chunk is native? 1 IIUC the user has no way to use this feature via loadable_variables? I think that makes sense, but trying to confirm my understanding.

Footnotes

  1. (Also I think we should actually call this "inlined", because it is effectively stored in our ChunkManifest, so that nomenclature would be more consistent with Icechunk and Kerchunk.)

@maxrjones
Copy link
Copy Markdown
Member Author

maxrjones commented Mar 20, 2026

Thanks for the review @TomNicholas. I addressed all your comments in the last set of commits.

My main concern about this PR is that it changes private internals and public behaviour in one go. The more neat way to do it would be to leave any changes to the Kerchunk and Icechunk readers/writers to follow-up PRs.

Good point, I pulled out the Kerchunk reader portion. I think the writers need to know what to do with inlined data immediately, otherwise you could get buggy serialized data if someone were to write a parser that uses the inlined data feature.

I'm also not totally clear on the intended relationship between this feature and the parsers. Do parsers choose if a chunk is native? 1 IIUC the user has no way to use this feature via loadable_variables? I think that makes sense, but trying to confirm my understanding.

That's right. Parsers chose which chunks are inlined. The user chooses which arrays are loaded via loadable_variables. Parsers could offer configuration options for per-chunk inlining, but probably won't.

@TomNicholas
Copy link
Copy Markdown
Member

Good point, I pulled out the Kerchunk reader portion. I think the writers need to know what to do with inlined data immediately, otherwise you could get buggy serialized data if someone were to write a parser that uses the inlined data feature.

I mean you could just update the writers to raise NotImplementedError if there are any native chunks, but the splitting you've already done is good, thanks you.

Parsers chose which chunks are inlined.

Cool. So it is really an implementation detail. It's only relevant for Parser authors. So doesn't that mean the docs for it should only be under "Writing a custom parser"?

@TomNicholas
Copy link
Copy Markdown
Member

@maxrjones the changes don't look right any more - where is the new attribute on chunk manifest for holding the buffer?

Also please:

  • remove any changes to the Icechunk writer
  • move the docs into the custom parsers section

@maxrjones
Copy link
Copy Markdown
Member Author

@maxrjones the changes don't look right any more - where is the new attribute on chunk manifest for holding the buffer?

https://github.com/maxrjones/VirtualiZarr/blob/922335028e624c0fe4d9a49aa241d45cffb6195b/virtualizarr/manifests/manifest.py#L214

Also please:

  • remove any changes to the Icechunk writer
  • move the docs into the custom parsers section

will do

@TomNicholas
Copy link
Copy Markdown
Member

TomNicholas commented Mar 24, 2026

Ah sorry. We also need to test other basic operations on manifests in this PR, such as concatenation, broadcasting, and any slicing we support. Might be easier to change fixtures of existing tests than to add new ones.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TomNicholas and others added 9 commits April 23, 2026 14:15
Broadcast should replicate inlined chunk bytes to every position along an
expanded axis, matching the behaviour already observed for virtual chunks.
Three of the four new tests fail under the current implementation.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously _broadcast_manifest only prepended singleton dimensions to
inlined chunk keys, leaving a single dict entry even when np.broadcast_to
expanded an axis. Reads at the replicated positions would find the
INLINED_CHUNK_PATH sentinel in the paths array but miss the _inlined dict,
producing broken behaviour in ManifestStore.get.

Now we replicate each inlined entry to every target position along any
axis that was size 1 in the source, mirroring how the paths/offsets/lengths
arrays are broadcast. The bytes themselves are shared by reference, not
copied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Locks in the existing behaviour of _concat_manifests and _stack_manifests
for manifests containing inlined chunks: keys are shifted along the concat
axis or gain the stack-axis index, and bytes are shared by reference rather
than copied.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Confirms replicated entries share the same bytes object rather than
allocating copies at each expanded position.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
When two ManifestArrays share paths/offsets/lengths but have different
inlined chunk data, ManifestArray.__eq__ falls through to its 'over-cautious'
fallback via ChunkManifest.elementwise_eq, which does not currently compare
inlined bytes. That triggers RuntimeWarning('Should not be possible to get here').

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Previously elementwise_eq only looked at paths/offsets/lengths, which all
agree for inlined chunks even when their bytes differ. That let two
ChunkManifests disagree per __eq__ but look identical per elementwise_eq,
tripping the 'Should not be possible to get here' branch in
ManifestArray.__eq__.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Covers the inlined-chunk branch in ManifestStore.get including byte-range
variants (RangeByteRequest, OffsetByteRequest, SuffixByteRequest), a mixed
manifest where inlined and virtual chunks are served from the same array,
and list_dir enumeration of inlined chunk keys.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TomNicholas
Copy link
Copy Markdown
Member

@maxrjones I hope you don't mind but I took over this PR.

I responded to my own review by making these changes:

  • Reverted all Icechunk writer changes — kept strictly to the ChunkManifest internals
  • Docs moved out of a standalone page into data_structures.md under ## Chunk Manifests
  • Tests added for concat / stack / broadcast of manifests with inlined chunks, plus ManifestStore reads (byte-range variants, mixed inlined/virtual, list_dir)
  • Fixed a broadcast bug: _broadcast_manifest wasn't replicating inlined entries along expanded axes — a (1,2) → (3,2) broadcast would leave the replicated positions missing from _inlined even though _paths claimed __inlined__ there. Now replicates by reference, no byte copies.
  • Fixed an equality bug: elementwise_eq didn't compare inlined bytes, which broke an invariant in ManifestArray.__eq__ and tripped RuntimeWarning("Should not be possible to get here") for two manifests with identical paths/offsets/lengths but different inlined bytes.

@maxrjones
Copy link
Copy Markdown
Member Author

@maxrjones I hope you don't mind but I took over this PR.

I responded to my own review by making these changes:

  • Reverted all Icechunk writer changes — kept strictly to the ChunkManifest internals
  • Docs moved out of a standalone page into data_structures.md under ## Chunk Manifests
  • Tests added for concat / stack / broadcast of manifests with inlined chunks, plus ManifestStore reads (byte-range variants, mixed inlined/virtual, list_dir)
  • Fixed a broadcast bug: _broadcast_manifest wasn't replicating inlined entries along expanded axes — a (1,2) → (3,2) broadcast would leave the replicated positions missing from _inlined even though _paths claimed __inlined__ there. Now replicates by reference, no byte copies.
  • Fixed an equality bug: elementwise_eq didn't compare inlined bytes, which broke an invariant in ManifestArray.__eq__ and tripped RuntimeWarning("Should not be possible to get here") for two manifests with identical paths/offsets/lengths but different inlined bytes.

this is great, thank you very much for taking this on! Those changes are all very helpful. I haven't done a line-by-line review, but would be happy with you merging whenever you feel it's ready.

@TomNicholas TomNicholas changed the title feat: generalize ChunkManifest to hold native chunks feat: generalize ChunkManifest to hold inline chunks Apr 23, 2026
Validation previously used a subset check, which silently accepted entries
with unknown keys alongside the required path/offset/length. Now the entry
key set must match exactly one of the two valid shapes: virtual ({path,
offset, length}) or inlined ({path, offset, length, data}).

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TomNicholas and others added 2 commits April 24, 2026 11:11
Calls out the path-value convention used by ChunkManifest entries so parser
authors have a single, discoverable reference for distinguishing virtual,
missing, and inlined chunks.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
@TomNicholas TomNicholas merged commit 3194b09 into zarr-developers:main Apr 24, 2026
17 checks passed
TomNicholas added a commit to TomNicholas/VirtualiZarr that referenced this pull request Apr 24, 2026
Broadcasting inlined chunks not only prepends singleton dims to their
keys, but also replicates the bytes (by reference) across every position
of an expanded axis, per the fix in zarr-developers#938.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
TomNicholas added a commit that referenced this pull request Apr 25, 2026
#981)

* Add failing test for writing inlined chunks to icechunk

The icechunk writer currently sends the INLINED_CHUNK_PATH sentinel
('__inlined__') straight into icechunk's set_virtual_refs_arr, which
rejects it as a malformed virtual URL. The new test writes a manifest
containing one inlined chunk plus one virtual chunk, commits, then
re-opens via xarray and asserts the values match end-to-end.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Support writing inlined ChunkManifest entries to icechunk

Icechunk's set_virtual_refs_arr rejects the INLINED_CHUNK_PATH sentinel
('__inlined__') as a malformed URL. write_manifest_to_icechunk now writes
inlined chunks first as native chunks via store.set, then rewrites those
positions to empty strings in the paths array before calling
set_virtual_refs_arr with the cleaned view. A cheap numpy-level check
skips the virtual-refs call entirely for all-inlined or all-missing
manifests.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

* Update broadcast description to reflect bytes replication

Broadcasting inlined chunks not only prepends singleton dims to their
keys, but also replicates the bytes (by reference) across every position
of an expanded axis, per the fix in #938.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

---------

Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Generalize ChunkManifest to hold native chunks as well as virtual refs

2 participants