Skip to content

Commit 3194b09

Browse files
maxrjonesTomNicholasclaudepre-commit-ci[bot]
authored
feat: generalize ChunkManifest to hold inline chunks (#938)
* feat: generalize ChunkManifest to hold native chunks * Rename native to inlined * Move docs to explanation * Rename data to inlined_data * Better sentinel values * Improve required entry validation * Add scalar test * Revert changes that should be a separate PR * Fix mypy: avoid narrowing StringDType on np.where reassignment Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Revert icechunk writer changes; handle inlined chunks in a follow-up PR Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Move inlined chunks docs into data_structures.md Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add failing tests for broadcasting manifests with inlined chunks Broadcast should replicate inlined chunk bytes to every position along an expanded axis, matching the behaviour already observed for virtual chunks. Three of the four new tests fail under the current implementation. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Replicate inlined chunks across expanded axes in broadcast_to Previously _broadcast_manifest only prepended singleton dimensions to inlined chunk keys, leaving a single dict entry even when np.broadcast_to expanded an axis. Reads at the replicated positions would find the INLINED_CHUNK_PATH sentinel in the paths array but miss the _inlined dict, producing broken behaviour in ManifestStore.get. Now we replicate each inlined entry to every target position along any axis that was size 1 in the source, mirroring how the paths/offsets/lengths arrays are broadcast. The bytes themselves are shared by reference, not copied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add tests for concat and stack with inlined chunks Locks in the existing behaviour of _concat_manifests and _stack_manifests for manifests containing inlined chunks: keys are shifted along the concat axis or gain the stack-axis index, and bytes are shared by reference rather than copied. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add bytes-identity test for broadcasting inlined chunks Confirms replicated entries share the same bytes object rather than allocating copies at each expanded position. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add failing test for ManifestArray equality with differing inlined bytes When two ManifestArrays share paths/offsets/lengths but have different inlined chunk data, ManifestArray.__eq__ falls through to its 'over-cautious' fallback via ChunkManifest.elementwise_eq, which does not currently compare inlined bytes. That triggers RuntimeWarning('Should not be possible to get here'). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Compare inlined bytes in ChunkManifest.elementwise_eq Previously elementwise_eq only looked at paths/offsets/lengths, which all agree for inlined chunks even when their bytes differ. That let two ChunkManifests disagree per __eq__ but look identical per elementwise_eq, tripping the 'Should not be possible to get here' branch in ManifestArray.__eq__. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add ManifestStore read tests for inlined chunks Covers the inlined-chunk branch in ManifestStore.get including byte-range variants (RangeByteRequest, OffsetByteRequest, SuffixByteRequest), a mixed manifest where inlined and virtual chunks are served from the same array, and list_dir enumeration of inlined chunk keys. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Smoke test that to_virtual_variable preserves inlined chunks Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * [pre-commit.ci] auto fixes from pre-commit.com hooks for more information, see https://pre-commit.ci * Reject ChunkManifest entries with extra keys Validation previously used a subset check, which silently accepted entries with unknown keys alongside the required path/offset/length. Now the entry key set must match exactly one of the two valid shapes: virtual ({path, offset, length}) or inlined ({path, offset, length, data}). Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Document the three chunk states (virtual, missing, inlined) in a table Calls out the path-value convention used by ChunkManifest entries so parser authors have a single, discoverable reference for distinguishing virtual, missing, and inlined chunks. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> * Add release note for inlined chunks support Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com> --------- Co-authored-by: TomNicholas <tom@earthmover.io> Co-authored-by: Claude Opus 4.7 (1M context) <noreply@anthropic.com> Co-authored-by: pre-commit-ci[bot] <66853113+pre-commit-ci[bot]@users.noreply.github.com>
1 parent e82ac27 commit 3194b09

8 files changed

Lines changed: 850 additions & 32 deletions

File tree

docs/data_structures.md

Lines changed: 52 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -94,6 +94,58 @@ lengths = np.asarray([100, 100], dtype=np.uint64)
9494
manifest = ChunkManifest.from_arrays(paths=paths, offsets=offsets, lengths=lengths)
9595
```
9696

97+
### Chunk states
98+
99+
Every position in a `ChunkManifest` is in one of three states, distinguished by the value of `path` in its entry:
100+
101+
| State | `path` | Meaning |
102+
|----------|---------------------------------------|----------------------------------------------------------------------------------------------|
103+
| Virtual | a real URI (e.g., `"s3://bucket/foo.nc"`) | Chunk lives at the given byte range in an external file. |
104+
| Missing | `""` (`MISSING_CHUNK_PATH`) | Chunk is absent. Reads return the array's `fill_value`. |
105+
| Inlined | `"__inlined__"` (`INLINED_CHUNK_PATH`)| Raw bytes for the chunk are stored in memory in the manifest's `_inlined` dict (see below). |
106+
107+
Parser authors are free to mix all three states within a single manifest.
108+
109+
### Inlined chunks
110+
111+
So far every chunk in the manifest has pointed to a byte range in some external file.
112+
A `ChunkManifest` can also hold **inlined chunks**: the raw chunk bytes are carried directly inside the manifest itself, rather than referenced from an external file.
113+
114+
Inlined chunks are useful for small variables — coordinate arrays, dimension labels, scalar metadata — where the overhead of a remote read exceeds the cost of just carrying the bytes along.
115+
116+
Inlined chunks are produced by [parsers](custom_parsers.md), not by end users; there is no way to request them via `loadable_variables`. If you are writing a custom parser for a format that stores small inlined references (e.g., Kerchunk JSON), you can emit them using the constructors below.
117+
118+
Internally, inlined chunks live in a sparse dictionary `_inlined: dict[tuple[int, ...], bytes]` on the `ChunkManifest`, keyed by chunk grid index. The corresponding entry in the paths array is set to the `INLINED_CHUNK_PATH` sentinel.
119+
120+
To create a manifest with inlined chunks, pass entries with a `data` key:
121+
122+
```python
123+
from virtualizarr.manifests import ChunkManifest
124+
125+
manifest = ChunkManifest(
126+
entries={
127+
"0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
128+
"0.1": {"path": "", "offset": 0, "length": 4, "data": b"\x00\x01\x02\x03"},
129+
}
130+
)
131+
```
132+
133+
Or via `from_arrays` with the `inlined` parameter:
134+
135+
```python
136+
import numpy as np
137+
from virtualizarr.manifests import ChunkManifest
138+
139+
manifest = ChunkManifest.from_arrays(
140+
paths=np.asarray(["s3://bucket/foo.nc", ""], dtype=np.dtypes.StringDType()),
141+
offsets=np.asarray([100, 0], dtype=np.uint64),
142+
lengths=np.asarray([100, 4], dtype=np.uint64),
143+
inlined={(1,): b"\x00\x01\x02\x03"},
144+
)
145+
```
146+
147+
Inlined chunks participate in all manifest operations: concatenation and stacking shift their indices, broadcasting prepends singleton dimensions to their keys, equality compares the inlined bytes, pickling carries the data along (for Dask/multiprocessing), `ManifestStore` reads return them directly from memory, and `nbytes` includes their size.
148+
97149
## `ManifestArray` class
98150

99151
A Zarr array is defined not just by the location of its constituent chunk data, but by its array-level attributes such as `shape` and `dtype`.

docs/releases.md

Lines changed: 4 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -4,6 +4,10 @@
44

55
### New Features
66

7+
- `ChunkManifest` can now hold inlined chunks — raw chunk bytes carried directly in memory rather than as references to external files. Intended for parser authors (e.g., loading Kerchunk references with inlined data); not exposed via `loadable_variables`.
8+
([#938](https://github.com/zarr-developers/VirtualiZarr/pull/938)).
9+
By [Max Jones](https://github.com/maxrjones) and [Tom Nicholas](https://github.com/TomNicholas).
10+
711
### Breaking changes
812

913
### Bug fixes

virtualizarr/manifests/array_api.py

Lines changed: 41 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -1,3 +1,4 @@
1+
import itertools
12
from typing import TYPE_CHECKING, Any, Callable, Union, cast
23

34
import numpy as np
@@ -214,11 +215,23 @@ def _concat_manifests(manifests: list[ChunkManifest], axis: int) -> ChunkManifes
214215
)
215216
concatenated_offsets = np.concatenate([m._offsets for m in manifests], axis=axis)
216217
concatenated_lengths = np.concatenate([m._lengths for m in manifests], axis=axis)
218+
219+
# merge inlined chunk dicts with index shifting along the concat axis
220+
concatenated_inlined: dict[tuple[int, ...], bytes] = {}
221+
grid_offset = 0
222+
for m in manifests:
223+
for key, data in m._inlined.items():
224+
shifted = list(key)
225+
shifted[axis] += grid_offset
226+
concatenated_inlined[tuple(shifted)] = data
227+
grid_offset += m._paths.shape[axis]
228+
217229
return ChunkManifest.from_arrays(
218230
paths=concatenated_paths,
219231
offsets=concatenated_offsets,
220232
lengths=concatenated_lengths,
221233
validate_paths=False,
234+
inlined=concatenated_inlined if concatenated_inlined else None,
222235
)
223236

224237

@@ -230,11 +243,21 @@ def _stack_manifests(manifests: list[ChunkManifest], axis: int) -> ChunkManifest
230243
)
231244
stacked_offsets = np.stack([m._offsets for m in manifests], axis=axis)
232245
stacked_lengths = np.stack([m._lengths for m in manifests], axis=axis)
246+
247+
# merge inlined chunk dicts, inserting the new stacked axis
248+
stacked_inlined: dict[tuple[int, ...], bytes] = {}
249+
for i, m in enumerate(manifests):
250+
for key, data in m._inlined.items():
251+
shifted = list(key)
252+
shifted.insert(axis, i)
253+
stacked_inlined[tuple(shifted)] = data
254+
233255
return ChunkManifest.from_arrays(
234256
paths=stacked_paths,
235257
offsets=stacked_offsets,
236258
lengths=stacked_lengths,
237259
validate_paths=False,
260+
inlined=stacked_inlined if stacked_inlined else None,
238261
)
239262

240263

@@ -248,11 +271,29 @@ def _broadcast_manifest(
248271
)
249272
broadcasted_offsets = np.broadcast_to(manifest._offsets, shape=shape)
250273
broadcasted_lengths = np.broadcast_to(manifest._lengths, shape=shape)
274+
275+
# broadcast inlined chunks: prepend singleton dims to each key, then replicate
276+
# the entry across every target position along any axis that was size 1 in the
277+
# source (matching np.broadcast_to semantics for the paths/offsets/lengths arrays).
278+
broadcasted_inlined: dict[tuple[int, ...], bytes] = {}
279+
if manifest._inlined:
280+
n_prepended = len(shape) - manifest._paths.ndim
281+
source_shape_padded = (1,) * n_prepended + manifest._paths.shape
282+
for key, data in manifest._inlined.items():
283+
padded_key = (0,) * n_prepended + key
284+
axis_ranges = [
285+
range(shape[i]) if source_shape_padded[i] == 1 else (padded_key[i],)
286+
for i in range(len(shape))
287+
]
288+
for target_key in itertools.product(*axis_ranges):
289+
broadcasted_inlined[target_key] = data
290+
251291
return ChunkManifest.from_arrays(
252292
paths=broadcasted_paths,
253293
offsets=broadcasted_offsets,
254294
lengths=broadcasted_lengths,
255295
validate_paths=False,
296+
inlined=broadcasted_inlined if broadcasted_inlined else None,
256297
)
257298

258299

0 commit comments

Comments
 (0)