Skip to content
Merged
Show file tree
Hide file tree
Changes from all commits
Commits
Show all changes
26 commits
Select commit Hold shift + click to select a range
54e9b48
feat: generalize ChunkManifest to hold native chunks
maxrjones Mar 20, 2026
1f1ead8
Rename native to inlined
maxrjones Mar 20, 2026
13adb46
Move docs to explanation
maxrjones Mar 20, 2026
8516604
Rename data to inlined_data
maxrjones Mar 20, 2026
04a420f
Better sentinel values
maxrjones Mar 20, 2026
e4ebc28
Improve required entry validation
maxrjones Mar 20, 2026
8fc3de3
Add scalar test
maxrjones Mar 20, 2026
9223350
Revert changes that should be a separate PR
maxrjones Mar 20, 2026
6e37005
Merge branch 'main' into store-native-chunks
maxrjones Apr 20, 2026
c97cf39
Fix mypy: avoid narrowing StringDType on np.where reassignment
TomNicholas Apr 22, 2026
d7b0abd
Revert icechunk writer changes; handle inlined chunks in a follow-up PR
TomNicholas Apr 23, 2026
e75c7f7
Move inlined chunks docs into data_structures.md
TomNicholas Apr 23, 2026
83b5c78
Add failing tests for broadcasting manifests with inlined chunks
TomNicholas Apr 23, 2026
0ba0f20
Replicate inlined chunks across expanded axes in broadcast_to
TomNicholas Apr 23, 2026
7e2506d
Add tests for concat and stack with inlined chunks
TomNicholas Apr 23, 2026
90aeeee
Add bytes-identity test for broadcasting inlined chunks
TomNicholas Apr 23, 2026
8e9f8af
Add failing test for ManifestArray equality with differing inlined bytes
TomNicholas Apr 23, 2026
96d8f17
Compare inlined bytes in ChunkManifest.elementwise_eq
TomNicholas Apr 23, 2026
c8475e2
Add ManifestStore read tests for inlined chunks
TomNicholas Apr 23, 2026
06345da
Smoke test that to_virtual_variable preserves inlined chunks
TomNicholas Apr 23, 2026
5122636
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] Apr 23, 2026
f328b11
Merge branch 'main' into store-native-chunks
maxrjones Apr 23, 2026
a51602c
Merge branch 'main' into store-native-chunks
TomNicholas Apr 24, 2026
b3cdfb7
Reject ChunkManifest entries with extra keys
TomNicholas Apr 24, 2026
746c779
Document the three chunk states (virtual, missing, inlined) in a table
TomNicholas Apr 24, 2026
de587a9
Add release note for inlined chunks support
TomNicholas Apr 24, 2026
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
52 changes: 52 additions & 0 deletions docs/data_structures.md
Original file line number Diff line number Diff line change
Expand Up @@ -94,6 +94,58 @@ lengths = np.asarray([100, 100], dtype=np.uint64)
manifest = ChunkManifest.from_arrays(paths=paths, offsets=offsets, lengths=lengths)
```

### Chunk states

Every position in a `ChunkManifest` is in one of three states, distinguished by the value of `path` in its entry:

| State | `path` | Meaning |
|----------|---------------------------------------|----------------------------------------------------------------------------------------------|
| Virtual | a real URI (e.g., `"s3://bucket/foo.nc"`) | Chunk lives at the given byte range in an external file. |
| Missing | `""` (`MISSING_CHUNK_PATH`) | Chunk is absent. Reads return the array's `fill_value`. |
| Inlined | `"__inlined__"` (`INLINED_CHUNK_PATH`)| Raw bytes for the chunk are stored in memory in the manifest's `_inlined` dict (see below). |

Parser authors are free to mix all three states within a single manifest.

### Inlined chunks

So far every chunk in the manifest has pointed to a byte range in some external file.
A `ChunkManifest` can also hold **inlined chunks**: the raw chunk bytes are carried directly inside the manifest itself, rather than referenced from an external file.

Inlined chunks are useful for small variables — coordinate arrays, dimension labels, scalar metadata — where the overhead of a remote read exceeds the cost of just carrying the bytes along.

Inlined chunks are produced by [parsers](custom_parsers.md), not by end users; there is no way to request them via `loadable_variables`. If you are writing a custom parser for a format that stores small inlined references (e.g., Kerchunk JSON), you can emit them using the constructors below.

Internally, inlined chunks live in a sparse dictionary `_inlined: dict[tuple[int, ...], bytes]` on the `ChunkManifest`, keyed by chunk grid index. The corresponding entry in the paths array is set to the `INLINED_CHUNK_PATH` sentinel.

To create a manifest with inlined chunks, pass entries with a `data` key:

```python
from virtualizarr.manifests import ChunkManifest

manifest = ChunkManifest(
entries={
"0.0": {"path": "s3://bucket/foo.nc", "offset": 100, "length": 100},
"0.1": {"path": "", "offset": 0, "length": 4, "data": b"\x00\x01\x02\x03"},
}
)
```

Or via `from_arrays` with the `inlined` parameter:

```python
import numpy as np
from virtualizarr.manifests import ChunkManifest

manifest = ChunkManifest.from_arrays(
paths=np.asarray(["s3://bucket/foo.nc", ""], dtype=np.dtypes.StringDType()),
offsets=np.asarray([100, 0], dtype=np.uint64),
lengths=np.asarray([100, 4], dtype=np.uint64),
inlined={(1,): b"\x00\x01\x02\x03"},
)
```

Inlined chunks participate in all manifest operations: concatenation and stacking shift their indices, broadcasting prepends singleton dimensions to their keys, equality compares the inlined bytes, pickling carries the data along (for Dask/multiprocessing), `ManifestStore` reads return them directly from memory, and `nbytes` includes their size.

## `ManifestArray` class

A Zarr array is defined not just by the location of its constituent chunk data, but by its array-level attributes such as `shape` and `dtype`.
Expand Down
4 changes: 4 additions & 0 deletions docs/releases.md
Original file line number Diff line number Diff line change
Expand Up @@ -4,6 +4,10 @@

### New Features

- `ChunkManifest` can now hold inlined chunks — raw chunk bytes carried directly in memory rather than as references to external files. Intended for parser authors (e.g., loading Kerchunk references with inlined data); not exposed via `loadable_variables`.
([#938](https://github.com/zarr-developers/VirtualiZarr/pull/938)).
By [Max Jones](https://github.com/maxrjones) and [Tom Nicholas](https://github.com/TomNicholas).

### Breaking changes

### Bug fixes
Expand Down
41 changes: 41 additions & 0 deletions virtualizarr/manifests/array_api.py
Original file line number Diff line number Diff line change
@@ -1,3 +1,4 @@
import itertools
from typing import TYPE_CHECKING, Any, Callable, Union, cast

import numpy as np
Expand Down Expand Up @@ -214,11 +215,23 @@ def _concat_manifests(manifests: list[ChunkManifest], axis: int) -> ChunkManifes
)
concatenated_offsets = np.concatenate([m._offsets for m in manifests], axis=axis)
concatenated_lengths = np.concatenate([m._lengths for m in manifests], axis=axis)

# merge inlined chunk dicts with index shifting along the concat axis
concatenated_inlined: dict[tuple[int, ...], bytes] = {}
grid_offset = 0
for m in manifests:
for key, data in m._inlined.items():
shifted = list(key)
shifted[axis] += grid_offset
concatenated_inlined[tuple(shifted)] = data
grid_offset += m._paths.shape[axis]

return ChunkManifest.from_arrays(
paths=concatenated_paths,
offsets=concatenated_offsets,
lengths=concatenated_lengths,
validate_paths=False,
inlined=concatenated_inlined if concatenated_inlined else None,
)


Expand All @@ -230,11 +243,21 @@ def _stack_manifests(manifests: list[ChunkManifest], axis: int) -> ChunkManifest
)
stacked_offsets = np.stack([m._offsets for m in manifests], axis=axis)
stacked_lengths = np.stack([m._lengths for m in manifests], axis=axis)

# merge inlined chunk dicts, inserting the new stacked axis
stacked_inlined: dict[tuple[int, ...], bytes] = {}
for i, m in enumerate(manifests):
for key, data in m._inlined.items():
shifted = list(key)
shifted.insert(axis, i)
stacked_inlined[tuple(shifted)] = data

return ChunkManifest.from_arrays(
paths=stacked_paths,
offsets=stacked_offsets,
lengths=stacked_lengths,
validate_paths=False,
inlined=stacked_inlined if stacked_inlined else None,
)


Expand All @@ -248,11 +271,29 @@ def _broadcast_manifest(
)
broadcasted_offsets = np.broadcast_to(manifest._offsets, shape=shape)
broadcasted_lengths = np.broadcast_to(manifest._lengths, shape=shape)

# broadcast inlined chunks: prepend singleton dims to each key, then replicate
# the entry across every target position along any axis that was size 1 in the
# source (matching np.broadcast_to semantics for the paths/offsets/lengths arrays).
broadcasted_inlined: dict[tuple[int, ...], bytes] = {}
if manifest._inlined:
n_prepended = len(shape) - manifest._paths.ndim
source_shape_padded = (1,) * n_prepended + manifest._paths.shape
for key, data in manifest._inlined.items():
padded_key = (0,) * n_prepended + key
axis_ranges = [
range(shape[i]) if source_shape_padded[i] == 1 else (padded_key[i],)
for i in range(len(shape))
]
for target_key in itertools.product(*axis_ranges):
broadcasted_inlined[target_key] = data

return ChunkManifest.from_arrays(
paths=broadcasted_paths,
offsets=broadcasted_offsets,
lengths=broadcasted_lengths,
validate_paths=False,
inlined=broadcasted_inlined if broadcasted_inlined else None,
)


Expand Down
Loading
Loading