-
Notifications
You must be signed in to change notification settings - Fork 63
feat: generalize ChunkManifest to hold inline chunks #938
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Merged
TomNicholas
merged 26 commits into
zarr-developers:main
from
maxrjones:store-native-chunks
Apr 24, 2026
Merged
Changes from 1 commit
Commits
Show all changes
26 commits
Select commit
Hold shift + click to select a range
54e9b48
feat: generalize ChunkManifest to hold native chunks
maxrjones 1f1ead8
Rename native to inlined
maxrjones 13adb46
Move docs to explanation
maxrjones 8516604
Rename data to inlined_data
maxrjones 04a420f
Better sentinel values
maxrjones e4ebc28
Improve required entry validation
maxrjones 8fc3de3
Add scalar test
maxrjones 9223350
Revert changes that should be a separate PR
maxrjones 6e37005
Merge branch 'main' into store-native-chunks
maxrjones c97cf39
Fix mypy: avoid narrowing StringDType on np.where reassignment
TomNicholas d7b0abd
Revert icechunk writer changes; handle inlined chunks in a follow-up PR
TomNicholas e75c7f7
Move inlined chunks docs into data_structures.md
TomNicholas 83b5c78
Add failing tests for broadcasting manifests with inlined chunks
TomNicholas 0ba0f20
Replicate inlined chunks across expanded axes in broadcast_to
TomNicholas 7e2506d
Add tests for concat and stack with inlined chunks
TomNicholas 90aeeee
Add bytes-identity test for broadcasting inlined chunks
TomNicholas 8e9f8af
Add failing test for ManifestArray equality with differing inlined bytes
TomNicholas 96d8f17
Compare inlined bytes in ChunkManifest.elementwise_eq
TomNicholas c8475e2
Add ManifestStore read tests for inlined chunks
TomNicholas 06345da
Smoke test that to_virtual_variable preserves inlined chunks
TomNicholas 5122636
[pre-commit.ci] auto fixes from pre-commit.com hooks
pre-commit-ci[bot] f328b11
Merge branch 'main' into store-native-chunks
maxrjones a51602c
Merge branch 'main' into store-native-chunks
TomNicholas b3cdfb7
Reject ChunkManifest entries with extra keys
TomNicholas 746c779
Document the three chunk states (virtual, missing, inlined) in a table
TomNicholas de587a9
Add release note for inlined chunks support
TomNicholas File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,130 @@ | ||
| # Loading inlined Kerchunk references | ||
|
|
||
| Kerchunk reference files can contain two kinds of chunk references: | ||
|
|
||
| - **Virtual references** point to byte ranges in external files (e.g., `["s3://bucket/data.nc", 1024, 512]`) | ||
| - **Inlined references** embed the raw chunk data directly in the JSON as base64-encoded strings (e.g., `"base64:AAAB..."`) | ||
|
|
||
| Inlined references are common for small variables like coordinate arrays, dimension labels, and scalar metadata. Kerchunk inlines data below a configurable `inline_threshold`. | ||
|
|
||
| VirtualiZarr can read both kinds of references. Inlined data is stored as **native chunks** directly in the [ChunkManifest][virtualizarr.manifests.ChunkManifest], so it travels with the manifest through concatenation, serialization, and pickling without needing access to any external file. | ||
|
|
||
| ## Roundtrip example | ||
|
|
||
| This example demonstrates that the full pipeline---NetCDF to kerchunk JSON (with inlined coordinates) back to an xarray Dataset---produces results identical to loading the NetCDF directly. | ||
|
|
||
| ### 1. Create a sample NetCDF file | ||
|
|
||
| ```python | ||
| import tempfile, os, json | ||
| import numpy as np | ||
| import xarray as xr | ||
|
|
||
| tmpdir = tempfile.mkdtemp() | ||
| nc_path = os.path.join(tmpdir, "example.nc") | ||
|
|
||
| ds = xr.Dataset( | ||
| {"temperature": xr.DataArray( | ||
| np.arange(12, dtype="float32").reshape(3, 4), | ||
| dims=["time", "x"], | ||
| )}, | ||
| coords={ | ||
| "time": np.array([0, 1, 2], dtype="int64"), | ||
| "x": np.array([10, 20, 30, 40], dtype="int64"), | ||
| }, | ||
| ) | ||
| ds.to_netcdf(nc_path, format="NETCDF4") | ||
| ``` | ||
|
|
||
| ### 2. Virtualize and write to kerchunk JSON | ||
|
|
||
| Use the HDF parser to read the NetCDF file. Specify `loadable_variables` for the | ||
| coordinate arrays so they are loaded into memory as numpy arrays. When serialized | ||
| to kerchunk format, these loaded variables are automatically base64-encoded as | ||
| inlined references. | ||
|
|
||
| ```python | ||
| from virtualizarr import open_virtual_dataset | ||
| from virtualizarr.parsers import HDFParser | ||
| from obspec_utils.registry import ObjectStoreRegistry | ||
| from obstore.store import LocalStore | ||
|
|
||
| store = LocalStore(prefix="/") | ||
| registry = ObjectStoreRegistry({"file://": store}) | ||
|
|
||
| with open_virtual_dataset( | ||
| url=f"file://{nc_path}", | ||
| registry=registry, | ||
| parser=HDFParser(), | ||
| loadable_variables=["time", "x"], | ||
| ) as vds: | ||
| refs = vds.vz.to_kerchunk(format="dict") | ||
|
|
||
| # Write to disk | ||
| ref_path = os.path.join(tmpdir, "refs.json") | ||
| with open(ref_path, "w") as f: | ||
| json.dump(refs, f) | ||
| ``` | ||
|
|
||
| The resulting JSON has a mix of virtual and inlined references: | ||
|
|
||
| ```python | ||
| for key, value in refs["refs"].items(): | ||
| if isinstance(value, str) and value.startswith("base64:"): | ||
| print(f" Inlined: {key}") | ||
| elif isinstance(value, list): | ||
| print(f" Virtual: {key} -> {value[0]}") | ||
| ``` | ||
|
|
||
| ``` | ||
| Inlined: time/0 | ||
| Inlined: x/0 | ||
| Virtual: temperature/0.0 -> /tmp/.../example.nc | ||
| ``` | ||
|
|
||
| ### 3. Load the kerchunk JSON back | ||
|
|
||
| Use the `KerchunkJSONParser` to read the reference file. Inlined data is decoded | ||
| from base64 and stored as native chunks in the manifest. | ||
|
|
||
| ```python | ||
| from virtualizarr.parsers import KerchunkJSONParser | ||
|
|
||
| parser = KerchunkJSONParser() | ||
| manifest_store = parser(url=f"file://{ref_path}", registry=registry) | ||
| ``` | ||
|
|
||
| Open the manifest store as an xarray Dataset via the Zarr engine: | ||
|
|
||
| ```python | ||
| loaded = xr.open_dataset( | ||
| manifest_store, engine="zarr", consolidated=False, zarr_format=3 | ||
| ).load() | ||
| ``` | ||
|
|
||
| ### 4. Verify the roundtrip | ||
|
|
||
| ```python | ||
| direct = xr.open_dataset(nc_path).load() | ||
| xr.testing.assert_identical(direct, loaded) | ||
| ``` | ||
|
|
||
| The two datasets are identical: coordinate values, data values, attributes, and dtypes all match. | ||
|
|
||
| ## How it works | ||
|
|
||
| When the kerchunk parser encounters a base64-encoded inlined reference, it decodes the bytes and stores them as a **native chunk** on the `ChunkManifest`. Native chunks are held in a sparse dictionary keyed by chunk grid index: | ||
|
|
||
| ```python | ||
| # After parsing, the manifest for 'time' has one native chunk: | ||
| time_manifest = manifest_store._group.arrays["time"].manifest | ||
| print(time_manifest._native) | ||
| # {(0,): b'\x00\x00\x00\x00\x00\x00\x00\x00\x01\x00\x00...'} | ||
| ``` | ||
|
|
||
| Native chunks participate in all manifest operations: | ||
|
|
||
| - **Concatenation and stacking**: indices are shifted to their new positions | ||
| - **Serialization**: included when writing back to kerchunk (re-encoded as base64) or Icechunk (written as real data) | ||
| - **Pickling**: travel with the manifest for distributed workflows (Dask, multiprocessing) | ||
| - **ManifestStore reads**: returned directly from memory without any network or disk I/O |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.