Version: 6
Design document for adding rectilinear (variable) chunk grid support to zarr-python, conforming to the rectilinear chunk grid extension spec.
Related:
- #3750 (single ChunkGrid proposal)
- #3534 (rectilinear implementation)
- #3735 (chunk grid module/registry)
- ZEP0003 (variable chunking spec)
- zarr-specs#370 (sharding v1.1: non-divisible subchunks)
- zarr-extensions#25 (rectilinear extension)
- zarr-extensions#34 (sharding + rectilinear)
Chunk grids form a hierarchy — the rectilinear grid is strictly more general than the regular grid. Any regular grid is expressible as a rectilinear grid. There is no known chunk grid that is both (a) more general than rectilinear and (b) retains the axis-aligned tessellation properties Zarr assumes. All known grids are special cases:
| Grid type | Description | Example |
|---|---|---|
| Regular | Uniform chunk size, boundary chunks padded with fill_value | [10, 10, 10, 10] |
| Regular-bounded (zarrs) | Uniform chunk size, boundary chunks trimmed to array extent | [10, 10, 10, 5] |
| HPC boundary-padded | Regular interior, larger boundary chunks (VirtualiZarr#217) | [10, 8, 8, 8, 10] |
| Fully variable | Arbitrary per-chunk sizes | [5, 12, 3, 20] |
Prior iterations on the chunk grid design were based on the Zarr V3 spec's definition of chunk grids as an extension point alongside codecs, dtypes, etc. Therefore, we started designing the chunk grid implementation following a similar registry based approach. However, in practice chunk grids are fundamentally different than codecs. Codecs are independent; supporting zstd tells you nothing about gzip. Chunk grids are not: every regular grid is a valid rectilinear grid. A registry-based plugin system makes sense for codecs but adds complexity without clear benefit for chunk grids. Here we start from some basic goals and propose a more fitting design for supporting different chunk grids in zarr-python.
- Follow the zarr extension proposal. The implementation should conform to the rectilinear chunk grid spec, not innovate on the metadata format.
- Minimize changes to the public API. Users creating regular arrays should see no difference. Rectilinear is additive.
- Maintain backwards compatibility. Existing code using
RegularChunkGrid,.chunks, orisinstancechecks should continue to work (with deprecation warnings where appropriate). - Design for future iteration. The internal architecture should allow refactoring (e.g., metadata/array separation, new dimension types) without breaking the public API.
- Minimize downstream changes. xarray, VirtualiZarr, Icechunk, Cubed, etc. should need minimal updates.
- Minimize time to stable release. Ship behind a feature flag, stabilize through real-world usage, promote to stable API.
- The new API should be useful.
read_chunk_sizes/write_chunk_sizes,ChunkGrid.__getitem__,is_regular— these should solve real problems, not just expose internals. - Extensible for other serialization structures. The per-dimension design should support future encodings (tile, temporal) without changes to indexing or codecs.
- A chunk grid is a concrete arrangement of chunks. Not an abstract tiling pattern. This means that the chunk grid is bound to specific array dimensions, which enables the chunk grid to answer any question about any chunk (offset, size, count) without external parameters.
- One implementation, multiple serialization forms. A single
ChunkGridclass handles all chunking logic. The serialization format ("regular"vs"rectilinear") is chosen by the metadata layer, not the grid. - No chunk grid registry. Simple name-based dispatch in
parse_chunk_grid(). - Fixed vs Varying per dimension.
FixedDimension(size, extent)for uniform chunks;VaryingDimension(edges, extent)for per-chunk edge lengths with precomputed prefix sums. Avoids expanding regular dimensions into lists of identical values. - Transparent transitions. Operations like
resize()can move an array from regular to rectilinear chunking.
@dataclass(frozen=True)
class FixedDimension:
"""Uniform chunk size. Boundary chunks contain less data but are
encoded at full size by the codec pipeline."""
size: int # chunk edge length (>= 0)
extent: int # array dimension length
def __post_init__(self) -> None:
# validates size >= 0 and extent >= 0
@property
def nchunks(self) -> int:
if self.size == 0:
return 0
return ceildiv(self.extent, self.size)
def index_to_chunk(self, idx: int) -> int:
return idx // self.size # raises IndexError if OOB
def chunk_offset(self, chunk_ix: int) -> int:
return chunk_ix * self.size # raises IndexError if OOB
def chunk_size(self, chunk_ix: int) -> int:
return self.size # always uniform; raises IndexError if OOB
def data_size(self, chunk_ix: int) -> int:
return max(0, min(self.size, self.extent - chunk_ix * self.size)) # raises IndexError if OOB
@property
def unique_edge_lengths(self) -> Iterable[int]:
return (self.size,) # O(1)
def indices_to_chunks(self, indices: NDArray) -> NDArray:
return indices // self.size
def with_extent(self, new_extent: int) -> FixedDimension:
return FixedDimension(size=self.size, extent=new_extent)
def resize(self, new_extent: int) -> FixedDimension:
return FixedDimension(size=self.size, extent=new_extent)
@dataclass(frozen=True)
class VaryingDimension:
"""Explicit per-chunk sizes. The last chunk may extend past the array
extent, in which case data_size clips to the valid region while
chunk_size returns the full edge length for codec processing."""
edges: tuple[int, ...] # per-chunk edge lengths (all > 0)
cumulative: tuple[int, ...] # prefix sums for O(log n) lookup
extent: int # array dimension length (may be < sum(edges))
def __init__(self, edges: Sequence[int], extent: int) -> None:
# validates edges non-empty, all > 0, extent >= 0, extent <= sum(edges)
# computes cumulative via itertools.accumulate
# uses object.__setattr__ for frozen dataclass
@property
def nchunks(self) -> int:
# number of chunks that overlap [0, extent)
if extent == 0:
return 0
return bisect.bisect_left(self.cumulative, extent) + 1
@property
def ngridcells(self) -> int:
return len(self.edges)
def index_to_chunk(self, idx: int) -> int:
return bisect.bisect_right(self.cumulative, idx) # raises IndexError if OOB
def chunk_offset(self, chunk_ix: int) -> int:
return self.cumulative[chunk_ix - 1] if chunk_ix > 0 else 0 # raises IndexError if OOB
def chunk_size(self, chunk_ix: int) -> int:
return self.edges[chunk_ix] # raises IndexError if OOB
def data_size(self, chunk_ix: int) -> int:
offset = self.chunk_offset(chunk_ix)
return max(0, min(self.edges[chunk_ix], self.extent - offset)) # raises IndexError if OOB
@property
def unique_edge_lengths(self) -> Iterable[int]:
# lazy generator: yields unseen values, short-circuits deduplication
def indices_to_chunks(self, indices: NDArray) -> NDArray:
return np.searchsorted(self.cumulative, indices, side='right')
def with_extent(self, new_extent: int) -> VaryingDimension:
# validates cumulative[-1] >= new_extent (O(1)), re-binds extent
return VaryingDimension(self.edges, extent=new_extent)
def resize(self, new_extent: int) -> VaryingDimension:
# grow past edge sum: append chunk of size (new_extent - sum(edges))
# shrink or grow within edge sum: preserve all edges, re-bind extentBoth types implement the DimensionGrid protocol: nchunks, extent, index_to_chunk, chunk_offset, chunk_size, data_size, indices_to_chunks, unique_edge_lengths, with_extent, resize. Memory usage scales with the number of varying dimensions, not total chunks.
All per-chunk methods (chunk_offset, chunk_size, data_size) raise IndexError for out-of-bounds chunk indices, providing consistent fail-fast behavior across both dimension types.
The two size methods serve different consumers:
| Method | Returns | Consumer |
|---|---|---|
chunk_size |
Buffer size for codec processing | Codec pipeline (ArraySpec.shape) |
data_size |
Valid data region within the buffer | Indexing pipeline (chunk_selection slicing) |
For FixedDimension, these differ only at the boundary. For VaryingDimension, these differ only when the last chunk extends past the extent (i.e., extent < sum(edges)). This matches current zarr-python behavior: get_chunk_spec passes the full chunk_shape to the codec for all chunks, and the indexer generates a chunk_selection that clips the decoded buffer.
@runtime_checkable
class DimensionGrid(Protocol):
"""Structural interface shared by FixedDimension and VaryingDimension."""
@property
def nchunks(self) -> int: ...
@property
def ngridcells(self) -> int: ...
@property
def extent(self) -> int: ...
def index_to_chunk(self, idx: int) -> int: ...
def chunk_offset(self, chunk_ix: int) -> int: ... # raises IndexError if OOB
def chunk_size(self, chunk_ix: int) -> int: ... # raises IndexError if OOB
def data_size(self, chunk_ix: int) -> int: ... # raises IndexError if OOB
def indices_to_chunks(self, indices: NDArray[np.intp]) -> NDArray[np.intp]: ...
@property
def unique_edge_lengths(self) -> Iterable[int]: ...
def with_extent(self, new_extent: int) -> DimensionGrid: ...
def resize(self, new_extent: int) -> DimensionGrid: ...The protocol is @runtime_checkable, enabling polymorphic handling of both dimension types without isinstance checks.
nchunks and ngridcells differ when extent < sum(edges): nchunks counts only chunks that overlap [0, extent), while ngridcells counts total defined grid cells (i.e., len(edges)). For FixedDimension, both are equal. For VaryingDimension, they differ after a resize that shrinks the extent below the edge sum.
@dataclass(frozen=True)
class ChunkSpec:
slices: tuple[slice, ...] # valid data region in array coordinates
codec_shape: tuple[int, ...] # buffer shape for codec processing
@property
def shape(self) -> tuple[int, ...]:
return tuple(s.stop - s.start for s in self.slices)
@property
def is_boundary(self) -> bool:
return self.shape != self.codec_shapeFor interior chunks, shape == codec_shape. For boundary chunks of a regular grid, codec_shape is the full declared chunk size while shape is clipped. For rectilinear grids, shape == codec_shape unless the last chunk extends past the extent.
# Creating arrays
arr = zarr.create_array(shape=(100, 200), chunks=(10, 20)) # regular
arr = zarr.create_array(shape=(60, 100), chunks=[[10, 20, 30], [25, 25, 25, 25]]) # rectilinear
# ChunkGrid as a collection
grid = arr.chunk_grid # behavioral ChunkGrid (bound to array shape)
grid.grid_shape # (10, 10) — number of chunks per dimension
grid.ndim # 2
grid.is_regular # True if all dimensions are Fixed
spec = grid[0, 1] # ChunkSpec for chunk at grid position (0, 1)
spec.slices # (slice(0, 10), slice(20, 40))
spec.shape # (10, 20) — data shape
spec.codec_shape # (10, 20) — same for interior chunks
boundary = grid[9, 0] # boundary chunk (extent=100, size=10)
boundary.shape # (10, 20) — data shape
boundary.codec_shape # (10, 20) — codec sees full buffer
grid[99, 99] # None — out of bounds
for spec in grid: # iterate all chunks
...
# .chunks property: retained for regular grids, raises NotImplementedError for rectilinear
arr.chunks # (10, 20)
# .read_chunk_sizes / .write_chunk_sizes: works for all grids (dask-style)
arr.write_chunk_sizes # ((10, 10, ..., 10), (20, 20, ..., 20))ChunkGrid.__getitem__ constructs ChunkSpec using chunk_size for codec_shape and data_size for slices:
def __getitem__(self, coords: int | tuple[int, ...]) -> ChunkSpec | None:
if isinstance(coords, int):
coords = (coords,)
slices = []
codec_shape = []
for dim, ix in zip(self.dimensions, coords):
if ix < 0 or ix >= dim.nchunks:
return None
offset = dim.chunk_offset(ix)
slices.append(slice(offset, offset + dim.data_size(ix)))
codec_shape.append(dim.chunk_size(ix))
return ChunkSpec(tuple(slices), tuple(codec_shape))Both from_regular and from_rectilinear require array_shape, binding the extent per dimension at construction time. This is a core design choice: a chunk grid is a concrete arrangement for a specific array, not an abstract tiling pattern.
# Regular grid — all FixedDimension
grid = ChunkGrid.from_regular(array_shape=(100, 200), chunk_shape=(10, 20))
# Rectilinear grid — extent = sum(edges) when shape matches
grid = ChunkGrid.from_rectilinear([[10, 20, 30], [25, 25, 25, 25]], array_shape=(60, 100))
# Rectilinear grid with boundary clipping — last chunk extends past array extent
# e.g., shape=(55, 90) but edges sum to (60, 100): data_size clips at extent
grid = ChunkGrid.from_rectilinear([[10, 20, 30], [25, 25, 25, 25]], array_shape=(55, 90))
# Direct construction
grid = ChunkGrid(dimensions=(FixedDimension(10, 100), VaryingDimension([10, 20, 30], 55)))When extent < sum(edges), the dimension is always stored as VaryingDimension (even if all edges are identical) to preserve the explicit edge count. The last chunk's chunk_size returns the full declared edge (codec buffer) while data_size clips to the extent. This mirrors how FixedDimension handles boundary chunks in regular grids.
# Regular grid:
{"name": "regular", "configuration": {"chunk_shape": [10, 20]}}
# Rectilinear grid (with RLE compression and "kind" field):
{"name": "rectilinear", "configuration": {"kind": "inline", "chunk_shapes": [[10, 20, 30], [[25, 4]]]}}Both names deserialize to the same ChunkGrid class. The serialized form does not include the array extent — that comes from shape in array metadata and is passed to parse_chunk_grid() at construction time.
The ChunkGrid does not serialize itself. The format choice ("regular" vs "rectilinear") belongs to ArrayV3Metadata. The name is inferred from the chunk grid metadata DTO type (RegularChunkGrid → "regular", RectilinearChunkGrid → "rectilinear") or from grid.is_regular when a behavioral ChunkGrid is passed directly.
For create_array, the format is inferred from the chunks argument: a flat tuple produces "regular", a nested list produces "rectilinear". The _is_rectilinear_chunks() helper detects nested sequences like [[10, 20], [5, 5]].
The rectilinear format requires "kind": "inline" (validated by _validate_rectilinear_kind()). Per the spec, each element of chunk_shapes can be:
- A bare integer
m: repeated untilsum >= array_extent - A list of bare integers: explicit per-chunk sizes
- A mixed array of bare integers and
[value, count]RLE pairs
RLE compression is used when serializing: runs of identical sizes become [value, count] pairs, singletons stay as bare integers.
# _compress_rle([10, 10, 10, 5]) -> [[10, 3], 5]
# _expand_rle([[10, 3], 5]) -> [10, 10, 10, 5]For FixedDimension serialized as rectilinear, _serialize_fixed_dim() returns the bare integer dim.size. Per the rectilinear spec, a bare integer is repeated until the sum >= extent, preserving the full codec buffer size for boundary chunks.
Zero-extent handling: Regular grids serialize zero-extent dimensions without issue (the format encodes only chunk_shape, no edges). Rectilinear grids reject zero-extent dimensions because the spec requires at least one positive-integer edge length per axis. This asymmetry is intentional and spec-compliant — documented in serialize_chunk_grid().
The read_chunk_sizes and write_chunk_sizes properties provide universal access to per-dimension chunk data sizes, matching the dask Array.chunks convention. They work for both regular and rectilinear grids:
write_chunk_sizes: always returns outer (storage) chunk sizesread_chunk_sizes: returns inner chunk sizes when sharding is used, otherwise same aswrite_chunk_sizes
>>> arr = zarr.create_array(store, shape=(100, 80), chunks=(30, 40))
>>> arr.write_chunk_sizes
((30, 30, 30, 10), (40, 40))
>>> arr = zarr.create_array(store, shape=(60, 100), chunks=[[10, 20, 30], [50, 50]])
>>> arr.write_chunk_sizes
((10, 20, 30), (50, 50))The underlying ChunkGrid.chunk_sizes property (on the grid, not the array) returns the same as write_chunk_sizes.
arr.resize((80, 100)) # re-binds extent; FixedDimension stays fixed
arr.resize((200, 100)) # VaryingDimension grows by appending a new chunk
arr.resize((30, 100)) # VaryingDimension shrinks: preserves all edges, re-binds extentResize uses ChunkGrid.update_shape(new_shape), which delegates to each dimension's .resize() method:
FixedDimension.resize(): simply re-binds the extent (identical towith_extent)VaryingDimension.resize(): grow pastsum(edges)appends a chunk covering the gap; shrink or grow withinsum(edges)preserves all edges and re-binds the extent (the spec allows trailing edges beyond the array extent)
Known limitation (deferred): When growing a VaryingDimension, the current implementation always appends a single chunk covering the new region. For example, [10, 10, 10] resized from 30 to 45 produces [10, 10, 10, 15] instead of the more natural [10, 10, 10, 10, 10]. A future improvement should add an optional chunks parameter to resize() that controls how the new region is partitioned, with a sane default (e.g., repeating the last chunk size). This is safely deferrable because:
FixedDimensionalready handles resize correctly (regular grids stay regular)- The single-chunk default produces valid state, just suboptimal chunk layout
- Rectilinear arrays are behind an experimental feature flag
- Adding an optional parameter is backwards-compatible
Open design questions for the chunks parameter:
- Does it describe the new region only, or the entire post-resize array?
- Must the overlapping portion agree with existing chunks (no rechunking)?
- What is the type? Same as
chunksincreate_array?
The from_array() function handles both regular and rectilinear source arrays:
src = zarr.create_array(store, shape=(60, 100), chunks=[[10, 20, 30], [50, 50]])
new = zarr.from_array(data=src, store=new_store, chunks="keep")
# Preserves rectilinear structure: new.write_chunk_sizes == ((10, 20, 30), (50, 50))When chunks="keep", the logic checks data.chunk_grid.is_regular:
- Regular: extracts
data.chunks(flat tuple) and preserves shards - Rectilinear: extracts
data.write_chunk_sizes(nested tuples) and forces shards to None
The indexing pipeline is coupled to regular grid assumptions — every per-dimension indexer takes a scalar dim_chunk_len: int and uses // and *:
dim_chunk_ix = self.dim_sel // self.dim_chunk_len # IntDimIndexer
dim_offset = dim_chunk_ix * self.dim_chunk_len # SliceDimIndexerReplace dim_chunk_len: int with the dimension object (FixedDimension | VaryingDimension). The shared interface means the indexer code structure stays the same — dim_sel // dim_chunk_len becomes dim_grid.index_to_chunk(dim_sel). O(1) for regular, binary search for varying.
Today, get_chunk_spec() returns the same ArraySpec(shape=chunk_grid.chunk_shape) for every chunk. For rectilinear grids, each chunk has a different codec shape:
def get_chunk_spec(self, chunk_coords, array_config, prototype) -> ArraySpec:
spec = self.chunk_grid[chunk_coords]
return ArraySpec(shape=spec.codec_shape, ...)Note spec.codec_shape, not spec.shape. For regular grids, codec_shape is uniform (preserving current behavior). The boundary clipping flow is unchanged:
Write: user data → pad to codec_shape with fill_value → encode → store
Read: store → decode to codec_shape → slice via chunk_selection → user data
The ShardingCodec constructs a ChunkGrid per shard using the shard shape as extent and the subchunk shape as FixedDimension. Each shard is self-contained — it doesn't need to know whether the outer grid is regular or rectilinear. Validation checks that every unique edge length per dimension is divisible by the inner chunk size, using dim.unique_edge_lengths for efficient polymorphic iteration (O(1) for fixed dimensions, lazy-deduplicated for varying).
Level 1 — Outer chunk grid (shard boundaries): regular or rectilinear
Level 2 — Inner subchunk grid (within each shard): always regular
Level 3 — Shard index: ceil(shard_dim / subchunk_dim) entries per dimension
zarr-specs#370 lifts the requirement that subchunk shapes evenly divide the shard shape. With the proposed ChunkGrid, this just means removing the shard_shape % subchunk_shape == 0 validation — FixedDimension already handles boundary clipping via data_size.
| Outer grid | Subchunk divisibility | Required change |
|---|---|---|
| Regular | Evenly divides (v1.0) | None |
| Regular | Non-divisible (v1.1) | Remove divisibility validation |
| Rectilinear | Evenly divides | Remove "sharding incompatible" guard |
| Rectilinear | Non-divisible | Both changes |
| Current | Proposed |
|---|---|
ChunkGrid ABC + RegularChunkGrid subclass |
Single concrete ChunkGrid with is_regular |
RectilinearChunkGrid (#3534) |
Same ChunkGrid class |
| Chunk grid registry + entrypoints (#3735) | Direct name dispatch |
arr.chunks |
Retained for regular; arr.read_chunk_sizes/arr.write_chunk_sizes for general use |
get_chunk_shape(shape, coord) |
grid[coord].codec_shape or grid[coord].shape |
The chunk grid is a concrete arrangement, not an abstract tiling pattern. A finite collection naturally has an extent. Storing it enables __getitem__, eliminates dim_len parameters from every method, and makes the grid self-describing.
This does not mean ArrayV3Metadata.shape should delegate to the grid. The array shape remains an independent field in metadata. The extent is passed into the grid at construction time so it can answer boundary questions without external parameters. It is not serialized as part of the chunk grid JSON — it comes from the shape field in array metadata and is passed to parse_chunk_grid().
A chunk in a regular grid has two sizes. chunk_size is the buffer size the codec processes — always size for FixedDimension, even at the boundary (padded with fill_value). data_size is the valid data region — clipped to extent % size at the boundary. The indexing layer uses data_size to generate chunk_selection slices.
This matches current zarr-python behavior and matters for:
- Backward compatibility. Existing stores have boundary chunks encoded at full
chunk_shape. - Codec simplicity. Codecs assume uniform input shapes for regular grids.
- Shard index correctness. The index assumes
subchunk_dim-sized entries.
For VaryingDimension, chunk_size == data_size when extent == sum(edges). When extent < sum(edges) (e.g., after a resize that keeps the last chunk oversized), data_size clips the last chunk. This is the fundamental difference: FixedDimension has a declared size plus an extent that clips data; VaryingDimension has explicit sizes that normally are the extent but can also extend past it.
There is no known chunk grid outside the rectilinear family that retains the tessellation properties zarr-python assumes. A match on the grid name is sufficient.
Discussed in #3534. @d-v-b argued that RegularChunkGrid is unnecessary since rectilinear is more general; @dcherian argued that downstream libraries need a fast way to detect regular grids without inspecting potentially millions of chunk edges (see xarray#9808).
The resolution: a single ChunkGrid class with an is_regular property (O(1), cached at construction). This gives downstream code the fast-path detection @dcherian needed without the class hierarchy complexity @d-v-b wanted to avoid. The metadata document's name field ("regular" vs "rectilinear") is also available for clients who inspect JSON directly.
A RegularChunkGrid deprecation shim preserves isinstance checks for existing code — see Backwards compatibility.
The old design had ChunkGrid as an ABC with RegularChunkGrid as a subclass. #3534 added RectilinearChunkGrid as a second subclass. This branch makes ChunkGrid a single concrete class instead.
All known grids are special cases of rectilinear, so there's no need for a class hierarchy at the grid level. A ChunkGrid Protocol/ABC would mean every caller programs against an abstract interface and adding a grid type requires implementing ~15 methods. A single class is simpler.
Note: the dimension types (FixedDimension, VaryingDimension) do use a DimensionGrid Protocol — that's where the polymorphism lives. The grid-level class is concrete; the dimension-level types are polymorphic. If a genuinely novel grid type emerges that can't be expressed as a combination of per-dimension types, a grid-level Protocol can be extracted.
Debated in #3534. @d-v-b suggested making .chunks return tuple[tuple[int, ...], ...] (dask-style) for all grids. @dcherian strongly objected: every downstream consumer expects tuple[int, ...], and silently returning a different type would be worse than raising. Materializing O(10M) chunk edges into a Python tuple is also a real performance risk (xarray#8902).
The resolution:
.chunksis retained for regular grids (returnstuple[int, ...]as before).chunksraisesNotImplementedErrorfor rectilinear grids with a message pointing to.read_chunk_sizes/.write_chunk_sizes.read_chunk_sizesand.write_chunk_sizesreturntuple[tuple[int, ...], ...](dask convention) for all grids
@maxrjones noted in review that deprecating .chunks for regular grids was not desirable. The current branch does not deprecate it.
@d-v-b raised in #3534 that users need a way to say "these chunks are regular, but serialize as rectilinear" (e.g., to allow future append/extend workflows without format changes). @jhamman initially made nested-list input always produce RectilinearChunkGrid.
The current branch resolves this via _infer_chunk_grid_name(), which extracts or infers the serialization name from the chunk grid input. When metadata is deserialized, the original name (from {"name": "regular"} or {"name": "rectilinear"}) flows through to serialize_chunk_grid() at write time. When a ChunkGrid is passed directly, the name is inferred from grid.is_regular. Current inference behavior:
chunks=(10, 20)(flat tuple) → infers"regular"chunks=[[10, 20], [5, 5]](nested lists with varying sizes) → infers"rectilinear"chunks=[[10, 10], [20, 20]](nested lists with uniform sizes) →from_rectilinearcollapses toFixedDimension, sois_regular=Trueand infers"regular"
Open question: Should uniform nested lists preserve "rectilinear" to support future append workflows without a format change? This could be addressed by checking the input form before collapsing, or by allowing users to pass chunk_grid_name explicitly through the create_array API.
#3750 discussion identified periodic chunk patterns as a use case not efficiently served by RLE alone. RLE compresses runs of identical values (np.repeat), but periodic patterns like days-per-month ([31, 28, 31, 30, ...] repeated 30 years) need a tile encoding (np.tile). Real-world examples include:
- Oceanographic models (ROMS): HPC boundary-padded chunks like
[10, 8, 8, 8, 10]— handled by RLE - Temporal axes: days-per-month, hours-per-day — need tile encoding for compact metadata
- Temporal-aware grids: date/time-aware chunk grids that layer over other axes (raised by @LDeakin)
A TiledDimension prototype was built (commit 9c0f582) demonstrating that the per-dimension design supports this without changes to indexing or the codec pipeline. However, it was intentionally excluded from this release because:
- Metadata format must come first. Tile encoding requires a new
kindvalue in the rectilinear spec (currently only"inline"is defined). This should go through zarr-extensions#25, not zarr-python unilaterally. - The per-dimension architecture doesn't preclude it. A future
TiledDimensioncan implement theDimensionGridprotocol alongsideFixedDimensionandVaryingDimensionwith no changes to indexing, codecs, or theChunkGridclass. - RLE covers the MVP. Most real-world variable chunk patterns (HPC boundaries, irregular partitions) are efficiently encoded with RLE. Tile encoding is an optimization for a specific (temporal) subset.
An earlier design doc proposed decoupling ChunkGrid (behavioral) from ArrayV3Metadata (data), so that metadata would store only a plain dict and the array layer would construct the ChunkGrid.
The current implementation partially realizes this separation:
- Metadata DTOs (
RegularChunkGrid,RectilinearChunkGridinmetadata/v3.py): Pure data, frozen dataclasses, no array shape. These live onArrayV3Metadata.chunk_gridand represent only what goes intozarr.json. - Behavioral
ChunkGrid(chunk_grids.py): Shape-bound, supports indexing, iteration, and chunk specs. Lives onAsyncArray.chunk_grid, constructed from metadata +shapeviaChunkGrid.from_metadata().
This means ArrayV3Metadata.chunk_grid is now a ChunkGridMetadata (the DTO union type), not the behavioral ChunkGrid. Code that previously accessed behavioral methods on metadata.chunk_grid (e.g., all_chunk_coords(), __getitem__) must now use the behavioral grid from the array layer instead.
The name controls serialization format; serialize_chunk_grid() is called by ArrayV3Metadata.to_dict(). The behavioral grid handles all runtime queries.
zarrs (Rust): Three independent grid types behind a ChunkGridTraits trait. Key patterns adopted: Fixed vs Varying per dimension, prefix sums + binary search, Option<T> for out-of-bounds, NonZeroU64 for chunk dimensions, separate subchunk grid per shard, array shape at construction.
TensorStore (C++): Stores only chunk_shape — boundary clipping via valid_data_bounds at query time. Both RegularGridRef and IrregularGrid internally. No registry.
A RegularChunkGrid deprecation shim preserves the three common usage patterns:
from zarr.core.chunk_grids import RegularChunkGrid # works (no ImportError)
# Construction emits DeprecationWarning, returns a real ChunkGrid
grid = RegularChunkGrid(chunk_shape=(10, 20))
# isinstance works via __instancecheck__ metaclass
isinstance(grid, RegularChunkGrid) # True for any regular ChunkGridThe shim uses chunk_shape as extent (matching the old shape-unaware behavior). The deprecation warning directs users to ChunkGrid.from_regular().
Known limitation: Because the shim binds extent=chunk_shape, RegularChunkGrid(chunk_shape=(100,)).get_nchunks() returns 1 (one chunk of size 100 in a dimension of extent 100). This is intentional — the old RegularChunkGrid was shape-unaware, and the shim preserves that by using the chunk shape as a stand-in extent. Code that relied on constructing a RegularChunkGrid and later querying nchunks without binding an array shape must migrate to ChunkGrid.from_regular(array_shape, chunk_shape).
| Two-class pattern | Unified pattern |
|---|---|
isinstance(cg, RegularChunkGrid) |
cg.is_regular (or keep isinstance — shim handles it) |
isinstance(cg, RectilinearChunkGrid) |
not cg.is_regular |
cg.chunk_shape |
cg.dimensions[i].size or cg[coord].shape |
cg.chunk_shapes |
tuple(d.edges for d in cg.dimensions) |
RegularChunkGrid(chunk_shape=...) |
ChunkGrid.from_regular(shape, chunks) |
RectilinearChunkGrid(chunk_shapes=...) |
ChunkGrid.from_rectilinear(edges, shape) |
| Feature detection via class import | Version check or hasattr(ChunkGrid, 'is_regular') |
xarray#10880: Replace isinstance checks with .is_regular. Write path simplifies with chunks=[[...]] API.
VirtualiZarr#877: Drop vendored _is_nested_sequence. Replace isinstance checks.
Icechunk#1338: Minimal impact — format changes driven by spec, not class hierarchy.
cubed#876: Switch store creation to ChunkGrid API. @tomwhite confirmed in #3534 that rechunking with variable-sized intermediate chunks works.
HEALPix use case: @tinaok demonstrated in #3534 that variable-chunked arrays arise naturally when grouping HEALPix cells by parent pixel — the chunk sizes come from np.unique(parents, return_counts=True).
This implementation builds on prior work:
- #3534 (@jhamman) — RLE helpers, validation logic, test cases, and the review discussion that shaped the architecture.
- #3737 — extent-in-grid idea (adopted per-dimension).
- #1483 — original variable chunking POC.
- #3736 — resolved by storing extent per-dimension.
If the design is accepted, the POC branch can be split into 5 incremental PRs. PRs 1–2 are where the design decisions are reviewed; PRs 3–5 are mechanical consequences.
PR 1: Per-dimension types + ChunkSpec (purely additive)
FixedDimension,VaryingDimension,DimensionGridprotocol,ChunkSpec- RLE helpers (
_expand_rle,_compress_rle,_decode_dim_spec) ChunkGridNametype alias- Unit tests for all new types
- Zero changes to existing code
PR 2: Unified ChunkGrid class + serialization (replaces hierarchy)
ChunkGridwithfrom_regular,from_rectilinear,__getitem__,__iter__,all_chunk_coords,is_regular,chunk_shape,chunk_sizes,unique_edge_lengthsparse_chunk_grid(),serialize_chunk_grid(),_infer_chunk_grid_name()RegularChunkGriddeprecation shim_infer_chunk_grid_name()for serialization format inference- Feature flag (
array.rectilinear_chunks)
PR 3: Indexing generalization
- Replace
dim_chunk_len: intwithdim_grid: DimensionGridin all per-dimension indexers - Vectorized
indices_to_chunks()inIntArrayDimIndexerandCoordinateIndexer
PR 4: Array, codec pipeline, and sharding integration
- Wire
ChunkGridintocreate_array/init_array get_chunk_spec()→grid[chunk_coords].codec_shape- Sharding validation via
dim.unique_edge_lengths arr.read_chunk_sizes,arr.write_chunk_sizes,from_arraywithchunks="keep", resize support- Hypothesis strategies for rectilinear grids
PR 5: End-to-end tests + docs
- Full pipeline tests (create → write → read → verify)
- V2 backwards compatibility regression tests
- Boundary/overflow/edge case tests
- Design doc and user guide updates
- Resize defaults (deferred): When growing a rectilinear array, should
resize()accept an optionalchunksparameter? See the Resize section for details and open design questions. Regular arrays already stay regular on resize. ChunkSpeccomplexity:ChunkSpeccarries bothslicesandcodec_shape. Should the grid expose separate methods for codec vs data queries instead?__getitem__with slices: Shouldgrid[0, :]orgrid[0:3, :]return a sub-grid or an iterator ofChunkSpecs?- Uniform nested lists: Should
chunks=[[10, 10], [20, 20]]serialize as"rectilinear"(preserving user intent for future append) or"regular"(current behavior, collapses uniform edges)? See User control over grid serialization format. zarr.openwith rectilinear: @tomwhite noted in #3534 thatzarr.open(mode="w")doesn't support rectilinear chunks directly. This could be addressed in a follow-up.
- Zarr-Python:
- Xarray:
- VirtualiZarr:
- Virtual TIFF:
- Cubed:
- Microbenchmarks: