docs(pipeline): document max_workers regression on small chunks

d-v-b · claude · d-v-b · commit 1c761b55527e · 2026-04-17T22:38:13.000+02:00
Added a Notes section to _resolve_max_workers explaining when threading
helps and when it hurts:

  - Large chunks (&gt;= 1 MB): threading helps, default is right
  - Small chunks (&lt;= 64 KB): per-task pool overhead (~30-50us) dominates
    the per-chunk work, threading slows things down 1.5-3x
  - Workaround: codec_pipeline.max_workers=1 for small-chunk workloads

Approximate breakeven: 256-512 KB per uncompressed chunk. Compressed
chunks shift the threshold lower because decode is real CPU work.

No code change. Wiring an automatic threshold is deferred — 1 MB is a
typical chunk size and a hard cutoff would catch legitimate workloads.

Co-Authored-By: Claude Opus 4.7 (1M context) &lt;noreply@anthropic.com&gt;
diff --git a/src/zarr/core/codec_pipeline.py b/src/zarr/core/codec_pipeline.py
@@ -45,6 +45,21 @@ def _resolve_max_workers() -> int:
 
     ``None`` means "auto" → ``os.cpu_count()`` (or 1 if unavailable).
     Values < 1 are clamped to 1 (sequential).
+
+    Notes
+    -----
+    The default (``None`` → ``cpu_count``) is tuned for large chunks
+    (≳ 1 MB encoded) where per-chunk decode + scatter is real work and
+    threading helps. For small chunks (≲ 64 KB) the per-task pool
+    overhead (≈ 30-50 µs submit + worker handoff) outweighs the work
+    and threading slows things down by 1.5-3x. If your workload uses
+    many small chunks, set ``codec_pipeline.max_workers=1`` explicitly:
+
+        zarr.config.set({"codec_pipeline.max_workers": 1})
+
+    Approximate breakeven on uncompressed reads: 256-512 KB per chunk.
+    Compressed chunks shift the threshold lower because decode is real
+    CPU work that benefits from parallelism.
     """
     import os as _os