Skip to content

Commit 2e8887b

Browse files
authored
FAQ answer on "why still write native zarr?" (#918)
* add empty release notes * faq answer * add nuance * sharding
1 parent 1ff7654 commit 2e8887b

1 file changed

Lines changed: 17 additions & 0 deletions

File tree

docs/faq.md

Lines changed: 17 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -2,6 +2,23 @@
22

33
## Usage questions
44

5+
### Why write to a cloud-native format directly if I can just virtualize later?
6+
7+
While virtual zarr stores are intended as a cloud-native bridge for archival formats, they shouldn't be used as a justification for continuing to write data into object storage using non-cloud-optimized formats (such as NetCDF, HDF5, TIFF, or GRIB) indefinitely.
8+
9+
Assuming that you have the freedom to cease supporting archival formats, then we believe that **if you can write your data directly as native Zarr (or native zarr chunks in Icechunk), you probably should!**
10+
11+
Some reasons are:
12+
13+
- Not all datasets can be virtualized, sometimes for subtle reasons (see [Can my specific data be virtualized?](#can-my-specific-data-be-virtualized)).
14+
- Writing individual files separately means there is nothing enforcing the cross-file constraints needed for later virtualization.
15+
- Virtualized stores are more fragile - the archival files could be moved or updated and you won't know that the reference is stale until read-time.
16+
- Virtualization allows arbitrary differences in metadata compared to the original files - this is mostly a useful feature but it could become out-of-sync or misleading.
17+
- It creates significant extra work for someone later down the line, and that person will almost certainly know less about the details of the dataset than the data provider does at write time.
18+
- Chunk sizes matter, and it's generally good to force data providers to think up-front about about what chunk sizes would be optimal for expected user queries.
19+
- Some other types of optimizations (particularly sharding) are not supported for virtual stores.
20+
- For static datasets, native Zarr stores scale effortlessly to arbitrary numbers of chunks today, without having to even think about things like [manifest splitting](https://icechunk.io/en/latest/performance/#splitting-manifests).
21+
522
### Can my specific data be virtualized?
623

724
Depends on some details of your data.

0 commit comments

Comments
 (0)