-
Notifications
You must be signed in to change notification settings - Fork 63
Avoid memory copies during serialization #860
Copy link
Copy link
Open
Labels
Icechunk 🧊Relates to Icechunk library / specRelates to Icechunk library / specKerchunkRelating to the kerchunk library / specification itselfRelating to the kerchunk library / specification itselfenhancementNew feature or requestNew feature or requestperformancereferences formatsStoring byte range info on diskStoring byte range info on disk
Metadata
Metadata
Assignees
Labels
Icechunk 🧊Relates to Icechunk library / specRelates to Icechunk library / specKerchunkRelating to the kerchunk library / specification itselfRelating to the kerchunk library / specification itselfenhancementNew feature or requestNew feature or requestperformancereferences formatsStoring byte range info on diskStoring byte range info on disk
Same issue as earth-mover/icechunk#1574 but on the virtualizarr side. Part of #104.
tl;dr: we are creating shitloads of python objects during
vds.vz.to_icechunk().We always knew our current implementation would be inefficient but it's even worse than I had expected. This benchmark writes a single chunk manifest containing 10M chunk references. Those take up about 300MB in memory as numpy arrays (could be better but not terrible), but writing them to Icechunk take 90s and 4GB!!!
I think if we pass the manifest in-memory using arrow arrays instead this should become much much better.
Actually if we transform to arrow arrays (or maybe even just use arrow arrays from the start...) we could make use of this idea in
vds.vz.to_kerchunk()also.