Skip to content

Avoid memory copies during serialization #860

@TomNicholas

Description

@TomNicholas

Same issue as earth-mover/icechunk#1574 but on the virtualizarr side. Part of #104.

tl;dr: we are creating shitloads of python objects during vds.vz.to_icechunk().

We always knew our current implementation would be inefficient but it's even worse than I had expected. This benchmark writes a single chunk manifest containing 10M chunk references. Those take up about 300MB in memory as numpy arrays (could be better but not terrible), but writing them to Icechunk take 90s and 4GB!!!

I think if we pass the manifest in-memory using arrow arrays instead this should become much much better.


Actually if we transform to arrow arrays (or maybe even just use arrow arrays from the start...) we could make use of this idea in vds.vz.to_kerchunk() also.

Metadata

Metadata

Assignees

No one assigned

    Labels

    Icechunk 🧊Relates to Icechunk library / specKerchunkRelating to the kerchunk library / specification itselfenhancementNew feature or requestperformancereferences formatsStoring byte range info on disk

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions