Skip to content

Orange Table-specific HDF5Reader#6791

Open
stuart-cls wants to merge 13 commits into
biolab:masterfrom
stuart-cls:orange-hdf5
Open

Orange Table-specific HDF5Reader#6791
stuart-cls wants to merge 13 commits into
biolab:masterfrom
stuart-cls:orange-hdf5

Conversation

@stuart-cls

@stuart-cls stuart-cls commented Apr 25, 2024

Copy link
Copy Markdown
Issue

Enable saving/loading the Orange Table data structure from the binary HDF5 container.
Based on the implementation used in the dask branch, but with the dask parts removed.

Related: #6356

Description of changes
  • Add HDF5Reader with both read and write_file methods.
  • Add round-trip test for HDF5Reader
  • Add h5py to requirements
  • Use existing .metadata sidecar file to store load table.attributes if present.
Includes
  • Code changes
  • Tests
  • Documentation

@stuart-cls stuart-cls changed the title Orange hdf5 Orange Table-specifc HDF5Reader Apr 25, 2024
@stuart-cls stuart-cls changed the title Orange Table-specifc HDF5Reader Orange Table-specific HDF5Reader Apr 25, 2024
Comment thread Orange/data/io.py Outdated
@stuart-cls

Copy link
Copy Markdown
Author

I couldn't find a satisfactory solution to the .attributes problem (for the use case I care about) so I re-used the .metadata sidecar files for now, which is better than nothing.

@codecov

codecov Bot commented May 20, 2024

Copy link
Copy Markdown

Codecov Report

Attention: Patch coverage is 93.67089% with 5 lines in your changes are missing coverage. Please review.

Project coverage is 88.21%. Comparing base (5ada6c4) to head (63c71b3).
Report is 21 commits behind head on master.

Additional details and impacted files
@@           Coverage Diff           @@
##           master    #6791   +/-   ##
=======================================
  Coverage   88.20%   88.21%           
=======================================
  Files         327      327           
  Lines       71223    71301   +78     
=======================================
+ Hits        62825    62900   +75     
- Misses       8398     8401    +3     

@markotoplak

Copy link
Copy Markdown
Member

Comments from @stuart-cls (from his email, just so that they do not get lost):

  • Table.attributes -> I could try again to properly store this in the HDF5, instead of the .metadata sidecar. I got stuck trying to get the round-trip on a visible image to work :)
  • Compatibility with dask branch: both opening files saved with that branch, and updating the branch to be compatible.

Comment thread Orange/data/io.py Outdated
for subdomain in ['attributes', 'class_vars', 'metas']:
parsed = [parse(feature) for feature in getattr(data.domain, subdomain)]
domain = np.array([[name, header] for name, header, _ in parsed], 'S')
domain_args = np.array([json.dumps(args) for *_, args in parsed], 'S')

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

For domain_args we should certainly use the variable length h5py string data type. Probably for domain too.

Copy link
Copy Markdown
Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Done.

@stuart-cls

Copy link
Copy Markdown
Author

Regarding the previous comments:

Table.attributes -> I could try again to properly store this in the HDF5, instead of the .metadata sidecar. I got stuck trying to get the round-trip on a visible image to work :)

In general this format isn't doing anything clever with nested dictionaries (see domain_args for example). It would be a lot of work to map this to HDF5, and this is the same problem with Table.attributes.

Compatibility with dask branch: both opening files saved with that branch, and updating the branch to be compatible.

I've tested both ways, it works fine. The new reader checks for "Orange" in the "creator" attribute, but falls back to checking that the "domain" group is there.

@stuart-cls

Copy link
Copy Markdown
Author

@markotoplak I added the possibility to store the Table.attributes dictionary in the file, with pickle .metadata file fallback.
Otherwise, we have been using this quite happily for some time, it is night and day performance improvement for large datasets.

Is there anything I'm missing to get this merged in?

@stuart-cls

Copy link
Copy Markdown
Author

@markotoplak Changed the last use of .metadata sidecar to store the pickled values in the hdf5 file as we discussed.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants