Skip to content

Wrapper for Arrow Datasets & Dataset Pieces#754

Open
aperiodic wants to merge 2 commits into
uber:masterfrom
aperiodic:feat/arrow-dataset-wrapper
Open

Wrapper for Arrow Datasets & Dataset Pieces#754
aperiodic wants to merge 2 commits into
uber:masterfrom
aperiodic:feat/arrow-dataset-wrapper

Conversation

@aperiodic
Copy link
Copy Markdown

This is a wrapper around PyArrow's ParquetDataset and ParquetDatasetPieces, the first part of the effort to support PyArrow's new Dataset API that was discussed in issue #613.

Since this wrapper has no functionality of its own, I didn't add unit tests specifically for the wrapper; let me know if I should.

I was unable to get the tests in petastorm/tests/test_tf_dataset.py to complete when running them inside the provided docker container, even when I left them to run for over 24 hours. Given that this is only a wrapper, and the code paths where the wrapper is used are covered by other tests, I think it's unlikely that tests in that file would fail when all other tests pass, but I'll keep an eye on the CI results. If they fail, then don't bother reviewing this until I fix them up (which may take several rounds of pushing new commits, as I can only run those tests in CI and not locally).

Add a wrapper around Arrow's ParquetDataset legacy class, to allow us to
re-implement that class's API using Arrow's new dataset class.

Add a wrapper around Arrow's ParquetDatasetPiece legacy class, to allow
us to re-implement that class's API using Arrow's new dataset "Fragment"
class.
@CLAassistant
Copy link
Copy Markdown

CLAassistant commented Apr 28, 2022

CLA assistant check
Thank you for your submission! We really appreciate it. Like many open source projects, we ask that you sign our Contributor License Agreement before we can accept your contribution.


Dan Lidral-Porter seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account.
You have signed the CLA already but the status is still pending? Let us recheck it.

@codecov
Copy link
Copy Markdown

codecov Bot commented Apr 28, 2022

Codecov Report

❌ Patch coverage is 94.66667% with 4 lines in your changes missing coverage. Please review.
✅ Project coverage is 86.38%. Comparing base (d32709d) to head (819a127).
⚠️ Report is 29 commits behind head on master.

Files with missing lines Patch % Lines
petastorm/pyarrow_helpers/dataset_wrapper.py 96.42% 0 Missing and 2 partials ⚠️
petastorm/etl/metadata_util.py 0.00% 1 Missing ⚠️
petastorm/etl/rowgroup_indexing.py 66.66% 1 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##           master     #754      +/-   ##
==========================================
+ Coverage   86.27%   86.38%   +0.11%     
==========================================
  Files          85       86       +1     
  Lines        5084     5141      +57     
  Branches      787      793       +6     
==========================================
+ Hits         4386     4441      +55     
  Misses        559      559              
- Partials      139      141       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.

Copy link
Copy Markdown
Collaborator

@selitvin selitvin left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

LGTM! Do you think we can keep it in a PR for now until we have a followup that builds on this PR with the actual new version of a dataset being used?

from pyarrow import parquet as pq


class PetastormPyArrowDataset:
Copy link
Copy Markdown
Collaborator

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Please add a comment explaining why do we need the wrapper.

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

good feature

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants