Wrapper for Arrow Datasets & Dataset Pieces#754
Conversation
Add a wrapper around Arrow's ParquetDataset legacy class, to allow us to re-implement that class's API using Arrow's new dataset class. Add a wrapper around Arrow's ParquetDatasetPiece legacy class, to allow us to re-implement that class's API using Arrow's new dataset "Fragment" class.
|
Dan Lidral-Porter seems not to be a GitHub user. You need a GitHub account to be able to sign the CLA. If you have already a GitHub account, please add the email address used for this commit to your account. You have signed the CLA already but the status is still pending? Let us recheck it. |
Codecov Report❌ Patch coverage is Additional details and impacted files@@ Coverage Diff @@
## master #754 +/- ##
==========================================
+ Coverage 86.27% 86.38% +0.11%
==========================================
Files 85 86 +1
Lines 5084 5141 +57
Branches 787 793 +6
==========================================
+ Hits 4386 4441 +55
Misses 559 559
- Partials 139 141 +2 ☔ View full report in Codecov by Sentry. 🚀 New features to boost your workflow:
|
selitvin
left a comment
There was a problem hiding this comment.
LGTM! Do you think we can keep it in a PR for now until we have a followup that builds on this PR with the actual new version of a dataset being used?
| from pyarrow import parquet as pq | ||
|
|
||
|
|
||
| class PetastormPyArrowDataset: |
There was a problem hiding this comment.
Please add a comment explaining why do we need the wrapper.
This is a wrapper around PyArrow's ParquetDataset and ParquetDatasetPieces, the first part of the effort to support PyArrow's new Dataset API that was discussed in issue #613.
Since this wrapper has no functionality of its own, I didn't add unit tests specifically for the wrapper; let me know if I should.
I was unable to get the tests in
petastorm/tests/test_tf_dataset.pyto complete when running them inside the provided docker container, even when I left them to run for over 24 hours. Given that this is only a wrapper, and the code paths where the wrapper is used are covered by other tests, I think it's unlikely that tests in that file would fail when all other tests pass, but I'll keep an eye on the CI results. If they fail, then don't bother reviewing this until I fix them up (which may take several rounds of pushing new commits, as I can only run those tests in CI and not locally).