Skip to content

feat: pushdown OFFSET to parquet for RG-level skipping#21828

Open
zhuqi-lucas wants to merge 3 commits intoapache:mainfrom
zhuqi-lucas:feat/offset-pushdown
Open

feat: pushdown OFFSET to parquet for RG-level skipping#21828
zhuqi-lucas wants to merge 3 commits intoapache:mainfrom
zhuqi-lucas:feat/offset-pushdown

Conversation

@zhuqi-lucas
Copy link
Copy Markdown
Contributor

@zhuqi-lucas zhuqi-lucas commented Apr 24, 2026

Which issue does this PR close?

Closes #19654

Rationale for this change

SELECT * FROM table LIMIT 5 OFFSET 59000000 on a 60M-row parquet file takes 182ms because DataFusion reads 59M+ rows then discards them in GlobalLimitExec. The parquet reader has no knowledge of the offset.

What changes are included in this PR?

Push OFFSET from GlobalLimitExec down to the parquet reader for single-file, no-filter queries.

Architecture

                BEFORE                              AFTER
                ======                              =====

  GlobalLimitExec(skip=59M, fetch=5)          DataSourceExec(limit=59M+5, offset=59M)
    │                                           │
    ▼                                           ▼ prune_by_offset
  DataSourceExec(limit=59M+5)                 Skip 590 RGs (zero I/O)
    │                                           │
    ▼                                           ▼ apply_offset (RowSelection)
  Read 59M+ rows from parquet               Skip remaining rows in first RG
  (decode, decompress all)                      │
    │                                           ▼ effective_limit = 5
    ▼                                         Read 5 rows from 1 RG
  Discard 59M rows in GlobalLimitExec           │
    │                                           ▼
    ▼                                         Return 5 rows
  Return 5 rows
  
  Time: 182ms                                 Time: <1ms

Scope

Scenario Offset pushdown GlobalLimitExec
Single file + no filter Eliminated
Multi-file Kept (#21915)
With filter Kept (#21916)
Non-parquet (CSV/JSON) Kept

Implementation

  1. FileSource::supports_offset() — trait method, parquet returns true, others false
  2. FileScanConfig.offset — new field, with_offset() returns Some only for single-file + no-filter parquet
  3. LimitPushdown optimizer — pushes offset, eliminates GlobalLimitExec when with_offset() returns Some
  4. prune_by_offset() — skips leading fully-matched RGs whose cumulative rows fall within offset
  5. PreparedAccessPlan::apply_offset() — creates RowSelection for remaining offset within first surviving RG
  6. effective_limit — decoder reads only fetch rows (limit - offset)

Benchmark (60M rows, 1.5GB single parquet file)

OFFSET Before After Speedup
0 3ms 2ms
1M 4ms 1ms 4x
30M 98ms 1ms 98x
59M 182ms <1ms >182x

Are these changes tested?

  • 6 unit tests for prune_by_offset (boundary, partial, non-fully-matched, zero, exceeds, exact)
  • SLT Test N (10 sub-tests): EXPLAIN, correctness, boundary cases, WHERE interaction, multi-partition
  • Updated explain_analyze, push_down_filter_parquet, limit_pruning SLTs for new metric
  • All limit.slt tests pass (CSV/JSON offset still handled by GlobalLimitExec)

Are there any user-facing changes?

API change: FileSource::supports_offset() added (default false). FileScanConfig gains offset field. ExecutionPlan gains with_offset() and offset() methods.

Performance: faster OFFSET queries on single-file parquet without filters.

Follow-up

@github-actions github-actions Bot added optimizer Optimizer rules sqllogictest SQL Logic Tests (.slt) datasource Changes to the datasource crate physical-plan Changes to the physical-plan crate labels Apr 24, 2026
@zhuqi-lucas zhuqi-lucas force-pushed the feat/offset-pushdown branch from 39621f4 to 60c508d Compare April 24, 2026 08:40
@zhuqi-lucas zhuqi-lucas marked this pull request as ready for review April 24, 2026 12:36
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

This PR improves performance for LIMIT .. OFFSET .. queries on Parquet by pushing OFFSET down into the Parquet scan so it can skip entire row groups (and partially skip within a row group via RowSelection), avoiding reading and discarding large numbers of rows in GlobalLimitExec.

Changes:

  • Introduces offset pushdown plumbing (with_offset, offset, offset_fully_handled) across ExecutionPlan, DataSource, FileSource, and FileScanConfig.
  • Implements Parquet-side offset handling (row-group pruning + optional RowSelection) and adds an offset_pruned_row_groups metric.
  • Adds/updates SLT coverage and expected EXPLAIN/metrics output to reflect offset pushdown behavior.

Reviewed changes

Copilot reviewed 14 out of 14 changed files in this pull request and generated 4 comments.

Show a summary per file
File Description
datafusion/sqllogictest/test_files/sort_pushdown.slt Adds SLT “Test N” covering OFFSET pushdown behavior and correctness checks.
datafusion/sqllogictest/test_files/push_down_filter_parquet.slt Updates expected EXPLAIN ANALYZE metrics to include offset_pruned_row_groups.
datafusion/sqllogictest/test_files/limit_pruning.slt Updates expected metrics to include offset_pruned_row_groups.
datafusion/sqllogictest/test_files/explain_analyze.slt Updates expected metrics to include offset_pruned_row_groups.
datafusion/sqllogictest/test_files/dynamic_filter_pushdown_config.slt Updates expected metrics to include offset_pruned_row_groups.
datafusion/physical-plan/src/execution_plan.rs Adds offset-related extension points to ExecutionPlan.
datafusion/physical-optimizer/src/limit_pushdown.rs Extends limit pushdown rule to attempt offset pushdown and (sometimes) remove GlobalLimitExec.
datafusion/datasource/src/source.rs Wires offset methods through DataSource and DataSourceExec.
datafusion/datasource/src/file_scan_config/mod.rs Adds offset to FileScanConfig + builder and implements offset pushdown behavior.
datafusion/datasource/src/file.rs Adds supports_offset() to FileSource.
datafusion/datasource-parquet/src/source.rs Marks Parquet as supporting offset pushdown and propagates offset into morselizer config.
datafusion/datasource-parquet/src/row_group_filter.rs Adds prune_by_offset and unit tests for row-group pruning by offset.
datafusion/datasource-parquet/src/opener.rs Applies offset pruning and row selection during Parquet open; adjusts effective limit when offset is fully handled.
datafusion/datasource-parquet/src/metrics.rs Adds offset_pruned_row_groups metric.

💡 Add Copilot custom instructions for smarter, more guided reviews. Learn how to get started.

Comment on lines +270 to +285
if global_skip > 0 {
add_global_limit(
plan_with_preserve_order,
global_skip,
Some(global_fetch),
)
// Push offset to the plan. If the plan fully handles
// offset (e.g. parquet without WHERE), eliminate
// GlobalLimitExec. Otherwise keep it for remaining skip.
if let Some(plan_with_offset) =
plan_with_preserve_order.with_offset(global_skip)
{
if plan_with_offset.offset_fully_handled() {
plan_with_offset
} else {
add_global_limit(
plan_with_offset,
global_skip,
Some(global_fetch),
)
}
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

When with_offset(global_skip) returns Some but offset_fully_handled() is false, this code still keeps GlobalLimitExec with skip=global_skip while also pushing the same offset into the child plan. If the child actually skips any rows (e.g. parquet skipping fully-matched row groups), the offset will be applied twice and results will be incorrect. Consider only calling/using with_offset when the plan can fully handle the offset (and otherwise leave the child unchanged), or alternatively adjust the GlobalLimitExec skip to reflect only the remaining offset actually not handled by the child (requires a way to compute/report it).

Copilot uses AI. Check for mistakes.
Comment on lines +889 to +893
// Offset is fully handled when set AND no filter —
// raw row counts are accurate for offset calculation.
// With filters, only fully-matched RGs can be skipped,
// GlobalLimitExec handles the rest.
self.offset.is_some() && self.file_source.filter().is_none()
Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

offset_fully_handled() currently returns true whenever offset is set and there is no filter, but it does not account for multi-partition scans. FileStream applies limit per output partition, so removing GlobalLimitExec when file_groups.len() > 1 can yield more than fetch rows (and make OFFSET/LIMIT semantics depend on partitioning). Consider requiring a single output partition (e.g. self.file_groups.len() == 1 / output_partitioning().partition_count() == 1) before reporting the offset as fully handled.

Suggested change
// Offset is fully handled when set AND no filter —
// raw row counts are accurate for offset calculation.
// With filters, only fully-matched RGs can be skipped,
// GlobalLimitExec handles the rest.
self.offset.is_some() && self.file_source.filter().is_none()
// Offset is fully handled only when set, no filter is present,
// and the scan has a single output partition.
//
// With filters, only fully-matched RGs can be skipped and
// GlobalLimitExec handles the rest. Likewise, multi-partition
// scans must retain global limit enforcement because FileStream
// applies limits per output partition.
self.offset.is_some()
&& self.file_source.filter().is_none()
&& self.file_groups.len() == 1

Copilot uses AI. Check for mistakes.
Comment thread datafusion/datasource-parquet/src/opener.rs Outdated
8 80
9 90
10 100

Copy link

Copilot AI Apr 24, 2026

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

The WHERE-clause coverage here uses OFFSET 2, which does not exercise the case where the offset spans at least one fully-matched row group under a predicate (the scenario that can break if offset is both pushed into parquet and also applied by GlobalLimitExec). Consider adding a variant like WHERE value > 50 LIMIT 3 OFFSET 7 (with 5-row row groups) to ensure correctness when the offset crosses a fully-matched row group boundary under filtering.

Suggested change
# Test N.9b: OFFSET with WHERE clause crossing a fully-matched row-group boundary
# Ensures correctness when OFFSET may be pushed into parquet and also applied by GlobalLimitExec
query II
SELECT * FROM tn_offset WHERE value > 50 LIMIT 3 OFFSET 7;
----
13 130
14 140
15 150

Copilot uses AI. Check for mistakes.
Copy link
Copy Markdown
Contributor

@alamb alamb left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Thanks @zhuqi-lucas -- I took a look and this looks pretty nice. I think we can potentially simplify the APIs a bit though and I left some suggestions on how to do so

let me know what you think

Comment thread datafusion/datasource-parquet/src/opener.rs Outdated
Comment thread datafusion/datasource-parquet/src/opener.rs Outdated
Comment thread datafusion/datasource-parquet/src/row_group_filter.rs
/// The number of rows to skip before returning results.
/// When combined with `limit`, this enables efficient OFFSET handling
/// at the file scan level by skipping entire row groups when possible.
pub offset: Option<usize>,
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think technically this is a public API and thus an API change (so we should mark the PR thusly)

self.offset
}

fn offset_fully_handled(&self) -> bool {
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It would help me if you could add a comment explaining what this method is for and maybe a more descriptive name.

I found "handled" a bit generic -- maybe the name "can_apply_offset" or something would be more

Comment thread datafusion/datasource/src/file.rs Outdated
}

/// Whether this source fully handles offset at the scan level.
/// When true, the optimizer can eliminate GlobalLimitExec's skip.
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

In general I think it would be easier to understand if this documentation does't refer to GlobalLimitExec.... and instead simply says "can be efficiently implemented by the file source" or something like that

Comment thread datafusion/datasource/src/source.rs Outdated
None
}

/// Whether offset is fully handled (no need for GlobalLimitExec skip).
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I am confused by this API -- if the datasource returns Some(..) for offset then I expect that the DataSource will correctly implement the offset handling. It seems error prone if callers ALSO have to remember to check offset_fully_handled

Perhaps we could hange FileScanConfig so that if a filter is ever set it clears out the offset (or better yet has an enum so it is not possible to set the offset and a filter at the same time)

@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

zhuqi-lucas commented Apr 27, 2026

Thanks @zhuqi-lucas -- I took a look and this looks pretty nice. I think we can potentially simplify the APIs a bit though and I left some suggestions on how to do so

let me know what you think

Thanks for the review @alamb! I agree the current API is confusing. Let me explain the design intent:

There are two cases for offset + parquet:

Case 1: No filter → offset is fully handled (raw row counts are exact). Safe to eliminate GlobalLimitExec.

Case 2: Filter + fully matched RGsprune_by_statistics marks RGs where ALL rows satisfy the filter (is_fully_matched). For these RGs, row counts are still exact — safe to skip for offset. But non-fully-matched RGs need GlobalLimitExec to handle remaining offset.

If with_offset returns None when filter is present, we lose Case 2 entirely — the offset hint is never passed to the opener, so prune_by_offset can't skip any RGs.

How about this cleaner API instead:

  • with_offset always returns Some for parquet (sets the hint)
  • Rename offset_fully_handledcan_skip_global_limit_for_offset
  • Optimizer: if can_skip_global_limit_for_offset() → eliminate GlobalLimitExec, else keep it

Or alternatively: split into two methods — with_offset(n) for full handling (no filter) and with_offset_hint(n) for partial (with filter).

WDYT?

I'll also address the other comments (move RowSelection logic to PreparedAccessPlan method, update docs/naming, mark as API change).

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 28, 2026

I think we should take a step back and maybe reconsider what we are trying to do -- maybe by defining and documenting somewhere what the intended semantics for OFFSET are.

For example, for a query like

SELECT * FROM `single_file.parquet` LIMIT 5 OFFSET 600000000

I think DataFusion would probably be "correct" to return any 5 rows (as the query doesn't specify any ORDER BY)

However, that is probably not what the user wanted / expected. The user probably wants rows starting at logical offset 600000000 of the file.

Likewise, what rows should be returned from this query (where there are multiple files)?

SELECT * FROM directory_with_multiple_files LIMIT 5 OFFSET 600000000

That is not clear to me - is there some sort of implied global order of rows within the file that users expect?

Another question: Do we expect DataFusion to always produce the same values for a given LIMIT / OFFSET combination? (Note that today it will potentially return different values for the same LIMIT when there are more than one partition and no ORDER BY clause)

Thanks @zhuqi-lucas -- I took a look and this looks pretty nice. I think we can potentially simplify the APIs a bit though and I left some suggestions on how to do so
let me know what you think

Thanks for the review @alamb! I agree the current API is confusing. Let me explain the design intent:

There are two cases for offset + parquet:

Case 1: No filter → offset is fully handled (raw row counts are exact). Safe to eliminate GlobalLimitExec.

I think this makes sense and it is the case that @AntoinePrv gives in #19654

Case 2: Filter + fully matched RGsprune_by_statistics marks RGs where ALL rows satisfy the filter (is_fully_matched). For these RGs, row counts are still exact — safe to skip for offset. But non-fully-matched RGs need GlobalLimitExec to handle remaining offset.

I am confused about how this is intended to work. For example in your example AFTER (with WHERE), what if the skipped row groups wouldn't have been skipped of the offset had been applied as the file was read? For example, what if the last row group was entirely skipped but wouldn't have made it into the results -- should it be included in the offset calculation?

Also, I think the GlobalLimitExec needs to be updated

                AFTER (with WHERE)
                ==================

  GlobalLimitExec(skip=59M, fetch=5)    <---- shouldn't the skip be reduced by the number of rows
    │                                         that are skipped with full matching RGs? How does that work with multiple files 🤔 
    ▼                                         
  DataSourceExec(limit=59M+5, offset=59M)
    │
    ▼ prune_by_offset
  Skip fully-matched RGs (WHERE stats  <---- how do you know what order these are? Is it only the first ones?
  prove all rows match → row count exact)
    │
    ▼
  Read remaining RGs (with WHERE filter)
    │
    ▼
  GlobalLimitExec handles remaining skip

Given the potential complexity of handling / optimizing the general case, I wonder if there is some way to start simple (maybe a PR that only implements case 1, when there are no predicates, and a single file, so it is clearer what the expected results should be

@AntoinePrv
Copy link
Copy Markdown

SELECT * FROM `single_file.parquet` LIMIT 5 OFFSET 600000000

I think DataFusion would probably be "correct" to return any 5 rows (as the query doesn't specify any ORDER BY)

However, that is probably not what the user wanted / expected. The user probably wants rows starting at logical offset 600000000 of the file.

From a user perspective, I very much expect to be returned the rows in the logical order of the file (alternatively is there a clause to express it as an ORDER BY?). I understand this may not what someone coming from databases may expect. But coming more from a data science, one may consider Parquet as a "smart CSV" and wish to investigate the "last 1000 row they just generated" (ie bottom of the file). While data scientist may not use DataFusion directly, they may absolutely write queries on systems build with it.

Likewise, what rows should be returned from this query (where there are multiple files)?

SELECT * FROM directory_with_multiple_files LIMIT 5 OFFSET 600000000

That is not clear to me - is there some sort of implied global order of rows within the file that users expect?

I do not have as strong an expectation here. Best case scenario I would say logical order is determined by something like lexicographic order of the file paths, but it does feels a bit more arbitrary. Though I would expect that there is some arbitrary order (possibly hidden from me) and that changing the offset changes the rows I am viewing in that hidden order. That is

  • The same query twice returns the same rows
  • I am able to observe all the rows by changing the offset.

Perhaps I can say a bit more on my use case. Basically I want to paginate rows from a dataset to show some user.
For the single file case, this is like file viewer so I do prefer to have a way to express logical order (either explicitly or implicitly).
For the multiple file scenario, I believe the order not to be as important, but I do need a way to get "chunks" of rows that will cover the whole dataset without duplication, while retrieving a single "chunk" with maximum efficiency (e.g. most of the time they are contiguous rows in a from a single file).

I think the "efficiently paginate rows" can be considered a reasonable feature beyond my own use case.

@asolimando
Copy link
Copy Markdown
Member

asolimando commented Apr 29, 2026

I understand where the need comes from, but there is a good reason why databases treat scans without order by as unordered, it's because a lot of logical/physical planning optimizations depend on this assumption, and they can only rely on metadata to tell if the plan changes they want to do are safe or not.

If the underlying data is truly sorted over something that can be encoded similarly to what you can write with an ORDER BY (or at least producing the same metadata DataFusion uses), that's could be fine, but if the order is just the order rows happen to have in the files, and we can't encode this promise nowhere, then it gets complex.

At that point, the planner should have a mode to disable all possible optimizations allowing a different results set order without an order by, which is definitely a non-trivial scrutiny, that every future contribution to the planner will have to go through, and since it's a major deviation from the SQL standard, everything must be re-checked for safety.

@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

zhuqi-lucas commented Apr 29, 2026

Thanks @alamb for the thoughtful feedback! I agree with your suggestion to simplify.

Plan: start with Case 1 only (no filter, single file)

This PR already works for Case 1 — when there's no predicate, with_offset accepts, GlobalLimitExec is eliminated, and parquet handles all offset skipping via RG prune + RowSelection. The semantics are unchanged from current behavior (physical row order, same as today's GlobalLimitExec skip).

I'll simplify by:

  1. Remove offset_fully_handledwith_offset returns Some only when no filter
  2. Remove the Case 2 (filter + fully matched) complexity — tracked as follow-up
  3. Move RowSelection logic into a method on PreparedAccessPlan
  4. Update docs/naming per your suggestions
  5. Mark as API change

Re: the semantics discussion from @AntoinePrv and @asolimando — great points on both sides. This PR doesn't change any OFFSET semantics; it produces the exact same rows as today's GlobalLimitExec skip, just faster. The broader questions around ORDER BY requirements and multi-file ordering are important but orthogonal — happy to discuss those separately.

Will push the simplified version shortly.

@zhuqi-lucas zhuqi-lucas force-pushed the feat/offset-pushdown branch 2 times, most recently from 0b5d7fa to 07d54c4 Compare April 29, 2026 09:30
@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

zhuqi-lucas commented Apr 29, 2026

Updated per review feedback — all items addressed:

Simplified scope (per @alamb suggestion):

  • Offset pushdown now only for single file + no filter
  • Multi-file and filtered queries keep GlobalLimitExec
  • Removed offset_fully_handledwith_offset returning Some means fully handled
  • with_offset returns None when: not parquet, has filter, or multi-file

Code changes:

  1. ✅ Moved RowSelection logic into PreparedAccessPlan::apply_offset() method
  2. ✅ Removed offset_fully_handled from all traits
  3. ✅ Updated supports_offset docs — no longer references GlobalLimitExec
  4. ✅ Marked as API change in PR description
  5. ✅ Simplified to plain offset: Option<usize> (removed multi-file atomic logic)
  6. ✅ Optimizer: with_offsetSome = eliminate GlobalLimitExec, None = keep it

Follow-up issues created:

Ready for re-review.

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

I asked about the Detect breaking changes CI failure here: #21499 (comment)

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

I understand where the need comes from, but there is a good reason why databases treat scans without order by as unordered, it's because a lot of logical/physical planning optimizations depend on this assumption, and they can only rely on metadata to tell if the plan changes they want to do are safe or not.

I agree with @asolimando on this -- and I think DataFusion should not be breaking new ground on what semantics we implement (we should follow other DB implementations as much as possible)

If the underlying data is truly sorted over something that can be encoded similarly to what you can write with an ORDER BY (or at least producing the same metadata DataFusion uses), that's could be fine, but if the order is just the order rows happen to have in the files, and we can't encode this promise nowhere, then it gets complex.

We did recently add the ability to emit row id from the parquet reader 🤔 -- maybe we could make that work and then treat row group skipping as an optimization when the data is explicitly ORDER BY row_number() 🤔

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

Thank yoU @zhuqi-lucas -- I am not sure this change actually solves @AntoinePrv's use case - I think the conservative checks will not trigger on a large file

For example, I tested using a 14 GB clickbench parquet file:

cd benchmarks
./bench.sh  data clickbench_1
cd data

And then run datafusion-cli from this branch:

select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;

It took 4seconds on my laptop (to return 5 rows) which I think means this branch is not triggered:

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks/data$ ~/Downloads/datafusion-cli-feat_offset-pushdown
DataFusion CLI v53.1.0
> select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;
+---------------------+------------+-------------------------------------------------------------------------------------+-----------+------------+-----------+-----------+-------------+----------+--------------------+--------------+-----+-----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+-----------+-------------------+-----------------+---------------+-------------+-----------------+------------------+-----------------+------------+------------+-------------+----------+----------+----------------+----------------+--------------+------------------+----------+-------------+------------------+------------------------+-------------+----------------+----------------+--------------+-------------+-------------+-------------------+--------------------+----------------+-----------------+---------------------+---------------------+---------------------+---------------------+-------------+-------------+--------+------------+-------------+---------------------+-------------+-----------+--------------+---------+-------------+---------------+----------+----------+----------------+-----+-----+--------+-----------+-----------+------------+------------+------------+---------------+-----------------+----------------+---------------+--------------+-----------+------------+-----------+---------------+---------------------+-------------------+-------------+-----------------------+------------------+------------+--------------+---------------+-----------------+---------------------+--------------------+--------------+------------------+-----------+-----------+-------------+------------+---------+---------+----------+----------------------+---------------------+------+
| WatchID             | JavaEnable | Title                                                                               | GoodEvent | EventTime  | EventDate | CounterID | ClientIP    | RegionID | UserID             | CounterClass | OS  | UserAgent | URL                                                                                                                      | Referer                                                                                                      | IsRefresh | RefererCategoryID | RefererRegionID | URLCategoryID | URLRegionID | ResolutionWidth | ResolutionHeight | ResolutionDepth | FlashMajor | FlashMinor | FlashMinor2 | NetMajor | NetMinor | UserAgentMajor | UserAgentMinor | CookieEnable | JavascriptEnable | IsMobile | MobilePhone | MobilePhoneModel | Params                 | IPNetworkID | TraficSourceID | SearchEngineID | SearchPhrase | AdvEngineID | IsArtifical | WindowClientWidth | WindowClientHeight | ClientTimeZone | ClientEventTime | SilverlightVersion1 | SilverlightVersion2 | SilverlightVersion3 | SilverlightVersion4 | PageCharset | CodeVersion | IsLink | IsDownload | IsNotBounce | FUniqID             | OriginalURL | HID       | IsOldCounter | IsEvent | IsParameter | DontCountHits | WithHash | HitColor | LocalEventTime | Age | Sex | Income | Interests | Robotness | RemoteIP   | WindowName | OpenerName | HistoryLength | BrowserLanguage | BrowserCountry | SocialNetwork | SocialAction | HTTPError | SendTiming | DNSTiming | ConnectTiming | ResponseStartTiming | ResponseEndTiming | FetchTiming | SocialSourceNetworkID | SocialSourcePage | ParamPrice | ParamOrderID | ParamCurrency | ParamCurrencyID | OpenstatServiceName | OpenstatCampaignID | OpenstatAdID | OpenstatSourceID | UTMSource | UTMMedium | UTMCampaign | UTMContent | UTMTerm | FromTag | HasGCLID | RefererHash          | URLHash             | CLID |
+---------------------+------------+-------------------------------------------------------------------------------------+-----------+------------+-----------+-----------+-------------+----------+--------------------+--------------+-----+-----------+--------------------------------------------------------------------------------------------------------------------------+--------------------------------------------------------------------------------------------------------------+-----------+-------------------+-----------------+---------------+-------------+-----------------+------------------+-----------------+------------+------------+-------------+----------+----------+----------------+----------------+--------------+------------------+----------+-------------+------------------+------------------------+-------------+----------------+----------------+--------------+-------------+-------------+-------------------+--------------------+----------------+-----------------+---------------------+---------------------+---------------------+---------------------+-------------+-------------+--------+------------+-------------+---------------------+-------------+-----------+--------------+---------+-------------+---------------+----------+----------+----------------+-----+-----+--------+-----------+-----------+------------+------------+------------+---------------+-----------------+----------------+---------------+--------------+-----------+------------+-----------+---------------+---------------------+-------------------+-------------+-----------------------+------------------+------------+--------------+---------------+-----------------+---------------------+--------------------+--------------+------------------+-----------+-----------+-------------+------------+---------+---------+----------+----------------------+---------------------+------+
....

5 row(s) fetched.
Elapsed 3.192 seconds.

So I think we either need to

  1. Claim DataFusion is doing the correct (though unexpected) thing and close the ticket
  2. Take a step back and see if we can implement this usecase (paginate results) in a more performant way (e.g. implement ORDER BY row_number() first and them implement this as an optimization)

@alamb
Copy link
Copy Markdown
Contributor

alamb commented Apr 29, 2026

BTW here is the explain plan

andrewlamb@Andrews-MacBook-Pro-3:~/Software/datafusion/benchmarks/data$ ~/Downloads/datafusion-cli-feat_offset-pushdown
DataFusion CLI v53.1.0
> explain select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;
+---------------+-------------------------------+
| plan_type     | plan                          |
+---------------+-------------------------------+
| physical_plan | ┌───────────────────────────┐ |
|               | │      GlobalLimitExec      │ |
|               | │    --------------------   │ |
|               | │          limit: 5         │ |
|               | │       skip: 99000000      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │   CoalescePartitionsExec  │ |
|               | │    --------------------   │ |
|               | │      limit: 99000005      │ |
|               | └─────────────┬─────────────┘ |
|               | ┌─────────────┴─────────────┐ |
|               | │       DataSourceExec      │ |
|               | │    --------------------   │ |
|               | │         files: 16         │ |
|               | │      format: parquet      │ |
|               | └───────────────────────────┘ |
|               |                               |
+---------------+-------------------------------+
1 row(s) fetched.
Elapsed 0.017 seconds.

> explain format indent select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| plan_type     | plan                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                      |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
| logical_plan  | Limit: skip=99000000, fetch=5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                             |
|               |   TableScan: hits.parquet projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], fetch=99000005                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                               |
| physical_plan | GlobalLimitExec: skip=99000000, fetch=5                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                   |
|               |   CoalescePartitionsExec: fetch=99000005                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                  |
|               |     DataSourceExec: file_groups={16 groups: [[Users/andrewlamb/Software/datafusion/benchmarks/data/hits.parquet:0..923748528], [Users/andrewlamb/Software/datafusion/benchmarks/data/hits.parquet:923748528..1847497056], [Users/andrewlamb/Software/datafusion/benchmarks/data/hits.parquet:1847497056..2771245584], [Users/andrewlamb/Software/datafusion/benchmarks/data/hits.parquet:2771245584..3694994112], [Users/andrewlamb/Software/datafusion/benchmarks/data/hits.parquet:3694994112..4618742640], ...]}, projection=[WatchID, JavaEnable, Title, GoodEvent, EventTime, EventDate, CounterID, ClientIP, RegionID, UserID, CounterClass, OS, UserAgent, URL, Referer, IsRefresh, RefererCategoryID, RefererRegionID, URLCategoryID, URLRegionID, ResolutionWidth, ResolutionHeight, ResolutionDepth, FlashMajor, FlashMinor, FlashMinor2, NetMajor, NetMinor, UserAgentMajor, UserAgentMinor, CookieEnable, JavascriptEnable, IsMobile, MobilePhone, MobilePhoneModel, Params, IPNetworkID, TraficSourceID, SearchEngineID, SearchPhrase, AdvEngineID, IsArtifical, WindowClientWidth, WindowClientHeight, ClientTimeZone, ClientEventTime, SilverlightVersion1, SilverlightVersion2, SilverlightVersion3, SilverlightVersion4, PageCharset, CodeVersion, IsLink, IsDownload, IsNotBounce, FUniqID, OriginalURL, HID, IsOldCounter, IsEvent, IsParameter, DontCountHits, WithHash, HitColor, LocalEventTime, Age, Sex, Income, Interests, Robotness, RemoteIP, WindowName, OpenerName, HistoryLength, BrowserLanguage, BrowserCountry, SocialNetwork, SocialAction, HTTPError, SendTiming, DNSTiming, ConnectTiming, ResponseStartTiming, ResponseEndTiming, FetchTiming, SocialSourceNetworkID, SocialSourcePage, ParamPrice, ParamOrderID, ParamCurrency, ParamCurrencyID, OpenstatServiceName, OpenstatCampaignID, OpenstatAdID, OpenstatSourceID, UTMSource, UTMMedium, UTMCampaign, UTMContent, UTMTerm, FromTag, HasGCLID, RefererHash, URLHash, CLID], limit=99000005, file_type=parquet |
|               |                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                                           |
+---------------+-----------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------------+
2 row(s) fetched.
Elapsed 0.009 seconds.

@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

zhuqi-lucas commented Apr 30, 2026

Thanks @alamb for testing! You found the key issue — the optimization doesn't trigger because hits.parquet is split into 16 file groups (byte-range partitioning by target_partitions). My guard incorrectly treats this as "multi-file" and falls back to GlobalLimitExec.

The root cause: with multiple partitions, each partition reads different byte ranges of the same file independently. The offset needs cross-partition coordination because CoalescePartitionsExec merges partition outputs in non-deterministic order — so we can't know which partition's rows come "first" in the merged stream.

Working on a fix that handles single-file multi-partition correctly. Will test locally with the ClickBench 14GB file before pushing.

@zhuqi-lucas zhuqi-lucas force-pushed the feat/offset-pushdown branch from fba644f to eee57af Compare April 30, 2026 03:25
@AntoinePrv
Copy link
Copy Markdown

We did recently add the ability to emit row id from the parquet reader 🤔 -- maybe we could make that work and then treat row group skipping as an optimization when the data is explicitly ORDER BY row_number() 🤔

ORDER BY row_number() (and associated rust / Python functions) as a way to explicitly describe what I mean absolutely works for me.

@rluvaton
Copy link
Copy Markdown
Member

the fix for detect breaking changes ci job is now resolved so I updated your branch with main, sorry for the trouble

@github-actions
Copy link
Copy Markdown

github-actions Bot commented Apr 30, 2026

Thank you for opening this pull request!

Reviewer note: cargo-semver-checks reported the current version number is not SemVer-compatible with the changes in this pull request (compared against the base branch).

Details
     Cloning origin/main
    Building datafusion-datasource v53.1.0 (current)
       Built [  37.917s] (current)
     Parsing datafusion-datasource v53.1.0 (current)
      Parsed [   0.031s] (current)
    Building datafusion-datasource v53.1.0 (baseline)
       Built [  36.464s] (baseline)
     Parsing datafusion-datasource v53.1.0 (baseline)
      Parsed [   0.030s] (baseline)
    Checking datafusion-datasource v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.232s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  76.061s] datafusion-datasource
    Building datafusion-datasource-parquet v53.1.0 (current)
       Built [  42.180s] (current)
     Parsing datafusion-datasource-parquet v53.1.0 (current)
      Parsed [   0.026s] (current)
    Building datafusion-datasource-parquet v53.1.0 (baseline)
       Built [  42.060s] (baseline)
     Parsing datafusion-datasource-parquet v53.1.0 (baseline)
      Parsed [   0.027s] (baseline)
    Checking datafusion-datasource-parquet v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.148s] 222 checks: 221 pass, 1 fail, 0 warn, 30 skip

--- failure constructible_struct_adds_field: externally-constructible struct adds field ---

Description:
A pub struct constructible with a struct literal has a new pub field. Existing struct literals must be updated to include the new field.
        ref: https://doc.rust-lang.org/reference/expressions/struct-expr.html
       impl: https://github.com/obi1kenobi/cargo-semver-checks/tree/v0.47.0/src/lints/constructible_struct_adds_field.ron

Failed in:
  field ParquetFileMetrics.offset_pruned_row_groups in /home/runner/work/datafusion/datafusion/datafusion/datasource-parquet/src/metrics.rs:53

     Summary semver requires new major version: 1 major and 0 minor checks failed
    Finished [  85.598s] datafusion-datasource-parquet
    Building datafusion-physical-optimizer v53.1.0 (current)
       Built [  36.494s] (current)
     Parsing datafusion-physical-optimizer v53.1.0 (current)
      Parsed [   0.021s] (current)
    Building datafusion-physical-optimizer v53.1.0 (baseline)
       Built [  36.475s] (baseline)
     Parsing datafusion-physical-optimizer v53.1.0 (baseline)
      Parsed [   0.022s] (baseline)
    Checking datafusion-physical-optimizer v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.120s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  74.237s] datafusion-physical-optimizer
    Building datafusion-physical-plan v53.1.0 (current)
       Built [  31.895s] (current)
     Parsing datafusion-physical-plan v53.1.0 (current)
      Parsed [   0.120s] (current)
    Building datafusion-physical-plan v53.1.0 (baseline)
       Built [  32.144s] (baseline)
     Parsing datafusion-physical-plan v53.1.0 (baseline)
      Parsed [   0.120s] (baseline)
    Checking datafusion-physical-plan v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.545s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [  65.984s] datafusion-physical-plan
    Building datafusion-sqllogictest v53.1.0 (current)
       Built [ 133.122s] (current)
     Parsing datafusion-sqllogictest v53.1.0 (current)
      Parsed [   0.020s] (current)
    Building datafusion-sqllogictest v53.1.0 (baseline)
       Built [ 133.799s] (baseline)
     Parsing datafusion-sqllogictest v53.1.0 (baseline)
      Parsed [   0.023s] (baseline)
    Checking datafusion-sqllogictest v53.1.0 -> v53.1.0 (no change; assume patch)
     Checked [   0.092s] 222 checks: 222 pass, 30 skip
     Summary no semver update required
    Finished [ 269.588s] datafusion-sqllogictest

@rluvaton
Copy link
Copy Markdown
Member

rluvaton commented Apr 30, 2026

Great, the check breaking changes is working!
note however that it does not add api change Changes the API exposed to users of the crate label

do you think we should add it?

@rluvaton rluvaton added the api change Changes the API exposed to users of the crate label Apr 30, 2026
Push OFFSET from GlobalLimitExec down to DataSourceExec/ParquetOpener.
Uses shared Arc<AtomicUsize> counter across partitions so multi-partition
single-file queries (byte-range partitioning) are handled correctly.

Design:
- with_offset accepted for parquet + no filter (any file count)
- SharedCount: each partition atomically consumes offset by skipping RGs
- RowSelection for partial RG skip (remaining offset within first RG)
- Optimizer eliminates GlobalLimitExec when offset is pushed
- effective_limit adjusted per partition based on consumed offset

Implementation:
- FileSource::supports_offset() (parquet=true, others=false)
- FileScanConfig: offset field, with_offset (no filter guard)
- LimitPushdown: push offset, eliminate GlobalLimitExec
- prune_by_offset: skip leading fully-matched RGs
- PreparedAccessPlan::apply_offset() for RowSelection
- Shared Arc<AtomicUsize> remaining_offset in ParquetMorselizer
@zhuqi-lucas zhuqi-lucas force-pushed the feat/offset-pushdown branch from 121064f to fb717c5 Compare April 30, 2026 07:45
When a single parquet file is split into multiple byte-range partitions,
CoalescePartitionsExec sits between GlobalLimitExec and DataSourceExec.
Previously, the optimizer would keep the GlobalLimitExec for the skip,
meaning offset wasn't pushed to the parquet reader.

Now we try to push the offset through the combining operator to its
DataSourceExec child, which uses a shared Arc<AtomicUsize> counter to
coordinate offset consumption across partitions. This eliminates the
GlobalLimitExec entirely for supported sources.

Result on 14GB hits.parquet: OFFSET 99M LIMIT 5 from 3.2s → 29ms.

Co-Authored-By: Claude Opus 4.6 (1M context) <noreply@anthropic.com>
@zhuqi-lucas zhuqi-lucas force-pushed the feat/offset-pushdown branch from fb717c5 to 1691b07 Compare April 30, 2026 07:55
@zhuqi-lucas
Copy link
Copy Markdown
Contributor Author

Thanks @alamb for the great discussion and testing!

Update: multi-partition issue fixed

The optimization now works on large files with byte-range partitioning. The key fix was pushing offset through CoalescePartitionsExec to its DataSourceExec child, using a shared Arc<AtomicUsize> counter so all partitions coordinate offset consumption atomically.

Testing on the same 14GB hits.parquet:

> select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;
....

5 row(s) fetched.
Elapsed 0.002 seconds.
> explain select * from 'hits.parquet' OFFSET 99000000 LIMIT 5;
CoalescePartitionsExec: fetch=99000005
  DataSourceExec: file_groups={12 groups: [...]}, limit=99000005, offset=99000000
-- No GlobalLimitExec needed

On semantics

I agree with you and @asolimando that OFFSET without ORDER BY has no guaranteed ordering per SQL standard. This PR does not change the semantics — it produces exactly the same results as the current GlobalLimitExec skip (physical row order). The only difference is we skip rows at the parquet reader level (RG pruning + RowSelection) instead of reading and discarding them.

@asolimando
Copy link
Copy Markdown
Member

On semantics

I agree with you and @asolimando that OFFSET without ORDER BY has no guaranteed ordering per SQL standard. This PR does not change the semantics — it produces exactly the same results as the current GlobalLimitExec skip (physical row order). The only difference is we skip rows at the parquet reader level (RG pruning + RowSelection) instead of reading and discarding them.

Apologies if my comment read as referring to the PR, I only meant to address @AntoinePrv's comment, as I was afraid of the implications, although I definitely see where the ask is coming from, having worked closely to data scientists and data engineers in the past.

This is very exciting work, and @alamb's proposal around using ORDER BY row_number() is an elegant way to optimize this even further for pagination.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

api change Changes the API exposed to users of the crate datasource Changes to the datasource crate optimizer Optimizer rules physical-plan Changes to the physical-plan crate sqllogictest SQL Logic Tests (.slt)

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Poor performance of parquet pushdown with offset

6 participants