Skip to content

feat(mem_wal): local-scoring FTS search for LSM scanner#6951

Open
touch-of-grey wants to merge 5 commits into
lance-format:mainfrom
touch-of-grey:LsmFtsLocal
Open

feat(mem_wal): local-scoring FTS search for LSM scanner#6951
touch-of-grey wants to merge 5 commits into
lance-format:mainfrom
touch-of-grey:LsmFtsLocal

Conversation

@touch-of-grey
Copy link
Copy Markdown
Contributor

Summary

Adds local-scoring full-text search to the LSM scanner via LsmScanner::full_text_search(column, query, k). Each LSM source — base table, flushed generations, and the active memtable — is scored independently with its own BM25 corpus stats, then the per-source results are unioned and merged by a top-k sort. Contained entirely within the mem_wal module, with no changes to lance-index.

Split out of #6910 per @jackye1995's review (#6910 (comment)): land the local mode first with minimal lance-index impact; the global-rescore mode stays as a follow-up.

Changes

  • scanner/fts_search.rsLsmFtsSearchPlanner: per-source FTS plans → union → per-partition top-k sort → sort-preserving merge.
  • scanner/builder.rsLsmScanner::full_text_search(column, query, k).
  • memtable/scanner/exec/fts.rs — nullable _score so the active-memtable arm unions with base/flushed sources.
  • benches/mem_wal/fts/ — read benchmark + storage sweep (latency + local-vs-merged Jaccard accuracy across flushed-generation counts).

Copy link
Copy Markdown

@claude claude Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Claude Code Review

This pull request is from a fork — automated review is disabled. A repository maintainer can comment @claude review to run a one-time review.

@touch-of-grey
Copy link
Copy Markdown
Contributor Author

@jackye1995 PTAL. This is the local-mode-first split from #6910, per your review there: #6910 (comment)

Local scoring only, kept entirely inside mem_wal with no lance-index changes. The global-rescore mode from the original PR is left as a follow-up.

@github-actions github-actions Bot added the enhancement New feature or request label May 27, 2026
@codecov
Copy link
Copy Markdown

codecov Bot commented May 27, 2026

Codecov Report

❌ Patch coverage is 78.07808% with 73 lines in your changes missing coverage. Please review.

Files with missing lines Patch % Lines
...st/lance/src/dataset/mem_wal/scanner/fts_search.rs 83.01% 43 Missing and 10 partials ⚠️
rust/lance/src/dataset/mem_wal/scanner/builder.rs 0.00% 20 Missing ⚠️

📢 Thoughts on this report? Let us know!

Adds full-text search to LsmScanner spanning the base table, flushed
memtable generations, and active/frozen in-memory memtables. Each source
scores with its own corpus statistics (local BM25); the planner unions
the per-source plans and merges by _score DESC with a top-k cap pushed
into each partition. Single-pass, no cross-source stat coordination.

Contained entirely in the mem_wal module: reuses scanner.full_text_search
for base/flushed Lance sources and MemTableScanner for the active arm. No
lance-index changes. A globally-consistent scoring mode is deferred as a
follow-up (the earlier draft showed it carries a real latency penalty).

- LsmFtsSearchPlanner + LsmScanner::full_text_search(column, query, k).
- Align FtsIndexExec `_score` nullability with the on-disk FTS schema so
  the active arm unions with base/flushed arms.
- mem_wal_fts_read_bench: local-scoring latency panel (ShardWriter
  ingestion, FineWeb text) with an optional --with-baseline merged-index
  ground truth reporting top-k Jaccard of local-LSM vs a globally merged
  index; run_fts_read_sweep.sh drives NVMe + S3.
run_fts_read_sweep.sh applied a single fixed flushed-generations value, so
the earlier sweep only exercised 2 flushed layers. Add a GENS_LIST
dimension (default "1 2 5") to the (backend, base, k) matrix and tag it
into the config name (g<gens>), reusing the prepared base across gens
since each search ingests into its own fresh shard.
…o sweep

Add an s3express storage tier (S3 Express One Zone directory bucket,
auto-detected by lance via the --x-s3 suffix; instance must be co-located
in the bucket's AZ). Restrict the storage-independent merged-index
accuracy baseline to BASELINE_BACKEND (default nvme) so it isn't rebuilt
redundantly across tiers.
Mirror LsmVectorSearchPlanner and the post-lance-format#6929 planner shape: thread
session and FlushedMemTableCache into LsmFtsSearchPlanner so repeated
searches against the same flushed generation skip the dataset re-open.
Replace the raw DatasetBuilder open with open_flushed_dataset.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

enhancement New feature or request

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant