feat(mem_wal): local-scoring FTS search for LSM scanner#6951
Open
touch-of-grey wants to merge 5 commits into
Open
feat(mem_wal): local-scoring FTS search for LSM scanner#6951touch-of-grey wants to merge 5 commits into
touch-of-grey wants to merge 5 commits into
Conversation
Contributor
Author
|
@jackye1995 PTAL. This is the local-mode-first split from #6910, per your review there: #6910 (comment) Local scoring only, kept entirely inside |
Codecov Report❌ Patch coverage is
📢 Thoughts on this report? Let us know! |
Adds full-text search to LsmScanner spanning the base table, flushed memtable generations, and active/frozen in-memory memtables. Each source scores with its own corpus statistics (local BM25); the planner unions the per-source plans and merges by _score DESC with a top-k cap pushed into each partition. Single-pass, no cross-source stat coordination. Contained entirely in the mem_wal module: reuses scanner.full_text_search for base/flushed Lance sources and MemTableScanner for the active arm. No lance-index changes. A globally-consistent scoring mode is deferred as a follow-up (the earlier draft showed it carries a real latency penalty). - LsmFtsSearchPlanner + LsmScanner::full_text_search(column, query, k). - Align FtsIndexExec `_score` nullability with the on-disk FTS schema so the active arm unions with base/flushed arms. - mem_wal_fts_read_bench: local-scoring latency panel (ShardWriter ingestion, FineWeb text) with an optional --with-baseline merged-index ground truth reporting top-k Jaccard of local-LSM vs a globally merged index; run_fts_read_sweep.sh drives NVMe + S3.
run_fts_read_sweep.sh applied a single fixed flushed-generations value, so the earlier sweep only exercised 2 flushed layers. Add a GENS_LIST dimension (default "1 2 5") to the (backend, base, k) matrix and tag it into the config name (g<gens>), reusing the prepared base across gens since each search ingests into its own fresh shard.
…o sweep Add an s3express storage tier (S3 Express One Zone directory bucket, auto-detected by lance via the --x-s3 suffix; instance must be co-located in the bucket's AZ). Restrict the storage-independent merged-index accuracy baseline to BASELINE_BACKEND (default nvme) so it isn't rebuilt redundantly across tiers.
Mirror LsmVectorSearchPlanner and the post-lance-format#6929 planner shape: thread session and FlushedMemTableCache into LsmFtsSearchPlanner so repeated searches against the same flushed generation skip the dataset re-open. Replace the raw DatasetBuilder open with open_flushed_dataset.
c296c46 to
7a128c7
Compare
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Adds local-scoring full-text search to the LSM scanner via
LsmScanner::full_text_search(column, query, k). Each LSM source — base table, flushed generations, and the active memtable — is scored independently with its own BM25 corpus stats, then the per-source results are unioned and merged by a top-k sort. Contained entirely within themem_walmodule, with no changes tolance-index.Split out of #6910 per @jackye1995's review (#6910 (comment)): land the local mode first with minimal lance-index impact; the global-rescore mode stays as a follow-up.
Changes
scanner/fts_search.rs—LsmFtsSearchPlanner: per-source FTS plans → union → per-partition top-k sort → sort-preserving merge.scanner/builder.rs—LsmScanner::full_text_search(column, query, k).memtable/scanner/exec/fts.rs— nullable_scoreso the active-memtable arm unions with base/flushed sources.benches/mem_wal/fts/— read benchmark + storage sweep (latency + local-vs-merged Jaccard accuracy across flushed-generation counts).