perf(lexical/search): wire the decoded posting-list cache#770
Merged
Conversation
`CacheManager::posting_cache` was declared `#[allow(dead_code)]` and the `enable_posting_cache` config flag was unused, so every `SegmentReader::postings` re-opened the segment's `.post` file, re-decoded the varint posting list, re-applied deletions, and rebuilt the skip table on every query — the read alone dominates on cloud/remote storage. Wire a real posting cache: - New `posting_cache.rs`: `PostingCache`, a per-`SegmentReader` byte-budget LRU over `Arc<DecodedPostingList>`. It wraps `lru::LruCache::unbounded()` with a `cur_bytes`/`max_bytes` accountant — `put` evicts the least-recently-used entries until under budget, a single list larger than the whole budget is not cached, and a budget of 0 disables it. - `SegmentReader` caches the decoded, deletion-filtered list per `(field, term)`. A segment is immutable for a reader snapshot (deletions are fixed), so the filtered list is always consistent; a commit builds fresh readers with empty caches. A hit clones the list to build the existing owned-data iterator (a Vec memcpy, cheaper than the decode it replaces) and skips the storage read. - The cache is disabled by default in `SegmentReader::open` (budget 0), so merge-engine and test readers — which read each term once — are byte-for-byte unchanged. `InvertedIndexReader::new` enables it per query reader, gated by `enable_posting_cache` and budgeted by `max_cache_memory` (no new config). - Removed the dead `CacheManager::posting_cache` field. Tests: `PostingCache` unit tests (hit/miss, byte-budget eviction, oversized skip, disabled) and a reader test (a repeated `postings` is a cache hit; a delete + commit excludes the doc in the new snapshot — no stale list). Docs (en/ja) updated. No public API change (internal; `enable_posting_cache` is now honoured). Closes #612
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Closes #612 (audit task LS-35, umbrella #534).
Problem
CacheManager::posting_cachewas declared#[allow(dead_code)]and theenable_posting_cacheconfig flag was unused. EverySegmentReader::postingstherefore re-opened the segment's.postfile, re-decoded the varint posting list, re-applied deletions, and rebuilt the skip table on every query — and on cloud/remote storage the read alone dominates.Change
posting_cache.rs—PostingCache: a per-SegmentReaderbyte-budget LRU overArc<DecodedPostingList>. It wrapslru::LruCache::unbounded()with acur_bytes/max_bytesaccountant —putevicts least-recently-used entries until under budget, a single list larger than the whole budget is not cached (avoids thrash), and a budget of0disables it. Byte-budget (not entry-count) because posting lists range from a few bytes to many megabytes.SegmentReadercaches the decoded, deletion-filtered list per(field, term). A segment is immutable for a reader snapshot (deletions are fixed:has_deletionsis set at open, the bitmap is load-once), so the filtered list is always consistent; a commit builds fresh segment readers with empty caches. A hit clones the list to build the existing owned-data iterator (from_decoded_soa) — aVecmemcpy, cheaper than the varint decode it replaces, so never a regression vs today — and skips the storage read entirely.SegmentReader::open(budget 0), so merge-engine and test readers — which read each term once — are byte-for-byte unchanged (no key allocation, no clone).InvertedIndexReader::newenables it per query reader, gated byenable_posting_cache(defaulttrue) and budgeted bymax_cache_memory— no new config fields.CacheManager::posting_cachefield.A zero-copy
Arc-sharing iterator (no clone on hit) is a larger hot-path refactor, noted as a follow-up.Scope / API
CacheManager::posting_cacheandenable_posting_cacheare internal (not exposed in any binding); no public API change. The previously-deadenable_posting_cacheflag is now honoured.Verification
cargo build(full workspace + bindings) ✅cargo clippy --all-targets -- -D warnings— zero warnings ✅cargo fmt --check— clean ✅cargo test -p laurus --lib— 1102 passed / 0 failed (+6: 5 unit + 1 integration);cargo test --workspace— exit 0, 51 binaries ✅markdownlint-cli2— 0 errors; docs (en + ja) updated ✅Tests:
PostingCacheunit tests (hit/miss, byte-budget LRU eviction, oversized-entry skip, disabled no-op), and a reader integration test (a repeatedpostings(field, term)is a cache hit via stats; a delete + commit excludes the doc in the new snapshot — the cache never serves a stale, pre-deletion list).🤖 Generated with Claude Code