Reduce update-path spec-lock contention: hoist preprocessing, relabel unchanged vectors#10012
Draft
ofiryanai wants to merge 4 commits into
Draft
Reduce update-path spec-lock contention: hoist preprocessing, relabel unchanged vectors#10012ofiryanai wants to merge 4 commits into
ofiryanai wants to merge 4 commits into
Conversation
… unchanged vectors Two changes to the HSET-update indexing path (IndexSpec_UpdateDoc) that shorten how long the spec write lock is held while a document is re-indexed, reducing contention with concurrent searches (which take the read side of the same lock). 1. Hoist preprocessing out of the write lock. Split the per-field preprocessor loop out of Document_AddToIndexes into a new Document_Preprocess, and run it (plus NewAddDocumentCtx and Document_MakeStringsOwner) BEFORE taking the spec write lock. Preprocessing -- tokenization, vector blob copy/normalize, tag splitting, etc. -- is pure per-document CPU work that only writes aCtx-local scratch and reads the (main-thread-stable) schema, so it does not need the lock. Only the actual index mutations (IndexDocument) now run under it. 2. Relabel unchanged vectors instead of delete + re-insert. Every HSET assigns a new internal doc id, which previously forced every vector to be deleted from and re-inserted into the (tiered) HNSW index even when the vector value was unchanged -- churning the graph and burning background-worker CPU. The command filter already records which hash fields the command modified; for a vector field NOT in that set, makeDocumentId now relabels the existing vector from the old doc id to the new one via the new VecSimIndex_RelabelVector API (O(1), no graph churn) and marks the field (skipVectorAdd) so the indexer skips re-adding the identical vector. Fields that changed -- or when the changed set is unknown (filter disabled) -- keep the existing delete + re-add path. This requires VecSimIndex_RelabelVector and bumps the VectorSimilarity submodule to the branch carrying it. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
🛡️ Jit Security Scan Results✅ No security findings were detected in this PR
Security scan by Jit
|
The command filter (search-partial-indexed-docs) opened the key only to check it is an existing hash before capturing the modified field names. That is a keyspace lookup plus a key-string allocation on every hash-write command, including writes to keys no index covers. It is unnecessary: hashFields is consumed only when the matching keyspace notification fires (so the command really did modify a hash), and freeHashFields() clears any stale capture at the top of the next hash-write. A new-key HSET now goes through the changed-field gate instead of a forced full reindex, which is correct and slightly cheaper. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With PARTIAL_INDEXED_DOCS enabled, exercises the relabel path end-to-end: an HSET that changes only non-vector fields must keep the vector intact (KNN still returns the doc at distance 0), while an HSET that changes the vector must reflect the new one. Also covers the new-key gating after dropping the filter's OpenKey, and DEL. Runs over HNSW (tiered) and FLAT. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FT.SEARCH returned the document's vector blob, which the decode_responses test client failed to UTF-8-decode (UnicodeDecodeError). RETURN only the distance and the text field. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
|
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.



Summary
Two changes to the
HSET-update indexing path (IndexSpec_UpdateDoc) that shorten how long the spec write lock is held while a document is re-indexed, reducing contention with concurrent searches (which take the read side of the same lock). Targets the common pattern of frequent partial updates to documents that contain large, rarely-changing vector fields.Changes
1. Hoist preprocessing out of the write lock (
document.c,document.h,spec.c)The per-field preprocessor loop is split out of
Document_AddToIndexesinto a newDocument_Preprocess, and run — together withNewAddDocumentCtxandDocument_MakeStringsOwner— beforeRedisSearchCtx_LockSpecWrite. Preprocessing (tokenization, vector blob copy/normalize, tag splitting, …) is pure per-document CPU work that only writesaCtx-local scratch and reads the main-thread-stable schema, so it doesn't need the lock. Only the actual index mutations (IndexDocument) now run under it.2. Relabel unchanged vectors instead of delete + re-insert (
indexer.c,indexer.h,document.c,spec.c)Every
HSETassigns a new internal doc id, which previously forced every vector to beVecSimIndex_DeleteVector+VecSimIndex_AddVectorinto the tiered HNSW index even when the vector value was unchanged — churning the graph (ghost/marked-deleted nodes, repair jobs) and burning background-worker CPU. The command filter already records which hash fields the command modified (aCtx->hashFields); for a vector field not in that set,makeDocumentIdnow relabels the existing vector from the old doc id to the new one via the newVecSimIndex_RelabelVectorAPI (O(1), no graph churn) and setsskipVectorAddso the indexer skips re-adding the identical vector. Changed fields — or when the changed set is unknown (filterCommandsdisabled) — keep the existing delete + re-add path, so behavior is unchanged in that case.This implements the long-standing
// TODO: use VecSimReplaceinmakeDocumentId.Dependency
Requires the new
VecSimIndex_RelabelVectorAPI: RedisAI/VectorSimilarity#978. This PR bumps thedeps/VectorSimilaritysubmodule to that branch (8ec72b63); it must be repointed to the merged commit before this is mergeable.Validation
The VecSim API has unit tests (single/multi, flat/HNSW/tiered incl. the pending-insert-job rekey case). The RediSearch changes were not built/tested locally — the local (macOS) build is blocked by the VectorSimilarity/SVS dependency chain — so please rely on CI to build and run the suite. Suggested focused coverage: HSET partial-update flows with vector + text/tag fields under
filterCommands, and concurrent search/update.🤖 Generated with Claude Code