Skip to content

Reduce update-path spec-lock contention: hoist preprocessing, relabel unchanged vectors#10012

Draft
ofiryanai wants to merge 4 commits into
8.4from
ofir-partial-update-lock
Draft

Reduce update-path spec-lock contention: hoist preprocessing, relabel unchanged vectors#10012
ofiryanai wants to merge 4 commits into
8.4from
ofir-partial-update-lock

Conversation

@ofiryanai

Copy link
Copy Markdown
Collaborator

Summary

Two changes to the HSET-update indexing path (IndexSpec_UpdateDoc) that shorten how long the spec write lock is held while a document is re-indexed, reducing contention with concurrent searches (which take the read side of the same lock). Targets the common pattern of frequent partial updates to documents that contain large, rarely-changing vector fields.

Changes

1. Hoist preprocessing out of the write lock (document.c, document.h, spec.c)
The per-field preprocessor loop is split out of Document_AddToIndexes into a new Document_Preprocess, and run — together with NewAddDocumentCtx and Document_MakeStringsOwnerbefore RedisSearchCtx_LockSpecWrite. Preprocessing (tokenization, vector blob copy/normalize, tag splitting, …) is pure per-document CPU work that only writes aCtx-local scratch and reads the main-thread-stable schema, so it doesn't need the lock. Only the actual index mutations (IndexDocument) now run under it.

2. Relabel unchanged vectors instead of delete + re-insert (indexer.c, indexer.h, document.c, spec.c)
Every HSET assigns a new internal doc id, which previously forced every vector to be VecSimIndex_DeleteVector + VecSimIndex_AddVector into the tiered HNSW index even when the vector value was unchanged — churning the graph (ghost/marked-deleted nodes, repair jobs) and burning background-worker CPU. The command filter already records which hash fields the command modified (aCtx->hashFields); for a vector field not in that set, makeDocumentId now relabels the existing vector from the old doc id to the new one via the new VecSimIndex_RelabelVector API (O(1), no graph churn) and sets skipVectorAdd so the indexer skips re-adding the identical vector. Changed fields — or when the changed set is unknown (filterCommands disabled) — keep the existing delete + re-add path, so behavior is unchanged in that case.

This implements the long-standing // TODO: use VecSimReplace in makeDocumentId.

Dependency

Requires the new VecSimIndex_RelabelVector API: RedisAI/VectorSimilarity#978. This PR bumps the deps/VectorSimilarity submodule to that branch (8ec72b63); it must be repointed to the merged commit before this is mergeable.

Validation

The VecSim API has unit tests (single/multi, flat/HNSW/tiered incl. the pending-insert-job rekey case). The RediSearch changes were not built/tested locally — the local (macOS) build is blocked by the VectorSimilarity/SVS dependency chain — so please rely on CI to build and run the suite. Suggested focused coverage: HSET partial-update flows with vector + text/tag fields under filterCommands, and concurrent search/update.

  • This PR requires release notes
  • This PR does not require release notes

🤖 Generated with Claude Code

… unchanged vectors

Two changes to the HSET-update indexing path (IndexSpec_UpdateDoc) that shorten how
long the spec write lock is held while a document is re-indexed, reducing contention
with concurrent searches (which take the read side of the same lock).

1. Hoist preprocessing out of the write lock.
   Split the per-field preprocessor loop out of Document_AddToIndexes into a new
   Document_Preprocess, and run it (plus NewAddDocumentCtx and Document_MakeStringsOwner)
   BEFORE taking the spec write lock. Preprocessing -- tokenization, vector blob
   copy/normalize, tag splitting, etc. -- is pure per-document CPU work that only writes
   aCtx-local scratch and reads the (main-thread-stable) schema, so it does not need the
   lock. Only the actual index mutations (IndexDocument) now run under it.

2. Relabel unchanged vectors instead of delete + re-insert.
   Every HSET assigns a new internal doc id, which previously forced every vector to be
   deleted from and re-inserted into the (tiered) HNSW index even when the vector value
   was unchanged -- churning the graph and burning background-worker CPU. The command
   filter already records which hash fields the command modified; for a vector field NOT
   in that set, makeDocumentId now relabels the existing vector from the old doc id to the
   new one via the new VecSimIndex_RelabelVector API (O(1), no graph churn) and marks the
   field (skipVectorAdd) so the indexer skips re-adding the identical vector. Fields that
   changed -- or when the changed set is unknown (filter disabled) -- keep the existing
   delete + re-add path. This requires VecSimIndex_RelabelVector and bumps the
   VectorSimilarity submodule to the branch carrying it.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@github-actions github-actions Bot added the size:M label Jun 7, 2026
@jit-ci

jit-ci Bot commented Jun 7, 2026

Copy link
Copy Markdown

🛡️ Jit Security Scan Results

CRITICAL HIGH MEDIUM

✅ No security findings were detected in this PR


Security scan by Jit

ofiryanai and others added 3 commits June 7, 2026 14:34
The command filter (search-partial-indexed-docs) opened the key only to check it
is an existing hash before capturing the modified field names. That is a keyspace
lookup plus a key-string allocation on every hash-write command, including writes
to keys no index covers. It is unnecessary: hashFields is consumed only when the
matching keyspace notification fires (so the command really did modify a hash),
and freeHashFields() clears any stale capture at the top of the next hash-write.
A new-key HSET now goes through the changed-field gate instead of a forced full
reindex, which is correct and slightly cheaper.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
With PARTIAL_INDEXED_DOCS enabled, exercises the relabel path end-to-end: an HSET
that changes only non-vector fields must keep the vector intact (KNN still returns
the doc at distance 0), while an HSET that changes the vector must reflect the new
one. Also covers the new-key gating after dropping the filter's OpenKey, and DEL.
Runs over HNSW (tiered) and FLAT.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
FT.SEARCH returned the document's vector blob, which the decode_responses test
client failed to UTF-8-decode (UnicodeDecodeError). RETURN only the distance and
the text field.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
@sonarqubecloud

sonarqubecloud Bot commented Jun 7, 2026

Copy link
Copy Markdown

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant