feat: Add RebuildHNSWIndex command for index maintenance#833
feat: Add RebuildHNSWIndex command for index maintenance#833himanalot wants to merge 5 commits intoHelixDB:mainfrom
Conversation
Adds a new built-in command to rebuild HNSW graph edges, fixing
fragmentation caused by vector deletions and re-insertions.
Changes:
- Add reconnect_vector method to HNSW trait
- Implement reconnect_vector in VectorCore
- Add RebuildHNSWIndex handler as built-in endpoint
The command reads all existing vectors, clears HNSW edges, then
reconnects all vectors to rebuild a fully connected graph. Processes
in configurable batches to control memory usage.
Usage:
POST /RebuildHNSWIndex
Returns: {"status": "success", "vectors_rebuilt": N}
| let vector_ids: Vec<u128> = { | ||
| let txn = db.graph_env.read_txn().map_err(GraphError::from)?; | ||
| let arena = bumpalo::Bump::new(); | ||
| let vectors = db | ||
| .vectors | ||
| .get_all_vectors(&txn, None, &arena) | ||
| .map_err(|e| GraphError::New(format!("Failed to get vectors: {}", e)))?; | ||
| vectors.iter().map(|v| v.id).collect() | ||
| }; |
There was a problem hiding this comment.
Loading all vectors into memory defeats the purpose of the batch processing mentioned in the PR description. get_all_vectors loads full vector data for all vectors, which could cause OOM on large datasets.
Consider iterating over vector IDs directly from the database using a prefix iterator (similar to the existing implementation in get_all_vectors at vector_core.rs:446-489) to truly avoid loading all data at once.
…ectors Addresses review feedback - adds get_all_vector_ids method that iterates over database keys without loading full vector data into memory.
Users can now pass batch_size in the request body to tune performance:
{"batch_size": 10}
Default remains 5 vectors per transaction. Response now includes
the batch_size that was used.
|
@greptile |
Changed reconnect_vector to return an error when search_level returns empty results instead of silently continuing. This makes debugging easier and ensures graph integrity.
3081264 to
8e4aa54
Compare
|
that last change was bc greptile messed up the diff in its suggested commit |
|
@greptile |
| if let Some(closest) = nearest.peek() { | ||
| curr_ep = *closest; | ||
| } |
There was a problem hiding this comment.
Silent failure allows continuing with stale curr_ep, diverging from insert's error handling at vector_core.rs:648-650. This could produce incorrect graph connections.
| if let Some(closest) = nearest.peek() { | |
| curr_ep = *closest; | |
| } | |
| curr_ep = *nearest.peek().ok_or(VectorError::VectorCoreError( | |
| "empty search result".to_string(), | |
| ))?; |
Apply same error handling to the connection level loop - return error instead of silently continuing with stale curr_ep.
|
@greptile |
Summary
Adds a new built-in command
RebuildHNSWIndexthat rebuilds HNSW graph edges, fixing fragmentation caused by vector deletions and re-insertions.Problem
When vectors are deleted and re-added to an HNSW index, the graph becomes fragmented. The entry point only connects to vectors inserted after it, leaving older vectors unreachable via search. This causes search quality degradation over time.
Solution
The
RebuildHNSWIndexcommand:Changes
helix-db/src/helix_engine/vector_core/hnsw.rsreconnect_vectormethod to HNSW traithelix-db/src/helix_engine/vector_core/vector_core.rsreconnect_vector+ addedget_all_vector_idsfor memory-efficient ID iterationhelix-db/src/helix_gateway/builtin/rebuild_hnsw_index.rshelix-db/src/helix_gateway/builtin/mod.rsUsage
Returns:
{"status": "success", "vectors_rebuilt": 97862, "batch_size": 5}Performance Notes
get_all_vector_ids()to iterate over keys without loading full vector dataWhat It Does NOT Do
🤖 Generated with Claude Code
Greptile Overview
Greptile Summary
Adds
RebuildHNSWIndexcommand to fix graph fragmentation caused by vector deletions and re-insertions. The implementation properly reconnects all vectors in memory-efficient batches.Key changes:
reconnect_vector()trait method and implementation to rebuild HNSW edges while preserving vector IDs and levelsget_all_vector_ids()for memory-efficient ID iteration without loading full vector data/RebuildHNSWIndexHTTP endpoint with configurablebatch_sizeparameterinsertmethod behavior (returns errors instead of silently continuing)Quality:
Important Files Changed
reconnect_vectortrait method for index rebuilding - well-documented and properly integratedreconnect_vectorwith proper error handling and added memory-efficientget_all_vector_idsmethodSequence Diagram
sequenceDiagram participant Client participant Handler as RebuildHNSWIndex Handler participant DB as Database participant VectorCore participant HNSW as HNSW Graph Client->>Handler: POST /RebuildHNSWIndex {batch_size: 10} Handler->>Handler: Parse batch_size (default: 5) Note over Handler,DB: Step 1: Collect Vector IDs Handler->>DB: read_txn() Handler->>VectorCore: get_all_vector_ids() VectorCore->>DB: prefix_iter(VECTOR_PREFIX) DB-->>VectorCore: Keys only (no vector data) VectorCore-->>Handler: Vec<u128> IDs Note over Handler,HNSW: Step 2: Clear Existing Graph Handler->>DB: write_txn() Handler->>HNSW: edges_db.clear() Handler->>HNSW: Delete entry_point Handler->>DB: commit() Note over Handler,HNSW: Step 3: Reconnect in Batches loop For each batch of vectors Handler->>Handler: Create fresh arena Handler->>DB: write_txn() loop For each vector_id in batch Handler->>VectorCore: get_full_vector(id) VectorCore-->>Handler: HVector with existing level Handler->>VectorCore: reconnect_vector(vector) alt No entry point yet VectorCore->>HNSW: set_entry_point(vector) else Entry point exists VectorCore->>HNSW: get_entry_point() loop Navigate to insertion level VectorCore->>HNSW: search_level(vector, level) HNSW-->>VectorCore: nearest neighbors end loop Connect at each level VectorCore->>HNSW: search_level(vector, level) VectorCore->>HNSW: select_neighbors() VectorCore->>HNSW: set_neighbours(vector, neighbors) loop Update each neighbor VectorCore->>HNSW: get_neighbors(neighbor) VectorCore->>HNSW: select_neighbors(neighbor) VectorCore->>HNSW: set_neighbours(neighbor) end end end end Handler->>DB: commit() Handler->>Handler: Drop arena (free memory) end Handler-->>Client: {status: "success", vectors_rebuilt: N, batch_size: 10}