feat: Add RebuildHNSWIndex command for index maintenance by himanalot · Pull Request #833 · HelixDB/helix-db

himanalot · 2026-01-27T19:31:07Z

Summary

Adds a new built-in command RebuildHNSWIndex that rebuilds HNSW graph edges, fixing fragmentation caused by vector deletions and re-insertions.

Problem

When vectors are deleted and re-added to an HNSW index, the graph becomes fragmented. The entry point only connects to vectors inserted after it, leaving older vectors unreachable via search. This causes search quality degradation over time.

Solution

The RebuildHNSWIndex command:

Reads all existing vector IDs efficiently (without loading full vector data)
Clears all HNSW edges and entry point
Reconnects all vectors in configurable batches to rebuild a fully connected graph

Changes

File	Change
`helix-db/src/helix_engine/vector_core/hnsw.rs`	Added `reconnect_vector` method to HNSW trait
`helix-db/src/helix_engine/vector_core/vector_core.rs`	Implemented `reconnect_vector` + added `get_all_vector_ids` for memory-efficient ID iteration
`helix-db/src/helix_gateway/builtin/rebuild_hnsw_index.rs`	NEW - Handler for RebuildHNSWIndex endpoint
`helix-db/src/helix_gateway/builtin/mod.rs`	Added module export

Usage

# Default batch size (5)
curl -X POST http://localhost:6969/RebuildHNSWIndex \
  -H "Content-Type: application/json" -d '{}'

# Custom batch size for performance tuning
curl -X POST http://localhost:6969/RebuildHNSWIndex \
  -H "Content-Type: application/json" -d '{"batch_size": 10}'

Returns:

{"status": "success", "vectors_rebuilt": 97862, "batch_size": 5}

Performance Notes

batch_size parameter: Controls vectors per transaction. Higher values reduce transaction overhead but increase memory usage per batch. Default of 5 balances speed and memory.
Memory efficient: Uses get_all_vector_ids() to iterate over keys without loading full vector data
Progress logged every 500 vectors

What It Does NOT Do

Does NOT re-generate embeddings (preserves existing vectors)
Does NOT change vector IDs
Only rebuilds the HNSW graph edges (navigation structure)

🤖 Generated with Claude Code

Greptile Overview

Greptile Summary

Adds RebuildHNSWIndex command to fix graph fragmentation caused by vector deletions and re-insertions. The implementation properly reconnects all vectors in memory-efficient batches.

Key changes:

Added reconnect_vector() trait method and implementation to rebuild HNSW edges while preserving vector IDs and levels
Added get_all_vector_ids() for memory-efficient ID iteration without loading full vector data
Created /RebuildHNSWIndex HTTP endpoint with configurable batch_size parameter
Uses fresh arena per batch to prevent memory growth during rebuild operations
Fixed error handling in reconnection loops to match insert method behavior (returns errors instead of silently continuing)

Quality:

All previously identified issues have been addressed in the latest commits
Proper error handling throughout the reconnection process
Memory-efficient batch processing with progress logging
Clear API documentation and usage examples

Important Files Changed

Filename	Overview
helix-db/src/helix_engine/vector_core/hnsw.rs	Added `reconnect_vector` trait method for index rebuilding - well-documented and properly integrated
helix-db/src/helix_engine/vector_core/vector_core.rs	Implemented `reconnect_vector` with proper error handling and added memory-efficient `get_all_vector_ids` method
helix-db/src/helix_gateway/builtin/rebuild_hnsw_index.rs	New handler for rebuilding HNSW index with proper batch processing and memory management

Sequence Diagram

sequenceDiagram
    participant Client
    participant Handler as RebuildHNSWIndex Handler
    participant DB as Database
    participant VectorCore
    participant HNSW as HNSW Graph

    Client->>Handler: POST /RebuildHNSWIndex {batch_size: 10}
    Handler->>Handler: Parse batch_size (default: 5)
    
    Note over Handler,DB: Step 1: Collect Vector IDs
    Handler->>DB: read_txn()
    Handler->>VectorCore: get_all_vector_ids()
    VectorCore->>DB: prefix_iter(VECTOR_PREFIX)
    DB-->>VectorCore: Keys only (no vector data)
    VectorCore-->>Handler: Vec<u128> IDs
    
    Note over Handler,HNSW: Step 2: Clear Existing Graph
    Handler->>DB: write_txn()
    Handler->>HNSW: edges_db.clear()
    Handler->>HNSW: Delete entry_point
    Handler->>DB: commit()
    
    Note over Handler,HNSW: Step 3: Reconnect in Batches
    loop For each batch of vectors
        Handler->>Handler: Create fresh arena
        Handler->>DB: write_txn()
        
        loop For each vector_id in batch
            Handler->>VectorCore: get_full_vector(id)
            VectorCore-->>Handler: HVector with existing level
            Handler->>VectorCore: reconnect_vector(vector)
            
            alt No entry point yet
                VectorCore->>HNSW: set_entry_point(vector)
            else Entry point exists
                VectorCore->>HNSW: get_entry_point()
                
                loop Navigate to insertion level
                    VectorCore->>HNSW: search_level(vector, level)
                    HNSW-->>VectorCore: nearest neighbors
                end
                
                loop Connect at each level
                    VectorCore->>HNSW: search_level(vector, level)
                    VectorCore->>HNSW: select_neighbors()
                    VectorCore->>HNSW: set_neighbours(vector, neighbors)
                    
                    loop Update each neighbor
                        VectorCore->>HNSW: get_neighbors(neighbor)
                        VectorCore->>HNSW: select_neighbors(neighbor)
                        VectorCore->>HNSW: set_neighbours(neighbor)
                    end
                end
            end
        end
        
        Handler->>DB: commit()
        Handler->>Handler: Drop arena (free memory)
    end
    
    Handler-->>Client: {status: "success", vectors_rebuilt: N, batch_size: 10}

Adds a new built-in command to rebuild HNSW graph edges, fixing fragmentation caused by vector deletions and re-insertions. Changes: - Add reconnect_vector method to HNSW trait - Implement reconnect_vector in VectorCore - Add RebuildHNSWIndex handler as built-in endpoint The command reads all existing vectors, clears HNSW edges, then reconnects all vectors to rebuild a fully connected graph. Processes in configurable batches to control memory usage. Usage: POST /RebuildHNSWIndex Returns: {"status": "success", "vectors_rebuilt": N}

greptile-apps

_{3 files reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T19:34:15Z

+    let vector_ids: Vec<u128> = {
+        let txn = db.graph_env.read_txn().map_err(GraphError::from)?;
+        let arena = bumpalo::Bump::new();
+        let vectors = db
+            .vectors
+            .get_all_vectors(&txn, None, &arena)
+            .map_err(|e| GraphError::New(format!("Failed to get vectors: {}", e)))?;
+        vectors.iter().map(|v| v.id).collect()
+    };


Loading all vectors into memory defeats the purpose of the batch processing mentioned in the PR description. get_all_vectors loads full vector data for all vectors, which could cause OOM on large datasets.

Consider iterating over vector IDs directly from the database using a prefix iterator (similar to the existing implementation in get_all_vectors at vector_core.rs:446-489) to truly avoid loading all data at once.

…ectors Addresses review feedback - adds get_all_vector_ids method that iterates over database keys without loading full vector data into memory.

Users can now pass batch_size in the request body to tune performance: {"batch_size": 10} Default remains 5 vectors per transaction. Response now includes the batch_size that was used.

himanalot · 2026-01-27T19:46:25Z

@greptile

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

Changed reconnect_vector to return an error when search_level returns empty results instead of silently continuing. This makes debugging easier and ensures graph integrity.

himanalot · 2026-01-27T19:55:02Z

that last change was bc greptile messed up the diff in its suggested commit

himanalot · 2026-01-27T19:55:15Z

@greptile

greptile-apps

_{1 file reviewed, 1 comment}

_{Edit Code Review Agent Settings | Greptile}

greptile-apps · 2026-01-27T19:58:46Z

+            if let Some(closest) = nearest.peek() {
+                curr_ep = *closest;
+            }


Silent failure allows continuing with stale curr_ep, diverging from insert's error handling at vector_core.rs:648-650. This could produce incorrect graph connections.

Suggested change

if let Some(closest) = nearest.peek() {

curr_ep = *closest;

}

curr_ep = *nearest.peek().ok_or(VectorError::VectorCoreError(

"empty search result".to_string(),

))?;

Apply same error handling to the connection level loop - return error instead of silently continuing with stale curr_ep.

himanalot · 2026-01-27T20:00:16Z

@greptile

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

himanalot added 2 commits January 27, 2026 14:38

fix: Use memory-efficient get_all_vector_ids instead of loading all v…

c1c9514

…ectors Addresses review feedback - adds get_all_vector_ids method that iterates over database keys without loading full vector data into memory.

feat: Allow configurable batch_size parameter

01adafe

Users can now pass batch_size in the request body to tune performance: {"batch_size": 10} Default remains 5 vectors per transaction. Response now includes the batch_size that was used.

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

Comment thread helix-db/src/helix_engine/vector_core/vector_core.rs Outdated

fix: Return error instead of silently ignoring empty search results

8e4aa54

Changed reconnect_vector to return an error when search_level returns empty results instead of silently continuing. This makes debugging easier and ensures graph integrity.

himanalot force-pushed the feature/rebuild-hnsw-index branch from 3081264 to 8e4aa54 Compare January 27, 2026 19:53

greptile-apps Bot reviewed Jan 27, 2026

View reviewed changes

fix: Return error on empty search in connection loop

2290341

Apply same error handling to the connection level loop - return error instead of silently continuing with stale curr_ep.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

feat: Add RebuildHNSWIndex command for index maintenance#833

feat: Add RebuildHNSWIndex command for index maintenance#833
himanalot wants to merge 5 commits intoHelixDB:mainfrom
himanalot:feature/rebuild-hnsw-index

himanalot commented Jan 27, 2026 •

edited by greptile-apps Bot

Loading

Uh oh!

greptile-apps Bot left a comment

Uh oh!

greptile-apps Bot Jan 27, 2026

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

greptile-apps Bot left a comment

Uh oh!

greptile-apps Bot Jan 27, 2026

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

himanalot commented Jan 27, 2026 • edited by greptile-apps Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Summary

Problem

Solution

Changes

Usage

Performance Notes

What It Does NOT Do

Greptile Overview

Greptile Summary

Important Files Changed

Sequence Diagram

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

greptile-apps Bot left a comment

Choose a reason for hiding this comment

Uh oh!

greptile-apps Bot Jan 27, 2026

Choose a reason for hiding this comment

Uh oh!

himanalot commented Jan 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

himanalot commented Jan 27, 2026 •

edited by greptile-apps Bot

Loading