Skip to content

feat: Add RebuildHNSWIndex command for index maintenance#833

Open
himanalot wants to merge 5 commits intoHelixDB:mainfrom
himanalot:feature/rebuild-hnsw-index
Open

feat: Add RebuildHNSWIndex command for index maintenance#833
himanalot wants to merge 5 commits intoHelixDB:mainfrom
himanalot:feature/rebuild-hnsw-index

Conversation

@himanalot
Copy link
Copy Markdown

@himanalot himanalot commented Jan 27, 2026

Summary

Adds a new built-in command RebuildHNSWIndex that rebuilds HNSW graph edges, fixing fragmentation caused by vector deletions and re-insertions.

Problem

When vectors are deleted and re-added to an HNSW index, the graph becomes fragmented. The entry point only connects to vectors inserted after it, leaving older vectors unreachable via search. This causes search quality degradation over time.

Solution

The RebuildHNSWIndex command:

  1. Reads all existing vector IDs efficiently (without loading full vector data)
  2. Clears all HNSW edges and entry point
  3. Reconnects all vectors in configurable batches to rebuild a fully connected graph

Changes

File Change
helix-db/src/helix_engine/vector_core/hnsw.rs Added reconnect_vector method to HNSW trait
helix-db/src/helix_engine/vector_core/vector_core.rs Implemented reconnect_vector + added get_all_vector_ids for memory-efficient ID iteration
helix-db/src/helix_gateway/builtin/rebuild_hnsw_index.rs NEW - Handler for RebuildHNSWIndex endpoint
helix-db/src/helix_gateway/builtin/mod.rs Added module export

Usage

# Default batch size (5)
curl -X POST http://localhost:6969/RebuildHNSWIndex \
  -H "Content-Type: application/json" -d '{}'

# Custom batch size for performance tuning
curl -X POST http://localhost:6969/RebuildHNSWIndex \
  -H "Content-Type: application/json" -d '{"batch_size": 10}'

Returns:

{"status": "success", "vectors_rebuilt": 97862, "batch_size": 5}

Performance Notes

  • batch_size parameter: Controls vectors per transaction. Higher values reduce transaction overhead but increase memory usage per batch. Default of 5 balances speed and memory.
  • Memory efficient: Uses get_all_vector_ids() to iterate over keys without loading full vector data
  • Progress logged every 500 vectors

What It Does NOT Do

  • Does NOT re-generate embeddings (preserves existing vectors)
  • Does NOT change vector IDs
  • Only rebuilds the HNSW graph edges (navigation structure)

🤖 Generated with Claude Code

Greptile Overview

Greptile Summary

Adds RebuildHNSWIndex command to fix graph fragmentation caused by vector deletions and re-insertions. The implementation properly reconnects all vectors in memory-efficient batches.

Key changes:

  • Added reconnect_vector() trait method and implementation to rebuild HNSW edges while preserving vector IDs and levels
  • Added get_all_vector_ids() for memory-efficient ID iteration without loading full vector data
  • Created /RebuildHNSWIndex HTTP endpoint with configurable batch_size parameter
  • Uses fresh arena per batch to prevent memory growth during rebuild operations
  • Fixed error handling in reconnection loops to match insert method behavior (returns errors instead of silently continuing)

Quality:

  • All previously identified issues have been addressed in the latest commits
  • Proper error handling throughout the reconnection process
  • Memory-efficient batch processing with progress logging
  • Clear API documentation and usage examples

Important Files Changed

Filename Overview
helix-db/src/helix_engine/vector_core/hnsw.rs Added reconnect_vector trait method for index rebuilding - well-documented and properly integrated
helix-db/src/helix_engine/vector_core/vector_core.rs Implemented reconnect_vector with proper error handling and added memory-efficient get_all_vector_ids method
helix-db/src/helix_gateway/builtin/rebuild_hnsw_index.rs New handler for rebuilding HNSW index with proper batch processing and memory management

Sequence Diagram

sequenceDiagram
    participant Client
    participant Handler as RebuildHNSWIndex Handler
    participant DB as Database
    participant VectorCore
    participant HNSW as HNSW Graph

    Client->>Handler: POST /RebuildHNSWIndex {batch_size: 10}
    Handler->>Handler: Parse batch_size (default: 5)
    
    Note over Handler,DB: Step 1: Collect Vector IDs
    Handler->>DB: read_txn()
    Handler->>VectorCore: get_all_vector_ids()
    VectorCore->>DB: prefix_iter(VECTOR_PREFIX)
    DB-->>VectorCore: Keys only (no vector data)
    VectorCore-->>Handler: Vec<u128> IDs
    
    Note over Handler,HNSW: Step 2: Clear Existing Graph
    Handler->>DB: write_txn()
    Handler->>HNSW: edges_db.clear()
    Handler->>HNSW: Delete entry_point
    Handler->>DB: commit()
    
    Note over Handler,HNSW: Step 3: Reconnect in Batches
    loop For each batch of vectors
        Handler->>Handler: Create fresh arena
        Handler->>DB: write_txn()
        
        loop For each vector_id in batch
            Handler->>VectorCore: get_full_vector(id)
            VectorCore-->>Handler: HVector with existing level
            Handler->>VectorCore: reconnect_vector(vector)
            
            alt No entry point yet
                VectorCore->>HNSW: set_entry_point(vector)
            else Entry point exists
                VectorCore->>HNSW: get_entry_point()
                
                loop Navigate to insertion level
                    VectorCore->>HNSW: search_level(vector, level)
                    HNSW-->>VectorCore: nearest neighbors
                end
                
                loop Connect at each level
                    VectorCore->>HNSW: search_level(vector, level)
                    VectorCore->>HNSW: select_neighbors()
                    VectorCore->>HNSW: set_neighbours(vector, neighbors)
                    
                    loop Update each neighbor
                        VectorCore->>HNSW: get_neighbors(neighbor)
                        VectorCore->>HNSW: select_neighbors(neighbor)
                        VectorCore->>HNSW: set_neighbours(neighbor)
                    end
                end
            end
        end
        
        Handler->>DB: commit()
        Handler->>Handler: Drop arena (free memory)
    end
    
    Handler-->>Client: {status: "success", vectors_rebuilt: N, batch_size: 10}
Loading

Adds a new built-in command to rebuild HNSW graph edges, fixing
fragmentation caused by vector deletions and re-insertions.

Changes:
- Add reconnect_vector method to HNSW trait
- Implement reconnect_vector in VectorCore
- Add RebuildHNSWIndex handler as built-in endpoint

The command reads all existing vectors, clears HNSW edges, then
reconnects all vectors to rebuild a fully connected graph. Processes
in configurable batches to control memory usage.

Usage:
  POST /RebuildHNSWIndex
  Returns: {"status": "success", "vectors_rebuilt": N}
Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

3 files reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +23 to +31
let vector_ids: Vec<u128> = {
let txn = db.graph_env.read_txn().map_err(GraphError::from)?;
let arena = bumpalo::Bump::new();
let vectors = db
.vectors
.get_all_vectors(&txn, None, &arena)
.map_err(|e| GraphError::New(format!("Failed to get vectors: {}", e)))?;
vectors.iter().map(|v| v.id).collect()
};
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Loading all vectors into memory defeats the purpose of the batch processing mentioned in the PR description. get_all_vectors loads full vector data for all vectors, which could cause OOM on large datasets.

Consider iterating over vector IDs directly from the database using a prefix iterator (similar to the existing implementation in get_all_vectors at vector_core.rs:446-489) to truly avoid loading all data at once.

…ectors

Addresses review feedback - adds get_all_vector_ids method that iterates
over database keys without loading full vector data into memory.
Users can now pass batch_size in the request body to tune performance:
  {"batch_size": 10}

Default remains 5 vectors per transaction. Response now includes
the batch_size that was used.
@himanalot
Copy link
Copy Markdown
Author

@greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment thread helix-db/src/helix_engine/vector_core/vector_core.rs Outdated
Changed reconnect_vector to return an error when search_level returns
empty results instead of silently continuing. This makes debugging easier
and ensures graph integrity.
@himanalot himanalot force-pushed the feature/rebuild-hnsw-index branch from 3081264 to 8e4aa54 Compare January 27, 2026 19:53
@himanalot
Copy link
Copy Markdown
Author

that last change was bc greptile messed up the diff in its suggested commit

@himanalot
Copy link
Copy Markdown
Author

@greptile

Copy link
Copy Markdown
Contributor

@greptile-apps greptile-apps Bot left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

1 file reviewed, 1 comment

Edit Code Review Agent Settings | Greptile

Comment on lines +726 to +728
if let Some(closest) = nearest.peek() {
curr_ep = *closest;
}
Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Silent failure allows continuing with stale curr_ep, diverging from insert's error handling at vector_core.rs:648-650. This could produce incorrect graph connections.

Suggested change
if let Some(closest) = nearest.peek() {
curr_ep = *closest;
}
curr_ep = *nearest.peek().ok_or(VectorError::VectorCoreError(
"empty search result".to_string(),
))?;

Apply same error handling to the connection level loop - return error
instead of silently continuing with stale curr_ep.
@himanalot
Copy link
Copy Markdown
Author

@greptile

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant