Skip to content

feat: Hybrid Search with RRF (SearchBM25 + SearchV)#837

Open
AmaanBilwar wants to merge 17 commits intoHelixDB:mainfrom
AmaanBilwar:feat/new-hybrid-search
Open

feat: Hybrid Search with RRF (SearchBM25 + SearchV)#837
AmaanBilwar wants to merge 17 commits intoHelixDB:mainfrom
AmaanBilwar:feat/new-hybrid-search

Conversation

@AmaanBilwar
Copy link
Copy Markdown

@AmaanBilwar AmaanBilwar commented Jan 28, 2026

Description

implements a new SearchHybrid operator that runs both vector (HNSW) and BM25 keyword searches, returning combined results that can be fused using existing RerankRRF or RerankMMR steps.

This feature lets users combine vector search and keyword-based in a single query

Related Issues

N/A

Closes #828

Checklist when merging to main

  • No compiler warnings (if applicable)
  • Code is formatted with rustfmt
  • No useless or dead code (if applicable)
  • Code is easy to understand
  • Doc comments are used for all functions, enums, structs, and fields (where appropriate)
  • All tests pass
  • Performance has not regressed (assuming change was not to fix a bug)
  • Version number has been updated in helix-cli/Cargo.toml and helixdb/Cargo.toml

Additional Notes

Greptile Overview

Greptile Summary

This PR adds a new SearchHybrid operator that combines vector similarity search (HNSW) with BM25 keyword search, allowing users to leverage both semantic and keyword-based retrieval in a single query. Results are returned as a combined iterator that can be fused using existing RerankRRF or RerankMMR operators.

Key Changes:

  • Implemented SearchHybridAdapter trait in runtime that executes both vector and BM25 searches, returning combined results
  • Extended grammar, parser, analyzer, and code generator to support SearchHybrid<Label>(vector_data, query_text, k) syntax
  • Added comprehensive test coverage with 9 test queries covering various usage patterns
  • Fixed string literal formatting for BM25 queries (wrapping strings in quotes)

Architecture:
The implementation follows the existing pattern of SearchVector and SearchBM25, integrating cleanly into the compiler pipeline from parsing through code generation. Vector results are returned first, followed by BM25 results, allowing downstream rerankers to properly fuse them based on position.

Testing:
Tests cover basic usage, Embed() for vector generation, variable parameters, and chaining with RerankRRF and RerankMMR operators.

Checklist Status:
Two checklist items remain unchecked: doc comments and version number updates. Consider adding doc comments to key public functions and updating version numbers before merging.

Important Files Changed

Filename Overview
helix-db/src/helix_engine/traversal_core/ops/hybrid/search_hybrid.rs New hybrid search implementation combining vector and BM25 search. Returns combined results as chained iterators. Implementation looks solid with proper error handling.
helix-db/src/grammar.pest Grammar updated to add search_hybrid rule with proper syntax for vector_data, query text, and k parameter. Follows existing patterns for SearchV and SearchBM25.
helix-db/src/helixc/analyzer/methods/infer_expr_type.rs Added type inference for SearchHybrid expression. Validates vector data, query string, and k parameter. Includes string formatting fix for BM25 literals.
helix-db/src/helixc/analyzer/methods/traversal_validation.rs Added traversal validation for SearchHybrid, validates label exists in vector set, processes vector_data, query, and k parameters similar to SearchVector and SearchBM25.
helix-db/src/helixc/parser/expression_parse_methods.rs Added parse_search_hybrid method to parse SearchHybrid syntax including label, vector_data, query, and k. Properly handles identifiers, string literals, and Embed expressions.

Sequence Diagram

sequenceDiagram
    participant User
    participant Parser
    participant Analyzer
    participant Generator
    participant Runtime
    participant VectorSearch
    participant BM25Search

    User->>Parser: SearchHybrid<Document>(vec, "query", 10)
    Parser->>Parser: parse_search_hybrid()
    Parser->>Parser: Extract label, vector_data, query, k
    Parser-->>Analyzer: SearchHybrid AST

    Analyzer->>Analyzer: infer_expr_type()
    Analyzer->>Analyzer: Validate label exists in vector_set
    Analyzer->>Analyzer: Process vector_data (literal/identifier/Embed)
    Analyzer->>Analyzer: Process query (string/identifier)
    Analyzer->>Analyzer: Process k parameter
    Analyzer-->>Generator: GeneratedSearchHybrid

    Generator->>Generator: Generate Rust code
    Generator->>Generator: Format: search_hybrid(label, vec, query, k)?
    Generator-->>Runtime: Compiled trait method call

    Runtime->>Runtime: search_hybrid() execution
    Runtime->>VectorSearch: vectors.search(query_vec, k, label)
    VectorSearch-->>Runtime: Vec<HVector> with scores
    
    Runtime->>BM25Search: bm25.search(query_text, k)
    BM25Search-->>Runtime: Vec<(doc_id, score)>
    
    Runtime->>Runtime: Convert vectors to TraversalValue::Vector
    Runtime->>Runtime: Lookup BM25 doc_ids in nodes_db
    Runtime->>Runtime: Filter by label
    Runtime->>Runtime: Convert to TraversalValue::NodeWithScore
    Runtime->>Runtime: Chain vector_results + bm25_results
    Runtime-->>User: RoTraversalIterator with combined results
Loading

@AmaanBilwar
Copy link
Copy Markdown
Author

@xav-db whenever u have the time 🫡

@AmaanBilwar AmaanBilwar marked this pull request as ready for review February 6, 2026 18:49
@AmaanBilwar
Copy link
Copy Markdown
Author

not sure other ways to test it out icl

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

feat: Hybrid Search with RRF (SearchBM25 + SearchV)

1 participant