Skip to content

[BUG] lucene_sanitize() escapes individual uppercase letters, breaking BM25 for most queries #1302

@iamalvisng

Description

@iamalvisng

Description

lucene_sanitize() in graphiti_core/helpers.py escapes individual uppercase letters O, R, N, T, A, D — intending to escape Lucene boolean operators (AND, OR, NOT). However, str.maketrans maps every occurrence of these characters, not just the keywords.

This destroys BM25 fulltext search for any query containing these common uppercase letters.

Examples

Query After lucene_sanitize() Impact
"EBITDA forecast" "EBI\T\D\A forecast" No BM25 match
"UIR 2 and UIR 3" "UI\R 2 and UI\R 3" No match on "UIR"
"KKR is competing" "KK\R is competing" No match on "KKR"
"NORD stream" "\N\O\R\D stream" Completely mangled
"deal team members" "deal team members" Works (no uppercase)

Root cause

PR #233 (Dec 9, 2024) added:

escape_map = str.maketrans({
    ...
    'O': r'\O',
    'R': r'\R',
    'N': r'\N',
    'T': r'\T',
    'A': r'\A',
    'D': r'\D',
})

str.maketrans operates per-character, so every T, D, A, etc. in the query gets escaped — not just when they form AND/OR/NOT.

The test in the same PR reveals awareness of this: the test input was changed from 'This has no escape characters' to lowercase 'this has no escape characters' to avoid the new escaping.

Impact

The graph_search hybrid retrieval uses BM25 + cosine_similarity with RRF reranking. The BM25 component is destroyed for most real-world queries, leaving only cosine similarity on single-sentence edge facts — effectively halving the hybrid search quality.

In our benchmark (22 queries over PE investment documents), graph_search scored 0.130 average recall. After lowercasing the query to bypass this bug, hybrid recall improved from 0.613 to 0.668 (+9%). A direct Cypher fulltext query (bypassing lucene_sanitize entirely) scored 0.257 standalone.

Suggested fix

Replace per-character escaping with whole-word regex:

import re

def lucene_sanitize(query: str) -> str:
    # Escape Lucene special characters
    escape_map = str.maketrans({
        '+': r'\+', '-': r'\-', '&': r'\&', '|': r'\|',
        '!': r'\!', '(': r'\(', ')': r'\)', '{': r'\{',
        '}': r'\}', '[': r'\[', ']': r'\]', '^': r'\^',
        '"': r'\"', '~': r'\~', '*': r'\*', '?': r'\?',
        ':': r'\:', '\\': r'\\', '/': r'\/',
    })
    sanitized = query.translate(escape_map)
    # Escape Lucene boolean operators as whole words only
    sanitized = re.sub(r'\bAND\b', r'\\AND', sanitized)
    sanitized = re.sub(r'\bOR\b', r'\\OR', sanitized)
    sanitized = re.sub(r'\bNOT\b', r'\\NOT', sanitized)
    return sanitized

Environment

  • graphiti-core 0.28.1 (latest)
  • Neo4j 5.x
  • Python 3.12

Metadata

Metadata

Assignees

No one assigned

    Labels

    No labels
    No labels

    Type

    No type

    Projects

    No projects

    Milestone

    No milestone

    Relationships

    None yet

    Development

    No branches or pull requests

    Issue actions