Description
lucene_sanitize() in graphiti_core/helpers.py escapes individual uppercase letters O, R, N, T, A, D — intending to escape Lucene boolean operators (AND, OR, NOT). However, str.maketrans maps every occurrence of these characters, not just the keywords.
This destroys BM25 fulltext search for any query containing these common uppercase letters.
Examples
| Query |
After lucene_sanitize() |
Impact |
"EBITDA forecast" |
"EBI\T\D\A forecast" |
No BM25 match |
"UIR 2 and UIR 3" |
"UI\R 2 and UI\R 3" |
No match on "UIR" |
"KKR is competing" |
"KK\R is competing" |
No match on "KKR" |
"NORD stream" |
"\N\O\R\D stream" |
Completely mangled |
"deal team members" |
"deal team members" |
Works (no uppercase) |
Root cause
PR #233 (Dec 9, 2024) added:
escape_map = str.maketrans({
...
'O': r'\O',
'R': r'\R',
'N': r'\N',
'T': r'\T',
'A': r'\A',
'D': r'\D',
})
str.maketrans operates per-character, so every T, D, A, etc. in the query gets escaped — not just when they form AND/OR/NOT.
The test in the same PR reveals awareness of this: the test input was changed from 'This has no escape characters' to lowercase 'this has no escape characters' to avoid the new escaping.
Impact
The graph_search hybrid retrieval uses BM25 + cosine_similarity with RRF reranking. The BM25 component is destroyed for most real-world queries, leaving only cosine similarity on single-sentence edge facts — effectively halving the hybrid search quality.
In our benchmark (22 queries over PE investment documents), graph_search scored 0.130 average recall. After lowercasing the query to bypass this bug, hybrid recall improved from 0.613 to 0.668 (+9%). A direct Cypher fulltext query (bypassing lucene_sanitize entirely) scored 0.257 standalone.
Suggested fix
Replace per-character escaping with whole-word regex:
import re
def lucene_sanitize(query: str) -> str:
# Escape Lucene special characters
escape_map = str.maketrans({
'+': r'\+', '-': r'\-', '&': r'\&', '|': r'\|',
'!': r'\!', '(': r'\(', ')': r'\)', '{': r'\{',
'}': r'\}', '[': r'\[', ']': r'\]', '^': r'\^',
'"': r'\"', '~': r'\~', '*': r'\*', '?': r'\?',
':': r'\:', '\\': r'\\', '/': r'\/',
})
sanitized = query.translate(escape_map)
# Escape Lucene boolean operators as whole words only
sanitized = re.sub(r'\bAND\b', r'\\AND', sanitized)
sanitized = re.sub(r'\bOR\b', r'\\OR', sanitized)
sanitized = re.sub(r'\bNOT\b', r'\\NOT', sanitized)
return sanitized
Environment
- graphiti-core 0.28.1 (latest)
- Neo4j 5.x
- Python 3.12
Description
lucene_sanitize()ingraphiti_core/helpers.pyescapes individual uppercase lettersO,R,N,T,A,D— intending to escape Lucene boolean operators (AND,OR,NOT). However,str.maketransmaps every occurrence of these characters, not just the keywords.This destroys BM25 fulltext search for any query containing these common uppercase letters.
Examples
lucene_sanitize()"EBITDA forecast""EBI\T\D\A forecast""UIR 2 and UIR 3""UI\R 2 and UI\R 3""KKR is competing""KK\R is competing""NORD stream""\N\O\R\D stream""deal team members""deal team members"Root cause
PR #233 (Dec 9, 2024) added:
str.maketransoperates per-character, so everyT,D,A, etc. in the query gets escaped — not just when they formAND/OR/NOT.The test in the same PR reveals awareness of this: the test input was changed from
'This has no escape characters'to lowercase'this has no escape characters'to avoid the new escaping.Impact
The
graph_searchhybrid retrieval usesBM25 + cosine_similaritywith RRF reranking. The BM25 component is destroyed for most real-world queries, leaving only cosine similarity on single-sentence edge facts — effectively halving the hybrid search quality.In our benchmark (22 queries over PE investment documents),
graph_searchscored 0.130 average recall. After lowercasing the query to bypass this bug, hybrid recall improved from 0.613 to 0.668 (+9%). A direct Cypher fulltext query (bypassinglucene_sanitizeentirely) scored 0.257 standalone.Suggested fix
Replace per-character escaping with whole-word regex:
Environment