[BUG] lucene_sanitize() escapes individual uppercase letters, breaking BM25 for most queries

## Description

`lucene_sanitize()` in `graphiti_core/helpers.py` escapes individual uppercase letters `O`, `R`, `N`, `T`, `A`, `D` — intending to escape Lucene boolean operators (`AND`, `OR`, `NOT`). However, `str.maketrans` maps **every occurrence** of these characters, not just the keywords.

This destroys BM25 fulltext search for any query containing these common uppercase letters.

## Examples

| Query | After `lucene_sanitize()` | Impact |
|-------|--------------------------|--------|
| `"EBITDA forecast"` | `"EBI\T\D\A forecast"` | No BM25 match |
| `"UIR 2 and UIR 3"` | `"UI\R 2 and UI\R 3"` | No match on "UIR" |
| `"KKR is competing"` | `"KK\R is competing"` | No match on "KKR" |
| `"NORD stream"` | `"\N\O\R\D stream"` | Completely mangled |
| `"deal team members"` | `"deal team members"` | Works (no uppercase) |

## Root cause

[PR #233](https://github.com/getzep/graphiti/pull/233) (Dec 9, 2024) added:

```python
escape_map = str.maketrans({
    ...
    'O': r'\O',
    'R': r'\R',
    'N': r'\N',
    'T': r'\T',
    'A': r'\A',
    'D': r'\D',
})
```

`str.maketrans` operates per-character, so **every** `T`, `D`, `A`, etc. in the query gets escaped — not just when they form `AND`/`OR`/`NOT`.

The test in the same PR reveals awareness of this: the test input was changed from `'This has no escape characters'` to lowercase `'this has no escape characters'` to avoid the new escaping.

## Impact

The `graph_search` hybrid retrieval uses `BM25 + cosine_similarity` with RRF reranking. The BM25 component is destroyed for most real-world queries, leaving only cosine similarity on single-sentence edge facts — effectively halving the hybrid search quality.

In our benchmark (22 queries over PE investment documents), `graph_search` scored **0.130 average recall**. After lowercasing the query to bypass this bug, hybrid recall improved from 0.613 to **0.668** (+9%). A direct Cypher fulltext query (bypassing `lucene_sanitize` entirely) scored **0.257** standalone.

## Suggested fix

Replace per-character escaping with whole-word regex:

```python
import re

def lucene_sanitize(query: str) -> str:
    # Escape Lucene special characters
    escape_map = str.maketrans({
        '+': r'\+', '-': r'\-', '&': r'\&', '|': r'\|',
        '!': r'\!', '(': r'\(', ')': r'\)', '{': r'\{',
        '}': r'\}', '[': r'\[', ']': r'\]', '^': r'\^',
        '"': r'\"', '~': r'\~', '*': r'\*', '?': r'\?',
        ':': r'\:', '\\': r'\\', '/': r'\/',
    })
    sanitized = query.translate(escape_map)
    # Escape Lucene boolean operators as whole words only
    sanitized = re.sub(r'\bAND\b', r'\\AND', sanitized)
    sanitized = re.sub(r'\bOR\b', r'\\OR', sanitized)
    sanitized = re.sub(r'\bNOT\b', r'\\NOT', sanitized)
    return sanitized
```

## Environment

- graphiti-core 0.28.1 (latest)
- Neo4j 5.x
- Python 3.12

Query	After `lucene_sanitize()`	Impact
`"EBITDA forecast"`	`"EBI\T\D\A forecast"`	No BM25 match
`"UIR 2 and UIR 3"`	`"UI\R 2 and UI\R 3"`	No match on "UIR"
`"KKR is competing"`	`"KK\R is competing"`	No match on "KKR"
`"NORD stream"`	`"\N\O\R\D stream"`	Completely mangled
`"deal team members"`	`"deal team members"`	Works (no uppercase)

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[BUG] lucene_sanitize() escapes individual uppercase letters, breaking BM25 for most queries #1302

Description

Examples

Root cause

Impact

Suggested fix

Environment

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

[BUG] lucene_sanitize() escapes individual uppercase letters, breaking BM25 for most queries #1302

Description

Description

Examples

Root cause

Impact

Suggested fix

Environment

Metadata

Metadata

Assignees

Labels

Type

Projects

Milestone

Relationships

Development

Issue actions