Skip to content

fix(context): enforce hard token-position cap in context window#2055

Open
AlexanderSanin wants to merge 1 commit into
microsoft:mainfrom
AlexanderSanin:fix/context-window-boundary
Open

fix(context): enforce hard token-position cap in context window#2055
AlexanderSanin wants to merge 1 commit into
microsoft:mainfrom
AlexanderSanin:fix/context-window-boundary

Conversation

@AlexanderSanin
Copy link
Copy Markdown

Summary

  • Fixes Context words are used outside the suffix/prefix window #1444 — context words were being matched outside the configured context_prefix_count / context_suffix_count window.
  • Root cause: _add_n_words only decremented the keyword budget (remaining) when a keyword was matched, so stop-words and punctuation were invisible to the budget. With a small prefix count (e.g. context_prefix_count=0) this allowed tokens arbitrarily far from the entity to contribute.
  • Fix: introduce max_token_positions = n_words * 2 + 1 as a hard cap on the total number of token positions scanned in either direction. This bounds the window to a predictable neighbourhood of the entity while still allowing enough room for typical stop-word and punctuation density (multiplier of 2).

Test plan

  • All existing unit tests pass (test_lemma_context_aware_enhancer.py, test_context_support.py).
  • Three new unit tests added:
    • test_add_n_words_forward_respects_token_window — hard cap blocks a keyword beyond the window boundary.
    • test_add_n_words_backward_respects_token_window — same check in the backward direction.
    • test_add_n_words_zero_window_only_includes_entityn_words=0 returns only the entity token.
  • Manually verify with the reproducer from Context words are used outside the suffix/prefix window #1444: entities near Dollars/cents should respect the configured prefix/suffix counts.
cd presidio-analyzer
pip install -e .
python -m pytest tests/test_lemma_context_aware_enhancer.py tests/test_context_support.py -v

Copilot AI review requested due to automatic review settings June 4, 2026 09:28
@AlexanderSanin
Copy link
Copy Markdown
Author

Hey @Surya-5555 @omri374 @SharonHart. Could you, please, have a look at this?

The `_add_n_words` helper only decremented `remaining` when a
keyword was matched, so stop-words and punctuation between the
entity and a distant context word were "invisible" to the budget.
With a small `context_prefix_count` (e.g. 0) this allowed tokens
far outside the intended window to contribute to the context,
causing incorrect score boosts.

Introduce `max_token_positions = n_words * 2 + 1` as a hard cap
on the total number of token positions scanned in either direction.
The cap is large enough to accommodate typical stop-word and
punctuation density while bounding the window to a predictable
neighbourhood of the entity.

Add three unit tests that exercise the forward/backward cap and
the n_words=0 (prefix/suffix disabled) edge case.

Closes microsoft#1444

Signed-off-by: Oleksandr Sanin <alexaaander.sanin@gmail.com>
@AlexanderSanin AlexanderSanin force-pushed the fix/context-window-boundary branch from 525311e to f1ee6c2 Compare June 4, 2026 09:29
Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a hard cap on token scanning in LemmaContextAwareEnhancer._add_n_words to prevent context keywords from being collected outside the intended prefix/suffix window, and adds regression tests to validate the new boundary behavior.

Changes:

  • Add a max_token_positions = n_words * 2 + 1 scan limit to _add_n_words
  • Add unit tests covering forward/backward scan caps and the n_words=0 edge case

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File Description
presidio-analyzer/presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py Adds a hard cap on token positions scanned when collecting context words
presidio-analyzer/tests/test_lemma_context_aware_enhancer.py Adds regression tests ensuring _add_n_words respects the new scan window

Comment on lines +340 to 352
# collect at most n_words keywords (plus the entity token itself).
# A hard cap on total token positions prevents scanning arbitrarily
# far when n_words is small (e.g. 0), which would otherwise allow
# context words far outside the intended window to contribute.
remaining = n_words + 1
while 0 <= i < len(lemmas) and remaining > 0:
max_token_positions = n_words * 2 + 1
while 0 <= i < len(lemmas) and remaining > 0 and max_token_positions > 0:
lower_lemma = lemmas[i].lower()
if lower_lemma in lemmatized_filtered_keywords:
context_words.append(lower_lemma)
remaining -= 1
max_token_positions -= 1
i = i - 1 if is_backward else i + 1
# the number of collected words

# collect at most n words (in lower case)
# collect at most n_words keywords (plus the entity token itself).
Comment on lines 344 to 350
remaining = n_words + 1
while 0 <= i < len(lemmas) and remaining > 0:
max_token_positions = n_words * 2 + 1
while 0 <= i < len(lemmas) and remaining > 0 and max_token_positions > 0:
lower_lemma = lemmas[i].lower()
if lower_lemma in lemmatized_filtered_keywords:
context_words.append(lower_lemma)
remaining -= 1
Comment on lines +291 to +293
With n_words=1 the hard cap is 3 token positions. A keyword sitting at
position 4 (beyond the cap) must not be collected even when stop-words
and punctuation fill the earlier positions.
)


def test_add_n_words_forward_respects_token_window():
lemmas = ["entity", "the", "near_keyword", "far_keyword"]
lemmatized_filtered_keywords = ["entity", "near_keyword", "far_keyword"]

result = LemmaContextAwareEnhancer._add_n_words(
@@ -1,3 +1,5 @@
import pytest
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Context words are used outside the suffix/prefix window

3 participants