fix(context): enforce hard token-position cap in context window by AlexanderSanin · Pull Request #2055 · microsoft/presidio

AlexanderSanin · 2026-06-04T09:28:24Z

Summary

Fixes Context words are used outside the suffix/prefix window #1444 — context words were being matched outside the configured context_prefix_count / context_suffix_count window.
Root cause: _add_n_words only decremented the keyword budget (remaining) when a keyword was matched, so stop-words and punctuation were invisible to the budget. With a small prefix count (e.g. context_prefix_count=0) this allowed tokens arbitrarily far from the entity to contribute.
Fix: introduce max_token_positions = n_words * 2 + 1 as a hard cap on the total number of token positions scanned in either direction. This bounds the window to a predictable neighbourhood of the entity while still allowing enough room for typical stop-word and punctuation density (multiplier of 2).

Test plan

All existing unit tests pass (test_lemma_context_aware_enhancer.py, test_context_support.py).
Three new unit tests added:
- test_add_n_words_forward_respects_token_window — hard cap blocks a keyword beyond the window boundary.
- test_add_n_words_backward_respects_token_window — same check in the backward direction.
- test_add_n_words_zero_window_only_includes_entity — n_words=0 returns only the entity token.
Manually verify with the reproducer from Context words are used outside the suffix/prefix window #1444: entities near Dollars/cents should respect the configured prefix/suffix counts.

cd presidio-analyzer
pip install -e .
python -m pytest tests/test_lemma_context_aware_enhancer.py tests/test_context_support.py -v

AlexanderSanin · 2026-06-04T09:28:42Z

Hey @Surya-5555 @omri374 @SharonHart. Could you, please, have a look at this?

The `_add_n_words` helper only decremented `remaining` when a keyword was matched, so stop-words and punctuation between the entity and a distant context word were "invisible" to the budget. With a small `context_prefix_count` (e.g. 0) this allowed tokens far outside the intended window to contribute to the context, causing incorrect score boosts. Introduce `max_token_positions = n_words * 2 + 1` as a hard cap on the total number of token positions scanned in either direction. The cap is large enough to accommodate typical stop-word and punctuation density while bounding the window to a predictable neighbourhood of the entity. Add three unit tests that exercise the forward/backward cap and the n_words=0 (prefix/suffix disabled) edge case. Closes microsoft#1444 Signed-off-by: Oleksandr Sanin <alexaaander.sanin@gmail.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR introduces a hard cap on token scanning in LemmaContextAwareEnhancer._add_n_words to prevent context keywords from being collected outside the intended prefix/suffix window, and adds regression tests to validate the new boundary behavior.

Changes:

Add a max_token_positions = n_words * 2 + 1 scan limit to _add_n_words
Add unit tests covering forward/backward scan caps and the n_words=0 edge case

Reviewed changes

Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.

File	Description
presidio-analyzer/presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py	Adds a hard cap on token positions scanned when collecting context words
presidio-analyzer/tests/test_lemma_context_aware_enhancer.py	Adds regression tests ensuring `_add_n_words` respects the new scan window

+        # collect at most n_words keywords (plus the entity token itself).
+        # A hard cap on total token positions prevents scanning arbitrarily
+        # far when n_words is small (e.g. 0), which would otherwise allow
+        # context words far outside the intended window to contribute.
        remaining = n_words + 1
-        while 0 <= i < len(lemmas) and remaining > 0:
+        max_token_positions = n_words * 2 + 1
+        while 0 <= i < len(lemmas) and remaining > 0 and max_token_positions > 0:
            lower_lemma = lemmas[i].lower()
            if lower_lemma in lemmatized_filtered_keywords:
                context_words.append(lower_lemma)
                remaining -= 1
+            max_token_positions -= 1
            i = i - 1 if is_backward else i + 1


        # the number of collected words

-        # collect at most n words (in lower case)
+        # collect at most n_words keywords (plus the entity token itself).


        remaining = n_words + 1
-        while 0 <= i < len(lemmas) and remaining > 0:
+        max_token_positions = n_words * 2 + 1
+        while 0 <= i < len(lemmas) and remaining > 0 and max_token_positions > 0:
            lower_lemma = lemmas[i].lower()
            if lower_lemma in lemmatized_filtered_keywords:
                context_words.append(lower_lemma)
                remaining -= 1


+    With n_words=1 the hard cap is 3 token positions.  A keyword sitting at
+    position 4 (beyond the cap) must not be collected even when stop-words
+    and punctuation fill the earlier positions.


    )
+
+
+def test_add_n_words_forward_respects_token_window():


+    lemmas = ["entity", "the", "near_keyword", "far_keyword"]
+    lemmatized_filtered_keywords = ["entity", "near_keyword", "far_keyword"]
+
+    result = LemmaContextAwareEnhancer._add_n_words(


@@ -1,3 +1,5 @@
+import pytest


Copilot AI review requested due to automatic review settings June 4, 2026 09:28

AlexanderSanin mentioned this pull request Jun 4, 2026

Context words are used outside the suffix/prefix window #1444

Open

AlexanderSanin force-pushed the fix/context-window-boundary branch from 525311e to f1ee6c2 Compare June 4, 2026 09:29

github-actions Bot added the external label Jun 4, 2026

Copilot AI reviewed Jun 4, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(context): enforce hard token-position cap in context window#2055

fix(context): enforce hard token-position cap in context window#2055
AlexanderSanin wants to merge 1 commit into
microsoft:mainfrom
AlexanderSanin:fix/context-window-boundary

AlexanderSanin commented Jun 4, 2026

Uh oh!

AlexanderSanin commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants

		)


		def test_add_n_words_forward_respects_token_window():

Conversation

AlexanderSanin commented Jun 4, 2026

Summary

Test plan

Uh oh!

AlexanderSanin commented Jun 4, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

3 participants