fix(context): enforce hard token-position cap in context window#2055
Open
AlexanderSanin wants to merge 1 commit into
Open
fix(context): enforce hard token-position cap in context window#2055AlexanderSanin wants to merge 1 commit into
AlexanderSanin wants to merge 1 commit into
Conversation
Author
|
Hey @Surya-5555 @omri374 @SharonHart. Could you, please, have a look at this? |
The `_add_n_words` helper only decremented `remaining` when a keyword was matched, so stop-words and punctuation between the entity and a distant context word were "invisible" to the budget. With a small `context_prefix_count` (e.g. 0) this allowed tokens far outside the intended window to contribute to the context, causing incorrect score boosts. Introduce `max_token_positions = n_words * 2 + 1` as a hard cap on the total number of token positions scanned in either direction. The cap is large enough to accommodate typical stop-word and punctuation density while bounding the window to a predictable neighbourhood of the entity. Add three unit tests that exercise the forward/backward cap and the n_words=0 (prefix/suffix disabled) edge case. Closes microsoft#1444 Signed-off-by: Oleksandr Sanin <alexaaander.sanin@gmail.com>
525311e to
f1ee6c2
Compare
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR introduces a hard cap on token scanning in LemmaContextAwareEnhancer._add_n_words to prevent context keywords from being collected outside the intended prefix/suffix window, and adds regression tests to validate the new boundary behavior.
Changes:
- Add a
max_token_positions = n_words * 2 + 1scan limit to_add_n_words - Add unit tests covering forward/backward scan caps and the
n_words=0edge case
Reviewed changes
Copilot reviewed 2 out of 2 changed files in this pull request and generated 7 comments.
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/context_aware_enhancers/lemma_context_aware_enhancer.py | Adds a hard cap on token positions scanned when collecting context words |
| presidio-analyzer/tests/test_lemma_context_aware_enhancer.py | Adds regression tests ensuring _add_n_words respects the new scan window |
Comment on lines
+340
to
352
| # collect at most n_words keywords (plus the entity token itself). | ||
| # A hard cap on total token positions prevents scanning arbitrarily | ||
| # far when n_words is small (e.g. 0), which would otherwise allow | ||
| # context words far outside the intended window to contribute. | ||
| remaining = n_words + 1 | ||
| while 0 <= i < len(lemmas) and remaining > 0: | ||
| max_token_positions = n_words * 2 + 1 | ||
| while 0 <= i < len(lemmas) and remaining > 0 and max_token_positions > 0: | ||
| lower_lemma = lemmas[i].lower() | ||
| if lower_lemma in lemmatized_filtered_keywords: | ||
| context_words.append(lower_lemma) | ||
| remaining -= 1 | ||
| max_token_positions -= 1 | ||
| i = i - 1 if is_backward else i + 1 |
| # the number of collected words | ||
|
|
||
| # collect at most n words (in lower case) | ||
| # collect at most n_words keywords (plus the entity token itself). |
Comment on lines
344
to
350
| remaining = n_words + 1 | ||
| while 0 <= i < len(lemmas) and remaining > 0: | ||
| max_token_positions = n_words * 2 + 1 | ||
| while 0 <= i < len(lemmas) and remaining > 0 and max_token_positions > 0: | ||
| lower_lemma = lemmas[i].lower() | ||
| if lower_lemma in lemmatized_filtered_keywords: | ||
| context_words.append(lower_lemma) | ||
| remaining -= 1 |
Comment on lines
+291
to
+293
| With n_words=1 the hard cap is 3 token positions. A keyword sitting at | ||
| position 4 (beyond the cap) must not be collected even when stop-words | ||
| and punctuation fill the earlier positions. |
| ) | ||
|
|
||
|
|
||
| def test_add_n_words_forward_respects_token_window(): |
| lemmas = ["entity", "the", "near_keyword", "far_keyword"] | ||
| lemmatized_filtered_keywords = ["entity", "near_keyword", "far_keyword"] | ||
|
|
||
| result = LemmaContextAwareEnhancer._add_n_words( |
| @@ -1,3 +1,5 @@ | |||
| import pytest | |||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
context_prefix_count/context_suffix_countwindow._add_n_wordsonly decremented the keyword budget (remaining) when a keyword was matched, so stop-words and punctuation were invisible to the budget. With a small prefix count (e.g.context_prefix_count=0) this allowed tokens arbitrarily far from the entity to contribute.max_token_positions = n_words * 2 + 1as a hard cap on the total number of token positions scanned in either direction. This bounds the window to a predictable neighbourhood of the entity while still allowing enough room for typical stop-word and punctuation density (multiplier of 2).Test plan
test_lemma_context_aware_enhancer.py,test_context_support.py).test_add_n_words_forward_respects_token_window— hard cap blocks a keyword beyond the window boundary.test_add_n_words_backward_respects_token_window— same check in the backward direction.test_add_n_words_zero_window_only_includes_entity—n_words=0returns only the entity token.Dollars/centsshould respect the configured prefix/suffix counts.