feat(analyzer): merge adjacent same-type entities separated by whitespace#2046
Open
AlexanderSanin wants to merge 1 commit into
Open
feat(analyzer): merge adjacent same-type entities separated by whitespace#2046AlexanderSanin wants to merge 1 commit into
AlexanderSanin wants to merge 1 commit into
Conversation
…pace When an NER model tokenizes a multi-word entity (e.g. "Dave Jones") it may return two consecutive spans of the same entity type with only whitespace between them. Previously Presidio would emit two separate placeholders (e.g. <PERSON> <PERSON>) instead of a single one, breaking anonymization quality and downstream synthetic-data generation. A new static method EntityRecognizer.merge_adjacent_text_entities(results, text) is added. It sorts results by start offset and greedily merges consecutive spans of the same entity type whose intervening gap consists solely of whitespace, assigning the maximum score to the fused span. The method is called in AnalyzerEngine.analyze() immediately after remove_duplicates(), so it integrates transparently into the existing pipeline without breaking any existing behaviour. Six unit tests are added to test_entity_recognizer.py covering: basic two- token merge, score preservation, three-token chain merge, different entity types not merged, non-whitespace gap not merged, and empty input. Closes microsoft#1090 Signed-off-by: Oleksandr Sanin <alexaaander.sanin@gmail.com>
Author
|
Hey @Surya-5555 @omri374 @SharonHart. Could you, please, have a look at this? |
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
Adds a utility to merge adjacent same-type entity spans separated only by whitespace, and wires it into the analyzer pipeline so multi-token NER detections produce a single span.
Changes:
- New
EntityRecognizer.merge_adjacent_text_entitiesstatic method. - Unit tests covering merge behavior, score preservation, multi-token merges, type/gap rejection, and empty input.
AnalyzerEngine.analyzenow calls the merge step after deduplication.
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.
| File | Description |
|---|---|
| presidio-analyzer/presidio_analyzer/entity_recognizer.py | Implements the new merge helper. |
| presidio-analyzer/presidio_analyzer/analyzer_engine.py | Invokes the merge helper inside the analyze pipeline. |
| presidio-analyzer/tests/test_entity_recognizer.py | Adds tests for the new merge helper. |
Comment on lines
258
to
260
| results = EntityRecognizer.remove_duplicates(results) | ||
| results = EntityRecognizer.merge_adjacent_text_entities(results, text) | ||
| results = self.__remove_low_scores(results, score_threshold) |
Comment on lines
+319
to
+326
| current = RecognizerResult( | ||
| entity_type=current.entity_type, | ||
| start=current.start, | ||
| end=nxt.end, | ||
| score=max(current.score, nxt.score), | ||
| analysis_explanation=current.analysis_explanation, | ||
| recognition_metadata=current.recognition_metadata, | ||
| ) |
Comment on lines
+316
to
+318
| for nxt in sorted_results[1:]: | ||
| gap = text[current.end : nxt.start] | ||
| if current.entity_type == nxt.entity_type and gap.strip() == "": |
Comment on lines
+192
to
+195
| def test_merge_empty_results(): | ||
| """Empty input returns empty output without error.""" | ||
| merged = EntityRecognizer.merge_adjacent_text_entities([], "some text") | ||
| assert merged == [] |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
PERSONspans) would produce redundant placeholders like<PERSON> <PERSON>instead of a single<PERSON>.EntityRecognizer.merge_adjacent_text_entities(results, text)— a static method that sorts results by start offset and greedily fuses consecutive spans of the same entity type when the text between them is purely whitespace. The merged span takesmax(score_a, score_b).AnalyzerEngine.analyze()immediately afterremove_duplicates(), so it is applied transparently to every analysis call.Test plan
test_merge_adjacent_same_type_entities— two PERSON spans"Dave" + "Jones"are merged into onetest_merge_adjacent_preserves_max_score— merged entity uses the higher of the two scorestest_merge_adjacent_three_tokens— chain of three same-type spans ("Jean Luc Picard") collapses to onetest_no_merge_when_different_entity_types— PERSON + LOCATION are not mergedtest_no_merge_when_gap_has_non_whitespace— spans separated by punctuation are not mergedtest_merge_empty_results— empty input returns empty output without errorCloses #1090