Skip to content

feat(analyzer): merge adjacent same-type entities separated by whitespace#2046

Open
AlexanderSanin wants to merge 1 commit into
microsoft:mainfrom
AlexanderSanin:fix/merge-adjacent-same-type-entities
Open

feat(analyzer): merge adjacent same-type entities separated by whitespace#2046
AlexanderSanin wants to merge 1 commit into
microsoft:mainfrom
AlexanderSanin:fix/merge-adjacent-same-type-entities

Conversation

@AlexanderSanin
Copy link
Copy Markdown

Summary

  • Fixes the long-standing issue where NER models that tokenise multi-word entities (e.g. spaCy detecting "Dave" and "Jones" as two separate PERSON spans) would produce redundant placeholders like <PERSON> <PERSON> instead of a single <PERSON>.
  • Adds EntityRecognizer.merge_adjacent_text_entities(results, text) — a static method that sorts results by start offset and greedily fuses consecutive spans of the same entity type when the text between them is purely whitespace. The merged span takes max(score_a, score_b).
  • Wires the new method into AnalyzerEngine.analyze() immediately after remove_duplicates(), so it is applied transparently to every analysis call.

Test plan

  • test_merge_adjacent_same_type_entities — two PERSON spans "Dave" + "Jones" are merged into one
  • test_merge_adjacent_preserves_max_score — merged entity uses the higher of the two scores
  • test_merge_adjacent_three_tokens — chain of three same-type spans ("Jean Luc Picard") collapses to one
  • test_no_merge_when_different_entity_types — PERSON + LOCATION are not merged
  • test_no_merge_when_gap_has_non_whitespace — spans separated by punctuation are not merged
  • test_merge_empty_results — empty input returns empty output without error
  • All pre-existing tests continue to pass (13/13 green)

Closes #1090

…pace

When an NER model tokenizes a multi-word entity (e.g. "Dave Jones") it may
return two consecutive spans of the same entity type with only whitespace
between them.  Previously Presidio would emit two separate placeholders
(e.g. <PERSON> <PERSON>) instead of a single one, breaking anonymization
quality and downstream synthetic-data generation.

A new static method EntityRecognizer.merge_adjacent_text_entities(results,
text) is added.  It sorts results by start offset and greedily merges
consecutive spans of the same entity type whose intervening gap consists
solely of whitespace, assigning the maximum score to the fused span.  The
method is called in AnalyzerEngine.analyze() immediately after
remove_duplicates(), so it integrates transparently into the existing
pipeline without breaking any existing behaviour.

Six unit tests are added to test_entity_recognizer.py covering: basic two-
token merge, score preservation, three-token chain merge, different entity
types not merged, non-whitespace gap not merged, and empty input.

Closes microsoft#1090

Signed-off-by: Oleksandr Sanin <alexaaander.sanin@gmail.com>
Copilot AI review requested due to automatic review settings May 29, 2026 09:33
@AlexanderSanin
Copy link
Copy Markdown
Author

Hey @Surya-5555 @omri374 @SharonHart. Could you, please, have a look at this?

Copy link
Copy Markdown
Contributor

Copilot AI left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a utility to merge adjacent same-type entity spans separated only by whitespace, and wires it into the analyzer pipeline so multi-token NER detections produce a single span.

Changes:

  • New EntityRecognizer.merge_adjacent_text_entities static method.
  • Unit tests covering merge behavior, score preservation, multi-token merges, type/gap rejection, and empty input.
  • AnalyzerEngine.analyze now calls the merge step after deduplication.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File Description
presidio-analyzer/presidio_analyzer/entity_recognizer.py Implements the new merge helper.
presidio-analyzer/presidio_analyzer/analyzer_engine.py Invokes the merge helper inside the analyze pipeline.
presidio-analyzer/tests/test_entity_recognizer.py Adds tests for the new merge helper.

Comment on lines 258 to 260
results = EntityRecognizer.remove_duplicates(results)
results = EntityRecognizer.merge_adjacent_text_entities(results, text)
results = self.__remove_low_scores(results, score_threshold)
Comment on lines +319 to +326
current = RecognizerResult(
entity_type=current.entity_type,
start=current.start,
end=nxt.end,
score=max(current.score, nxt.score),
analysis_explanation=current.analysis_explanation,
recognition_metadata=current.recognition_metadata,
)
Comment on lines +316 to +318
for nxt in sorted_results[1:]:
gap = text[current.end : nxt.start]
if current.entity_type == nxt.entity_type and gap.strip() == "":
Comment on lines +192 to +195
def test_merge_empty_results():
"""Empty input returns empty output without error."""
merged = EntityRecognizer.merge_adjacent_text_entities([], "some text")
assert merged == []
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Merge two entities from the same type with whitespace between them

2 participants