feat(analyzer): merge adjacent same-type entities separated by whitespace by AlexanderSanin · Pull Request #2046 · microsoft/presidio

AlexanderSanin · 2026-05-29T09:33:26Z

Summary

Fixes the long-standing issue where NER models that tokenise multi-word entities (e.g. spaCy detecting "Dave" and "Jones" as two separate PERSON spans) would produce redundant placeholders like <PERSON> <PERSON> instead of a single <PERSON>.
Adds EntityRecognizer.merge_adjacent_text_entities(results, text) — a static method that sorts results by start offset and greedily fuses consecutive spans of the same entity type when the text between them is purely whitespace. The merged span takes max(score_a, score_b).
Wires the new method into AnalyzerEngine.analyze() immediately after remove_duplicates(), so it is applied transparently to every analysis call.

Test plan

test_merge_adjacent_same_type_entities — two PERSON spans "Dave" + "Jones" are merged into one
test_merge_adjacent_preserves_max_score — merged entity uses the higher of the two scores
test_merge_adjacent_three_tokens — chain of three same-type spans ("Jean Luc Picard") collapses to one
test_no_merge_when_different_entity_types — PERSON + LOCATION are not merged
test_no_merge_when_gap_has_non_whitespace — spans separated by punctuation are not merged
test_merge_empty_results — empty input returns empty output without error
All pre-existing tests continue to pass (13/13 green)

…pace When an NER model tokenizes a multi-word entity (e.g. "Dave Jones") it may return two consecutive spans of the same entity type with only whitespace between them. Previously Presidio would emit two separate placeholders (e.g. <PERSON> <PERSON>) instead of a single one, breaking anonymization quality and downstream synthetic-data generation. A new static method EntityRecognizer.merge_adjacent_text_entities(results, text) is added. It sorts results by start offset and greedily merges consecutive spans of the same entity type whose intervening gap consists solely of whitespace, assigning the maximum score to the fused span. The method is called in AnalyzerEngine.analyze() immediately after remove_duplicates(), so it integrates transparently into the existing pipeline without breaking any existing behaviour. Six unit tests are added to test_entity_recognizer.py covering: basic two- token merge, score preservation, three-token chain merge, different entity types not merged, non-whitespace gap not merged, and empty input. Closes microsoft#1090 Signed-off-by: Oleksandr Sanin <alexaaander.sanin@gmail.com>

AlexanderSanin · 2026-05-29T09:33:44Z

Hey @Surya-5555 @omri374 @SharonHart. Could you, please, have a look at this?

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

Adds a utility to merge adjacent same-type entity spans separated only by whitespace, and wires it into the analyzer pipeline so multi-token NER detections produce a single span.

Changes:

New EntityRecognizer.merge_adjacent_text_entities static method.
Unit tests covering merge behavior, score preservation, multi-token merges, type/gap rejection, and empty input.
AnalyzerEngine.analyze now calls the merge step after deduplication.

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 4 comments.

File	Description
presidio-analyzer/presidio_analyzer/entity_recognizer.py	Implements the new merge helper.
presidio-analyzer/presidio_analyzer/analyzer_engine.py	Invokes the merge helper inside the analyze pipeline.
presidio-analyzer/tests/test_entity_recognizer.py	Adds tests for the new merge helper.

        results = EntityRecognizer.remove_duplicates(results)
+        results = EntityRecognizer.merge_adjacent_text_entities(results, text)
        results = self.__remove_low_scores(results, score_threshold)


+                current = RecognizerResult(
+                    entity_type=current.entity_type,
+                    start=current.start,
+                    end=nxt.end,
+                    score=max(current.score, nxt.score),
+                    analysis_explanation=current.analysis_explanation,
+                    recognition_metadata=current.recognition_metadata,
+                )


+        for nxt in sorted_results[1:]:
+            gap = text[current.end : nxt.start]
+            if current.entity_type == nxt.entity_type and gap.strip() == "":


+def test_merge_empty_results():
+    """Empty input returns empty output without error."""
+    merged = EntityRecognizer.merge_adjacent_text_entities([], "some text")
+    assert merged == []


Copilot AI review requested due to automatic review settings May 29, 2026 09:33

AlexanderSanin mentioned this pull request May 29, 2026

Merge two entities from the same type with whitespace between them #1090

Open

github-actions Bot added the external label May 29, 2026

Copilot AI reviewed May 29, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(analyzer): merge adjacent same-type entities separated by whitespace#2046

feat(analyzer): merge adjacent same-type entities separated by whitespace#2046
AlexanderSanin wants to merge 1 commit into
microsoft:mainfrom
AlexanderSanin:fix/merge-adjacent-same-type-entities

AlexanderSanin commented May 29, 2026

Uh oh!

AlexanderSanin commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

Conversation

AlexanderSanin commented May 29, 2026

Summary

Test plan

Uh oh!

AlexanderSanin commented May 29, 2026

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants