fix(analyzer): prevent UrlRecognizer matching partial TLD labels (#1498) by Lawson-Darrow · Pull Request #2044 · microsoft/presidio

Lawson-Darrow · 2026-05-29T06:06:54Z

Fixes #1498

Problem

UrlRecognizer.BASE_URL_REGEX matches a TLD even when it is only the prefix of a longer label. The TLD alternation has no trailing boundary, so schema-less <word>.<word> tokens — common in source code, module paths, and filenames — produce false positives:

Input	Matched (before)
`os.system`	`os.sy` (`.sy` = Syria)
`zeus.mtia.local`	`zeus.mt` (`.mt` = Malta)

Fix

Add a negative lookahead (?![a-z0-9-]) immediately after the TLD alternation, requiring the matched TLD to be a complete domain label (i.e. not immediately followed by another label character).

Genuine bare domains (example.sy, microsoft.com, google.co.il) still match.
URLs with schemes/paths are unaffected — a path begins with /, which the lookahead permits.

Tests

Added regression cases to test_url_recognizer.py:

The reported false positives (os.system, zeus.mtia.local, and a code-snippet line) now yield zero matches.
A guard case (example.sy) ensures a real ccTLD used as a complete label still matches.

All cases in test_url_recognizer.py pass; ruff check reports no new findings on the changed files. CHANGELOG updated under Analyzer → Fixed.

Note on scope

Tokens that are syntactically valid bare domains remain ambiguous by design — e.g. rpc.py (a Python file) is indistinguishable from the Paraguay domain rpc.py without surrounding context, so it intentionally still matches. This PR removes only the unambiguous partial-label matches.

🤖 Generated with Claude Code

The URL regex matched a ccTLD even when it was only the prefix of a longer label, producing false positives on `<word>.<word>` tokens common in code and filenames (e.g. `os.sy` in `os.system`, `zeus.mt` in `zeus.mtia`). A negative lookahead now requires the matched TLD to be a complete domain label. Genuine bare domains (e.g. `example.sy`) and URLs with paths are unaffected. Adds regression tests for the reported false positives plus a guard ensuring real ccTLD domains still match. Fixes microsoft#1498 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes UrlRecognizer false positives where a TLD match was accepted even when it was only the prefix of a longer label (e.g., matching os.sy inside os.system), and adds regression tests and a changelog entry for the fix.

Changes:

Tightened UrlRecognizer’s base URL regex with a negative lookahead so the matched TLD must be a complete label.
Added regression tests to ensure code-like tokens (e.g. os.system, zeus.mtia.local) are not recognized as URLs, while bare ccTLD domains still are.
Documented the fix in the changelog (linked to #1498).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File	Description
presidio-analyzer/tests/test_url_recognizer.py	Adds regression tests for #1498 and a positive control for a bare ccTLD domain.
presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/url_recognizer.py	Updates the URL regex to require the TLD be a complete label (negative lookahead).
CHANGELOG.md	Documents the behavior change and references the fixing issue.

        ('"https://microsoft.github.io/presidio/"', 1, ((0, 39),), 0.6),  
        ("'https://microsoft.github.io/presidio/'", 1, ((0, 39),), 0.6),
+        # A genuine ccTLD as a complete label must still match (#1498 guard)
+        ("example.sy", 1, ((0, 10),), 0.5,),


+        ("zeus.mtia.local", 0, (), 0,),
+        ("return os.system, (cmd,)", 0, (), 0,),


 - Added `supported_entity` parameter to `PhoneRecognizer`. Previously, this recognizer hard-coded `["PHONE_NUMBER"]` as the only possible supported entity.

 #### Fixed
+- Fixed `UrlRecognizer` matching a TLD that is only the prefix of a longer label, producing false positives on `<word>.<word>` tokens common in code and filenames (e.g. `os.sy` in `os.system`, `zeus.mt` in `zeus.mtia`). A negative lookahead now requires the matched TLD to be a complete domain label. Genuine bare domains (e.g. `example.sy`) and URLs with paths are unaffected. Fixes [#1498](https://github.com/microsoft/presidio/issues/1498).


Lawson-Darrow · 2026-05-29T06:09:19Z

@microsoft-github-policy-service agree

- Drop the trailing commas inside the new test-case tuples to match the surrounding entries' style. - Reword the CHANGELOG entry to describe the false positive as a `<word>.<tld>` substring matched inside a longer identifier.

Copilot AI review requested due to automatic review settings May 29, 2026 06:06

Lawson-Darrow mentioned this pull request May 29, 2026

UrlRecognizer detects many false positives when analyzing code snippets #1498

Open

github-actions Bot added the external label May 29, 2026

Copilot AI reviewed May 29, 2026

View reviewed changes

Address review: test-tuple style and CHANGELOG wording

27a8382

- Drop the trailing commas inside the new test-case tuples to match the surrounding entries' style. - Reword the CHANGELOG entry to describe the false positive as a `<word>.<tld>` substring matched inside a longer identifier.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(analyzer): prevent UrlRecognizer matching partial TLD labels (#1498)#2044

fix(analyzer): prevent UrlRecognizer matching partial TLD labels (#1498)#2044
Lawson-Darrow wants to merge 2 commits into
microsoft:mainfrom
Lawson-Darrow:fix/url-recognizer-partial-tld-1498

Lawson-Darrow commented May 29, 2026

Uh oh!

Copilot AI left a comment

Uh oh!

Lawson-Darrow commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

		("zeus.mtia.local", 0, (), 0,),
		("return os.system, (cmd,)", 0, (), 0,),

Conversation

Lawson-Darrow commented May 29, 2026

Fixes #1498

Problem

Fix

Tests

Note on scope

Uh oh!

Copilot AI left a comment

Choose a reason for hiding this comment

Pull request overview

Reviewed changes

Uh oh!

Lawson-Darrow commented May 29, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants