Skip to content

fix(analyzer): prevent UrlRecognizer matching partial TLD labels (#1498)#2044

Open
Lawson-Darrow wants to merge 2 commits into
microsoft:mainfrom
Lawson-Darrow:fix/url-recognizer-partial-tld-1498
Open

fix(analyzer): prevent UrlRecognizer matching partial TLD labels (#1498)#2044
Lawson-Darrow wants to merge 2 commits into
microsoft:mainfrom
Lawson-Darrow:fix/url-recognizer-partial-tld-1498

Conversation

@Lawson-Darrow

Copy link
Copy Markdown

Fixes #1498

Problem

UrlRecognizer.BASE_URL_REGEX matches a TLD even when it is only the prefix of a longer label. The TLD alternation has no trailing boundary, so schema-less <word>.<word> tokens — common in source code, module paths, and filenames — produce false positives:

Input Matched (before)
os.system os.sy (.sy = Syria)
zeus.mtia.local zeus.mt (.mt = Malta)

Fix

Add a negative lookahead (?![a-z0-9-]) immediately after the TLD alternation, requiring the matched TLD to be a complete domain label (i.e. not immediately followed by another label character).

  • Genuine bare domains (example.sy, microsoft.com, google.co.il) still match.
  • URLs with schemes/paths are unaffected — a path begins with /, which the lookahead permits.

Tests

Added regression cases to test_url_recognizer.py:

  • The reported false positives (os.system, zeus.mtia.local, and a code-snippet line) now yield zero matches.
  • A guard case (example.sy) ensures a real ccTLD used as a complete label still matches.

All cases in test_url_recognizer.py pass; ruff check reports no new findings on the changed files. CHANGELOG updated under Analyzer → Fixed.

Note on scope

Tokens that are syntactically valid bare domains remain ambiguous by design — e.g. rpc.py (a Python file) is indistinguishable from the Paraguay domain rpc.py without surrounding context, so it intentionally still matches. This PR removes only the unambiguous partial-label matches.

🤖 Generated with Claude Code

The URL regex matched a ccTLD even when it was only the prefix of a
longer label, producing false positives on `<word>.<word>` tokens
common in code and filenames (e.g. `os.sy` in `os.system`, `zeus.mt`
in `zeus.mtia`). A negative lookahead now requires the matched TLD to
be a complete domain label. Genuine bare domains (e.g. `example.sy`)
and URLs with paths are unaffected.

Adds regression tests for the reported false positives plus a guard
ensuring real ccTLD domains still match.

Fixes microsoft#1498

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Note

Copilot was unable to run its full agentic suite in this review.

This PR fixes UrlRecognizer false positives where a TLD match was accepted even when it was only the prefix of a longer label (e.g., matching os.sy inside os.system), and adds regression tests and a changelog entry for the fix.

Changes:

  • Tightened UrlRecognizer’s base URL regex with a negative lookahead so the matched TLD must be a complete label.
  • Added regression tests to ensure code-like tokens (e.g. os.system, zeus.mtia.local) are not recognized as URLs, while bare ccTLD domains still are.
  • Documented the fix in the changelog (linked to #1498).

Reviewed changes

Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.

File Description
presidio-analyzer/tests/test_url_recognizer.py Adds regression tests for #1498 and a positive control for a bare ccTLD domain.
presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/url_recognizer.py Updates the URL regex to require the TLD be a complete label (negative lookahead).
CHANGELOG.md Documents the behavior change and references the fixing issue.

('"https://microsoft.github.io/presidio/"', 1, ((0, 39),), 0.6),
("'https://microsoft.github.io/presidio/'", 1, ((0, 39),), 0.6),
# A genuine ccTLD as a complete label must still match (#1498 guard)
("example.sy", 1, ((0, 10),), 0.5,),
Comment on lines +45 to +46
("zeus.mtia.local", 0, (), 0,),
("return os.system, (cmd,)", 0, (), 0,),
Comment thread CHANGELOG.md Outdated
- Added `supported_entity` parameter to `PhoneRecognizer`. Previously, this recognizer hard-coded `["PHONE_NUMBER"]` as the only possible supported entity.

#### Fixed
- Fixed `UrlRecognizer` matching a TLD that is only the prefix of a longer label, producing false positives on `<word>.<word>` tokens common in code and filenames (e.g. `os.sy` in `os.system`, `zeus.mt` in `zeus.mtia`). A negative lookahead now requires the matched TLD to be a complete domain label. Genuine bare domains (e.g. `example.sy`) and URLs with paths are unaffected. Fixes [#1498](https://github.com/microsoft/presidio/issues/1498).
@Lawson-Darrow

Copy link
Copy Markdown
Author

@microsoft-github-policy-service agree

- Drop the trailing commas inside the new test-case tuples to match the
  surrounding entries' style.
- Reword the CHANGELOG entry to describe the false positive as a
  `<word>.<tld>` substring matched inside a longer identifier.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

Projects

None yet

Development

Successfully merging this pull request may close these issues.

UrlRecognizer detects many false positives when analyzing code snippets

2 participants