fix(analyzer): prevent UrlRecognizer matching partial TLD labels (#1498)#2044
Open
Lawson-Darrow wants to merge 2 commits into
Open
fix(analyzer): prevent UrlRecognizer matching partial TLD labels (#1498)#2044Lawson-Darrow wants to merge 2 commits into
Lawson-Darrow wants to merge 2 commits into
Conversation
The URL regex matched a ccTLD even when it was only the prefix of a longer label, producing false positives on `<word>.<word>` tokens common in code and filenames (e.g. `os.sy` in `os.system`, `zeus.mt` in `zeus.mtia`). A negative lookahead now requires the matched TLD to be a complete domain label. Genuine bare domains (e.g. `example.sy`) and URLs with paths are unaffected. Adds regression tests for the reported false positives plus a guard ensuring real ccTLD domains still match. Fixes microsoft#1498 Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Contributor
There was a problem hiding this comment.
Pull request overview
Note
Copilot was unable to run its full agentic suite in this review.
This PR fixes UrlRecognizer false positives where a TLD match was accepted even when it was only the prefix of a longer label (e.g., matching os.sy inside os.system), and adds regression tests and a changelog entry for the fix.
Changes:
- Tightened
UrlRecognizer’s base URL regex with a negative lookahead so the matched TLD must be a complete label. - Added regression tests to ensure code-like tokens (e.g.
os.system,zeus.mtia.local) are not recognized as URLs, while bare ccTLD domains still are. - Documented the fix in the changelog (linked to #1498).
Reviewed changes
Copilot reviewed 3 out of 3 changed files in this pull request and generated 3 comments.
| File | Description |
|---|---|
| presidio-analyzer/tests/test_url_recognizer.py | Adds regression tests for #1498 and a positive control for a bare ccTLD domain. |
| presidio-analyzer/presidio_analyzer/predefined_recognizers/generic/url_recognizer.py | Updates the URL regex to require the TLD be a complete label (negative lookahead). |
| CHANGELOG.md | Documents the behavior change and references the fixing issue. |
| ('"https://microsoft.github.io/presidio/"', 1, ((0, 39),), 0.6), | ||
| ("'https://microsoft.github.io/presidio/'", 1, ((0, 39),), 0.6), | ||
| # A genuine ccTLD as a complete label must still match (#1498 guard) | ||
| ("example.sy", 1, ((0, 10),), 0.5,), |
Comment on lines
+45
to
+46
| ("zeus.mtia.local", 0, (), 0,), | ||
| ("return os.system, (cmd,)", 0, (), 0,), |
| - Added `supported_entity` parameter to `PhoneRecognizer`. Previously, this recognizer hard-coded `["PHONE_NUMBER"]` as the only possible supported entity. | ||
|
|
||
| #### Fixed | ||
| - Fixed `UrlRecognizer` matching a TLD that is only the prefix of a longer label, producing false positives on `<word>.<word>` tokens common in code and filenames (e.g. `os.sy` in `os.system`, `zeus.mt` in `zeus.mtia`). A negative lookahead now requires the matched TLD to be a complete domain label. Genuine bare domains (e.g. `example.sy`) and URLs with paths are unaffected. Fixes [#1498](https://github.com/microsoft/presidio/issues/1498). |
Author
|
@microsoft-github-policy-service agree |
- Drop the trailing commas inside the new test-case tuples to match the surrounding entries' style. - Reword the CHANGELOG entry to describe the false positive as a `<word>.<tld>` substring matched inside a longer identifier.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #1498
Problem
UrlRecognizer.BASE_URL_REGEXmatches a TLD even when it is only the prefix of a longer label. The TLD alternation has no trailing boundary, so schema-less<word>.<word>tokens — common in source code, module paths, and filenames — produce false positives:os.systemos.sy(.sy= Syria)zeus.mtia.localzeus.mt(.mt= Malta)Fix
Add a negative lookahead
(?![a-z0-9-])immediately after the TLD alternation, requiring the matched TLD to be a complete domain label (i.e. not immediately followed by another label character).example.sy,microsoft.com,google.co.il) still match./, which the lookahead permits.Tests
Added regression cases to
test_url_recognizer.py:os.system,zeus.mtia.local, and a code-snippet line) now yield zero matches.example.sy) ensures a real ccTLD used as a complete label still matches.All cases in
test_url_recognizer.pypass;ruff checkreports no new findings on the changed files. CHANGELOG updated under Analyzer → Fixed.Note on scope
Tokens that are syntactically valid bare domains remain ambiguous by design — e.g.
rpc.py(a Python file) is indistinguishable from the Paraguay domainrpc.pywithout surrounding context, so it intentionally still matches. This PR removes only the unambiguous partial-label matches.🤖 Generated with Claude Code