fix(safaa): repair broken email regex via verbose flag#50
Open
Valyrian-Code wants to merge 1 commit into
Open
Conversation
The email regex in _perform_text_substitutions has never matched any
real email address. The pattern was written as a triple-quoted raw
string and formatted with indentation for readability, but Python
includes that indentation (newlines and leading spaces) as literal
characters in the regex. The engine therefore tried to match
`(?:[...]+...|"\n (?:[...])` and so on, which no real
input can satisfy.
Add the (?x) inline verbose flag at the start of the pattern. In
verbose mode, unescaped whitespace in the pattern is ignored by the
regex engine, restoring the author's original intent without
restructuring the expression.
With the fix:
>>> import re
>>> re.sub(pattern, " EMAIL ", "contact john@example.com")
'contact EMAIL '
Add pytest-based tests covering simple addresses, dotted local parts,
plus tags, subdomains, hyphenated domains, multiple emails per string,
no-email strings, stray @ symbols, and a documented known limitation
where the \\d{4} year rule corrupts emails containing 4+ consecutive
digits before this regex runs.
Signed-off-by: RAJVEER42 <irajveer.bishnoi2310@gmail.com>
Author
|
Hi @GMishx & @Kaushl2208 Hi! I opened a small PR with bug fixe and regression tests. I’d appreciate a review whenever you have time thanks for maintaining the project! |
Member
|
@Valyrian-Code I'd really appreciate to not spam with tags at each step. You have just opened the PRs. Someone will come and have a look. Have some patience. Spamming @Valyrian-Code will not help. |
Author
|
@GMishx |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
The email-replacement regex in
_perform_text_substitutionswas written as a multi-line triple-quoted raw string with indentation for readability. Because the pattern was not compiled in verbose mode, Python preserved the embedded whitespace as literal characters, causing the regex to expect newlines and spaces inside email addresses. As a result, valid emails never matched and were never replaced with theEMAILtoken.This PR prepends the inline verbose flag
(?x)to the pattern so formatting whitespace is ignored while preserving whitespace inside character classes. This restores the original intended behavior without restructuring the regex.Changes
[Safaa.py#L213](https://github.com/fossology/safaa/blob/main/Safaa/src/safaa/Safaa.py?utm_source=chatgpt.com#L213)
Add
(?x)to the email regex and document why verbose mode is requiredtests/__init__.py,tests/test_safaa.pyAdd pytest coverage for:
@symbols[pyproject.toml](https://github.com/fossology/safaa/blob/main/pyproject.toml?utm_source=chatgpt.com)
Add
pytestto dev dependenciesKnown limitation
The
\d{4}year substitution currently runs before email normalization. Emails containing 4+ consecutive digits (for exampleauthor5565@example.com) therefore become partially transformed before the email regex executes and no longer match.This behavior is documented in:
test_email_with_four_plus_digits_known_limitationA follow-up PR can address this by reordering substitutions.
How to test
Manual reproduction
Before:
['contact john example com']After:
['contact EMAIL']This closes #49.