Skip to content

fix(safaa): repair broken email regex via verbose flag#50

Open
Valyrian-Code wants to merge 1 commit into
fossology:mainfrom
Valyrian-Code:fix/email-regex
Open

fix(safaa): repair broken email regex via verbose flag#50
Valyrian-Code wants to merge 1 commit into
fossology:mainfrom
Valyrian-Code:fix/email-regex

Conversation

@Valyrian-Code

@Valyrian-Code Valyrian-Code commented May 23, 2026

Copy link
Copy Markdown

Description

The email-replacement regex in _perform_text_substitutions was written as a multi-line triple-quoted raw string with indentation for readability. Because the pattern was not compiled in verbose mode, Python preserved the embedded whitespace as literal characters, causing the regex to expect newlines and spaces inside email addresses. As a result, valid emails never matched and were never replaced with the EMAIL token.

This PR prepends the inline verbose flag (?x) to the pattern so formatting whitespace is ignored while preserving whitespace inside character classes. This restores the original intended behavior without restructuring the regex.

Changes

Known limitation

The \d{4} year substitution currently runs before email normalization. Emails containing 4+ consecutive digits (for example author5565@example.com) therefore become partially transformed before the email regex executes and no longer match.

This behavior is documented in:

test_email_with_four_plus_digits_known_limitation

A follow-up PR can address this by reordering substitutions.

How to test

poetry install
poetry run pytest tests/test_safaa.py -v

Manual reproduction

from safaa.Safaa import SafaaAgent

agent = SafaaAgent()

print(agent._perform_text_substitutions(["contact john@example.com"]))

Before:

['contact john example com']

After:

['contact EMAIL']

This closes #49.

Note: this PR shares the tests/ scaffolding introduced in #48. Depending on merge order, a small rebase may be needed for tests/__init__.py and the test module header.

The email regex in _perform_text_substitutions has never matched any
real email address. The pattern was written as a triple-quoted raw
string and formatted with indentation for readability, but Python
includes that indentation (newlines and leading spaces) as literal
characters in the regex. The engine therefore tried to match
`(?:[...]+...|"\n                (?:[...])` and so on, which no real
input can satisfy.

Add the (?x) inline verbose flag at the start of the pattern. In
verbose mode, unescaped whitespace in the pattern is ignored by the
regex engine, restoring the author's original intent without
restructuring the expression.

With the fix:

  >>> import re
  >>> re.sub(pattern, " EMAIL ", "contact john@example.com")
  'contact  EMAIL '

Add pytest-based tests covering simple addresses, dotted local parts,
plus tags, subdomains, hyphenated domains, multiple emails per string,
no-email strings, stray @ symbols, and a documented known limitation
where the \\d{4} year rule corrupts emails containing 4+ consecutive
digits before this regex runs.

Signed-off-by: RAJVEER42 <irajveer.bishnoi2310@gmail.com>
Copilot AI review requested due to automatic review settings May 23, 2026 11:01
@Valyrian-Code

Valyrian-Code commented May 23, 2026

Copy link
Copy Markdown
Author

Hi @GMishx & @Kaushl2208

Hi! I opened a small PR with bug fixe and regression tests. I’d appreciate a review whenever you have time thanks for maintaining the project!

GIF

Copilot AI left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Copilot encountered an error and was unable to review this pull request. You can try again by re-requesting a review.

@GMishx

GMishx commented May 23, 2026

Copy link
Copy Markdown
Member

@Valyrian-Code I'd really appreciate to not spam with tags at each step. You have just opened the PRs. Someone will come and have a look. Have some patience. Spamming @Valyrian-Code will not help.

@Valyrian-Code

Copy link
Copy Markdown
Author

@GMishx
Understood, my bad. I didn’t mean to spam or rush anyone . I’ll wait patiently for the review. Thanks for pointing it out.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

Email regex in _perform_text_substitutions has never matched any email address

3 participants