HTML API: Serialize decoded carriage returns as character references#42
Open
sirreal wants to merge 7 commits into
Open
HTML API: Serialize decoded carriage returns as character references#42sirreal wants to merge 7 commits into
sirreal wants to merge 7 commits into
Conversation
|
The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the Core Committers: Use this line as a base for the props when committing in SVN: To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook. |
This comment was marked as outdated.
This comment was marked as outdated.
sirreal
added a commit
that referenced
this pull request
Jun 10, 2026
…ation # Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php
Red TDD step: decoded carriage returns in text and attribute values must serialize as so that normalized output is idempotent: a raw CR in serialized output would be normalized to a line feed when parsed again. The raw-CR attribute and class-update cases pass already through the preprocessing-correct getters and pin that behavior. See #65372.
The serializer emitted decoded carriage returns raw into text and attribute values, where input preprocessing turns them into line feeds on the next parse: normalized output never reached a fixed point for documents containing . Escaping CR after htmlspecialchars() keeps the character through parse/serialize round trips. Attribute values read through get_attribute(), whose input preprocessing guarantees raw source carriage returns already arrive normalized to line feeds, so only genuinely decoded CRs are escaped. See #65372.
An attribute value set through set_attribute() may contain NULL bytes; serializing them as U+FFFD keeps normalized output idempotent, where browsers' innerHTML emits the raw byte and loses it to replacement on the next parse. This pins the behavior ahead of consolidating the serializer's NULL handling. See #65372.
The getters now expose tag and attribute names with NULL bytes already replaced by U+FFFD, leaving the serializer's name scrubbing dead, and the only live input to the per-attribute whole-buffer scrub was an API-supplied attribute value. That replacement moves into serialize_decoded_text() next to the carriage-return escaping, which exists for the same reason: emitting bytes the next parse would transform. UTF-8 scrubbing of qualified names remains, as invalid sequences can still reach serialization through source names. See #65372.
From adversarial review: pins that SCRIPT and STYLE contents serialize without escaping, where character references do not decode, and that serialize_token() output for modified class and NULL-containing attribute values parses back to the same decoded values. See #65372.
c86078c to
3a74497
Compare
sirreal
added a commit
that referenced
this pull request
Jun 11, 2026
# Conflicts: # src/wp-includes/html-api/class-wp-html-processor.php # tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php
Owner
Author
|
This fixes a serialization issue, depends on #53. <div attr="
">
Is normalized to (␍␊ representing CRLF bytes). <div attr="␍␊">␍␊</div>This is then parsed as (that CRLF in the attribute value is also surprising): |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
so normalization reaches a fixed point.Testing
codex review --base trunk.Trac ticket: https://core.trac.wordpress.org/ticket/65372
Use of AI Tools
AI assistance: Yes
Tool(s): Codex
Model(s): GPT-5.5
Used for: PR description cleanup and code review.
This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.