Skip to content

HTML API: Serialize decoded carriage returns as character references#42

Open
sirreal wants to merge 7 commits into
spec-compliant-gettersfrom
html-api-fuzz-fiz/decoded-cr
Open

HTML API: Serialize decoded carriage returns as character references#42
sirreal wants to merge 7 commits into
spec-compliant-gettersfrom
html-api-fuzz-fiz/decoded-cr

Conversation

@sirreal

@sirreal sirreal commented Jun 10, 2026

Copy link
Copy Markdown
Owner

Summary

Testing

  • Regression coverage for decoded CR handling across text, attributes, RCDATA, tables, and templates.
  • HTML API and html5lib PHPUnit groups and PHPCS pass.
  • codex review --base trunk.

Trac ticket: https://core.trac.wordpress.org/ticket/65372

Use of AI Tools

AI assistance: Yes
Tool(s): Codex
Model(s): GPT-5.5
Used for: PR description cleanup and code review.


This Pull Request is for code review only. Please keep all other discussion in the Trac ticket. Do not merge this Pull Request. See GitHub Pull Requests for Code Review in the Core Handbook for more details.

@sirreal sirreal marked this pull request as ready for review June 10, 2026 09:28
@github-actions

github-actions Bot commented Jun 10, 2026

Copy link
Copy Markdown

The following accounts have interacted with this PR and/or linked issues. I will continue to update these lists as activity occurs. You can also manually ask me to refresh this list by adding the props-bot label.

Core Committers: Use this line as a base for the props when committing in SVN:

Props jonsurrell.

To understand the WordPress project's expectations around crediting contributors, please review the Contributor Attribution page in the Core Handbook.

@sirreal sirreal changed the base branch from trunk to html-api-fuzz-fiz/decoded-cr-base June 10, 2026 09:55
@sirreal

This comment was marked as outdated.

sirreal added a commit that referenced this pull request Jun 10, 2026
…ation

# Conflicts:
#	src/wp-includes/html-api/class-wp-html-processor.php
sirreal added 5 commits June 11, 2026 19:22
Red TDD step: decoded carriage returns in text and attribute values
must serialize as 
 so that normalized output is idempotent: a
raw CR in serialized output would be normalized to a line feed when
parsed again. The raw-CR attribute and class-update cases pass already
through the preprocessing-correct getters and pin that behavior.

See #65372.
The serializer emitted decoded carriage returns raw into text and
attribute values, where input preprocessing turns them into line feeds
on the next parse: normalized output never reached a fixed point for
documents containing 
. Escaping CR after htmlspecialchars() keeps
the character through parse/serialize round trips. Attribute values
read through get_attribute(), whose input preprocessing guarantees raw
source carriage returns already arrive normalized to line feeds, so
only genuinely decoded CRs are escaped.

See #65372.
An attribute value set through set_attribute() may contain NULL bytes;
serializing them as U+FFFD keeps normalized output idempotent, where
browsers' innerHTML emits the raw byte and loses it to replacement on
the next parse. This pins the behavior ahead of consolidating the
serializer's NULL handling.

See #65372.
The getters now expose tag and attribute names with NULL bytes already
replaced by U+FFFD, leaving the serializer's name scrubbing dead, and
the only live input to the per-attribute whole-buffer scrub was an
API-supplied attribute value. That replacement moves into
serialize_decoded_text() next to the carriage-return escaping, which
exists for the same reason: emitting bytes the next parse would
transform. UTF-8 scrubbing of qualified names remains, as invalid
sequences can still reach serialization through source names.

See #65372.
From adversarial review: pins that SCRIPT and STYLE contents serialize
without escaping, where character references do not decode, and that
serialize_token() output for modified class and NULL-containing
attribute values parses back to the same decoded values.

See #65372.
@sirreal sirreal force-pushed the html-api-fuzz-fiz/decoded-cr branch from c86078c to 3a74497 Compare June 11, 2026 19:04
@sirreal sirreal changed the title HTML API: Preserve decoded carriage returns in serialization HTML API: Serialize decoded carriage returns as character references Jun 11, 2026
@sirreal sirreal changed the base branch from html-api-fuzz-fiz/decoded-cr-base to trunk June 11, 2026 19:07
sirreal added a commit that referenced this pull request Jun 11, 2026
# Conflicts:
#	src/wp-includes/html-api/class-wp-html-processor.php
#	tests/phpunit/tests/html-api/wpHtmlProcessor-serialize.php
@sirreal sirreal added this to the HTML API confirmed fuzz PRs milestone Jun 17, 2026
@sirreal

sirreal commented Jun 17, 2026

Copy link
Copy Markdown
Owner Author

This fixes a serialization issue, depends on #53.

<div attr="&#x0D;&#x0A;">&#x0D;&#x0A;

Is normalized to (␍␊ representing CRLF bytes).

<div attr="␍␊">␍␊</div>

This is then parsed as (that CRLF in the attribute value is also surprising):

<div attr="␍␊">␊</div>

@sirreal sirreal changed the base branch from trunk to spec-compliant-getters July 1, 2026 18:51
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant