Skip to content

fix: use StringDecoder to handle UTF-8 chunk boundaries in setEncoding#5035

Open
398651434 wants to merge 1 commit intonodejs:mainfrom
398651434:main
Open

fix: use StringDecoder to handle UTF-8 chunk boundaries in setEncoding#5035
398651434 wants to merge 1 commit intonodejs:mainfrom
398651434:main

Conversation

@398651434
Copy link
Copy Markdown

Description

Fixes a bug where response.body.setEncoding('utf8') corrupts multi-byte UTF-8 characters that span chunk boundaries.

Root Cause

Each chunk was being individually converted to a string via buffer.utf8Slice() (or toString()). When a multi-byte UTF-8 character (e.g., a Chinese character = 3 bytes) is split across two HTTP response chunks, the first chunk gets an incomplete byte sequence converted to garbage, and the second chunk's portion becomes a separate corrupted character.

Fix

Use Node.js's built-in StringDecoder (from node:string_decoder) which properly buffers incomplete byte sequences between write() calls:

  1. setEncoding(encoding): Initialize a StringDecoder when encoding is set
  2. consumePush: When a decoder exists, use decoder.write(chunk) instead of storing the raw buffer — this accumulates incomplete UTF-8 bytes internally
  3. consumeFinish: Reset the decoder to allow garbage collection

Testing

The bug manifests when:

  • HTTP response contains multi-byte UTF-8 text (e.g., Chinese characters, emoji)
  • setEncoding('utf8') is called on the body
  • The text spans multiple TCP packets/chunks

After fix, characters are correctly reassembled across chunk boundaries.


Closes #5002

When setEncoding('utf8') is called, each chunk was being converted to
a string individually, which corrupts multi-byte UTF-8 characters that
span chunk boundaries.

This fix:
- Initializes a StringDecoder when setEncoding is called
- Uses StringDecoder.write() in consumePush to properly handle
  incomplete UTF-8 sequences at chunk boundaries
- Resets the decoder in consumeFinish to allow garbage collection

Closes nodejs#5002
Copy link
Copy Markdown
Member

@metcoder95 metcoder95 left a comment

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

can you add a regression for it?

@codecov-commenter
Copy link
Copy Markdown

Codecov Report

❌ Patch coverage is 58.33333% with 5 lines in your changes missing coverage. Please review.
✅ Project coverage is 93.03%. Comparing base (bc0a19c) to head (f788349).
⚠️ Report is 31 commits behind head on main.

Files with missing lines Patch % Lines
lib/api/readable.js 58.33% 5 Missing ⚠️
Additional details and impacted files
@@            Coverage Diff             @@
##             main    #5035      +/-   ##
==========================================
- Coverage   93.03%   93.03%   -0.01%     
==========================================
  Files         110      110              
  Lines       35793    35803      +10     
==========================================
+ Hits        33301    33309       +8     
- Misses       2492     2494       +2     

☔ View full report in Codecov by Sentry.
📢 Have feedback on the report? Share it here.

🚀 New features to boost your workflow:
  • ❄️ Test Analytics: Detect flaky tests, report on failures, and find test suite problems.
  • 📦 JS Bundle Analysis: Save yourself from yourself by tracking and limiting bundle sizes in JS merges.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

setEncoding('utf8') on response body corrupts multi-byte UTF-8 characters at chunk boundaries

3 participants