Skip to content

fix(py): make encode_with_unstable handle surrogates like encode (#541)#567

Open
MasterOfLogic1 wants to merge 1 commit into
openai:mainfrom
MasterOfLogic1:fix/encode-with-unstable-surrogates
Open

fix(py): make encode_with_unstable handle surrogates like encode (#541)#567
MasterOfLogic1 wants to merge 1 commit into
openai:mainfrom
MasterOfLogic1:fix/encode-with-unstable-surrogates

Conversation

@MasterOfLogic1

@MasterOfLogic1 MasterOfLogic1 commented Jun 7, 2026

Copy link
Copy Markdown

Summary

Fixes #541.

Encoding.encode and Encoding.encode_ordinary already catch UnicodeEncodeError from the Rust BPE layer and retry after a UTF-16 surrogatepass + replace round-trip. Encoding.encode_with_unstable did not, so the same inputs that worked through encode raised UnicodeEncodeError.

This PR mirrors the same try/except/repair pattern in encode_with_unstable so all three encode paths accept the same surrogate inputs.

Repro (before)

import tiktoken

enc = tiktoken.get_encoding("cl100k_base")

enc.encode("\ud83d\udc4d")               # works 
enc.encode_with_unstable("\ud83d\udc4d") # UnicodeEncodeError before but works 

enc.encode("\ud83d")                     # works
enc.encode_with_unstable("\ud83d")       # UnicodeEncodeError

Changes

Wrap self._core_bpe.encode_with_unstable(...) in the same surrogate repair logic used by encode() in tiktoken/core.py
Add test_encode_with_unstable_surrogate_pairs in tests/test_encoding.py

Test Plan

def test_encode_with_unstable_surrogate_pairs():
    enc = tiktoken.get_encoding("cl100k_base")

    # would raise UnicodeEncodeError before the fix in core.py
    enc.encode_with_unstable("\ud83d\udc4d")
    enc.encode_with_unstable("\ud83d")

    assert enc.encode("\ud83d\udc4d") == enc.encode("👍")
    assert enc.encode("\ud83d") == enc.encode("�")

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

encode_with_unstable does not handle surrogate pairs like encode

1 participant