fix: reject encoders missing single-byte tokens to prevent reachable panic (#568)#570
Open
yen0304 wants to merge 1 commit into
Open
fix: reject encoders missing single-byte tokens to prevent reachable panic (#568)#570yen0304 wants to merge 1 commit into
yen0304 wants to merge 1 commit into
Conversation
…panic A CoreBPE built from an untrusted or malformed encoder that lacks tokens for some single byte values caused a reachable panic (denial of service) during tokenization: byte_pair_encode reduces a piece to its constituent tokens and indexes ranks[...] for any leftover single byte, which panics when that byte has no token. Validate at construction (new_internal) that the encoder contains a token for every byte 0..=255, returning a clear error instead. This surfaces as a ValueError from Python rather than a process-killing panic later in encode(). All standard tiktoken encoders already include every single-byte token, so no valid encoder is affected.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
Fixes #568 — a reachable Rust panic (denial of service) when tokenizing with an untrusted/malformed encoder that is missing single-byte tokens.
Root cause.
byte_pair_encodereduces a piece down to the tokens it is made of. Any segment that does not get merged is a single byte, looked up directly viaranks[&piece[..]]. BPE only ever merges pairs that exist in the encoder, so the only lookups that can fail are single bytes that have no token. When such a byte is encountered, the indexing panics:With an untrusted or malformed encoder (e.g. a custom
mergeable_rankspassed totiktoken.Encoding), tokenizing ordinary text reaches this panic and kills the process — a denial of service rather than a catchable error.I confirmed the panic empirically with a
#[should_panic]reproduction callingbyte_pair_encode(b"xyz", &ranks)against an encoder lacking single-byte tokens.The fix
Validate at construction (
new_internal, the single chokepoint behindCoreBPE::newand the PythonCoreBPE(...)constructor) that the encoder contains a token for every byte0..=255. If any are missing, return a clear error instead of allowing a later panic. From Python this surfaces as aValueErrorat load time.A complete byte-level BPE encoder must define a token for every byte for tokenization to be total, and all standard tiktoken encoders (
gpt2,cl100k_base,o200k_base, …) already include every single-byte token, so no valid encoder is affected — only malformed ones that would otherwise panic.Changes
new_internal: reject encoders missing any single-byte token, with an error naming how many bytes are missing and a few examples.test_new_rejects_encoder_missing_single_byte_tokensandtest_new_accepts_complete_byte_encoder.Test plan
cargo test --lib— all passcargo fmt --checkcleancargo clippy --libcleanNote for maintainers
This enforces the contract "a complete byte-level encoder defines all 256 single-byte tokens" at construction. Strictly, valid UTF-8 input can never contain a handful of byte values (0xC0, 0xC1, 0xF5–0xFF), so a minimal encoder could omit those and still tokenize all valid
&strinput. Requiring all 256 is the simpler, readable rule that matches every shipped encoder; happy to narrow the check to UTF-8-reachable bytes, or move it behind an explicit validation entry point, if you prefer.