Skip to content

fix: reject encoders missing single-byte tokens to prevent reachable panic (#568)#570

Open
yen0304 wants to merge 1 commit into
openai:mainfrom
yen0304:fix/reject-encoder-missing-byte-tokens
Open

fix: reject encoders missing single-byte tokens to prevent reachable panic (#568)#570
yen0304 wants to merge 1 commit into
openai:mainfrom
yen0304:fix/reject-encoder-missing-byte-tokens

Conversation

@yen0304

@yen0304 yen0304 commented Jun 14, 2026

Copy link
Copy Markdown

Summary

Fixes #568 — a reachable Rust panic (denial of service) when tokenizing with an untrusted/malformed encoder that is missing single-byte tokens.

Root cause. byte_pair_encode reduces a piece down to the tokens it is made of. Any segment that does not get merged is a single byte, looked up directly via ranks[&piece[..]]. BPE only ever merges pairs that exist in the encoder, so the only lookups that can fail are single bytes that have no token. When such a byte is encountered, the indexing panics:

// src/lib.rs, byte_pair_encode
if piece_len == 1 {
    return vec![ranks[piece]];          // panics if this byte has no token
}
...
.map(|part| ranks[&piece[part[0].0..part[1].0]])  // same

With an untrusted or malformed encoder (e.g. a custom mergeable_ranks passed to tiktoken.Encoding), tokenizing ordinary text reaches this panic and kills the process — a denial of service rather than a catchable error.

I confirmed the panic empirically with a #[should_panic] reproduction calling byte_pair_encode(b"xyz", &ranks) against an encoder lacking single-byte tokens.

The fix

Validate at construction (new_internal, the single chokepoint behind CoreBPE::new and the Python CoreBPE(...) constructor) that the encoder contains a token for every byte 0..=255. If any are missing, return a clear error instead of allowing a later panic. From Python this surfaces as a ValueError at load time.

A complete byte-level BPE encoder must define a token for every byte for tokenization to be total, and all standard tiktoken encoders (gpt2, cl100k_base, o200k_base, …) already include every single-byte token, so no valid encoder is affected — only malformed ones that would otherwise panic.

Changes

  • new_internal: reject encoders missing any single-byte token, with an error naming how many bytes are missing and a few examples.
  • Tests: test_new_rejects_encoder_missing_single_byte_tokens and test_new_accepts_complete_byte_encoder.

Test plan

  • cargo test --lib — all pass
  • cargo fmt --check clean
  • cargo clippy --lib clean
  • Empirically reproduced the original panic before the fix

Note for maintainers

This enforces the contract "a complete byte-level encoder defines all 256 single-byte tokens" at construction. Strictly, valid UTF-8 input can never contain a handful of byte values (0xC0, 0xC1, 0xF5–0xFF), so a minimal encoder could omit those and still tokenize all valid &str input. Requiring all 256 is the simpler, readable rule that matches every shipped encoder; happy to narrow the check to UTF-8-reachable bytes, or move it behind an explicit validation entry point, if you prefer.

…panic

A CoreBPE built from an untrusted or malformed encoder that lacks tokens for
some single byte values caused a reachable panic (denial of service) during
tokenization: byte_pair_encode reduces a piece to its constituent tokens and
indexes ranks[...] for any leftover single byte, which panics when that byte
has no token.

Validate at construction (new_internal) that the encoder contains a token for
every byte 0..=255, returning a clear error instead. This surfaces as a
ValueError from Python rather than a process-killing panic later in encode().
All standard tiktoken encoders already include every single-byte token, so no
valid encoder is affected.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[Security] Reachable Rust panic (denial of service) when tokenizing with an untrusted encoder missing single-byte tokens

1 participant