fix: reject encoders missing single-byte tokens to prevent reachable panic (#568) by yen0304 · Pull Request #570 · openai/tiktoken

yen0304 · 2026-06-14T15:02:11Z

Summary

Fixes #568 — a reachable Rust panic (denial of service) when tokenizing with an untrusted/malformed encoder that is missing single-byte tokens.

Root cause. byte_pair_encode reduces a piece down to the tokens it is made of. Any segment that does not get merged is a single byte, looked up directly via ranks[&piece[..]]. BPE only ever merges pairs that exist in the encoder, so the only lookups that can fail are single bytes that have no token. When such a byte is encountered, the indexing panics:

// src/lib.rs, byte_pair_encode
if piece_len == 1 {
    return vec![ranks[piece]];          // panics if this byte has no token
}
...
.map(|part| ranks[&piece[part[0].0..part[1].0]])  // same

With an untrusted or malformed encoder (e.g. a custom mergeable_ranks passed to tiktoken.Encoding), tokenizing ordinary text reaches this panic and kills the process — a denial of service rather than a catchable error.

I confirmed the panic empirically with a #[should_panic] reproduction calling byte_pair_encode(b"xyz", &ranks) against an encoder lacking single-byte tokens.

The fix

Validate at construction (new_internal, the single chokepoint behind CoreBPE::new and the Python CoreBPE(...) constructor) that the encoder contains a token for every byte 0..=255. If any are missing, return a clear error instead of allowing a later panic. From Python this surfaces as a ValueError at load time.

A complete byte-level BPE encoder must define a token for every byte for tokenization to be total, and all standard tiktoken encoders (gpt2, cl100k_base, o200k_base, …) already include every single-byte token, so no valid encoder is affected — only malformed ones that would otherwise panic.

Changes

new_internal: reject encoders missing any single-byte token, with an error naming how many bytes are missing and a few examples.
Tests: test_new_rejects_encoder_missing_single_byte_tokens and test_new_accepts_complete_byte_encoder.

Test plan

cargo test --lib — all pass
cargo fmt --check clean
cargo clippy --lib clean
Empirically reproduced the original panic before the fix

Note for maintainers

This enforces the contract "a complete byte-level encoder defines all 256 single-byte tokens" at construction. Strictly, valid UTF-8 input can never contain a handful of byte values (0xC0, 0xC1, 0xF5–0xFF), so a minimal encoder could omit those and still tokenize all valid &str input. Requiring all 256 is the simpler, readable rule that matches every shipped encoder; happy to narrow the check to UTF-8-reachable bytes, or move it behind an explicit validation entry point, if you prefer.

…panic A CoreBPE built from an untrusted or malformed encoder that lacks tokens for some single byte values caused a reachable panic (denial of service) during tokenization: byte_pair_encode reduces a piece to its constituent tokens and indexes ranks[...] for any leftover single byte, which panics when that byte has no token. Validate at construction (new_internal) that the encoder contains a token for every byte 0..=255, returning a clear error instead. This surfaces as a ValueError from Python rather than a process-killing panic later in encode(). All standard tiktoken encoders already include every single-byte token, so no valid encoder is affected.

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

Uh oh!

fix: reject encoders missing single-byte tokens to prevent reachable panic (#568)#570

fix: reject encoders missing single-byte tokens to prevent reachable panic (#568)#570
yen0304 wants to merge 1 commit into
openai:mainfrom
yen0304:fix/reject-encoder-missing-byte-tokens

yen0304 commented Jun 14, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Uh oh!

Conversation

yen0304 commented Jun 14, 2026

Summary

The fix

Changes

Test plan

Note for maintainers

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant