Skip to content

fix: correct stop_sequence reporting and multi-token stop handling in Claude Messages API#2123

Open
morisil wants to merge 1 commit into
exo-explore:mainfrom
morisil:fix/stop-sequences-reporting-and-multitoken
Open

fix: correct stop_sequence reporting and multi-token stop handling in Claude Messages API#2123
morisil wants to merge 1 commit into
exo-explore:mainfrom
morisil:fix/stop-sequences-reporting-and-multitoken

Conversation

@morisil
Copy link
Copy Markdown

@morisil morisil commented May 27, 2026

Fixes #2122.

The Claude Messages API mishandled stop_sequences in three ways. This PR fixes all three and adds the missing test coverage.

The bugs

  1. stop_reason was always "end_turn", never "stop_sequence". A matched stop sequence collapsed into the generic "stop" finish reason, which finish_reason_to_claude_stop_reason maps to "end_turn". The matched string was discarded, so "stop_sequence" was unreachable.
  2. The stop_sequence response field was never populated (always null).
  3. Multi-token stop sequences leaked their leading bytes. Text was emitted one token at a time and the stop check was a substring test over accumulated text, so "END" arriving as "E" then "ND" streamed the "E" before the full match was recognised.

Before / after

curl http://localhost:52415/v1/messages -H "Content-Type: application/json" -d '{
  "model": "<mlx model>", "max_tokens": 1024,
  "messages": [{"role":"user","content":[{"type":"text",
    "text":"Output the alphabet A to Z, no spaces, then immediately output END"}]}],
  "stop_sequences": ["END"], "thinking": {"type":"disabled"}}'
before after
content[0].text ABC…XYZ (single-token END) / leaks E… (multi-token) ABC…XYZ
stop_reason "end_turn" "stop_sequence"
stop_sequence null "END"

The fix

  • scan_stop_sequences (new, pure module): a streaming-safe scanner that holds back any trailing partial match until it is known safe to emit, reports the matched sequence, and bounds held-back text to len(longest_stop) - 1.
  • generate.py / batch_generate.py: use the scanner, flushing held-back text on a natural EOS / length limit. Removes the leaky accumulated-text truncation.
  • GenerationResponse / TokenChunk: add matched_stop_sequence, propagated through map_responses_to_chunks, so the adapter can tell a stop-sequence stop from a natural EOS.
  • Claude adapter: reports stop_reason="stop_sequence" and populates stop_sequence for both streaming and non-streaming responses. The generic "stop" -> "end_turn" mapping is unchanged (natural EOS).

OpenAI/Ollama adapters inherit the multi-token fix; "stop" is already correct for them, so their reported finish reasons are unchanged.

Tests

  • test_stop_sequences.py: scanner contract + a streaming simulation covering the multi-token leak, splits across tokens, false-partial release, EOS flush, and pending boundedness. Runs in CI without a model.
  • test_claude_stop_sequence.py: drives the real adapter functions and asserts stop_reason="stop_sequence" + the stop_sequence echo (streaming and non-streaming), and that natural EOS / length still report end_turn / max_tokens.

Checks

ruff check, ruff format, basedpyright (on changed files), and the affected test suites all pass locally (22 new tests + existing Claude adapter suites, no regressions). The MLX generator integration tests require uv sync --extra mlx to run.

🤖 Generated with Claude Code

The Claude Messages API mishandled stop sequences in three ways:

1. stop_reason was always reported as "end_turn", never "stop_sequence".
   A matched stop sequence collapsed into the generic "stop" finish_reason,
   which finish_reason_to_claude_stop_reason maps to "end_turn". The matched
   sequence was never tracked, so "stop_sequence" was unreachable.

2. The stop_sequence response field was never populated (always null), as the
   matched string was discarded in the generator and never threaded through.

3. Multi-token stop sequences leaked their leading bytes. Text was emitted one
   token at a time and the stop check was a substring test on accumulated text,
   so a sequence like "END" arriving as "E" then "ND" streamed the "E" before
   the full match was recognised.

Fixes:

- Add scan_stop_sequences: a pure, streaming-safe scanner that holds back any
  trailing partial match until it is known safe to emit, reports the matched
  sequence, and bounds held-back text to len(longest_stop) - 1.
- Use it in generate.py and batch_generate.py, flushing held-back text on a
  natural EOS / length limit. Removes the leaky accumulated-text truncation.
- Thread matched_stop_sequence through GenerationResponse and TokenChunk so the
  adapter can distinguish a stop-sequence stop from a natural EOS.
- The Claude adapter now reports stop_reason="stop_sequence" and populates the
  stop_sequence field for both non-streaming and streaming responses. The
  generic "stop" -> "end_turn" mapping is unchanged (natural EOS).

OpenAI/Ollama adapters inherit the multi-token fix; "stop" is already correct
for them, so their reported finish reasons are unchanged.

Tests:

- test_stop_sequences.py: scanner contract + streaming simulation covering the
  multi-token leak, splits across tokens, false-partial release, EOS flush, and
  pending boundedness (runs in CI without a model).
- test_claude_stop_sequence.py: drives the real adapter functions and asserts
  stop_reason="stop_sequence" + stop_sequence echo (streaming and non-streaming),
  and that natural EOS / length still report end_turn / max_tokens.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

[BUG] Claude Messages API: stop_sequences report stop_reason=end_turn / stop_sequence=null, and multi-token stop sequences leak into output

1 participant