fix: correct stop_sequence reporting and multi-token stop handling in Claude Messages API by morisil · Pull Request #2123 · exo-explore/exo

morisil · 2026-05-27T21:16:28Z

The Claude Messages API mishandled stop_sequences in three ways. This PR fixes all three and adds the missing test coverage.

The bugs

stop_reason was always "end_turn", never "stop_sequence". A matched stop sequence collapsed into the generic "stop" finish reason, which finish_reason_to_claude_stop_reason maps to "end_turn". The matched string was discarded, so "stop_sequence" was unreachable.
The stop_sequence response field was never populated (always null).
Multi-token stop sequences leaked their leading bytes. Text was emitted one token at a time and the stop check was a substring test over accumulated text, so "END" arriving as "E" then "ND" streamed the "E" before the full match was recognised.

Before / after

curl http://localhost:52415/v1/messages -H "Content-Type: application/json" -d '{
  "model": "<mlx model>", "max_tokens": 1024,
  "messages": [{"role":"user","content":[{"type":"text",
    "text":"Output the alphabet A to Z, no spaces, then immediately output END"}]}],
  "stop_sequences": ["END"], "thinking": {"type":"disabled"}}'

	before	after
`content[0].text`	`ABC…XYZ` (single-token `END`) / leaks `E…` (multi-token)	`ABC…XYZ`
`stop_reason`	`"end_turn"`	`"stop_sequence"`
`stop_sequence`	`null`	`"END"`

The fix

scan_stop_sequences (new, pure module): a streaming-safe scanner that holds back any trailing partial match until it is known safe to emit, reports the matched sequence, and bounds held-back text to len(longest_stop) - 1.
generate.py / batch_generate.py: use the scanner, flushing held-back text on a natural EOS / length limit. Removes the leaky accumulated-text truncation.
GenerationResponse / TokenChunk: add matched_stop_sequence, propagated through map_responses_to_chunks, so the adapter can tell a stop-sequence stop from a natural EOS.
Claude adapter: reports stop_reason="stop_sequence" and populates stop_sequence for both streaming and non-streaming responses. The generic "stop" -> "end_turn" mapping is unchanged (natural EOS).

OpenAI/Ollama adapters inherit the multi-token fix; "stop" is already correct for them, so their reported finish reasons are unchanged.

Tests

test_stop_sequences.py: scanner contract + a streaming simulation covering the multi-token leak, splits across tokens, false-partial release, EOS flush, and pending boundedness. Runs in CI without a model.
test_claude_stop_sequence.py: drives the real adapter functions and asserts stop_reason="stop_sequence" + the stop_sequence echo (streaming and non-streaming), and that natural EOS / length still report end_turn / max_tokens.

Checks

ruff check, ruff format, basedpyright (on changed files), and the affected test suites all pass locally (22 new tests + existing Claude adapter suites, no regressions). The MLX generator integration tests require uv sync --extra mlx to run.

🤖 Generated with Claude Code

The Claude Messages API mishandled stop sequences in three ways: 1. stop_reason was always reported as "end_turn", never "stop_sequence". A matched stop sequence collapsed into the generic "stop" finish_reason, which finish_reason_to_claude_stop_reason maps to "end_turn". The matched sequence was never tracked, so "stop_sequence" was unreachable. 2. The stop_sequence response field was never populated (always null), as the matched string was discarded in the generator and never threaded through. 3. Multi-token stop sequences leaked their leading bytes. Text was emitted one token at a time and the stop check was a substring test on accumulated text, so a sequence like "END" arriving as "E" then "ND" streamed the "E" before the full match was recognised. Fixes: - Add scan_stop_sequences: a pure, streaming-safe scanner that holds back any trailing partial match until it is known safe to emit, reports the matched sequence, and bounds held-back text to len(longest_stop) - 1. - Use it in generate.py and batch_generate.py, flushing held-back text on a natural EOS / length limit. Removes the leaky accumulated-text truncation. - Thread matched_stop_sequence through GenerationResponse and TokenChunk so the adapter can distinguish a stop-sequence stop from a natural EOS. - The Claude adapter now reports stop_reason="stop_sequence" and populates the stop_sequence field for both non-streaming and streaming responses. The generic "stop" -> "end_turn" mapping is unchanged (natural EOS). OpenAI/Ollama adapters inherit the multi-token fix; "stop" is already correct for them, so their reported finish reasons are unchanged. Tests: - test_stop_sequences.py: scanner contract + streaming simulation covering the multi-token leak, splits across tokens, false-partial release, EOS flush, and pending boundedness (runs in CI without a model). - test_claude_stop_sequence.py: drives the real adapter functions and asserts stop_reason="stop_sequence" + stop_sequence echo (streaming and non-streaming), and that natural EOS / length still report end_turn / max_tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix: correct stop_sequence reporting and multi-token stop handling in Claude Messages API#2123

fix: correct stop_sequence reporting and multi-token stop handling in Claude Messages API#2123
morisil wants to merge 1 commit into
exo-explore:mainfrom
morisil:fix/stop-sequences-reporting-and-multitoken

morisil commented May 27, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

morisil commented May 27, 2026

The bugs

Before / after

The fix

Tests

Checks

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant