fix: correct stop_sequence reporting and multi-token stop handling in Claude Messages API#2123
Open
morisil wants to merge 1 commit into
Open
Conversation
The Claude Messages API mishandled stop sequences in three ways: 1. stop_reason was always reported as "end_turn", never "stop_sequence". A matched stop sequence collapsed into the generic "stop" finish_reason, which finish_reason_to_claude_stop_reason maps to "end_turn". The matched sequence was never tracked, so "stop_sequence" was unreachable. 2. The stop_sequence response field was never populated (always null), as the matched string was discarded in the generator and never threaded through. 3. Multi-token stop sequences leaked their leading bytes. Text was emitted one token at a time and the stop check was a substring test on accumulated text, so a sequence like "END" arriving as "E" then "ND" streamed the "E" before the full match was recognised. Fixes: - Add scan_stop_sequences: a pure, streaming-safe scanner that holds back any trailing partial match until it is known safe to emit, reports the matched sequence, and bounds held-back text to len(longest_stop) - 1. - Use it in generate.py and batch_generate.py, flushing held-back text on a natural EOS / length limit. Removes the leaky accumulated-text truncation. - Thread matched_stop_sequence through GenerationResponse and TokenChunk so the adapter can distinguish a stop-sequence stop from a natural EOS. - The Claude adapter now reports stop_reason="stop_sequence" and populates the stop_sequence field for both non-streaming and streaming responses. The generic "stop" -> "end_turn" mapping is unchanged (natural EOS). OpenAI/Ollama adapters inherit the multi-token fix; "stop" is already correct for them, so their reported finish reasons are unchanged. Tests: - test_stop_sequences.py: scanner contract + streaming simulation covering the multi-token leak, splits across tokens, false-partial release, EOS flush, and pending boundedness (runs in CI without a model). - test_claude_stop_sequence.py: drives the real adapter functions and asserts stop_reason="stop_sequence" + stop_sequence echo (streaming and non-streaming), and that natural EOS / length still report end_turn / max_tokens. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Fixes #2122.
The Claude Messages API mishandled
stop_sequencesin three ways. This PR fixes all three and adds the missing test coverage.The bugs
stop_reasonwas always"end_turn", never"stop_sequence". A matched stop sequence collapsed into the generic"stop"finish reason, whichfinish_reason_to_claude_stop_reasonmaps to"end_turn". The matched string was discarded, so"stop_sequence"was unreachable.stop_sequenceresponse field was never populated (alwaysnull)."END"arriving as"E"then"ND"streamed the"E"before the full match was recognised.Before / after
content[0].textABC…XYZ(single-tokenEND) / leaksE…(multi-token)ABC…XYZstop_reason"end_turn""stop_sequence"stop_sequencenull"END"The fix
scan_stop_sequences(new, pure module): a streaming-safe scanner that holds back any trailing partial match until it is known safe to emit, reports the matched sequence, and bounds held-back text tolen(longest_stop) - 1.generate.py/batch_generate.py: use the scanner, flushing held-back text on a natural EOS / length limit. Removes the leaky accumulated-text truncation.GenerationResponse/TokenChunk: addmatched_stop_sequence, propagated throughmap_responses_to_chunks, so the adapter can tell a stop-sequence stop from a natural EOS.stop_reason="stop_sequence"and populatesstop_sequencefor both streaming and non-streaming responses. The generic"stop" -> "end_turn"mapping is unchanged (natural EOS).OpenAI/Ollama adapters inherit the multi-token fix;
"stop"is already correct for them, so their reported finish reasons are unchanged.Tests
test_stop_sequences.py: scanner contract + a streaming simulation covering the multi-token leak, splits across tokens, false-partial release, EOS flush, and pending boundedness. Runs in CI without a model.test_claude_stop_sequence.py: drives the real adapter functions and assertsstop_reason="stop_sequence"+ thestop_sequenceecho (streaming and non-streaming), and that natural EOS / length still reportend_turn/max_tokens.Checks
ruff check,ruff format,basedpyright(on changed files), and the affected test suites all pass locally (22 new tests + existing Claude adapter suites, no regressions). The MLX generator integration tests requireuv sync --extra mlxto run.🤖 Generated with Claude Code