feat(clients): add dynamo_chat renderer transport (TITO over Dynamo)#1574
Open
biswapanda wants to merge 19 commits into
Open
feat(clients): add dynamo_chat renderer transport (TITO over Dynamo)#1574biswapanda wants to merge 19 commits into
biswapanda wants to merge 19 commits into
Conversation
…tokens Dynamo's vLLM and SGLang backends emit engine-emitted token IDs and per-token logprobs under `response.nvext.engine_data` when the client opts in via `nvext.extra_fields=["engine_data"]` (PR #8119). The vLLM-native path uses non-standard top-level fields (`choices[0].token_ids`, `response.prompt_token_ids`). Add a small graft inside `from_native_response.parse_tokens` that copies the engine_data fields onto the OpenAI-shaped response when present and the top-level fields are absent. The rest of parse_tokens then reads via the standard SDK attribute path regardless of backend.
The verifiers TITO client previously only spoke vLLM's TITO surface
(POST /v1/chat/completions/tokens with tokens=prompt_ids; bridge tokens
via /tokenize). Dynamo serves neither route, so multi-turn TITO against
Dynamo silently degraded to MITO every turn-2+.
This teaches OpenAIChatCompletionsTokenClient to read
ClientConfig.renderer_transport and route accordingly:
* prime_vllm_generate (default): unchanged. POST /v1/chat/completions/tokens
with tokens=prompt_ids; bridge tokens via /tokenize HTTP. Requires vLLM
>= 0.20.
* dynamo_chat_nvext: POST /v1/chat/completions with placeholder messages +
nvext.token_data=prompt_ids. Bridge tokens are computed locally via the
model's HF fast tokenizer (no /tokenize HTTP round-trip). Server returns
engine-side token IDs and logprobs under nvext.engine_data (PR #8119
channel), parsed by the OpenAIChatCompletionsClient.from_native_response
graft so the rest of the pipeline is transport-agnostic.
Also fix the normalize_for_comparison asymmetry that caused get_prompt_ids
to never match for vf.Message-shaped input (the form MultiTurnEnv produces
after maybe_normalize_messages). Drop None-valued keys so model_dump's
exhaustive view is equivalent to to_native_prompt's slimmer view.
…ken_ids (plan B3)
…ChatCompletion, scrub return_token_ids, forward sampling args, graft engine_data logprobs) + rename to dynamo_chat
… content-less; trim test comments
…p fixed allowlist) for vLLM-path parity
…prob length, tokenizer override, drop dead renderer field
…route dynamo TITO through routed-experts sidecar helper
…_pretrained must not block the event loop)
…er key-order robust
…ect; document dtype field
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 2 potential issues.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 17c819b. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
Adds a
dynamo_chatrenderer transport so the verifiers TITO (tokens-in/tokens-out) client can run multi-turn against NVIDIA Dynamo, alongside the existing vLLM TITO path. Previously the TITO client only spoke vLLM's surface (POST /v1/chat/completions/tokens+/tokenize); Dynamo serves neither route, so multi-turn TITO against Dynamo silently degraded to MITO from turn 2 onward.Changes
RendererTransport = Literal["vllm_generate", "dynamo_chat"]andClientConfig.renderer_transport(defaultvllm_generate— the new path is opt-in).renderer_transportthrough torenderers.generate()and route by transport.vllm_generate(default): unchanged —POST /v1/chat/completions/tokens, bridge tokens via/tokenize.dynamo_chat:POST /v1/chat/completionswith placeholder messages +nvext.token_data=prompt_ids; bridge tokens computed locally via the model's HF fast tokenizer (no/tokenizeround-trip). Engine token IDs + logprobs come back undernvext.engine_data.nvext.engine_data(engine token IDs + per-token logprobs) onto the OpenAI-shaped response when present and the vLLM-native fields are absent, keeping the rest of the pipeline transport-agnostic.RoutedExpertsPayloadgainsdtype: NotRequired[Literal["uint8", "uint16", "int32"]]so the routed-experts buffer is self-describing (≤256 experts → uint8, larger → uint16/int32) instead of consumers assuming a fixed width; the JSON-gate sidecar stripper is bounded to the routed_experts object and made key-order robust.normalize_for_comparisonasymmetry soget_prompt_idsmatchesvf.Message-shaped input (dropsNone-valued keys).Type of Change
Review
Codex adversarial review: SIGN-OFF (head
ea53210). All review threads resolved.Notes
Default behavior is unchanged (
renderer_transportdefaults tovllm_generate). Companion to PrimeIntellect-ai/renderers#79 and PrimeIntellect-ai/prime-rl#2737.Note
Medium Risk
Changes multi-turn token stitching, inference request shapes, and response parsing for Dynamo backends; misaligned local vs server tokenization could still break TITO, but default vLLM behavior is unchanged.
Overview
Adds
renderer_transport("vllm"default,"dynamo"opt-in) onClientConfigso TITO (openai_chat_completions_token) andRendererClientcan target NVIDIA Dynamo without vLLM’s/chat/completions/tokensor/tokenizeroutes.For
renderer_transport="dynamo", the token client posts stitched prompts vianvext.token_dataon/v1/chat/completions, requestsnvext.extra_fields=["engine_data"], strips vLLM-only sampling keys, and computes bridge tokens locally with a cached HuggingFace tokenizer (renderer_model_nameoverride supported).OpenAIChatCompletionsClientgraftsnvext.engine_data(prompt/completion token IDs, logprobs, routed experts) onto the OpenAI-shaped response soparse_tokensstays unchanged, including synthesizing logprobs when the choice has empty content and dropping tokens when logprob lengths mismatch.RoutedExpertsPayloadgains optionaldtype;strip_routed_experts_dataand the routed-experts sidecar now handle varying JSON key order and attach blobs under choice ornvext/engine_data, raising if a blob was stripped but no container exists. TITO prefix matching dropsNonekeys in message normalization so multi-turn stitching no longer falls back to MITO every turn after the first.Reviewed by Cursor Bugbot for commit b658883. Bugbot is set up for automated code reviews on this repo. Configure here.
Note
Add Dynamo renderer transport (TITO over Dynamo) to chat completions token client
renderer_transportfield toClientConfig(default"vllm") and aRendererTransporttype alias inverifiers/types.py, allowing per-client selection of either"vllm"or"dynamo"transport."dynamo",OpenAIChatCompletionsTokenClienttokenizes locally via a cached HF fast tokenizer, posts to/v1/chat/completionswithnvext.token_datacontaining prompt IDs, and strips vLLM-only sampling keys (return_token_ids,spaces_between_special_tokens,priority)._graft_engine_datahelper toOpenAIChatCompletionsClientto read token IDs and logprobs fromnvext.engine_data, synthesize missinglogprobs.content, and widenrouted_expertsdiscovery to additionalnvextpaths.strip_routed_experts_datato findrouted_experts.dataregardless of key order by bounding the search within the object span.post_chat_completion_with_routed_experts_sidecarto reattach the routed experts memoryview for both vLLM and Dynamo response shapes, raising an error if no container is found.parse_tokensnow returnsNonewhencompletion_logprobslength mismatchescompletion_token_ids, which is a new failure mode for misaligned responses.Macroscope summarized b658883.