feat: dynamo inference backend integration#2737
Open
biswapanda wants to merge 19 commits into
Open
Conversation
… (qwen3_moe/glm_moe)
…p_recent for LoRA
…nsport (vllm + dynamo nvext)
…) + Dockerfile.dynamo
…t; bump verifiers/renderers deps to rl-sdk-4 heads
…t16 to int32, auto-triton on router replay
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
There are 3 total unresolved issues (including 2 from previous reviews).
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 2c61937. Configure here.
1 task
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Description
replaced by - #2773
End-to-end support for running prime-rl RL training against NVIDIA Dynamo (GB200/GB300) alongside the existing vLLM path. Adds a Dynamo inference backend, NCCL/filesystem weight transfer for GB200, vLLM 0.22 patches, MoE routed-experts capture + replay, and the deploy tooling (image, helm, k8s manifests) to run it.
Highlights
AdminAPIabstraction + backend selector (client.backend=vllm|dynamo) + RL worker discovery (GET /v1/rl/workers) for Dynamo-served inference.qwen3_moe/glm_moe), plus an NFS-safe filesystem broadcast path withweight_broadcast.keep_recent.{data, shape, start, dtype}payload dtype-aware (uint8/uint16, normalizing uint16→int32 for the trainer; int32 fallback for >65535 experts), and the trainer replays the captured routing so recomputed logprobs match inference. Inference forwardsmoe_backendand auto-selectstritonwhen router replay is enabled — the default FlashInfer fused MoE kernel bypasses the capture hook (→ all-zero routing), so a non-fused backend is required.compute_teacher_logprobsbyrenderer_transport(vLLM generate vs DynamonvextTITO); stop sendingreturn_token_idsfor Dynamo compatibility.silu_mul_quant, padded scrub.Dockerfile.cuda.runtime(vLLM 0.22, DeepGEMM) +Dockerfile.dynamo, helm chart updates, Dynamo k8s manifests (client example setsbackend=dynamo), andtools/dynamorun/smoke scripts.Type of Change
Review
Codex adversarial review: SIGN-OFF (head
1b5917a). The 2 remaining review threads are non-routed_experts production-path follow-ups, flagged with fixes: weight-update pause retries, and broadcastkeep_recentshould be ≥orchestrator.max_off_policy_steps.Validation
3-GPU GB200 (1 inference + 2 FSDP trainer), Qwen3-30B-A3B-Thinking, router replay +
moe_backend=triton: 10-step RL run with Mismatch KL 0.0002–0.0005 every step (faithful routing replay, no drift), no errors/OOM, stable memory.Notes
Companion to PrimeIntellect-ai/verifiers#1574 and PrimeIntellect-ai/renderers#79 (the
dynamo_chatTITO transport this orchestrator path drives). The deps commit repoints theverifiers/rendererssubmodules atbiswapandaforks pending those PRs merging.Note
High Risk
Touches NCCL weight broadcast, inference weight reload (E8M0/FP8), orchestrator–inference admin contracts, and large vLLM runtime patches; misconfiguration can break training sync or serving on GPU clusters.
Overview
Adds NVIDIA Dynamo as an alternate inference backend (
client.backend:vllm|dynamo) via anAdminAPIabstraction (VLLMAdminAPIvsDynamoAdminAPIon/engine/*), RL worker discovery (GET /v1/rl/workers,rl_base_url), andrenderer_transport=dynamo_chatfor nvext rollouts. Orchestrator stops defaultingreturn_token_idsfor Dynamo; teacher logprobs dispatch on transport (vLLM generate vs Dynamo chat/nvext).Weight sync & GB200: Filesystem broadcast gains configurable
keep_recent, fsync-before-STABLE, and retention-aware cleanup; NCCL broadcast adds per-layerdist.barrier+ CUDA sync. Inference reload handles DeepGEMM E8M0 scale layout; Qwen3 MoE can export vLLM kernel/FP8 weights. vLLM patches add int64 DeepGEMM SiLU/mul quant, fp32 lm-head idempotency, and dtype-aware routed_experts capture/replay (moe_backend, autotritonwhen router replay is on).Deploy: New
Dockerfile.cuda.runtime(cuda-dl-base devel for NVRTC/tilelang +python3.12-dev), Dynamo k8s examples (DGD, ConfigMap, Helm values with inference disabled), Helm chart extensions (ConfigMap mounts,existingClaim, DRA resource claims, tolerations/pull secrets), andtools/dynamolaunch/smoke scripts.Reviewed by Cursor Bugbot for commit 08bb4ea. Bugbot is set up for automated code reviews on this repo. Configure here.