Skip to content

feat: consume native multimodal from the v1 trace (v0 + v1 VLM training)#2751

Draft
mikasenghaas wants to merge 2 commits into
feat/nano-as-v1from
fix/v0-multimodal
Draft

feat: consume native multimodal from the v1 trace (v0 + v1 VLM training)#2751
mikasenghaas wants to merge 2 commits into
feat/nano-as-v1from
fix/v0-multimodal

Conversation

@mikasenghaas

@mikasenghaas mikasenghaas commented Jun 10, 2026

Copy link
Copy Markdown
Member

Summary

The orchestrator/trainer-feeding side of native v1 multimodal, plus the v1 example wiring (pairs with verifiers #1601).

  • mm consumer. trace_to_samples turns each turn's TurnTokens.multi_modal_data (vf.MMData) into mm_kwargs — concatenating each HF-processor kwarg (pixel_values, image_grid_thw) over the sample's images — and builds mm_token_type_ids from the renderer's placeholder→type map.
  • Per-turn delta. A native v1 turn re-renders the whole prompt (cumulative multi_modal_data) while the v0 bridge ships deltas; _pack_mm_kwargs contributes each image once, aligned to the placeholder tokens (which appear once, in the turn the image is introduced). Repeated images (e.g. two squares of the same color) keep distinct slots — matched by position, not deduped by hash.
  • Type-id map. mm_token_type_id_map (in orchestrator/utils.py) derives the map transiently from the renderer config (the orchestrator keeps no renderer; the old self.renderer hook was dead). Gated on model.is_vlm, so text runs pay nothing.
  • v1 example. Registers color-codeword-v1 (pyproject) + configs/debug/v1/multimodal.toml (the v1 port of the multimodal debug config). Bumps deps/verifiers to the mm-enabled bridge + taskset.

Depends on

verifiers #1601 (native MMData types, renderer/bridge emission, image-input message types, the color-codeword-v1 taskset). The submodule is pinned to that branch tip; it'll re-pin to the merged feat/nano-as-v1 commit once #1601 lands.

Verification

  • v1 multimodal (color-codeword-v1, Qwen3-VL-4B, native): rollouts reward ~0.83 (the VLM reads the squares), Training finished!, both trainer steps complete the M-RoPE path.
  • v0 multimodal (color-codeword via bridge): re-verified through the shared delta-aware packing — trains cleanly (no regression).
  • Unit-tested the renderer→MMDatamm_kwargs round-trip (dtype/shape/values) and the cumulative-vs-delta per-turn handling incl. repeated colors.

trace_to_samples now unions each turn's TurnTokens.multi_modal_data into mm_kwargs
(pixel_values/image_grid_thw EncodedTensors) and builds mm_token_type_ids from the
renderer's placeholder->type map. The map is derived transiently from the renderer config
(mm_token_type_id_map in utils) since the orchestrator keeps no renderer. Bumps verifiers
to the mm-enabled bridge (depends on verifiers #1601).
…mm packing

Registers the color-codeword-v1 taskset + a v1 debug config (configs/debug/v1/multimodal.toml),
and makes _pack_mm_kwargs take each turn's *new* images: a v1 turn re-renders the whole prompt
(cumulative multi_modal_data) while the v0 bridge ships deltas — both resolve to one slot per
image, aligned to the placeholder tokens (repeated colors kept by position). Bumps verifiers to
the input-side + taskset (depends on #1601).
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant