feat: consume native multimodal from the v1 trace (v0 + v1 VLM training)#2751
Draft
mikasenghaas wants to merge 2 commits into
Draft
feat: consume native multimodal from the v1 trace (v0 + v1 VLM training)#2751mikasenghaas wants to merge 2 commits into
mikasenghaas wants to merge 2 commits into
Conversation
trace_to_samples now unions each turn's TurnTokens.multi_modal_data into mm_kwargs (pixel_values/image_grid_thw EncodedTensors) and builds mm_token_type_ids from the renderer's placeholder->type map. The map is derived transiently from the renderer config (mm_token_type_id_map in utils) since the orchestrator keeps no renderer. Bumps verifiers to the mm-enabled bridge (depends on verifiers #1601).
…mm packing Registers the color-codeword-v1 taskset + a v1 debug config (configs/debug/v1/multimodal.toml), and makes _pack_mm_kwargs take each turn's *new* images: a v1 turn re-renders the whole prompt (cumulative multi_modal_data) while the v0 bridge ships deltas — both resolve to one slot per image, aligned to the placeholder tokens (repeated colors kept by position). Bumps verifiers to the input-side + taskset (depends on #1601).
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
Summary
The orchestrator/trainer-feeding side of native v1 multimodal, plus the v1 example wiring (pairs with verifiers #1601).
trace_to_samplesturns each turn'sTurnTokens.multi_modal_data(vf.MMData) intomm_kwargs— concatenating each HF-processor kwarg (pixel_values,image_grid_thw) over the sample's images — and buildsmm_token_type_idsfrom the renderer's placeholder→type map.multi_modal_data) while the v0 bridge ships deltas;_pack_mm_kwargscontributes each image once, aligned to the placeholder tokens (which appear once, in the turn the image is introduced). Repeated images (e.g. two squares of the same color) keep distinct slots — matched by position, not deduped by hash.mm_token_type_id_map(inorchestrator/utils.py) derives the map transiently from the renderer config (the orchestrator keeps no renderer; the oldself.rendererhook was dead). Gated onmodel.is_vlm, so text runs pay nothing.color-codeword-v1(pyproject) +configs/debug/v1/multimodal.toml(the v1 port of the multimodal debug config). Bumpsdeps/verifiersto the mm-enabled bridge + taskset.Depends on
verifiers #1601 (native
MMDatatypes, renderer/bridge emission, image-input message types, thecolor-codeword-v1taskset). The submodule is pinned to that branch tip; it'll re-pin to the mergedfeat/nano-as-v1commit once #1601 lands.Verification
Training finished!, both trainer steps complete the M-RoPE path.MMData→mm_kwargsround-trip (dtype/shape/values) and the cumulative-vs-delta per-turn handling incl. repeated colors.