[diff-only] Feat/ephemeral mm pixels#2758
Draft
hubert-marek wants to merge 32 commits into
Draft
Conversation
The env worker now ships descriptor-only mm_data, so re-derive pixel_values in the orchestrator when building training samples: reprocess the offloaded (file://) window images via the renderer, matched by content hash with a grid_thw assert, before packing mm_kwargs. Thread the rollout renderer into interleave_rollout; trainer unchanged. _pack_mm_kwargs_from_renderer now normalizes decoded payloads via torch.as_tensor so reconstructed numpy pixels batch alongside the existing encoded-wire path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Point the submodules at the descriptor-only mm_data work: - renderers: pixels ephemeral on the rollout path + bridge no-mutate fix - verifiers: descriptor-only delta regression test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply verifiers' native-thread limits (OMP/MKL/BLAS=1, MALLOC_ARENA_MAX=2, tokenizers parallelism off) to the orchestrator subprocess env and around the env-worker `process.start()`, so high-core hosts don't explode native thread teams + glibc arenas during multimodal image processing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mp deps Reduce orchestrator memory on VLM RL batches when converting rollouts to training samples: - Bound the interleave fan-out with a semaphore (new `OrchestratorConfig.mm_materialize_concurrency`, default 4) so only N rollouts reconstruct pixels at once — caps the transient build spike that otherwise stacks across the whole batch. - Skip pixel materialization for filtered rollouts (their samples are dropped before the trainer), and auto-wire `VF_RENDERER_IMAGE_OFFLOAD_DIR` to the run's assets dir so the env-worker live offload needs no override. - `offload_images_to_disk` now also normalizes `file://` images (leave if already under the run dir, else copy in) with hashing aligned to the env-worker writer (sha256 of decoded bytes + media-type ext). - Drop the masked_scatter repro instrumentation. Bump deps/renderers -> 73345a2 (descriptor-aware hash-only/full serialization) and deps/verifiers -> bff4cb78 (lean ephemeral-MM policy + live image offload), so prime-rl imports the consolidated multimodal path with no PYTHONPATH override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erifiers The offloaded image path becomes a file:// URL; a relative --output-dir yielded a malformed URI (file://rel/...) that the renderer could not load, failing turn 0 of every multimodal rollout. Resolve to absolute in both offload-dir wiring spots. Bump deps/verifiers bff4cb78 -> 4112bc0a (same fix in _image_offload_dir + regression test). Validated end-to-end: orchestrator offloads 11 unique images and converts rollouts to training examples cleanly (smoke stage1c). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ir setdefault; bump deps/verifiers - Memory leak: the end-of-step cleanup deleted train_examples/training_batch but not 'results', which holds the same samples' mm_kwargs (full pixel byte-copies). That left the batch's pixels pinned past gc.collect()+malloc_trim, defeating the trim until 'results' was rebound a step later (multimodal-specific RSS ratchet). Add results + samples (and filter_df/timing_df) to the del. - Remove the VF_RENDERER_IMAGE_OFFLOAD_DIR setdefault: dead now that verifiers derives the offload dir from RUN_ID, and useless in prod anyway (env workers are separate pods that don't inherit os.environ). - Bump deps/verifiers 4112bc0a -> 91555323 (RUN_ID-derived offload dir). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- deps/renderers -> d6ed224 (shared mm_store module; feature offload default-on + placeholder_length self-repair). - deps/verifiers -> 7ec7169c (offload dir from mm_store, prompt-only offload; token-prefix delta-baseline for descriptor-only mm_data). Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…metric When defer_mm_materialization is on, the orchestrator ships lightweight image refs (MMRefs: descriptor + file:// uris) instead of heavy pixel mm_kwargs, and the trainer materializes pixels in its data loader: - transport: MMRefs struct (array_like, appended LAST), carried on TrainingSample + MicroBatch alongside mm_kwargs. - utils/mm.py: materialize_mm_refs (hash-deduped decode-once-populate-all-slots via the renderer), reconstruct/pack/encode helpers, defer-validation hook. - trainer: build the renderer once (VLM runs only), materialize in _micro_batch_to_tensor, carry mm_refs through packing/prepare_batch; always-on pre-forward all-reduce so a materialization error fails every rank before the forward collective; log time/mm_materialize + mm/images_materialized. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…weep - trajectories: _collect_mm_refs normalizes the per-sample descriptor (strip pixel_values, grids -> nested lists, hashes -> str, image-only guard) and interleave_rollout(defer_materialization) ships MMRefs instead of materializing pixels; offload_images_to_disk writes under mm_store's assets/images subdir. - orchestrator: resolve a canonical asset_root = run_dir(RUN_ID) (falls back to output_dir locally) and use it for BOTH the image offload and the per-step TTL sweep, so the env worker, orchestrator, and sweeper share one /data/outputs/run_<RUN_ID>/assets tree (no re-copy, sweep covers features too). Also free per-step results/samples to drop the multimodal RSS ratchet. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
serving_tokens parses mmfile:v1 refs (via mm_store.split_mmfile_ref), validates the artifact (safe-id regexes, slot modality/hash match, version-pinned fingerprint compat) and loads the processed MultiModalKwargsItem from /data/outputs/run_<RUN_ID>/assets/mm_features, with structured errors on a missing/invalid artifact. Shares the format + envelope helpers with the writer via renderers.mm_store. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scoped) - defer_mm_materialization defaults to True on both orchestrator + trainer; the renderer requirement is scoped to VLM runs (model.vlm is not None) so text-only runs are a no-op. Trainer renderer defaults to AutoRendererConfig() (mirrors the orchestrator); SFT force-sets defer off. - mm_artifact_ttl_seconds (default 3600s) drives the orchestrator's per-step sweep. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…el.vlm
The orchestrator ships mm_refs based on whether rollouts have images, not on the
trainer's optional [model.vlm] block — which hosted VLM configs leave unset
(model.vlm=None). The previous gate (renderer only when model.vlm is not None)
discarded the auto-resolved renderer, so the trainer built none and crashed on
the first mm_refs ("trainer has no renderer"). Pass config.renderer
unconditionally; data.py still builds only when defer is on and a renderer is
configured (default AutoRendererConfig; explicit renderer=None still opts out).
Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bare `python:3.12-slim` tag tracks Debian stable, which moved to trixie (glibc ~2.41). A rebuild drifted the runtime glibc, and at training time the FLA gated-delta-rule backward JIT-compiles a TileLang CUDA kernel via the hostPath-mounted node nvcc — which then fails with `bits/mathcalls.h: exception specification is incompatible ... cospi/sinpi` (new glibc's noexcept decls vs the CUDA host math headers). Pinning bookworm (glibc 2.36) matches the ubuntu22.04 builder's glibc 2.35 and the known-good CUDA 12.x combo. Runtime-only change; deps unchanged. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stop sweeping raw images; only evict the expensive mm_features cache. Images are the run's recoverable source of truth (terminal, non-regenerable) and the trainer rebuilds pixels from them, so they are kept; features are regenerable and disposable. Drop mm_artifact_ttl_seconds 7200 -> 1800: the horizon only needs to exceed the write->vLLM-admit window (seconds), so 30 min is a large safety margin against racing in-flight reads while keeping disk bounded. offload_images_to_disk now refreshes mtime on the skip-if-exists path so all content-addressed writers (images + features) share last-use semantics, making a future image sweep safe by default. Bumps deps/renderers (features-only sweep + feature-writer mtime) and deps/verifiers (image-writer mtime). Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Before the image-embed masked_scatter, assert that the image-token count equals the image-feature count and raise a descriptive ValueError (token id, both counts, input_ids/pixel_values/grid shapes) when they differ. A mismatch otherwise surfaces as an opaque CUDA masked_scatter device-side assert that is near-impossible to diagnose. This is a loud tripwire, not a masking guard: it fails fast with the context needed to find the bad mm sidecar. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the cross-repo contract with the verifiers trajectory delta encoder: when a stale-prefix step-back starts a new TrainingSample, prime-rl will not merge it into the prior sample, so its mm_data delta must carry the full cumulative window. Guards against a partial delta that would drop images on step-back. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Allow multimodal samples to share a packed microbatch (gated by trainer.pack_multimodal) instead of each MM sample getting its own. Packing is enabled only where it is provably correct: the model advertises the pass_1d position strategy, flash-attn varlen is active (block-diagonal cu_seqlens from per-sample-reset positions isolates samples), context parallelism is off, and a single run per microbatch (the MoE LoRA path applies one adapter per microbatch). MM samples still never pack with text (FSDP per-step modality invariant) nor mix refs/kwargs sidecars; sidecars are concatenated in sample order. Cleanups to the packing primitive: - drop the redundant EncodedTensor payload-length validators (a malformed payload already fails loudly downstream at frombuffer(...).reshape); keep the dtype/shape concat precondition. - dedup the sidecar dispatch in _append_micro_batch (the kind match is already guaranteed by _can_pack_sample). Bumps deps/renderers and deps/verifiers to their OOM-hunt-cleanup heads. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + coordinated fail-fast Ship packed per-rank micro-batches (trainer master -> data ranks) over ZMQ instead of the shared filesystem. Hardened for multi-node and made fail-fast: - bind 0.0.0.0 / connect MASTER_ADDR (or configured host) so PUB/SUB + the READY barrier work across nodes; port+1 = data, port+2 = startup READY barrier. - bounded timeouts (recv/ready/publish) replace infinite blocking, with a publish_grace to reduce PUB/SUB slow-joiner races at startup. - per-message step tag + receiver mismatch assertion: a dropped message becomes a loud crash, never silent training on misaligned data. - torch.distributed Store publish-status barrier (master -> ranks) plus an all-reduce(MAX) fail-fast (any rank -> all): a master pack failure or a single rank's transport failure crashes all ranks coordinated instead of hanging on a collective. synchronize_state runs only after that barrier. micro_batch_transport defaults to ZMQ; rollouts (orchestrator -> trainer) stay filesystem. Adds unit fan-out/timeout/step-mismatch tests and a torchrun 2-rank integration smoke. Residual risk is multi-node PUB/SUB startup; validate the slow-joiner grace on the real topology. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After image offload, logged samples carry file:// URIs pointing at local trainer/orchestrator disk, which the Prime platform can't resolve (broken images in the dashboard). When building the sample parquet, re-inline local image files as data: URLs so the dashboard can display them. Purely the platform sample-logging path; does not touch training, offload, or the materialize path. Guarded: file://+image-mime only, 2 MB cap, OSError -> skip (handles a swept/missing file), per-call dedup cache, and the original rollout is not mutated. Pairs path.as_uri() (offload) with urlparse/unquote (monitor) so file:// URIs round-trip. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Conservative one-time startup grace for the first prod runs: gives PUB/SUB subscriptions ample time to propagate before the first publish, avoiding step-0 slow-joiner drops (which would crash-loop at startup). It is a one-time cost at job start, so erring large is essentially free; dial down once a topology is observed to start cleanly. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tch wait The worker wait on the master's micro-batch publish status was bounded by a fixed timeout (publish_timeout_seconds=1800), but the master sets that status only after pack() — which blocks on the orchestrator for generation. Slow multimodal steps (observed 1178-3137s) legitimately exceed 1800s, so every rank timed out -> coordinated crash -> deterministic crash-loop on the slow step. The old filesystem path had no such timeout and simply idled. Wait on the publish key with no deadline instead. Liveness is already covered: a wedged master is killed by the packer watchdog (-> torchrun tears down the group) and a master pack error sets ok=False for an immediate coordinated fail. Genuine ZMQ delivery stays bounded by the receiver's recv_timeout once published. Net deletes the publish_timeout_seconds knob and its plumbing. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Large multimodal sample parquets (inlined base64 images) were PUT to R2 as one in-RAM body, failing on a memory-tight orchestrator with WriteTimeout / OpenSSL [BUF] malloc failure. Gated on VLM runs (run_config model/student.model is_vlm), write the parquet straight to a disk-backed temp file (co-located with output_dir to avoid tmpfs) and stream it: an async byte iterator with an explicit Content-Length keeps the presigned PUT on Content-Length (httpx drops chunked transfer-encoding) and bounds each TLS write; a generous write timeout replaces the flat 30s; concurrent streamed uploads are serialized; the temp file is unlinked on every exit. Text runs keep the inline byte PUT unchanged. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both submodules merged origin/main on their feat/ephemeral-mm-pixels branches (renderers f7696cd, verifiers 00d83204); these supersede main's pins and carry our mm work on top. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> # Conflicts: # deps/renderers # deps/verifiers # src/prime_rl/orchestrator/orchestrator.py # src/prime_rl/orchestrator/trajectories.py # src/prime_rl/trainer/rl/train.py
Picks up: no-VCS `fallback-version` so the Docker editable build resolves (fixes `uv sync` LookupError on verifiers), plus worldsims #1557 — multimodal tool content + reasoning_content passthrough in the in-sandbox base runner. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds raw-image ref handling to the token serving route: parse mmraw refs, load the raw image from shared disk, reprocess via the HF image processor + vLLM field factory, and validate hash/fingerprint/grid/placeholder before caching. Bumps the renderers submodule to the matching emit-side commit. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The auto formula (max(1, ceil(max_inflight_rollouts/256))) gave a single env worker for any max_inflight <= 256, so one worker absorbed all in-flight rollouts; its event loop saturated during concurrent sandbox setup and missed the 30s heartbeat -> EnvRouter restart loop (surfaces as 'env server unhealthy'). Fix the train resolver to a fixed pool of 32 (explicit per-env/group values still win) so setup load spreads across workers. Requires bumping the orchestrator pod's memory/CPU to host 32 workers. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
R2 sample uploads still hit OpenSSL [BUF] malloc failure (OOM) because the streaming path was gated on a manual config flag (`student.model.vlm`/`is_vlm`) that is silently off when `[model.vlm]` isn't set, and because the parquet was built as a full in-RAM Arrow table before spilling. Always build straight to a disk-backed temp file via ParquetWriter, one rollout at a time, with a per-rollout image cache so base64-inlined screenshots are never all held in RAM; always stream the upload (Content-Length, chunked TLS writes). Removes the fragile config gate and the in-RAM table/bytes build. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This reverts commit 7414cd7. The auto->32 train default was dead code: the earlier num_workers resolver (num_examples == -1 -> 4) sets the value before the auto->32 check runs, so it never applied (train envs still got ~4 workers -> heartbeat timeouts). The correct lever is now per-run run_config.env_server.num_workers (platform #2838, bounds 1-64), which sets it explicitly and bypasses both auto resolvers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… after tokenization (#2752) * fix(orchestrator): drop per-step prompt arrays from buffered rollouts after tokenization Every trajectory step ships prompt_ids/prompt_mask for its ENTIRE prompt prefix — O(turns x context) boxed ints, 100-370MB per long browser rollout — but their only readers run at arrival (backfill_rollout_tokens / interleave_rollout). The rollout then buffers in pending_groups (sibling wait) and pending_batch until its batch ships, so at the ~250 buffered rollouts observed on worldsims gflights runs this dead weight alone overflows a 128GB orchestrator. Strip the prompt arrays at the arrival boundary in TrainSink.add, mirroring the offload_images_to_disk treatment of image bytes: - keep a num_prompt_tokens count (the prime monitor's per-step num_input_tokens stat reads it, falling back to len(prompt_ids)) - keep completion arrays on every step (entropy/rare-token filters scan them) - keep the final step whole (the wandb sample table decodes it) save_rollouts already excludes the trajectory from disk artifacts on both the train and eval paths, so on-disk output is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * style: ruff format monitor num_input_tokens expression Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
…oad fallback) (#2753) renderers#82: raw-image layout no longer hard-fails on hosted workers when the model id misses the local HF cache — falls back to hf_hub_download for preprocessor_config.json. Explicit image_* renderer-config overrides remain the hermetic path; this is the safety net for configs without them. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit 77b8567) Co-authored-by: Christian <cdreetz@gmail.com>
| try: | ||
| for port in (base + 1, base + 2): | ||
| sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| sock.bind(("0.0.0.0", port)) |
| try: | ||
| for port in (base + 1, base + 2): | ||
| sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM) | ||
| sock.bind(("0.0.0.0", port)) |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
No description provided.