[diff-only] Feat/ephemeral mm pixels by hubert-marek · Pull Request #2758 · PrimeIntellect-ai/prime-rl

hubert-marek · 2026-06-10T08:34:00Z

No description provided.

The env worker now ships descriptor-only mm_data, so re-derive pixel_values in the orchestrator when building training samples: reprocess the offloaded (file://) window images via the renderer, matched by content hash with a grid_thw assert, before packing mm_kwargs. Thread the rollout renderer into interleave_rollout; trainer unchanged. _pack_mm_kwargs_from_renderer now normalizes decoded payloads via torch.as_tensor so reconstructed numpy pixels batch alongside the existing encoded-wire path. Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Point the submodules at the descriptor-only mm_data work: - renderers: pixels ephemeral on the rollout path + bridge no-mutate fix - verifiers: descriptor-only delta regression test Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>

Apply verifiers' native-thread limits (OMP/MKL/BLAS=1, MALLOC_ARENA_MAX=2, tokenizers parallelism off) to the orchestrator subprocess env and around the env-worker `process.start()`, so high-core hosts don't explode native thread teams + glibc arenas during multimodal image processing. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…mp deps Reduce orchestrator memory on VLM RL batches when converting rollouts to training samples: - Bound the interleave fan-out with a semaphore (new `OrchestratorConfig.mm_materialize_concurrency`, default 4) so only N rollouts reconstruct pixels at once — caps the transient build spike that otherwise stacks across the whole batch. - Skip pixel materialization for filtered rollouts (their samples are dropped before the trainer), and auto-wire `VF_RENDERER_IMAGE_OFFLOAD_DIR` to the run's assets dir so the env-worker live offload needs no override. - `offload_images_to_disk` now also normalizes `file://` images (leave if already under the run dir, else copy in) with hashing aligned to the env-worker writer (sha256 of decoded bytes + media-type ext). - Drop the masked_scatter repro instrumentation. Bump deps/renderers -> 73345a2 (descriptor-aware hash-only/full serialization) and deps/verifiers -> bff4cb78 (lean ephemeral-MM policy + live image offload), so prime-rl imports the consolidated multimodal path with no PYTHONPATH override. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…erifiers The offloaded image path becomes a file:// URL; a relative --output-dir yielded a malformed URI (file://rel/...) that the renderer could not load, failing turn 0 of every multimodal rollout. Resolve to absolute in both offload-dir wiring spots. Bump deps/verifiers bff4cb78 -> 4112bc0a (same fix in _image_offload_dir + regression test). Validated end-to-end: orchestrator offloads 11 unique images and converts rollouts to training examples cleanly (smoke stage1c). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…ir setdefault; bump deps/verifiers - Memory leak: the end-of-step cleanup deleted train_examples/training_batch but not 'results', which holds the same samples' mm_kwargs (full pixel byte-copies). That left the batch's pixels pinned past gc.collect()+malloc_trim, defeating the trim until 'results' was rebound a step later (multimodal-specific RSS ratchet). Add results + samples (and filter_df/timing_df) to the del. - Remove the VF_RENDERER_IMAGE_OFFLOAD_DIR setdefault: dead now that verifiers derives the offload dir from RUN_ID, and useless in prod anyway (env workers are separate pods that don't inherit os.environ). - Bump deps/verifiers 4112bc0a -> 91555323 (RUN_ID-derived offload dir). Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

- deps/renderers -> d6ed224 (shared mm_store module; feature offload default-on + placeholder_length self-repair). - deps/verifiers -> 7ec7169c (offload dir from mm_store, prompt-only offload; token-prefix delta-baseline for descriptor-only mm_data). Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…metric When defer_mm_materialization is on, the orchestrator ships lightweight image refs (MMRefs: descriptor + file:// uris) instead of heavy pixel mm_kwargs, and the trainer materializes pixels in its data loader: - transport: MMRefs struct (array_like, appended LAST), carried on TrainingSample + MicroBatch alongside mm_kwargs. - utils/mm.py: materialize_mm_refs (hash-deduped decode-once-populate-all-slots via the renderer), reconstruct/pack/encode helpers, defer-validation hook. - trainer: build the renderer once (VLM runs only), materialize in _micro_batch_to_tensor, carry mm_refs through packing/prepare_batch; always-on pre-forward all-reduce so a materialization error fails every rank before the forward collective; log time/mm_materialize + mm/images_materialized. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…weep - trajectories: _collect_mm_refs normalizes the per-sample descriptor (strip pixel_values, grids -> nested lists, hashes -> str, image-only guard) and interleave_rollout(defer_materialization) ships MMRefs instead of materializing pixels; offload_images_to_disk writes under mm_store's assets/images subdir. - orchestrator: resolve a canonical asset_root = run_dir(RUN_ID) (falls back to output_dir locally) and use it for BOTH the image offload and the per-step TTL sweep, so the env worker, orchestrator, and sweeper share one /data/outputs/run_<RUN_ID>/assets tree (no re-copy, sweep covers features too). Also free per-step results/samples to drop the multimodal RSS ratchet. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

serving_tokens parses mmfile:v1 refs (via mm_store.split_mmfile_ref), validates the artifact (safe-id regexes, slot modality/hash match, version-pinned fingerprint compat) and loads the processed MultiModalKwargsItem from /data/outputs/run_<RUN_ID>/assets/mm_features, with structured errors on a missing/invalid artifact. Shares the format + envelope helpers with the writer via renderers.mm_store. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…scoped) - defer_mm_materialization defaults to True on both orchestrator + trainer; the renderer requirement is scoped to VLM runs (model.vlm is not None) so text-only runs are a no-op. Trainer renderer defaults to AutoRendererConfig() (mirrors the orchestrator); SFT force-sets defer off. - mm_artifact_ttl_seconds (default 3600s) drives the orchestrator's per-step sweep. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…el.vlm The orchestrator ships mm_refs based on whether rollouts have images, not on the trainer's optional [model.vlm] block — which hosted VLM configs leave unset (model.vlm=None). The previous gate (renderer only when model.vlm is not None) discarded the auto-resolved renderer, so the trainer built none and crashed on the first mm_refs ("trainer has no renderer"). Pass config.renderer unconditionally; data.py still builds only when defer is on and a renderer is configured (default AutoRendererConfig; explicit renderer=None still opts out). Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The bare `python:3.12-slim` tag tracks Debian stable, which moved to trixie (glibc ~2.41). A rebuild drifted the runtime glibc, and at training time the FLA gated-delta-rule backward JIT-compiles a TileLang CUDA kernel via the hostPath-mounted node nvcc — which then fails with `bits/mathcalls.h: exception specification is incompatible ... cospi/sinpi` (new glibc's noexcept decls vs the CUDA host math headers). Pinning bookworm (glibc 2.36) matches the ubuntu22.04 builder's glibc 2.35 and the known-good CUDA 12.x combo. Runtime-only change; deps unchanged. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Stop sweeping raw images; only evict the expensive mm_features cache. Images are the run's recoverable source of truth (terminal, non-regenerable) and the trainer rebuilds pixels from them, so they are kept; features are regenerable and disposable. Drop mm_artifact_ttl_seconds 7200 -> 1800: the horizon only needs to exceed the write->vLLM-admit window (seconds), so 30 min is a large safety margin against racing in-flight reads while keeping disk bounded. offload_images_to_disk now refreshes mtime on the skip-if-exists path so all content-addressed writers (images + features) share last-use semantics, making a future image sweep safe by default. Bumps deps/renderers (features-only sweep + feature-writer mtime) and deps/verifiers (image-writer mtime). Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Before the image-embed masked_scatter, assert that the image-token count equals the image-feature count and raise a descriptive ValueError (token id, both counts, input_ids/pixel_values/grid shapes) when they differ. A mismatch otherwise surfaces as an opaque CUDA masked_scatter device-side assert that is near-impossible to diagnose. This is a loud tripwire, not a masking guard: it fails fast with the context needed to find the bad mm sidecar. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Cover the cross-repo contract with the verifiers trajectory delta encoder: when a stale-prefix step-back starts a new TrainingSample, prime-rl will not merge it into the prior sample, so its mm_data delta must carry the full cumulative window. Guards against a partial delta that would drop images on step-back. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Allow multimodal samples to share a packed microbatch (gated by trainer.pack_multimodal) instead of each MM sample getting its own. Packing is enabled only where it is provably correct: the model advertises the pass_1d position strategy, flash-attn varlen is active (block-diagonal cu_seqlens from per-sample-reset positions isolates samples), context parallelism is off, and a single run per microbatch (the MoE LoRA path applies one adapter per microbatch). MM samples still never pack with text (FSDP per-step modality invariant) nor mix refs/kwargs sidecars; sidecars are concatenated in sample order. Cleanups to the packing primitive: - drop the redundant EncodedTensor payload-length validators (a malformed payload already fails loudly downstream at frombuffer(...).reshape); keep the dtype/shape concat precondition. - dedup the sidecar dispatch in _append_micro_batch (the kind match is already guaranteed by _can_pack_sample). Bumps deps/renderers and deps/verifiers to their OOM-hunt-cleanup heads. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… + coordinated fail-fast Ship packed per-rank micro-batches (trainer master -> data ranks) over ZMQ instead of the shared filesystem. Hardened for multi-node and made fail-fast: - bind 0.0.0.0 / connect MASTER_ADDR (or configured host) so PUB/SUB + the READY barrier work across nodes; port+1 = data, port+2 = startup READY barrier. - bounded timeouts (recv/ready/publish) replace infinite blocking, with a publish_grace to reduce PUB/SUB slow-joiner races at startup. - per-message step tag + receiver mismatch assertion: a dropped message becomes a loud crash, never silent training on misaligned data. - torch.distributed Store publish-status barrier (master -> ranks) plus an all-reduce(MAX) fail-fast (any rank -> all): a master pack failure or a single rank's transport failure crashes all ranks coordinated instead of hanging on a collective. synchronize_state runs only after that barrier. micro_batch_transport defaults to ZMQ; rollouts (orchestrator -> trainer) stay filesystem. Adds unit fan-out/timeout/step-mismatch tests and a torchrun 2-rank integration smoke. Residual risk is multi-node PUB/SUB startup; validate the slow-joiner grace on the real topology. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

After image offload, logged samples carry file:// URIs pointing at local trainer/orchestrator disk, which the Prime platform can't resolve (broken images in the dashboard). When building the sample parquet, re-inline local image files as data: URLs so the dashboard can display them. Purely the platform sample-logging path; does not touch training, offload, or the materialize path. Guarded: file://+image-mime only, 2 MB cap, OSError -> skip (handles a swept/missing file), per-call dedup cache, and the original rollout is not mutated. Pairs path.as_uri() (offload) with urlparse/unquote (monitor) so file:// URIs round-trip. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Conservative one-time startup grace for the first prod runs: gives PUB/SUB subscriptions ample time to propagate before the first publish, avoiding step-0 slow-joiner drops (which would crash-loop at startup). It is a one-time cost at job start, so erring large is essentially free; dial down once a topology is observed to start cleanly. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

…tch wait The worker wait on the master's micro-batch publish status was bounded by a fixed timeout (publish_timeout_seconds=1800), but the master sets that status only after pack() — which blocks on the orchestrator for generation. Slow multimodal steps (observed 1178-3137s) legitimately exceed 1800s, so every rank timed out -> coordinated crash -> deterministic crash-loop on the slow step. The old filesystem path had no such timeout and simply idled. Wait on the publish key with no deadline instead. Liveness is already covered: a wedged master is killed by the packer watchdog (-> torchrun tears down the group) and a master pack error sets ok=False for an immediate coordinated fail. Genuine ZMQ delivery stays bounded by the receiver's recv_timeout once published. Net deletes the publish_timeout_seconds knob and its plumbing. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Large multimodal sample parquets (inlined base64 images) were PUT to R2 as one in-RAM body, failing on a memory-tight orchestrator with WriteTimeout / OpenSSL [BUF] malloc failure. Gated on VLM runs (run_config model/student.model is_vlm), write the parquet straight to a disk-backed temp file (co-located with output_dir to avoid tmpfs) and stream it: an async byte iterator with an explicit Content-Length keeps the presigned PUT on Content-Length (httpx drops chunked transfer-encoding) and bounds each TLS write; a generous write timeout replaces the flat 30s; concurrent streamed uploads are serialized; the temp file is unlinked on every exit. Text runs keep the inline byte PUT unchanged. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Both submodules merged origin/main on their feat/ephemeral-mm-pixels branches (renderers f7696cd, verifiers 00d83204); these supersede main's pins and carry our mm work on top. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com> # Conflicts: # deps/renderers # deps/verifiers # src/prime_rl/orchestrator/orchestrator.py # src/prime_rl/orchestrator/trajectories.py # src/prime_rl/trainer/rl/train.py

Picks up: no-VCS `fallback-version` so the Docker editable build resolves (fixes `uv sync` LookupError on verifiers), plus worldsims #1557 — multimodal tool content + reasoning_content passthrough in the in-sandbox base runner. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

Adds raw-image ref handling to the token serving route: parse mmraw refs, load the raw image from shared disk, reprocess via the HF image processor + vLLM field factory, and validate hash/fingerprint/grid/placeholder before caching. Bumps the renderers submodule to the matching emit-side commit. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

The auto formula (max(1, ceil(max_inflight_rollouts/256))) gave a single env worker for any max_inflight <= 256, so one worker absorbed all in-flight rollouts; its event loop saturated during concurrent sandbox setup and missed the 30s heartbeat -> EnvRouter restart loop (surfaces as 'env server unhealthy'). Fix the train resolver to a fixed pool of 32 (explicit per-env/group values still win) so setup load spreads across workers. Requires bumping the orchestrator pod's memory/CPU to host 32 workers. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

R2 sample uploads still hit OpenSSL [BUF] malloc failure (OOM) because the streaming path was gated on a manual config flag (`student.model.vlm`/`is_vlm`) that is silently off when `[model.vlm]` isn't set, and because the parquet was built as a full in-RAM Arrow table before spilling. Always build straight to a disk-backed temp file via ParquetWriter, one rollout at a time, with a per-rollout image cache so base64-inlined screenshots are never all held in RAM; always stream the upload (Content-Length, chunked TLS writes). Removes the fragile config gate and the in-RAM table/bytes build. Co-Authored-By: Codex <codex@openai.com> Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

This reverts commit 7414cd7. The auto->32 train default was dead code: the earlier num_workers resolver (num_examples == -1 -> 4) sets the value before the auto->32 check runs, so it never applied (train envs still got ~4 workers -> heartbeat timeouts). The correct lever is now per-run run_config.env_server.num_workers (platform #2838, bounds 1-64), which sets it explicitly and bypasses both auto resolvers. Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

… after tokenization (#2752) * fix(orchestrator): drop per-step prompt arrays from buffered rollouts after tokenization Every trajectory step ships prompt_ids/prompt_mask for its ENTIRE prompt prefix — O(turns x context) boxed ints, 100-370MB per long browser rollout — but their only readers run at arrival (backfill_rollout_tokens / interleave_rollout). The rollout then buffers in pending_groups (sibling wait) and pending_batch until its batch ships, so at the ~250 buffered rollouts observed on worldsims gflights runs this dead weight alone overflows a 128GB orchestrator. Strip the prompt arrays at the arrival boundary in TrainSink.add, mirroring the offload_images_to_disk treatment of image bytes: - keep a num_prompt_tokens count (the prime monitor's per-step num_input_tokens stat reads it, falling back to len(prompt_ids)) - keep completion arrays on every step (entropy/rare-token filters scan them) - keep the final step whole (the wandb sample table decodes it) save_rollouts already excludes the trajectory from disk artifacts on both the train and eval paths, so on-disk output is unchanged. Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> * style: ruff format monitor num_input_tokens expression Co-Authored-By: Claude Fable 5 <noreply@anthropic.com> --------- Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

…oad fallback) (#2753) renderers#82: raw-image layout no longer hard-fails on hosted workers when the model id misses the local HF cache — falls back to hf_hub_download for preprocessor_config.json. Explicit image_* renderer-config overrides remain the hermetic path; this is the safety net for configs without them. Co-authored-by: Claude Fable 5 <noreply@anthropic.com>

(cherry picked from commit 77b8567) Co-authored-by: Christian <cdreetz@gmail.com>

+        try:
+            for port in (base + 1, base + 2):
+                sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+                sock.bind(("0.0.0.0", port))


+        try:
+            for port in (base + 1, base + 2):
+                sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
+                sock.bind(("0.0.0.0", port))


eligotts and others added 30 commits May 27, 2026 02:58

hubert-marek and others added 2 commits June 9, 2026 22:05

explicit del and malloc (#2757)

b101d22

(cherry picked from commit 77b8567) Co-authored-by: Christian <cdreetz@gmail.com>

hubert-marek changed the title ~~Feat/ephemeral mm pixels~~ [diff-only] Feat/ephemeral mm pixels Jun 10, 2026

github-advanced-security AI found potential problems Jun 10, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

[diff-only] Feat/ephemeral mm pixels#2758

[diff-only] Feat/ephemeral mm pixels#2758
hubert-marek wants to merge 32 commits into
mainfrom
feat/ephemeral-mm-pixels

hubert-marek commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants

Conversation

hubert-marek commented Jun 10, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

4 participants