Skip to content

[diff-only] Feat/ephemeral mm pixels#2758

Draft
hubert-marek wants to merge 32 commits into
mainfrom
feat/ephemeral-mm-pixels
Draft

[diff-only] Feat/ephemeral mm pixels#2758
hubert-marek wants to merge 32 commits into
mainfrom
feat/ephemeral-mm-pixels

Conversation

@hubert-marek

Copy link
Copy Markdown
Contributor

No description provided.

eligotts and others added 30 commits May 27, 2026 02:58
The env worker now ships descriptor-only mm_data, so re-derive pixel_values in
the orchestrator when building training samples: reprocess the offloaded
(file://) window images via the renderer, matched by content hash with a
grid_thw assert, before packing mm_kwargs. Thread the rollout renderer into
interleave_rollout; trainer unchanged. _pack_mm_kwargs_from_renderer now
normalizes decoded payloads via torch.as_tensor so reconstructed numpy pixels
batch alongside the existing encoded-wire path.

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Point the submodules at the descriptor-only mm_data work:
- renderers: pixels ephemeral on the rollout path + bridge no-mutate fix
- verifiers: descriptor-only delta regression test

Co-Authored-By: Claude Opus 4.7 (1M context) <noreply@anthropic.com>
Apply verifiers' native-thread limits (OMP/MKL/BLAS=1, MALLOC_ARENA_MAX=2,
tokenizers parallelism off) to the orchestrator subprocess env and around the
env-worker `process.start()`, so high-core hosts don't explode native thread
teams + glibc arenas during multimodal image processing.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…mp deps

Reduce orchestrator memory on VLM RL batches when converting rollouts to
training samples:

- Bound the interleave fan-out with a semaphore (new
  `OrchestratorConfig.mm_materialize_concurrency`, default 4) so only N rollouts
  reconstruct pixels at once — caps the transient build spike that otherwise
  stacks across the whole batch.
- Skip pixel materialization for filtered rollouts (their samples are dropped
  before the trainer), and auto-wire `VF_RENDERER_IMAGE_OFFLOAD_DIR` to the
  run's assets dir so the env-worker live offload needs no override.
- `offload_images_to_disk` now also normalizes `file://` images (leave if
  already under the run dir, else copy in) with hashing aligned to the
  env-worker writer (sha256 of decoded bytes + media-type ext).
- Drop the masked_scatter repro instrumentation.

Bump deps/renderers -> 73345a2 (descriptor-aware hash-only/full serialization)
and deps/verifiers -> bff4cb78 (lean ephemeral-MM policy + live image offload),
so prime-rl imports the consolidated multimodal path with no PYTHONPATH override.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…erifiers

The offloaded image path becomes a file:// URL; a relative --output-dir yielded
a malformed URI (file://rel/...) that the renderer could not load, failing turn 0
of every multimodal rollout. Resolve to absolute in both offload-dir wiring spots.
Bump deps/verifiers bff4cb78 -> 4112bc0a (same fix in _image_offload_dir + regression test).

Validated end-to-end: orchestrator offloads 11 unique images and converts rollouts
to training examples cleanly (smoke stage1c).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…ir setdefault; bump deps/verifiers

- Memory leak: the end-of-step cleanup deleted train_examples/training_batch but
  not 'results', which holds the same samples' mm_kwargs (full pixel byte-copies).
  That left the batch's pixels pinned past gc.collect()+malloc_trim, defeating the
  trim until 'results' was rebound a step later (multimodal-specific RSS ratchet).
  Add results + samples (and filter_df/timing_df) to the del.
- Remove the VF_RENDERER_IMAGE_OFFLOAD_DIR setdefault: dead now that verifiers
  derives the offload dir from RUN_ID, and useless in prod anyway (env workers are
  separate pods that don't inherit os.environ).
- Bump deps/verifiers 4112bc0a -> 91555323 (RUN_ID-derived offload dir).

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
- deps/renderers -> d6ed224 (shared mm_store module; feature offload default-on
  + placeholder_length self-repair).
- deps/verifiers -> 7ec7169c (offload dir from mm_store, prompt-only offload;
  token-prefix delta-baseline for descriptor-only mm_data).

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…metric

When defer_mm_materialization is on, the orchestrator ships lightweight image
refs (MMRefs: descriptor + file:// uris) instead of heavy pixel mm_kwargs, and
the trainer materializes pixels in its data loader:
- transport: MMRefs struct (array_like, appended LAST), carried on
  TrainingSample + MicroBatch alongside mm_kwargs.
- utils/mm.py: materialize_mm_refs (hash-deduped decode-once-populate-all-slots
  via the renderer), reconstruct/pack/encode helpers, defer-validation hook.
- trainer: build the renderer once (VLM runs only), materialize in
  _micro_batch_to_tensor, carry mm_refs through packing/prepare_batch; always-on
  pre-forward all-reduce so a materialization error fails every rank before the
  forward collective; log time/mm_materialize + mm/images_materialized.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…weep

- trajectories: _collect_mm_refs normalizes the per-sample descriptor (strip
  pixel_values, grids -> nested lists, hashes -> str, image-only guard) and
  interleave_rollout(defer_materialization) ships MMRefs instead of materializing
  pixels; offload_images_to_disk writes under mm_store's assets/images subdir.
- orchestrator: resolve a canonical asset_root = run_dir(RUN_ID) (falls back to
  output_dir locally) and use it for BOTH the image offload and the per-step TTL
  sweep, so the env worker, orchestrator, and sweeper share one
  /data/outputs/run_<RUN_ID>/assets tree (no re-copy, sweep covers features too).
  Also free per-step results/samples to drop the multimodal RSS ratchet.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
serving_tokens parses mmfile:v1 refs (via mm_store.split_mmfile_ref), validates
the artifact (safe-id regexes, slot modality/hash match, version-pinned
fingerprint compat) and loads the processed MultiModalKwargsItem from
/data/outputs/run_<RUN_ID>/assets/mm_features, with structured errors on a
missing/invalid artifact. Shares the format + envelope helpers with the writer
via renderers.mm_store.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…scoped)

- defer_mm_materialization defaults to True on both orchestrator + trainer; the
  renderer requirement is scoped to VLM runs (model.vlm is not None) so text-only
  runs are a no-op. Trainer renderer defaults to AutoRendererConfig() (mirrors the
  orchestrator); SFT force-sets defer off.
- mm_artifact_ttl_seconds (default 3600s) drives the orchestrator's per-step sweep.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…el.vlm

The orchestrator ships mm_refs based on whether rollouts have images, not on the
trainer's optional [model.vlm] block — which hosted VLM configs leave unset
(model.vlm=None). The previous gate (renderer only when model.vlm is not None)
discarded the auto-resolved renderer, so the trainer built none and crashed on
the first mm_refs ("trainer has no renderer"). Pass config.renderer
unconditionally; data.py still builds only when defer is on and a renderer is
configured (default AutoRendererConfig; explicit renderer=None still opts out).

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The bare `python:3.12-slim` tag tracks Debian stable, which moved to trixie
(glibc ~2.41). A rebuild drifted the runtime glibc, and at training time the FLA
gated-delta-rule backward JIT-compiles a TileLang CUDA kernel via the
hostPath-mounted node nvcc — which then fails with
`bits/mathcalls.h: exception specification is incompatible ... cospi/sinpi`
(new glibc's noexcept decls vs the CUDA host math headers). Pinning bookworm
(glibc 2.36) matches the ubuntu22.04 builder's glibc 2.35 and the
known-good CUDA 12.x combo. Runtime-only change; deps unchanged.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Stop sweeping raw images; only evict the expensive mm_features cache. Images are
the run's recoverable source of truth (terminal, non-regenerable) and the
trainer rebuilds pixels from them, so they are kept; features are regenerable
and disposable. Drop mm_artifact_ttl_seconds 7200 -> 1800: the horizon only
needs to exceed the write->vLLM-admit window (seconds), so 30 min is a large
safety margin against racing in-flight reads while keeping disk bounded.

offload_images_to_disk now refreshes mtime on the skip-if-exists path so all
content-addressed writers (images + features) share last-use semantics, making a
future image sweep safe by default.

Bumps deps/renderers (features-only sweep + feature-writer mtime) and
deps/verifiers (image-writer mtime).

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Before the image-embed masked_scatter, assert that the image-token count equals
the image-feature count and raise a descriptive ValueError (token id, both
counts, input_ids/pixel_values/grid shapes) when they differ. A mismatch
otherwise surfaces as an opaque CUDA masked_scatter device-side assert that is
near-impossible to diagnose. This is a loud tripwire, not a masking guard: it
fails fast with the context needed to find the bad mm sidecar.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Cover the cross-repo contract with the verifiers trajectory delta encoder: when
a stale-prefix step-back starts a new TrainingSample, prime-rl will not merge it
into the prior sample, so its mm_data delta must carry the full cumulative
window. Guards against a partial delta that would drop images on step-back.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Allow multimodal samples to share a packed microbatch (gated by
trainer.pack_multimodal) instead of each MM sample getting its own. Packing is
enabled only where it is provably correct: the model advertises the pass_1d
position strategy, flash-attn varlen is active (block-diagonal cu_seqlens from
per-sample-reset positions isolates samples), context parallelism is off, and a
single run per microbatch (the MoE LoRA path applies one adapter per microbatch).
MM samples still never pack with text (FSDP per-step modality invariant) nor mix
refs/kwargs sidecars; sidecars are concatenated in sample order.

Cleanups to the packing primitive:
- drop the redundant EncodedTensor payload-length validators (a malformed
  payload already fails loudly downstream at frombuffer(...).reshape); keep the
  dtype/shape concat precondition.
- dedup the sidecar dispatch in _append_micro_batch (the kind match is already
  guaranteed by _can_pack_sample).

Bumps deps/renderers and deps/verifiers to their OOM-hunt-cleanup heads.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… + coordinated fail-fast

Ship packed per-rank micro-batches (trainer master -> data ranks) over ZMQ
instead of the shared filesystem. Hardened for multi-node and made fail-fast:

- bind 0.0.0.0 / connect MASTER_ADDR (or configured host) so PUB/SUB + the READY
  barrier work across nodes; port+1 = data, port+2 = startup READY barrier.
- bounded timeouts (recv/ready/publish) replace infinite blocking, with a
  publish_grace to reduce PUB/SUB slow-joiner races at startup.
- per-message step tag + receiver mismatch assertion: a dropped message becomes a
  loud crash, never silent training on misaligned data.
- torch.distributed Store publish-status barrier (master -> ranks) plus an
  all-reduce(MAX) fail-fast (any rank -> all): a master pack failure or a single
  rank's transport failure crashes all ranks coordinated instead of hanging on a
  collective. synchronize_state runs only after that barrier.

micro_batch_transport defaults to ZMQ; rollouts (orchestrator -> trainer) stay
filesystem. Adds unit fan-out/timeout/step-mismatch tests and a torchrun 2-rank
integration smoke. Residual risk is multi-node PUB/SUB startup; validate the
slow-joiner grace on the real topology.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
After image offload, logged samples carry file:// URIs pointing at local
trainer/orchestrator disk, which the Prime platform can't resolve (broken images
in the dashboard). When building the sample parquet, re-inline local image
files as data: URLs so the dashboard can display them.

Purely the platform sample-logging path; does not touch training, offload, or
the materialize path. Guarded: file://+image-mime only, 2 MB cap, OSError ->
skip (handles a swept/missing file), per-call dedup cache, and the original
rollout is not mutated. Pairs path.as_uri() (offload) with urlparse/unquote
(monitor) so file:// URIs round-trip.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Conservative one-time startup grace for the first prod runs: gives PUB/SUB
subscriptions ample time to propagate before the first publish, avoiding step-0
slow-joiner drops (which would crash-loop at startup). It is a one-time cost at
job start, so erring large is essentially free; dial down once a topology is
observed to start cleanly.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
…tch wait

The worker wait on the master's micro-batch publish status was bounded by a fixed
timeout (publish_timeout_seconds=1800), but the master sets that status only after
pack() — which blocks on the orchestrator for generation. Slow multimodal steps
(observed 1178-3137s) legitimately exceed 1800s, so every rank timed out ->
coordinated crash -> deterministic crash-loop on the slow step. The old
filesystem path had no such timeout and simply idled.

Wait on the publish key with no deadline instead. Liveness is already covered: a
wedged master is killed by the packer watchdog (-> torchrun tears down the group)
and a master pack error sets ok=False for an immediate coordinated fail. Genuine
ZMQ delivery stays bounded by the receiver's recv_timeout once published. Net
deletes the publish_timeout_seconds knob and its plumbing.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Large multimodal sample parquets (inlined base64 images) were PUT to R2 as one
in-RAM body, failing on a memory-tight orchestrator with WriteTimeout / OpenSSL
[BUF] malloc failure. Gated on VLM runs (run_config model/student.model is_vlm),
write the parquet straight to a disk-backed temp file (co-located with output_dir
to avoid tmpfs) and stream it: an async byte iterator with an explicit
Content-Length keeps the presigned PUT on Content-Length (httpx drops chunked
transfer-encoding) and bounds each TLS write; a generous write timeout replaces
the flat 30s; concurrent streamed uploads are serialized; the temp file is
unlinked on every exit. Text runs keep the inline byte PUT unchanged.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Both submodules merged origin/main on their feat/ephemeral-mm-pixels
branches (renderers f7696cd, verifiers 00d83204); these supersede main's
pins and carry our mm work on top.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>

# Conflicts:
#	deps/renderers
#	deps/verifiers
#	src/prime_rl/orchestrator/orchestrator.py
#	src/prime_rl/orchestrator/trajectories.py
#	src/prime_rl/trainer/rl/train.py
Picks up: no-VCS `fallback-version` so the Docker editable build resolves
(fixes `uv sync` LookupError on verifiers), plus worldsims #1557 — multimodal
tool content + reasoning_content passthrough in the in-sandbox base runner.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
Adds raw-image ref handling to the token serving route: parse mmraw refs,
load the raw image from shared disk, reprocess via the HF image processor +
vLLM field factory, and validate hash/fingerprint/grid/placeholder before
caching. Bumps the renderers submodule to the matching emit-side commit.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
The auto formula (max(1, ceil(max_inflight_rollouts/256))) gave a single env
worker for any max_inflight <= 256, so one worker absorbed all in-flight
rollouts; its event loop saturated during concurrent sandbox setup and missed
the 30s heartbeat -> EnvRouter restart loop (surfaces as 'env server unhealthy').
Fix the train resolver to a fixed pool of 32 (explicit per-env/group values
still win) so setup load spreads across workers. Requires bumping the
orchestrator pod's memory/CPU to host 32 workers.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
R2 sample uploads still hit OpenSSL [BUF] malloc failure (OOM) because the
streaming path was gated on a manual config flag (`student.model.vlm`/`is_vlm`)
that is silently off when `[model.vlm]` isn't set, and because the parquet was
built as a full in-RAM Arrow table before spilling. Always build straight to a
disk-backed temp file via ParquetWriter, one rollout at a time, with a per-rollout
image cache so base64-inlined screenshots are never all held in RAM; always stream
the upload (Content-Length, chunked TLS writes). Removes the fragile config gate
and the in-RAM table/bytes build.

Co-Authored-By: Codex <codex@openai.com>
Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
This reverts commit 7414cd7.

The auto->32 train default was dead code: the earlier num_workers resolver
(num_examples == -1 -> 4) sets the value before the auto->32 check runs, so it
never applied (train envs still got ~4 workers -> heartbeat timeouts). The
correct lever is now per-run run_config.env_server.num_workers (platform #2838,
bounds 1-64), which sets it explicitly and bypasses both auto resolvers.

Co-Authored-By: Claude Opus 4.8 (1M context) <noreply@anthropic.com>
… after tokenization (#2752)

* fix(orchestrator): drop per-step prompt arrays from buffered rollouts after tokenization

Every trajectory step ships prompt_ids/prompt_mask for its ENTIRE prompt
prefix — O(turns x context) boxed ints, 100-370MB per long browser rollout —
but their only readers run at arrival (backfill_rollout_tokens /
interleave_rollout). The rollout then buffers in pending_groups (sibling
wait) and pending_batch until its batch ships, so at the ~250 buffered
rollouts observed on worldsims gflights runs this dead weight alone
overflows a 128GB orchestrator.

Strip the prompt arrays at the arrival boundary in TrainSink.add, mirroring
the offload_images_to_disk treatment of image bytes:
- keep a num_prompt_tokens count (the prime monitor's per-step
  num_input_tokens stat reads it, falling back to len(prompt_ids))
- keep completion arrays on every step (entropy/rare-token filters scan them)
- keep the final step whole (the wandb sample table decodes it)

save_rollouts already excludes the trajectory from disk artifacts on both
the train and eval paths, so on-disk output is unchanged.

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

* style: ruff format monitor num_input_tokens expression

Co-Authored-By: Claude Fable 5 <noreply@anthropic.com>

---------

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
hubert-marek and others added 2 commits June 9, 2026 22:05
…oad fallback) (#2753)

renderers#82: raw-image layout no longer hard-fails on hosted workers when
the model id misses the local HF cache — falls back to hf_hub_download for
preprocessor_config.json. Explicit image_* renderer-config overrides remain
the hermetic path; this is the safety net for configs without them.

Co-authored-by: Claude Fable 5 <noreply@anthropic.com>
(cherry picked from commit 77b8567)

Co-authored-by: Christian <cdreetz@gmail.com>
@hubert-marek hubert-marek changed the title Feat/ephemeral mm pixels [diff-only] Feat/ephemeral mm pixels Jun 10, 2026
try:
for port in (base + 1, base + 2):
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(("0.0.0.0", port))
try:
for port in (base + 1, base + 2):
sock = socket.socket(socket.AF_INET, socket.SOCK_STREAM)
sock.bind(("0.0.0.0", port))
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants