feat(orchestrator): EnvMixStrategy seam for env selection by hallerite · Pull Request #2743 · PrimeIntellect-ai/prime-rl

hallerite · 2026-06-09T17:53:53Z

What

Extract TrainSource's weighted round-robin env selection into a swappable EnvMixStrategy seam (default WeightedRoundRobin). Example selection — the per-env reshuffling cursor — stays in TrainSource.

This is slice (b) of the composable algorithm abstraction: cleanly separate which env (global EnvMixStrategy) from which example (per-env, slice c builds on this).

Changes

orchestrator/sampling.py (new): EnvMixStrategy ABC + WeightedRoundRobin default. pick() returns the next env name via weighted random choice.
TrainSource delegates the env pick to self.env_mix.pick(); it still owns dataset loading, the per-env cursor, reshuffle-on-exhaustion, and env_costs.

Behavior

Behavior-preserving. WeightedRoundRobin draws from TrainSource's existing RNG (injected), so env selection stays in the same stream as the dataset shuffles — the example sequence is identical to before. (Partitioning RNG per-env is a slice-(c) change, not here.) The public API (TrainSource(train_envs, seed) + next_example) is unchanged.

Testing

ruff check + ruff format --check clean.
tests/unit/orchestrator/test_sampling.py (new): WeightedRoundRobin determinism per seed, weight respecting (incl. zero-weight never picked), empty-envs guard.
tests/unit/test_configs.py (106) pass; imports resolve.

🤖 Generated with Claude Code

Note

Low Risk
Refactor with default implementation matching prior logic; TrainSource API and dispatcher wiring unchanged.

Overview
Introduces a swappable env mix seam for training rollouts: global “which env next?” is no longer inlined in TrainSource.

New orchestrator/sampling.py defines EnvMixStrategy and default WeightedRoundRobin, which picks an env name via weighted random choice (same weight rules as before: per-env ratio when all set, else dataset size). TrainSource now calls self.env_mix.pick() instead of rng.choices directly; it still owns datasets, per-env cursors, reshuffle-on-exhaustion, permit costs, and next_example’s public contract.

Behavior-preserving: WeightedRoundRobin uses TrainSource’s existing random.Random instance so env picks stay in the same RNG stream as dataset shuffles.

Tests: tests/unit/orchestrator/test_sampling.py covers determinism, weight behavior (including zero weight), and empty-env validation.

^{Reviewed by Cursor Bugbot for commit 84370ab. Bugbot is set up for automated code reviews on this repo. Configure here.}

Extract TrainSource's weighted round-robin env pick into a swappable EnvMixStrategy (default WeightedRoundRobin). Example selection (the reshuffling cursor) stays in TrainSource. The strategy draws from TrainSource's RNG, so the example sequence is unchanged — pure extraction, no behavior delta. Separates 'which env' from 'which example' as the seam slice (c) builds on. Co-Authored-By: Claude Opus 4.8 <noreply@anthropic.com>

hallerite marked this pull request as ready for review June 9, 2026 17:58

hallerite requested a review from mikasenghaas June 9, 2026 17:58

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): EnvMixStrategy seam for env selection#2743

feat(orchestrator): EnvMixStrategy seam for env selection#2743
hallerite wants to merge 1 commit into
mainfrom
feat/env-mix-strategy

hallerite commented Jun 9, 2026 •

edited by cursor Bot

Loading

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

hallerite commented Jun 9, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What

Changes

Behavior

Testing

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

hallerite commented Jun 9, 2026 •

edited by cursor Bot

Loading