-
Notifications
You must be signed in to change notification settings - Fork 309
feat: algorithm abstraction — named algorithm classes + inline frozen-model references (grpo, opd, sft_distill, self_distill, echo) #2746
New issue
Have a question about this project? Sign up for a free GitHub account to open an issue and contact its maintainers and the community.
By clicking “Sign up for GitHub”, you agree to our terms of service and privacy statement. We’ll occasionally send you account related emails.
Already on GitHub? Sign in to your account
Open
hallerite
wants to merge
47
commits into
main
Choose a base branch
from
feat/algorithm-abstraction
base: main
Could not load branches
Branch not found: {{ refName }}
Loading
Could not load tags
Nothing to show
Loading
Are you sure you want to change the base?
Some commits from the old base branch may be removed from the timeline,
and old review comments may become outdated.
Open
Changes from 8 commits
Commits
Show all changes
47 commits
Select commit
Hold shift + click to select a range
f35e3a8
feat: algorithm abstraction — sampling/scoring/loss presets (grpo, op…
hallerite 5893d71
refactor: kill the teacher concept — model registry + runtime Algorit…
hallerite fc79ae1
refactor: unify advantage and token scorers into one advantage-strate…
hallerite 3c3c3d1
Merge remote-tracking branch 'origin/main' into feat/algorithm-abstra…
hallerite 54864f4
fix(trainer): keep ref_logprobs position-aligned when packing mixed bins
hallerite fc79651
feat(trainer): assert per-token array alignment after packing
hallerite 9b41d1b
refactor: rename config key algorithm -> algo, loss_core -> loss_type
hallerite 207ad13
refactor(configs): make [orchestrator.model] canonical for the policy
hallerite 5f91d7d
feat(configs): mixed grpo+opd debug config
hallerite 8eb5c33
docs: state the role principle — roles are algorithm-local labels ove…
hallerite fceac30
feat(orchestrator): inline algorithm-owned model references, drop the…
hallerite 96beac3
refactor(configs): flatten FrozenModelConfig to ClientConfig + name
hallerite c3a7d6f
chore(configs): drop the POLICY_MODEL constant
hallerite abbcb8b
refactor(transport): LossType IntEnum for loss type scalars
hallerite d21c9f9
feat(orchestrator): per-token advantages from custom advantage strate…
hallerite 51fa6b1
refactor(orchestrator): split algorithms.py into the algo/ package
hallerite 5083441
chore(orchestrator): rename setup_frozen_pool/owned_pools to connect_…
hallerite 06d50ed
feat(trainer): loss partition -> sum of three weighted components
hallerite 4a743e3
refactor(configs): presets as data deltas, shorthand folding on raw i…
hallerite 34bdd15
feat(orchestrator): named algorithm classes own assign/score; declare…
hallerite f8f73cb
refactor(orchestrator): algorithms take (policy_pool, renderer); toke…
hallerite 3eba9f4
refactor: merge loss routing into the advantage; split sampling into …
hallerite d8b7e89
refactor(orchestrator): OPD ships no scalar advantage, like OPSD
hallerite fdb6b87
chore: parsimony pass over the algorithm abstraction
hallerite d322bcc
Merge remote-tracking branch 'origin/main' into feat/algorithm-abstra…
hallerite cfc1041
feat(orchestrator): MaxRL advantage strategy
hallerite 618b605
refactor(configs): presets are atomic — select whole or assemble your…
hallerite ba05022
Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage
hallerite 64304b9
docs: module docstring matches atomic presets
hallerite 3ff860c
Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage
hallerite 1d7e76c
feat(orchestrator): echo trains tool-response tokens by default
hallerite eeaf4f0
Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage
hallerite 1389227
test: shorthand-assembly fixture needs a renderer for tool-mode echo
hallerite b9af98c
Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage
hallerite 9157076
fix(orchestrator): accept attribution as dict or RenderedTokens
hallerite 1395858
Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage
hallerite 8a7b8a6
refactor(configs): flatten the policy config — orchestrator.model.nam…
hallerite 26cdbd3
Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage
hallerite 3126d46
fix: flat-policy leftovers — elastic client section, setup mocks, docs
hallerite a718d59
Merge branch 'feat/algorithm-abstraction' into feat/maxrl-advantage
hallerite 262baf2
chore: section-style algo config in max_rl debug toml
hallerite 509d6a3
feat(echo): per-role echo weights + user-supplied token filter
hallerite e55f067
Merge branch 'feat/echo-selection' into feat/algorithm-abstraction
hallerite 8e7e51f
feat(algo): advantage types are the algorithm names; preset layer del…
hallerite 48ac7ff
docs: scrub stale algorithm names and preset vocabulary
hallerite f5be6f4
fix(trainer): anchor the loss to the graph when every component is empty
hallerite 0606f12
fix(trainer): namespace ref_kl loss metrics; harden batch preparation
hallerite File filter
Filter by extension
Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
There are no files selected for viewing
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,57 @@ | ||
| # Algorithm — Debug Configs | ||
|
|
||
| Minimal end-to-end configs for the algorithm presets against bundled verifiers envs, using `PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT` as the policy. | ||
|
|
||
| | Config | Algorithm | Frozen model | Notes | | ||
| |---|---|---|---| | ||
| | `grpo.toml` | `grpo` | none | | | ||
| | `opd.toml` | `opd` | local vLLM (`Qwen3-0.6B-Reverse-Text-RL`) | | | ||
| | `opd_lora.toml` | `opd` | local vLLM (`Qwen3-0.6B-Reverse-Text-RL`) | trains a LoRA adapter (rank 8) | | ||
| | `sft_distill.toml` | `sft_distill` | local vLLM (`Qwen3-0.6B-Reverse-Text-RL`) | | | ||
| | `sft_distill_lora.toml` | `sft_distill` | local vLLM (`Qwen3-0.6B-Reverse-Text-RL`) | trains a LoRA adapter (rank 8) | | ||
| | `sft_distill_external.toml` | `sft_distill` | PI inference (`openai/gpt-5-mini`) | external OAI endpoint; no local server | | ||
| | `self_distill.toml` | `self_distill` | none (`model = "policy"`) | SDFT against the live policy; demo from reverse-text's `answer` field | | ||
| | `echo.toml` | `echo` | none | multi-turn `alphabet-sort`; CE on observation tokens | | ||
|
|
||
| The policy inference server is auto-launched on GPU 0 at `http://localhost:8000/v1` with `gpu_memory_utilization=0.5`. The local frozen model (used by `opd*.toml` and `sft_distill.toml` / `sft_distill_lora.toml`) is **not** auto-launched — start it manually on GPU 1. | ||
|
|
||
| Frozen models are plain `[orchestrator.models.<key>]` entries; the algorithm points at them by key (`algo = { name = "opd", model = "reverse-text-rl" }`). There is no dedicated teacher slot — the same entry can serve any number of envs' algorithms. | ||
|
|
||
| ## Start the local frozen model | ||
|
|
||
| Needed for `opd*.toml` and `sft_distill.toml` / `sft_distill_lora.toml`: | ||
|
|
||
| ```bash | ||
| CUDA_VISIBLE_DEVICES=1 uv run inference \ | ||
| --model.name PrimeIntellect/Qwen3-0.6B-Reverse-Text-RL \ | ||
| --server.port 8001 \ | ||
| --gpu-memory-utilization 0.5 \ | ||
| --model.enforce-eager | ||
| ``` | ||
|
|
||
| ## Run the debug configs | ||
|
|
||
| ```bash | ||
| # GRPO (no frozen model) | ||
| uv run rl @ configs/debug/algorithms/grpo.toml | ||
|
|
||
| # OPD (needs the frozen model on port 8001) | ||
| uv run rl @ configs/debug/algorithms/opd.toml | ||
| uv run rl @ configs/debug/algorithms/opd_lora.toml | ||
|
|
||
| # SFT distillation (needs the frozen model on port 8001) | ||
| uv run rl @ configs/debug/algorithms/sft_distill.toml | ||
| uv run rl @ configs/debug/algorithms/sft_distill_lora.toml | ||
|
|
||
| # SFT distillation from openai/gpt-5-mini via PI inference | ||
| # (requires PRIME_API_KEY + PRIME_TEAM_ID in env; no local frozen model needed) | ||
| uv run rl @ configs/debug/algorithms/sft_distill_external.toml | ||
|
|
||
| # Self-distillation against the live policy (no frozen model) | ||
| uv run rl @ configs/debug/algorithms/self_distill.toml | ||
|
|
||
| # ECHO (no frozen model; multi-turn env) | ||
| uv run rl @ configs/debug/algorithms/echo.toml | ||
| ``` | ||
|
|
||
| See [docs/algorithms.md](../../../docs/algorithms.md) for what each algorithm does and how to compose custom ones. |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # ECHO on the multi-turn alphabet-sort env (bundled with verifiers): GRPO on | ||
| # action tokens + weighted CE on the env's observation tokens. | ||
| # uv run rl @ configs/debug/algorithms/echo.toml | ||
|
|
||
| max_steps = 20 | ||
| seq_len = 4096 | ||
|
|
||
| [model] | ||
| name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT" | ||
|
|
||
| [wandb] | ||
| project = "algorithms-debug" | ||
| name = "debug-echo" | ||
|
|
||
| [orchestrator] | ||
| batch_size = 32 | ||
| group_size = 4 | ||
|
|
||
| [orchestrator.algo] | ||
| name = "echo" | ||
|
|
||
| [orchestrator.algo.loss] | ||
| observation_weight = 0.1 | ||
|
|
||
| [[orchestrator.train.env]] | ||
| id = "alphabet-sort" | ||
| args = { min_turns = 3, max_turns = 5, power_per_turn = false } | ||
|
|
||
| [orchestrator.train.sampling] | ||
| max_completion_tokens = 512 | ||
|
|
||
| # ECHO learns from observation tokens even when the GRPO advantage collapses | ||
| # to zero — keep zero-advantage rollouts in the batch. | ||
| [[orchestrator.post_batch_filters]] | ||
| type = "zero_advantage" | ||
| enforce = false | ||
|
|
||
| # Qwen3 finetune with the standard PI template patch; always re-emits prior | ||
| # <think> blocks, matched by the qwen3 renderer's preserve_all_thinking. | ||
| [orchestrator.renderer] | ||
| name = "qwen3" | ||
| preserve_all_thinking = true | ||
|
|
||
| [trainer.optim] | ||
| lr = 1e-6 | ||
|
|
||
| [inference] | ||
| gpu_memory_utilization = 0.5 |
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
| Original file line number | Diff line number | Diff line change |
|---|---|---|
| @@ -0,0 +1,48 @@ | ||
| # Self-distillation (SDFT, https://arxiv.org/abs/2601.19897) against the live | ||
| # policy itself: the reference for each completion is the current model | ||
| # conditioned on the expert demonstration — no extra deployment needed. | ||
| # reverse-text carries the demonstration in its top-level `answer` field. | ||
| # uv run rl @ configs/debug/algorithms/self_distill.toml | ||
|
|
||
| max_steps = 20 | ||
| seq_len = 2048 | ||
|
|
||
| [model] | ||
| name = "PrimeIntellect/Qwen3-0.6B-Reverse-Text-SFT" | ||
|
|
||
| [wandb] | ||
| project = "algorithms-debug" | ||
| name = "debug-self-distill" | ||
|
|
||
| [orchestrator] | ||
| batch_size = 32 | ||
| group_size = 1 | ||
|
|
||
| [orchestrator.algo] | ||
| name = "self_distill" | ||
| advantage = { type = "demo_ref_kl", model = "policy", demo_key = "answer" } | ||
|
cursor[bot] marked this conversation as resolved.
Outdated
|
||
|
|
||
| [orchestrator.renderer] | ||
| name = "qwen3" | ||
|
|
||
| [orchestrator.train.sampling] | ||
| max_completion_tokens = 128 | ||
|
|
||
| [[orchestrator.train.env]] | ||
| id = "reverse-text" | ||
|
|
||
| [orchestrator.eval] | ||
| interval = 1 | ||
| num_examples = 128 | ||
|
|
||
| [orchestrator.eval.sampling] | ||
| max_completion_tokens = 128 | ||
|
|
||
| [[orchestrator.eval.env]] | ||
| id = "reverse-text" | ||
|
|
||
| [trainer.optim] | ||
| lr = 3e-6 | ||
|
|
||
| [inference] | ||
| gpu_memory_utilization = 0.5 | ||
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Oops, something went wrong.
Oops, something went wrong.
Add this suggestion to a batch that can be applied as a single commit.
This suggestion is invalid because no changes were made to the code.
Suggestions cannot be applied while the pull request is closed.
Suggestions cannot be applied while viewing a subset of changes.
Only one suggestion per line can be applied in a batch.
Add this suggestion to a batch that can be applied as a single commit.
Applying suggestions on deleted lines is not supported.
You must change the existing code in this line in order to create a valid suggestion.
Outdated suggestions cannot be applied.
This suggestion has been applied or marked resolved.
Suggestions cannot be applied from pending reviews.
Suggestions cannot be applied on multi-line comments.
Suggestions cannot be applied while the pull request is queued to merge.
Suggestion cannot be applied right now. Please check back later.
Uh oh!
There was an error while loading. Please reload this page.