feat(orchestrator): add penalize action for gibberish/repetition filters#2775
Open
anravich13-cloud wants to merge 4 commits into
Open
feat(orchestrator): add penalize action for gibberish/repetition filters#2775anravich13-cloud wants to merge 4 commits into
anravich13-cloud wants to merge 4 commits into
Conversation
Adds an opt-in `penalize` action for rollout filters. Gibberish and
repetition filters can now cap detected rollout rewards to a configurable
negative value (`penalty_reward`, default -1.0) before advantage
computation, creating negative training signal without dropping the
rollout.
- Filter configs gain `action` (monitor|drop|penalize) + `penalty_reward`;
legacy `enforce` configs still parse (true→drop, false→monitor),
conflicting combinations raise a validation error.
- Filters gain a phase: gibberish/repetition are pre_advantage,
zero_advantage is post_advantage. TrainSink.process_group now applies
pre-advantage filters before assign_advantages so penalized rewards are
visible to the group baseline (and to sample.reward propagation).
- Penalties preserve the original reward (rollout.raw_reward) and record
per-filter metadata (rollout.reward_penalties); both flow into saved
rollouts via to_dict.
- New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min}
when penalties fire.
- Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.
A penalize filter in post_batch_filters caps rollout.raw['reward'] in process_batch, after process_group already propagated reward onto the trainer-bound TrainingSamples. Re-stamp sample.reward for penalized rollouts so shipped samples agree with the rollout reward used in metrics. Advantage is intentionally untouched: post-batch runs after advantage computation, so a penalty there is metadata-only.
Covers the sink paths the filter unit tests can't reach: - process_group applies pre-advantage penalize before the group baseline and stamps post-penalty reward/advantage onto trainer-bound samples; - an equally-penalized group collapses to zero advantage and is dropped by the post-advantage zero_advantage filter (drop attribution excludes the penalty filter); - process_batch re-syncs sample.reward after a post-batch penalize (regression test for stale TrainingSample rewards); - post-batch drop still excludes samples from the shipped batch. Bypasses tokenizer/renderer by pre-building rollout.samples and driving process_group / process_batch directly.
samsja
reviewed
Jun 12, 2026
Comment on lines
+1
to
+24
| """Sink-level tests for filter actions: ``process_group`` ordering (penalty | ||
| visible to the advantage baseline + sample propagation) and ``process_batch`` | ||
| post-filter sample re-sync. | ||
|
|
||
| ``TrainSink``'s heavy constructor args (tokenizer / renderer / real envs) | ||
| are only used by ``add()`` / ``process_rollout``; these tests bypass them by | ||
| pre-building ``rollout.samples`` and driving ``process_group`` / | ||
| ``process_batch`` directly. | ||
| """ | ||
|
|
||
| import math | ||
| import uuid | ||
| from types import SimpleNamespace | ||
|
|
||
| from prime_rl.configs.orchestrator import DefaultAdvantageConfig | ||
| from prime_rl.orchestrator.advantage import setup_advantage_fn | ||
| from prime_rl.orchestrator.filters import GibberishFilter, ZeroAdvantageFilter | ||
| from prime_rl.orchestrator.train_sink import TrainSink | ||
| from prime_rl.orchestrator.types import TrainRollout | ||
| from prime_rl.transport import TrainingSample | ||
|
|
||
| VOCAB_SIZE = 128_000 | ||
|
|
||
|
|
samsja
reviewed
Jun 12, 2026
Comment on lines
+223
to
+259
| def test_config_default_actions(): | ||
| assert GibberishFilterConfig().resolved_action == "monitor" | ||
| assert RepetitionFilterConfig().resolved_action == "monitor" | ||
| assert ZeroAdvantageFilterConfig().resolved_action == "drop" | ||
|
|
||
|
|
||
| def test_config_legacy_enforce_true_resolves_to_drop(): | ||
| assert GibberishFilterConfig(enforce=True).resolved_action == "drop" | ||
| assert RepetitionFilterConfig(enforce=True).resolved_action == "drop" | ||
| assert ZeroAdvantageFilterConfig(enforce=True).resolved_action == "drop" | ||
|
|
||
|
|
||
| def test_config_legacy_enforce_false_resolves_to_monitor(): | ||
| assert GibberishFilterConfig(enforce=False).resolved_action == "monitor" | ||
| assert RepetitionFilterConfig(enforce=False).resolved_action == "monitor" | ||
| assert ZeroAdvantageFilterConfig(enforce=False).resolved_action == "monitor" | ||
|
|
||
|
|
||
| def test_config_penalize_parses_with_penalty_reward(): | ||
| config = GibberishFilterConfig(action="penalize", penalty_reward=-0.5) | ||
| assert config.resolved_action == "penalize" | ||
| assert config.penalty_reward == -0.5 | ||
|
|
||
|
|
||
| def test_config_conflicting_action_and_enforce_raises(): | ||
| with pytest.raises(ValidationError): | ||
| GibberishFilterConfig(action="penalize", enforce=True) | ||
| with pytest.raises(ValidationError): | ||
| RepetitionFilterConfig(action="monitor", enforce=True) | ||
| with pytest.raises(ValidationError): | ||
| ZeroAdvantageFilterConfig(action="drop", enforce=False) | ||
|
|
||
|
|
||
| def test_config_consistent_action_and_enforce_ok(): | ||
| assert GibberishFilterConfig(action="drop", enforce=True).resolved_action == "drop" | ||
| assert RepetitionFilterConfig(action="monitor", enforce=False).resolved_action == "monitor" | ||
|
|
samsja
reviewed
Jun 12, 2026
Comment on lines
+459
to
+465
| def resolved_action(self) -> FilterAction: | ||
| """The effective action: explicit ``action`` wins, then legacy ``enforce``, then the per-type default.""" | ||
| if self.action is not None: | ||
| return self.action | ||
| if self.enforce is not None: | ||
| return "drop" if self.enforce else "monitor" | ||
| return self._default_action |
Member
There was a problem hiding this comment.
lets rather use proper default value that having | None and doing this logic here
hallerite
reviewed
Jun 12, 2026
| class BaseFilterConfig(BaseConfig): | ||
| """Shared action fields for rollout filters. | ||
|
|
||
| Exactly one of ``action`` / ``enforce`` should be set (``enforce`` is legacy |
Member
There was a problem hiding this comment.
I think we can just remove the backwards compat here
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.

Adds an opt-in
penalizeaction for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (penalty_reward, default -1.0) before advantage computation, creating negative training signal without dropping the rollout.action(monitor|drop|penalize) +penalty_reward; legacyenforceconfigs still parse (true→drop, false→monitor), conflicting combinations raise a validation error.Note
Medium Risk
Changes GRPO training signal timing (reward caps before advantages) and filter config semantics; defaults are preserved but misconfigured penalize could skew advantages.
Overview
Rollout filters now use
action(monitor|drop|penalize) instead of a booleanenforce, via sharedBaseFilterConfigwithresolved_action,penalty_reward, and validation whenactionandenforcedisagree.penalizecaps reward withmin(raw, penalty_reward), keeps rollouts trainable, and recordsraw_reward/reward_penaltiesonTrainRollout(surfaced in saved rollouts).TrainSink.process_groupruns pre-advantage filters (gibberish/repetition) beforeassign_advantages, then post-advantage filters (zero advantage); pre-batch drop stats count onlydropactions.process_batchre-syncssample.rewardafter post-batch penalize. Metrics add penalized rates andraw_reward/all/*when penalties occur.Defaults stay monitor for gibberish/repetition and drop for zero advantage; legacy
enforcestill maps to monitor/drop.Reviewed by Cursor Bugbot for commit 85b696e. Bugbot is set up for automated code reviews on this repo. Configure here.