Skip to content

feat(orchestrator): add penalize action for gibberish/repetition filters#2775

Open
anravich13-cloud wants to merge 4 commits into
mainfrom
feat/repetition-gibberish-penalty
Open

feat(orchestrator): add penalize action for gibberish/repetition filters#2775
anravich13-cloud wants to merge 4 commits into
mainfrom
feat/repetition-gibberish-penalty

Conversation

@anravich13-cloud

@anravich13-cloud anravich13-cloud commented Jun 11, 2026

Copy link
Copy Markdown

Adds an opt-in penalize action for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (penalty_reward, default -1.0) before advantage computation, creating negative training signal without dropping the rollout.

  • Filter configs gain action (monitor|drop|penalize) + penalty_reward; legacy enforce configs still parse (true→drop, false→monitor), conflicting combinations raise a validation error.
  • Filters gain a phase: gibberish/repetition are pre_advantage, zero_advantage is post_advantage. TrainSink.process_group now applies pre-advantage filters before assign_advantages so penalized rewards are visible to the group baseline (and to sample.reward propagation).
  • Penalties preserve the original reward (rollout.raw_reward) and record per-filter metadata (rollout.reward_penalties); both flow into saved rollouts via to_dict.
  • New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min} when penalties fire.
  • Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

Note

Medium Risk
Changes GRPO training signal timing (reward caps before advantages) and filter config semantics; defaults are preserved but misconfigured penalize could skew advantages.

Overview
Rollout filters now use action (monitor | drop | penalize) instead of a boolean enforce, via shared BaseFilterConfig with resolved_action, penalty_reward, and validation when action and enforce disagree. penalize caps reward with min(raw, penalty_reward), keeps rollouts trainable, and records raw_reward / reward_penalties on TrainRollout (surfaced in saved rollouts).

TrainSink.process_group runs pre-advantage filters (gibberish/repetition) before assign_advantages, then post-advantage filters (zero advantage); pre-batch drop stats count only drop actions. process_batch re-syncs sample.reward after post-batch penalize. Metrics add penalized rates and raw_reward/all/* when penalties occur.

Defaults stay monitor for gibberish/repetition and drop for zero advantage; legacy enforce still maps to monitor/drop.

Reviewed by Cursor Bugbot for commit 85b696e. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds an opt-in `penalize` action for rollout filters. Gibberish and
repetition filters can now cap detected rollout rewards to a configurable
negative value (`penalty_reward`, default -1.0) before advantage
computation, creating negative training signal without dropping the
rollout.

- Filter configs gain `action` (monitor|drop|penalize) + `penalty_reward`;
  legacy `enforce` configs still parse (true→drop, false→monitor),
  conflicting combinations raise a validation error.
- Filters gain a phase: gibberish/repetition are pre_advantage,
  zero_advantage is post_advantage. TrainSink.process_group now applies
  pre-advantage filters before assign_advantages so penalized rewards are
  visible to the group baseline (and to sample.reward propagation).
- Penalties preserve the original reward (rollout.raw_reward) and record
  per-filter metadata (rollout.reward_penalties); both flow into saved
  rollouts via to_dict.
- New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min}
  when penalties fire.
- Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.

Comment thread src/prime_rl/orchestrator/train_sink.py
A penalize filter in post_batch_filters caps rollout.raw['reward'] in
process_batch, after process_group already propagated reward onto the
trainer-bound TrainingSamples. Re-stamp sample.reward for penalized
rollouts so shipped samples agree with the rollout reward used in
metrics. Advantage is intentionally untouched: post-batch runs after
advantage computation, so a penalty there is metadata-only.
Covers the sink paths the filter unit tests can't reach:
- process_group applies pre-advantage penalize before the group baseline
  and stamps post-penalty reward/advantage onto trainer-bound samples;
- an equally-penalized group collapses to zero advantage and is dropped
  by the post-advantage zero_advantage filter (drop attribution excludes
  the penalty filter);
- process_batch re-syncs sample.reward after a post-batch penalize
  (regression test for stale TrainingSample rewards);
- post-batch drop still excludes samples from the shipped batch.

Bypasses tokenizer/renderer by pre-building rollout.samples and driving
process_group / process_batch directly.
Comment on lines +1 to +24
"""Sink-level tests for filter actions: ``process_group`` ordering (penalty
visible to the advantage baseline + sample propagation) and ``process_batch``
post-filter sample re-sync.

``TrainSink``'s heavy constructor args (tokenizer / renderer / real envs)
are only used by ``add()`` / ``process_rollout``; these tests bypass them by
pre-building ``rollout.samples`` and driving ``process_group`` /
``process_batch`` directly.
"""

import math
import uuid
from types import SimpleNamespace

from prime_rl.configs.orchestrator import DefaultAdvantageConfig
from prime_rl.orchestrator.advantage import setup_advantage_fn
from prime_rl.orchestrator.filters import GibberishFilter, ZeroAdvantageFilter
from prime_rl.orchestrator.train_sink import TrainSink
from prime_rl.orchestrator.types import TrainRollout
from prime_rl.transport import TrainingSample

VOCAB_SIZE = 128_000


Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lest remove tests

Comment on lines +223 to +259
def test_config_default_actions():
assert GibberishFilterConfig().resolved_action == "monitor"
assert RepetitionFilterConfig().resolved_action == "monitor"
assert ZeroAdvantageFilterConfig().resolved_action == "drop"


def test_config_legacy_enforce_true_resolves_to_drop():
assert GibberishFilterConfig(enforce=True).resolved_action == "drop"
assert RepetitionFilterConfig(enforce=True).resolved_action == "drop"
assert ZeroAdvantageFilterConfig(enforce=True).resolved_action == "drop"


def test_config_legacy_enforce_false_resolves_to_monitor():
assert GibberishFilterConfig(enforce=False).resolved_action == "monitor"
assert RepetitionFilterConfig(enforce=False).resolved_action == "monitor"
assert ZeroAdvantageFilterConfig(enforce=False).resolved_action == "monitor"


def test_config_penalize_parses_with_penalty_reward():
config = GibberishFilterConfig(action="penalize", penalty_reward=-0.5)
assert config.resolved_action == "penalize"
assert config.penalty_reward == -0.5


def test_config_conflicting_action_and_enforce_raises():
with pytest.raises(ValidationError):
GibberishFilterConfig(action="penalize", enforce=True)
with pytest.raises(ValidationError):
RepetitionFilterConfig(action="monitor", enforce=True)
with pytest.raises(ValidationError):
ZeroAdvantageFilterConfig(action="drop", enforce=False)


def test_config_consistent_action_and_enforce_ok():
assert GibberishFilterConfig(action="drop", enforce=True).resolved_action == "drop"
assert RepetitionFilterConfig(action="monitor", enforce=False).resolved_action == "monitor"

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets remove

Comment on lines +459 to +465
def resolved_action(self) -> FilterAction:
"""The effective action: explicit ``action`` wins, then legacy ``enforce``, then the per-type default."""
if self.action is not None:
return self.action
if self.enforce is not None:
return "drop" if self.enforce else "monitor"
return self._default_action

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

lets rather use proper default value that having | None and doing this logic here

class BaseFilterConfig(BaseConfig):
"""Shared action fields for rollout filters.

Exactly one of ``action`` / ``enforce`` should be set (``enforce`` is legacy

Copy link
Copy Markdown
Member

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

I think we can just remove the backwards compat here

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

3 participants