Skip to content

feat(orchestrator): add penalize action for gibberish/repetition filters#2775

Open
anravich13-cloud wants to merge 1 commit into
mainfrom
feat/repetition-gibberish-penalty
Open

feat(orchestrator): add penalize action for gibberish/repetition filters#2775
anravich13-cloud wants to merge 1 commit into
mainfrom
feat/repetition-gibberish-penalty

Conversation

@anravich13-cloud

@anravich13-cloud anravich13-cloud commented Jun 11, 2026

Copy link
Copy Markdown

Adds an opt-in penalize action for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (penalty_reward, default -1.0) before advantage computation, creating negative training signal without dropping the rollout.

  • Filter configs gain action (monitor|drop|penalize) + penalty_reward; legacy enforce configs still parse (true→drop, false→monitor), conflicting combinations raise a validation error.
  • Filters gain a phase: gibberish/repetition are pre_advantage, zero_advantage is post_advantage. TrainSink.process_group now applies pre-advantage filters before assign_advantages so penalized rewards are visible to the group baseline (and to sample.reward propagation).
  • Penalties preserve the original reward (rollout.raw_reward) and record per-filter metadata (rollout.reward_penalties); both flow into saved rollouts via to_dict.
  • New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min} when penalties fire.
  • Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

Note

Medium Risk
Changes GRPO training signal ordering (reward caps before advantages) and filter config semantics; defaults stay monitor/drop but misconfigured penalize could skew reward and advantage stats.

Overview
Adds a penalize rollout-filter action so gibberish/repetition detections can cap reward (default penalty_reward=-1.0) instead of only monitoring or dropping rollouts.

Filter config is unified under BaseFilterConfig: action (monitor | drop | penalize), optional penalty_reward, and resolved_action with legacy enforce (true→drop, false→monitor) plus validation when action and enforce disagree. Filters are split into pre_advantage (gibberish/repetition) vs post_advantage (zero advantage); TrainSink.process_group runs pre-advantage filters before assign_advantages so penalized rewards affect GRPO baselines, then post-advantage filters.

Penalties update raw["reward"], stash raw_reward / reward_penalties on TrainRollout, and surface in saved rollouts via to_dict. apply_filters is safe across multiple passes; pre-batch drop stats count only drop actions. New logging: filters/all/{name}_penalized and raw_reward/all/* when penalties fire.

Reviewed by Cursor Bugbot for commit 5208ce8. Bugbot is set up for automated code reviews on this repo. Configure here.

Adds an opt-in `penalize` action for rollout filters. Gibberish and
repetition filters can now cap detected rollout rewards to a configurable
negative value (`penalty_reward`, default -1.0) before advantage
computation, creating negative training signal without dropping the
rollout.

- Filter configs gain `action` (monitor|drop|penalize) + `penalty_reward`;
  legacy `enforce` configs still parse (true→drop, false→monitor),
  conflicting combinations raise a validation error.
- Filters gain a phase: gibberish/repetition are pre_advantage,
  zero_advantage is post_advantage. TrainSink.process_group now applies
  pre-advantage filters before assign_advantages so penalized rewards are
  visible to the group baseline (and to sample.reward propagation).
- Penalties preserve the original reward (rollout.raw_reward) and record
  per-filter metadata (rollout.reward_penalties); both flow into saved
  rollouts via to_dict.
- New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min}
  when penalties fire.
- Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

@cursor cursor Bot left a comment

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Cursor Bugbot has reviewed your changes and found 1 potential issue.

Fix All in Cursor

❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.

Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.

num_filtered += 1
for name, hit in r.filter_results.items():
if hit:
if hit and name in drop_filter_names:

Copy link
Copy Markdown

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Post-batch penalize stale samples

Medium Severity

If a penalize filter is configured on post_batch_filters, process_batch caps rollout.raw["reward"] via apply_filters but never refreshes TrainingSample.reward (or advantage) after that pass. Shipped samples can still carry pre-penalty values from process_group, so trainer-bound data disagrees with the rollout reward used in metrics.

Fix in Cursor Fix in Web

Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

1 participant