feat(orchestrator): add penalize action for gibberish/repetition filters by anravich13-cloud · Pull Request #2775 · PrimeIntellect-ai/prime-rl

anravich13-cloud · 2026-06-11T19:11:26Z

Adds an opt-in penalize action for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (penalty_reward, default -1.0) before advantage computation, creating negative training signal without dropping the rollout.

Filter configs gain action (monitor|drop|penalize) + penalty_reward; legacy enforce configs still parse (true→drop, false→monitor), conflicting combinations raise a validation error.
Filters gain a phase: gibberish/repetition are pre_advantage, zero_advantage is post_advantage. TrainSink.process_group now applies pre-advantage filters before assign_advantages so penalized rewards are visible to the group baseline (and to sample.reward propagation).
Penalties preserve the original reward (rollout.raw_reward) and record per-filter metadata (rollout.reward_penalties); both flow into saved rollouts via to_dict.
New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min} when penalties fire.
Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

Note

Medium Risk
Changes GRPO training signal ordering (reward caps before advantages) and filter config semantics; defaults stay monitor/drop but misconfigured penalize could skew reward and advantage stats.

Overview
Adds a penalize rollout-filter action so gibberish/repetition detections can cap reward (default penalty_reward=-1.0) instead of only monitoring or dropping rollouts.

Filter config is unified under BaseFilterConfig: action (monitor | drop | penalize), optional penalty_reward, and resolved_action with legacy enforce (true→drop, false→monitor) plus validation when action and enforce disagree. Filters are split into pre_advantage (gibberish/repetition) vs post_advantage (zero advantage); TrainSink.process_group runs pre-advantage filters before assign_advantages so penalized rewards affect GRPO baselines, then post-advantage filters.

Penalties update raw["reward"], stash raw_reward / reward_penalties on TrainRollout, and surface in saved rollouts via to_dict. apply_filters is safe across multiple passes; pre-batch drop stats count only drop actions. New logging: filters/all/{name}_penalized and raw_reward/all/* when penalties fire.

^{Reviewed by Cursor Bugbot for commit 5208ce8. Bugbot is set up for automated code reviews on this repo. Configure here.}

Adds an opt-in `penalize` action for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (`penalty_reward`, default -1.0) before advantage computation, creating negative training signal without dropping the rollout. - Filter configs gain `action` (monitor|drop|penalize) + `penalty_reward`; legacy `enforce` configs still parse (true→drop, false→monitor), conflicting combinations raise a validation error. - Filters gain a phase: gibberish/repetition are pre_advantage, zero_advantage is post_advantage. TrainSink.process_group now applies pre-advantage filters before assign_advantages so penalized rewards are visible to the group baseline (and to sample.reward propagation). - Penalties preserve the original reward (rollout.raw_reward) and record per-filter metadata (rollout.reward_penalties); both flow into saved rollouts via to_dict. - New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min} when penalties fire. - Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.

cursor

Cursor Bugbot has reviewed your changes and found 1 potential issue.

^{❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.}

^{Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.}

cursor · 2026-06-11T19:13:57Z

                num_filtered += 1
                for name, hit in r.filter_results.items():
-                    if hit:
+                    if hit and name in drop_filter_names:


Post-batch penalize stale samples

Medium Severity

If a penalize filter is configured on post_batch_filters, process_batch caps rollout.raw["reward"] via apply_filters but never refreshes TrainingSample.reward (or advantage) after that pass. Shipped samples can still carry pre-penalty values from process_group, so trainer-bound data disagrees with the rollout reward used in metrics.

^{Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.}

cursor Bot reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

feat(orchestrator): add penalize action for gibberish/repetition filters#2775

feat(orchestrator): add penalize action for gibberish/repetition filters#2775
anravich13-cloud wants to merge 1 commit into
mainfrom
feat/repetition-gibberish-penalty

anravich13-cloud commented Jun 11, 2026 •

edited by cursor Bot

Loading

Uh oh!

cursor Bot left a comment

Uh oh!

cursor Bot Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

Conversation

anravich13-cloud commented Jun 11, 2026 • edited by cursor Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

cursor Bot left a comment

Choose a reason for hiding this comment

Uh oh!

cursor Bot Jun 11, 2026

Choose a reason for hiding this comment

Post-batch penalize stale samples

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

1 participant

anravich13-cloud commented Jun 11, 2026 •

edited by cursor Bot

Loading