feat(orchestrator): add penalize action for gibberish/repetition filters#2775
Open
anravich13-cloud wants to merge 1 commit into
Open
feat(orchestrator): add penalize action for gibberish/repetition filters#2775anravich13-cloud wants to merge 1 commit into
anravich13-cloud wants to merge 1 commit into
Conversation
Adds an opt-in `penalize` action for rollout filters. Gibberish and
repetition filters can now cap detected rollout rewards to a configurable
negative value (`penalty_reward`, default -1.0) before advantage
computation, creating negative training signal without dropping the
rollout.
- Filter configs gain `action` (monitor|drop|penalize) + `penalty_reward`;
legacy `enforce` configs still parse (true→drop, false→monitor),
conflicting combinations raise a validation error.
- Filters gain a phase: gibberish/repetition are pre_advantage,
zero_advantage is post_advantage. TrainSink.process_group now applies
pre-advantage filters before assign_advantages so penalized rewards are
visible to the group baseline (and to sample.reward propagation).
- Penalties preserve the original reward (rollout.raw_reward) and record
per-filter metadata (rollout.reward_penalties); both flow into saved
rollouts via to_dict.
- New metrics: filters/all/{name}_penalized and raw_reward/all/{mean,max,min}
when penalties fire.
- Defaults unchanged: gibberish/repetition monitor, zero_advantage drop.
There was a problem hiding this comment.
Cursor Bugbot has reviewed your changes and found 1 potential issue.
❌ Bugbot Autofix is OFF. To automatically fix reported issues with cloud agents, enable autofix in the Cursor dashboard.
Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.
| num_filtered += 1 | ||
| for name, hit in r.filter_results.items(): | ||
| if hit: | ||
| if hit and name in drop_filter_names: |
There was a problem hiding this comment.
Post-batch penalize stale samples
Medium Severity
If a penalize filter is configured on post_batch_filters, process_batch caps rollout.raw["reward"] via apply_filters but never refreshes TrainingSample.reward (or advantage) after that pass. Shipped samples can still carry pre-penalty values from process_group, so trainer-bound data disagrees with the rollout reward used in metrics.
Reviewed by Cursor Bugbot for commit 5208ce8. Configure here.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.


Adds an opt-in
penalizeaction for rollout filters. Gibberish and repetition filters can now cap detected rollout rewards to a configurable negative value (penalty_reward, default -1.0) before advantage computation, creating negative training signal without dropping the rollout.action(monitor|drop|penalize) +penalty_reward; legacyenforceconfigs still parse (true→drop, false→monitor), conflicting combinations raise a validation error.Note
Medium Risk
Changes GRPO training signal ordering (reward caps before advantages) and filter config semantics; defaults stay monitor/drop but misconfigured
penalizecould skew reward and advantage stats.Overview
Adds a
penalizerollout-filter action so gibberish/repetition detections can cap reward (defaultpenalty_reward=-1.0) instead of only monitoring or dropping rollouts.Filter config is unified under
BaseFilterConfig:action(monitor|drop|penalize), optionalpenalty_reward, andresolved_actionwith legacyenforce(true→drop,false→monitor) plus validation whenactionandenforcedisagree. Filters are split intopre_advantage(gibberish/repetition) vspost_advantage(zero advantage);TrainSink.process_groupruns pre-advantage filters beforeassign_advantagesso penalized rewards affect GRPO baselines, then post-advantage filters.Penalties update
raw["reward"], stashraw_reward/reward_penaltiesonTrainRollout, and surface in saved rollouts viato_dict.apply_filtersis safe across multiple passes; pre-batch drop stats count onlydropactions. New logging:filters/all/{name}_penalizedandraw_reward/all/*when penalties fire.Reviewed by Cursor Bugbot for commit 5208ce8. Bugbot is set up for automated code reviews on this repo. Configure here.