fix(weave): exclude op methods from LLMAsAJudgeScorer publish payload#7183
Open
ro31337 wants to merge 3 commits into
Open
fix(weave): exclude op methods from LLMAsAJudgeScorer publish payload#7183ro31337 wants to merge 3 commits into
ro31337 wants to merge 3 commits into
Conversation
Programmatically created LLMAsAJudgeScorer objects published to an online-scoring monitor carried op refs (score, the inherited summarize, and the nested model's predict) that serialize to CustomWeaveType(Op) and trip the scoring worker's safety guard, so the monitor silently never scored. Building the same scorer in the UI worked. Set _weave_exclude_ops_from_record on LLMAsAJudgeScorer and LLMStructuredCompletionModel — the same opt-out RemoteScorer uses (#7036) — so the published shape matches what the UI already persists. The ops still run and trace at runtime; only the unused stored ref is dropped. SDK-only: a ClassVar is not a pydantic field, so there is no schema change.
Codecov Report✅ All modified and coverable lines are covered by tests. 📢 Thoughts on this report? Let us know! |
|
Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=57f63f865e6639aac59ee478c33e61a82a5f163d |
Resolves the library_cases.py serialization fixture after #6914 (auto_summarize mixed-types fix) landed on master: keep #6914's Scorer.summarize / MyScorer digests, re-apply this branch's LLMAsAJudgeScorer/LLMStructuredCompletionModel op exclusions on top. Regenerated and verified green under Python 3.13 and 3.12.
|
|
||
|
|
||
| # Runtime serialization produces version-dependent digest (for tests not explicitly using legacy) | ||
| # When following the directions in test_serialization_correctness.py, it will be necessary to set is_legacy=True. |
Contributor
There was a problem hiding this comment.
It looks like this PR updates the test case instead of creating a new one. I don't think that matches the original design of these tests.
From test_serialization_correctness.py:22:
"""
IMPORTANT RULES: Once a SerializationTestCase is created, it should never be modified.
As the code base evolves, it is expected that some of these test cases will break (since
the serialization format changes, op code changes, etc...). In such cases:
1. Copy the failing test case to a new test case.
2. Set the is_legacy flag to True on the new test case.
3. Rerun the test: this should PASS. If it does not, then it means you have made a
backwards incompatible change and data written by older clients will not be able to
be deserialized by newer clients.
4. Now you can modify the original test case to pass.
This methodology allows us to lock in the legacy serialization formats as a contact,
independent of the actual code that is used to serialize the data.
"""
Contributor
There was a problem hiding this comment.
Also, the change to this file in this PR but not the one for RemoteScorer demonstrates that we have a hole in our test coverage. I've filed a new ticket. https://coreweave.atlassian.net/browse/WB-35583
…case The serialization tests require copying a case to a new is_legacy=True case before modifying the live one, so the prior wire format stays covered as a deserialization contract. The op-excluding change modified the live case but skipped that snapshot step; add it as legacy v6 (the with-ops shape). It passes, confirming data written by older clients still round-trips under the new code.
This file contains hidden or bidirectional Unicode text that may be interpreted or compiled differently than what appears below. To review, open the file in an editor that reveals hidden Unicode characters.
Learn more about bidirectional Unicode characters
Sign up for free
to join this conversation on GitHub.
Already have an account?
Sign in to comment
Add this suggestion to a batch that can be applied as a single commit.This suggestion is invalid because no changes were made to the code.Suggestions cannot be applied while the pull request is closed.Suggestions cannot be applied while viewing a subset of changes.Only one suggestion per line can be applied in a batch.Add this suggestion to a batch that can be applied as a single commit.Applying suggestions on deleted lines is not supported.You must change the existing code in this line in order to create a valid suggestion.Outdated suggestions cannot be applied.This suggestion has been applied or marked resolved.Suggestions cannot be applied from pending reviews.Suggestions cannot be applied on multi-line comments.Suggestions cannot be applied while the pull request is queued to merge.Suggestion cannot be applied right now. Please check back later.
JIRA Issue(s)
WB-35184
Description
Creating an
LLMAsAJudgeScorerin the SDK and publishing it to attach to an online-scoring monitor produced a payload the scoring worker rejects, so the monitor silently never scored — even though building the same scorer through the UI worked. On publish, the client serializes each@opmethod as a ref that deserializes to aCustomWeaveType(Op), and the worker's safety check fails closed on those. A published judge scorer carried three of them: its ownscore, the inheritedsummarize, and — because the nestedmodelis published as its own object —LLMStructuredCompletionModel.predict.This sets
_weave_exclude_ops_from_record = Trueon bothLLMAsAJudgeScorerandLLMStructuredCompletionModel, the same opt-outRemoteScoreralready uses (#7036). The op methods still run and trace at runtime; only their stored ref — which nothing reads — is dropped, so the published shape now matches what the UI already persists. It is SDK-only: aClassVaris not a pydantic field, so there is no schema/zod change and no core-side dependency.Testing
Added a regression test that
pydantic_object_recorddropsscore/summarize/predictand that a published scorer round-trips with no op refs in the stored payload. Regenerated the one serialization fixture this affects and ran the serialization suite on Python 3.13 (the version that produces it) and 3.12, plus the judge-scorer, structured-model, evaluation-ref-get, and remote-scorer suites — all green.