fix(weave): exclude op methods from LLMAsAJudgeScorer publish payload by ro31337 · Pull Request #7183 · wandb/weave

ro31337 · 2026-06-11T16:21:29Z

JIRA Issue(s)

Description

Creating an LLMAsAJudgeScorer in the SDK and publishing it to attach to an online-scoring monitor produced a payload the scoring worker rejects, so the monitor silently never scored — even though building the same scorer through the UI worked. On publish, the client serializes each @op method as a ref that deserializes to a CustomWeaveType(Op), and the worker's safety check fails closed on those. A published judge scorer carried three of them: its own score, the inherited summarize, and — because the nested model is published as its own object — LLMStructuredCompletionModel.predict.

This sets _weave_exclude_ops_from_record = True on both LLMAsAJudgeScorer and LLMStructuredCompletionModel, the same opt-out RemoteScorer already uses (#7036). The op methods still run and trace at runtime; only their stored ref — which nothing reads — is dropped, so the published shape now matches what the UI already persists. It is SDK-only: a ClassVar is not a pydantic field, so there is no schema/zod change and no core-side dependency.

Testing

Added a regression test that pydantic_object_record drops score/summarize/predict and that a published scorer round-trips with no op refs in the stored payload. Regenerated the one serialization fixture this affects and ran the serialization suite on Python 3.13 (the version that produces it) and 3.12, plus the judge-scorer, structured-model, evaluation-ref-get, and remote-scorer suites — all green.

Programmatically created LLMAsAJudgeScorer objects published to an online-scoring monitor carried op refs (score, the inherited summarize, and the nested model's predict) that serialize to CustomWeaveType(Op) and trip the scoring worker's safety guard, so the monitor silently never scored. Building the same scorer in the UI worked. Set _weave_exclude_ops_from_record on LLMAsAJudgeScorer and LLMStructuredCompletionModel — the same opt-out RemoteScorer uses (#7036) — so the published shape matches what the UI already persists. The ops still run and trace at runtime; only the unused stored ref is dropped. SDK-only: a ClassVar is not a pydantic field, so there is no schema change.

codecov · 2026-06-11T16:24:21Z

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

wandbot-3000 · 2026-06-11T16:26:49Z

Preview this PR with FeatureBee: https://beta.wandb.ai/?betaVersion=57f63f865e6639aac59ee478c33e61a82a5f163d

Resolves the library_cases.py serialization fixture after #6914 (auto_summarize mixed-types fix) landed on master: keep #6914's Scorer.summarize / MyScorer digests, re-apply this branch's LLMAsAJudgeScorer/LLMStructuredCompletionModel op exclusions on top. Regenerated and verified green under Python 3.13 and 3.12.

mscavezze-cw · 2026-06-11T23:01:15Z



 # Runtime serialization produces version-dependent digest (for tests not explicitly using legacy)
 # When following the directions in test_serialization_correctness.py, it will be necessary to set is_legacy=True.


It looks like this PR updates the test case instead of creating a new one. I don't think that matches the original design of these tests.

From test_serialization_correctness.py:22:

""" IMPORTANT RULES: Once a SerializationTestCase is created, it should never be modified. As the code base evolves, it is expected that some of these test cases will break (since the serialization format changes, op code changes, etc...). In such cases: 1. Copy the failing test case to a new test case. 2. Set the is_legacy flag to True on the new test case. 3. Rerun the test: this should PASS. If it does not, then it means you have made a backwards incompatible change and data written by older clients will not be able to be deserialized by newer clients. 4. Now you can modify the original test case to pass. This methodology allows us to lock in the legacy serialization formats as a contact, independent of the actual code that is used to serialize the data. """

Also, the change to this file in this PR but not the one for RemoteScorer demonstrates that we have a hole in our test coverage. I've filed a new ticket. https://coreweave.atlassian.net/browse/WB-35583

…case The serialization tests require copying a case to a new is_legacy=True case before modifying the live one, so the prior wire format stays covered as a deserialization contract. The op-excluding change modified the live case but skipped that snapshot step; add it as legacy v6 (the with-ops shape). It passes, confirming data written by older clients still round-trips under the new code.

ro31337 marked this pull request as ready for review June 11, 2026 20:20

ro31337 requested a review from a team as a code owner June 11, 2026 20:20

ro31337 requested a review from mscavezze-cw June 11, 2026 20:28

mscavezze-cw reviewed Jun 11, 2026

View reviewed changes

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

fix(weave): exclude op methods from LLMAsAJudgeScorer publish payload#7183

fix(weave): exclude op methods from LLMAsAJudgeScorer publish payload#7183
ro31337 wants to merge 3 commits into
masterfrom
roman/wb-35184-llm-judge-exclude-ops

ro31337 commented Jun 11, 2026 •

edited by atlassian Bot

Loading

Uh oh!

codecov Bot commented Jun 11, 2026

Uh oh!

wandbot-3000 Bot commented Jun 11, 2026 •

edited

Loading

Uh oh!

mscavezze-cw Jun 11, 2026

Uh oh!

mscavezze-cw Jun 11, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants



		# Runtime serialization produces version-dependent digest (for tests not explicitly using legacy)
		# When following the directions in test_serialization_correctness.py, it will be necessary to set is_legacy=True.

Conversation

ro31337 commented Jun 11, 2026 • edited by atlassian Bot Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

JIRA Issue(s)

Description

Testing

Uh oh!

codecov Bot commented Jun 11, 2026

Codecov Report

Uh oh!

wandbot-3000 Bot commented Jun 11, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

mscavezze-cw Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

mscavezze-cw Jun 11, 2026

Choose a reason for hiding this comment

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

2 participants

ro31337 commented Jun 11, 2026 •

edited by atlassian Bot

Loading

wandbot-3000 Bot commented Jun 11, 2026 •

edited

Loading