Skip to content

fix(weave): exclude op methods from LLMAsAJudgeScorer publish payload#7183

Open
ro31337 wants to merge 3 commits into
masterfrom
roman/wb-35184-llm-judge-exclude-ops
Open

fix(weave): exclude op methods from LLMAsAJudgeScorer publish payload#7183
ro31337 wants to merge 3 commits into
masterfrom
roman/wb-35184-llm-judge-exclude-ops

Conversation

@ro31337

@ro31337 ro31337 commented Jun 11, 2026

Copy link
Copy Markdown
Contributor

JIRA Issue(s)

WB-35184

Description

Creating an LLMAsAJudgeScorer in the SDK and publishing it to attach to an online-scoring monitor produced a payload the scoring worker rejects, so the monitor silently never scored — even though building the same scorer through the UI worked. On publish, the client serializes each @op method as a ref that deserializes to a CustomWeaveType(Op), and the worker's safety check fails closed on those. A published judge scorer carried three of them: its own score, the inherited summarize, and — because the nested model is published as its own object — LLMStructuredCompletionModel.predict.

This sets _weave_exclude_ops_from_record = True on both LLMAsAJudgeScorer and LLMStructuredCompletionModel, the same opt-out RemoteScorer already uses (#7036). The op methods still run and trace at runtime; only their stored ref — which nothing reads — is dropped, so the published shape now matches what the UI already persists. It is SDK-only: a ClassVar is not a pydantic field, so there is no schema/zod change and no core-side dependency.

Testing

Added a regression test that pydantic_object_record drops score/summarize/predict and that a published scorer round-trips with no op refs in the stored payload. Regenerated the one serialization fixture this affects and ran the serialization suite on Python 3.13 (the version that produces it) and 3.12, plus the judge-scorer, structured-model, evaluation-ref-get, and remote-scorer suites — all green.

Programmatically created LLMAsAJudgeScorer objects published to an online-scoring
monitor carried op refs (score, the inherited summarize, and the nested model's
predict) that serialize to CustomWeaveType(Op) and trip the scoring worker's
safety guard, so the monitor silently never scored. Building the same scorer in
the UI worked.

Set _weave_exclude_ops_from_record on LLMAsAJudgeScorer and
LLMStructuredCompletionModel — the same opt-out RemoteScorer uses (#7036) — so the
published shape matches what the UI already persists. The ops still run and trace
at runtime; only the unused stored ref is dropped. SDK-only: a ClassVar is not a
pydantic field, so there is no schema change.
@codecov

codecov Bot commented Jun 11, 2026

Copy link
Copy Markdown

Codecov Report

✅ All modified and coverable lines are covered by tests.

📢 Thoughts on this report? Let us know!

@wandbot-3000

wandbot-3000 Bot commented Jun 11, 2026

Copy link
Copy Markdown

@ro31337 ro31337 marked this pull request as ready for review June 11, 2026 20:20
@ro31337 ro31337 requested a review from a team as a code owner June 11, 2026 20:20
Resolves the library_cases.py serialization fixture after #6914 (auto_summarize
mixed-types fix) landed on master: keep #6914's Scorer.summarize / MyScorer
digests, re-apply this branch's LLMAsAJudgeScorer/LLMStructuredCompletionModel
op exclusions on top. Regenerated and verified green under Python 3.13 and 3.12.
@ro31337 ro31337 requested a review from mscavezze-cw June 11, 2026 20:28


# Runtime serialization produces version-dependent digest (for tests not explicitly using legacy)
# When following the directions in test_serialization_correctness.py, it will be necessary to set is_legacy=True.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

It looks like this PR updates the test case instead of creating a new one. I don't think that matches the original design of these tests.

From test_serialization_correctness.py:22:

"""
IMPORTANT RULES: Once a SerializationTestCase is created, it should never be modified.
As the code base evolves, it is expected that some of these test cases will break (since
the serialization format changes, op code changes, etc...). In such cases:
1. Copy the failing test case to a new test case.
2. Set the is_legacy flag to True on the new test case.
3. Rerun the test: this should PASS. If it does not, then it means you have made a
backwards incompatible change and data written by older clients will not be able to
be deserialized by newer clients.
4. Now you can modify the original test case to pass.

This methodology allows us to lock in the legacy serialization formats as a contact,
independent of the actual code that is used to serialize the data.
"""

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Also, the change to this file in this PR but not the one for RemoteScorer demonstrates that we have a hole in our test coverage. I've filed a new ticket. https://coreweave.atlassian.net/browse/WB-35583

…case

The serialization tests require copying a case to a new is_legacy=True case
before modifying the live one, so the prior wire format stays covered as a
deserialization contract. The op-excluding change modified the live case but
skipped that snapshot step; add it as legacy v6 (the with-ops shape). It passes,
confirming data written by older clients still round-trips under the new code.
Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

2 participants