OCPEDGE-2727: Add agent-eval-harness integration config for cluster-diagnostic skill#178
OCPEDGE-2727: Add agent-eval-harness integration config for cluster-diagnostic skill#178dhensel-rh wants to merge 7 commits into
Conversation
|
Skipping CI for Draft Pull Request. |
|
No actionable comments were generated in the recent review. 🎉 ℹ️ Recent review info⚙️ Run configurationConfiguration used: Repository YAML (base), Central YAML (inherited) Review profile: CHILL Plan: Enterprise Run ID: 📒 Files selected for processing (4)
✅ Files skipped from review due to trivial changes (3)
🚧 Files skipped from review as they are similar to previous changes (1)
WalkthroughAdds a complete automated eval framework under ChangesTwo-node eval framework
Estimated code review effort🎯 3 (Moderate) | ⏱️ ~20 minutes Possibly related PRs
🚥 Pre-merge checks | ✅ 10 | ❌ 1❌ Failed checks (1 warning)
✅ Passed checks (10 passed)
✏️ Tip: You can configure your own custom pre-merge checks in the settings. ✨ Finishing Touches🧪 Generate unit tests (beta)
Warning There were issues while running some tools. Please review the errors and either fix the tool's configuration or disable the tool if it's a critical failure. 🔧 markdownlint-cli2 (0.22.1)plugins/two-node/evals/cluster-diagnostic.mdmarkdownlint-cli2 v0.22.1 (markdownlint v0.40.0) ... [truncated 1092 characters] ... node:internal/modules/esm/resolve:271:11) plugins/two-node/evals/README.mdmarkdownlint-cli2 v0.22.1 (markdownlint v0.40.0) ... [truncated 1080 characters] ... node:internal/modules/esm/resolve:271:11) plugins/two-node/evals/threat-model-tnf.mdmarkdownlint-cli2 v0.22.1 (markdownlint v0.40.0) ... [truncated 1090 characters] ... node:internal/modules/esm/resolve:271:11) Comment |
|
[APPROVALNOTIFIER] This PR is APPROVED This pull-request has been approved by: dhensel-rh The full list of commands accepted by this bot can be found here. The pull request process is described here DetailsNeeds approval from an approver in each of these files:
Approvers can indicate their approval by writing |
|
|
||
| inputs: | ||
| tools: | ||
| - match: Questions asked to the user via AskUserQuestion (game mode) |
There was a problem hiding this comment.
The inputs.tools section defines AskUserQuestion interception for game mode, but the analysis doc explicitly says game mode is excluded from eval scope. None of the 5 cases test game mode.
This is unused config that could confuse future contributors.
There was a problem hiding this comment.
case-006-game-quiz exercises game mode — the inputs.tools AskUserQuestion hook is used there to auto-answer the quiz
| models: | ||
| skill: claude-opus-4-6 | ||
| judge: claude-opus-4-6 | ||
| hook: claude-sonnet-4-6 |
There was a problem hiding this comment.
can we drop both skill and hook here ?
There was a problem hiding this comment.
skill can be used here or on the command line
/eval-run --model claude-opus-4-6 --config evals/cluster-diagnostic.yaml
The hook is useful if I chose to exercise the game mode.
What is the motivation to drop ?
| hook: claude-sonnet-4-6 | ||
|
|
||
| permissions: | ||
| allow: [] |
There was a problem hiding this comment.
Do we want to list any speific tools or allow all ?
There was a problem hiding this comment.
I do not believe so. Do you have a suggestion ?
|
|
||
| if expected_blockers and not found_blockers: | ||
| return (False, f"Expected blockers {expected_blockers} not found in output") | ||
|
|
There was a problem hiding this comment.
Warnings are never validated. The expected_warnings field is
collected but never used by any judge — either add a check or remove the field from annotations to avoid false confidence.
There was a problem hiding this comment.
warning_classification judge was added. (lines 118–140) — it validates expected_warnings from annotations against the conversation output.
| sec_lower = section.lower() | ||
| if "pcs node standby" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower: | ||
| forbidden.append("pcs node standby recommended") | ||
| if "shutdown -h 1" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower: |
There was a problem hiding this comment.
The judge checks for "shutdown -h 1" (with the 1), but case-001's expected blocker is shutdown -h (without 1). If the skill outputs "shutdown -h" without the trailing argument, the
forbidden check wouldn't catch it. Should be "shutdown -h" in the judge.
|
|
||
| Return a JSON object: {"score": <1-5>, "rationale": "<explanation>"} | ||
|
|
||
| thresholds: |
There was a problem hiding this comment.
How are these defined ? they are from the eval-analyze command ?
There was a problem hiding this comment.
Yes, the eval-analyze generated this.
A threshold sets the bar for a judge across all cases in a run.
- 1.0 = all 5 must pass (5/5)
- 0.8 = at least 4 must pass (4/5)
- 0.6 = at least 3 must pass (3/5)
| cluster issues across 4 modes: diagnose (live SSH), validate (check proposed | ||
| procedures), recovery-guide (return correct procedures), and game (interactive | ||
| training). The skill encodes 7 validated bare metal test scenarios (HPE ProLiant | ||
| e920t, OCP 4.22.0-rc.3) into a knowledge base. |
There was a problem hiding this comment.
should we mention the baremetal system and ocp details ? this should work against any baremetal and ocp versions right ?
There was a problem hiding this comment.
This file (cluster-diagnostic.md) is generated from /eval-analyze. The cluster-diagnostic tool should work with any OCP version, baremetal, or VM. The reference ./cluster-knowledge-base.md skill would need updating, not this markdown file. This can probably be addressed for accuracy, but does not necessarily affect the test output.
…eat-model skills Add evaluation configs, test cases, and README for two skills: - cluster-diagnostic: 5 cases covering validate and recovery-guide modes - threat-model-tnf: 5 cases covering PR security analysis Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
The eval harness resolves dataset.path from the repo root, not relative to the config file. Both configs were using short relative paths that broke when running from different working directories. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
- Add case-006-game-quiz with quiz mode test case and answers - Add warning_classification judge for expected WARNING findings - Add game_mode_scoring judge for rating/score validation - Fix forbidden_recommendations to check 'shutdown -h' (not 'shutdown -h 1') - Update severity_classification description for clarity - Drop models.skill default (let CLI --model flag control it) - Simplify schema note to only exclude diagnose mode Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Evals score skill quality on a spectrum (1-5), not pass/fail. Update terminology to reflect this: testing→scoring, test cases→scenarios, test input→scenario input. Add game mode to cluster-diagnostic case count. Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
Raise cluster-diagnostic budget to $8 to cover full 6-scenario run cost. Relax threat-model report filename match from "THREAT-MODEL" to "THREAT" to handle naming variants (THREAT-REPORT, THREAT-ANALYSIS). Co-Authored-By: Claude Opus 4.6 <noreply@anthropic.com>
There was a problem hiding this comment.
Actionable comments posted: 1
🤖 Prompt for all review comments with AI agents
Verify each finding against current code. Fix only still-valid issues, skip the
rest with a brief reason, keep changes minimal, and validate.
Inline comments:
In `@plugins/two-node/evals/cluster-diagnostic.md`:
- Line 83: The phrase on line 83 contains an unhyphenated compound adjective
that should be hyphenated per standard English grammar rules. Change "bare metal
test results" to "bare-metal test results" since "bare-metal" is a compound
adjective modifying the noun "test results" and must be hyphenated when
appearing before the noun it modifies.
🪄 Autofix (Beta)
Fix all unresolved CodeRabbit comments on this PR:
- Push a commit to this branch (recommended)
- Create a new PR with the fixes
ℹ️ Review info
⚙️ Run configuration
Configuration used: Repository YAML (base), Central YAML (inherited)
Review profile: CHILL
Plan: Enterprise
Run ID: f9d67794-7dfe-4cf3-a2c4-aea6a827f0f8
📒 Files selected for processing (28)
plugins/two-node/evals/README.mdplugins/two-node/evals/cluster-diagnostic.mdplugins/two-node/evals/cluster-diagnostic.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/annotations.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/input.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/annotations.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/input.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/annotations.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/input.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/annotations.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/input.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/annotations.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/input.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/annotations.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/answers.yamlplugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/input.yamlplugins/two-node/evals/threat-model-tnf.mdplugins/two-node/evals/threat-model-tnf.yamlplugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/annotations.yamlplugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/input.yamlplugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/annotations.yamlplugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/input.yamlplugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/annotations.yamlplugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/input.yamlplugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/annotations.yamlplugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/input.yamlplugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/annotations.yamlplugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/input.yaml
Summary
cluster-diagnosticskill to enable automated qualitytesting via
/test eval-cluster-diagnosticin CIvalidatemode (severity classification of dangerous procedures)and
recovery-guidemode (correct shutdown/recovery procedures)recommendations, and cost budget; one LLM judge evaluates knowledge base accuracy
Summary by CodeRabbit
Release Notes
Documentation
Tests