openshift-eng · dhensel-rh · Jun 6, 2026 · Jun 7, 2026 · Jun 8, 2026 · Jun 11, 2026
diff --git a/plugins/two-node/evals/README.md b/plugins/two-node/evals/README.md
@@ -0,0 +1,86 @@
+# Evaluation Configs
+
+Automated quality scoring for two-node plugin skills using the
+[agent-eval-harness](https://github.com/opendatahub-io/agent-eval-harness)
+Claude Code plugin.
+
+Evals measure skill quality on a spectrum (judges score 1-5, not
+pass/fail) — they catch regressions and drift, not exact-match
+correctness.
+
+## Available Evals
+
+| Config | Skill | Modes Tested | Cases |
+|--------|-------|--------------|-------|
+| `cluster-diagnostic.yaml` | `two-node:cluster-diagnostic` | validate, recovery-guide, game | 6 |
+| `threat-model-tnf.yaml` | `threat-model:tnf` | PR analysis | 5 |
+
+## Running Locally
+
+```bash
+# Install the eval harness plugin
+/plugin marketplace add opendatahub-skills/agent-eval-harness
+
+# Run an existing eval
+/eval-run --model claude-opus-4-6 --config evals/cluster-diagnostic.yaml
+```
+
+To create a new eval, see [Adding a New Eval](#adding-a-new-eval) below.
+
+## Running in CI
+
+Comment `/test eval-cluster-diagnostic` on a PR to trigger the eval job.
+The CI workflow is defined in
+[openshift/release](https://github.com/openshift/release) under
+`ci-operator/config/openshift-eng/edge-tooling/`.
+
+## Directory Structure
+
+```text
+evals/
+├── <skill-name>.yaml           # Eval config (judges, thresholds, schema)
+├── <skill-name>.md             # Cached skill analysis
+└── <skill-name>/
+    └── cases/
+        └── case-NNN-<slug>/
+            ├── input.yaml      # Scenario input
+            └── annotations.yaml # Expected outcomes
+```
+
+## Adding a New Eval
+
+1. **Analyze the skill** — reads SKILL.md, designs judges, writes the eval config
+
+   ```bash
+   /eval-analyze --skill <name> --config evals/<name>.yaml
+   ```
+
+2. **Generate scenarios** — creates `input.yaml` + `annotations.yaml` per case
+
+   ```bash
+   /eval-dataset --config evals/<name>.yaml
+   ```
+
+3. **Run the eval** — executes the skill against each case, scores with judges, generates HTML report
+
+   ```bash
+   /eval-run --model claude-opus-4-6 --config evals/<name>.yaml
+   ```
+
+4. **Review results** — walk through cases, collect human feedback
+
+   ```bash
+   /eval-review --run-id <run-id> --config evals/<name>.yaml
+   ```
+
+5. **(Optional) Optimize** — auto-fix SKILL.md based on judge failures, re-run to verify
+
+   ```bash
+   /eval-optimize --config evals/<name>.yaml
+   ```
+
+6. **Commit and CI**
+   - Commit `evals/<name>.yaml`, `evals/<name>.md`, and `evals/<name>/cases/` to this repo
+   - Add a CI entry in [openshift/release](https://github.com/openshift/release)
+     pointing `EVAL_CONFIG` to the yaml path
+   - PR reviewers can then trigger the eval with `/test eval-<name>`
diff --git a/plugins/two-node/evals/cluster-diagnostic.md b/plugins/two-node/evals/cluster-diagnostic.md
@@ -0,0 +1,87 @@
+---
+# Auto-generated by /eval-analyze — edit to override
+skill: two-node:cluster-diagnostic
+analyzed_at: 2026-06-05T23:00:00Z
+skill_hash: bb04c2fed029
+
+# Discovered skill capabilities
+execution_mode: case
+headless: true
+dry_run: false
+
+# Suggested judges (summary from analysis)
+suggested_judges:
+  - name: budget_check
+    type: builtin
+    description: "Cost stays within $3.00 per case"
+  - name: severity_classification
+    type: check
+    description: "Validate mode assigns correct BLOCKER/WARNING/INFO severity"
+  - name: procedure_completeness
+    type: check
+    description: "Recovery-guide mode returns bash commands, verification steps, parameter templates"
+  - name: forbidden_recommendations
+    type: check
+    description: "Never recommends pcs standby, sequential shutdown, or shutdown -h"
+  - name: knowledge_base_accuracy
+    type: llm
+    description: "Response accurately reflects TNF knowledge base content"
+---
+
+# Skill Analysis
+
+The `two-node:cluster-diagnostic` skill diagnoses TNF (Two-Node Fencing)
+cluster issues across 4 modes: diagnose (live SSH), validate (check proposed
+procedures), recovery-guide (return correct procedures), and game (interactive
+training). The skill encodes 7 validated bare metal test scenarios (HPE ProLiant
+e920t, OCP 4.22.0-rc.3) into a knowledge base.
+
+**Eval scope**: Only `validate` and `recovery-guide` modes are testable in eval
+because `diagnose` requires live SSH access and `game` requires interactive
+AskUserQuestion handling. Game mode can be tested with tool interception but
+adds complexity.
+
+## Inputs
+
+Each test case has `input.yaml` with:
+
+- `command_input`: Full argument string (e.g., `validate "cordon, drain, shutdown"`,
+  `recovery-guide full-shutdown`)
+- `mode`: Which mode is being tested (`validate`, `recovery-guide`, `game`)
+
+And `annotations.yaml` with expected outcomes:
+
+- `expected_blockers`: List of BLOCKER findings expected (validate mode)
+- `expected_warnings`: List of WARNING findings expected
+- `expected_scenario`: Scenario name (recovery-guide mode)
+- `should_reject`: Whether the procedure should be rejected (validate mode)
+
+## Outputs
+
+All output is conversational — the skill writes nothing to disk. Judges use
+`{{ conversation }}` to evaluate the assistant's response text.
+
+## Pipeline Flow
+
+1. Parse argument to determine mode
+2. Read `cluster-knowledge-base.md` (800+ lines with 7 failure modes, severity
+   table, correct procedures, edge cases)
+3. For validate: parse procedure text → check each step against 7 failure modes
+   → report BLOCKER/WARNING/INFO with explanations
+4. For recovery-guide: look up scenario → return step-by-step bash commands with
+   parameter templates and verification steps
+5. For game: read `game-mode.md` → present questions via AskUserQuestion → score
+
+## Quality Criteria
+
+**Deterministic** (code-checkable):
+
+- Severity classification matches knowledge base table
+- Never recommends pcs standby, sequential shutdown, or shutdown -h
+- Recovery procedures include bash commands and verification steps
+
+**LLM judgment** (requires reasoning):
+
+- Response accurately reflects TNF architecture facts
+- Failure mode explanations reference correct root causes
+- Recovery procedures match validated bare metal test results