Skip to content
Open
Show file tree
Hide file tree
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
86 changes: 86 additions & 0 deletions plugins/two-node/evals/README.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,86 @@
# Evaluation Configs

Automated quality scoring for two-node plugin skills using the
[agent-eval-harness](https://github.com/opendatahub-io/agent-eval-harness)
Claude Code plugin.

Evals measure skill quality on a spectrum (judges score 1-5, not
pass/fail) — they catch regressions and drift, not exact-match
correctness.

## Available Evals

| Config | Skill | Modes Tested | Cases |
|--------|-------|--------------|-------|
| `cluster-diagnostic.yaml` | `two-node:cluster-diagnostic` | validate, recovery-guide, game | 6 |
| `threat-model-tnf.yaml` | `threat-model:tnf` | PR analysis | 5 |

## Running Locally

```bash
# Install the eval harness plugin
/plugin marketplace add opendatahub-skills/agent-eval-harness

# Run an existing eval
/eval-run --model claude-opus-4-6 --config evals/cluster-diagnostic.yaml
```

To create a new eval, see [Adding a New Eval](#adding-a-new-eval) below.

## Running in CI

Comment `/test eval-cluster-diagnostic` on a PR to trigger the eval job.
The CI workflow is defined in
[openshift/release](https://github.com/openshift/release) under
`ci-operator/config/openshift-eng/edge-tooling/`.

## Directory Structure

```text
evals/
├── <skill-name>.yaml # Eval config (judges, thresholds, schema)
├── <skill-name>.md # Cached skill analysis
└── <skill-name>/
└── cases/
└── case-NNN-<slug>/
├── input.yaml # Scenario input
└── annotations.yaml # Expected outcomes
```

## Adding a New Eval

1. **Analyze the skill** — reads SKILL.md, designs judges, writes the eval config

```bash
/eval-analyze --skill <name> --config evals/<name>.yaml
```

2. **Generate scenarios** — creates `input.yaml` + `annotations.yaml` per case

```bash
/eval-dataset --config evals/<name>.yaml
```

3. **Run the eval** — executes the skill against each case, scores with judges, generates HTML report

```bash
/eval-run --model claude-opus-4-6 --config evals/<name>.yaml
```

4. **Review results** — walk through cases, collect human feedback

```bash
/eval-review --run-id <run-id> --config evals/<name>.yaml
```

5. **(Optional) Optimize** — auto-fix SKILL.md based on judge failures, re-run to verify

```bash
/eval-optimize --config evals/<name>.yaml
```

6. **Commit and CI**
- Commit `evals/<name>.yaml`, `evals/<name>.md`, and `evals/<name>/cases/` to this repo
- Add a CI entry in [openshift/release](https://github.com/openshift/release)
pointing `EVAL_CONFIG` to the yaml path
- PR reviewers can then trigger the eval with `/test eval-<name>`
87 changes: 87 additions & 0 deletions plugins/two-node/evals/cluster-diagnostic.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,87 @@
---
# Auto-generated by /eval-analyze — edit to override
skill: two-node:cluster-diagnostic
analyzed_at: 2026-06-05T23:00:00Z
skill_hash: bb04c2fed029

# Discovered skill capabilities
execution_mode: case
headless: true
dry_run: false

# Suggested judges (summary from analysis)
suggested_judges:
- name: budget_check
type: builtin
description: "Cost stays within $3.00 per case"
- name: severity_classification
type: check
description: "Validate mode assigns correct BLOCKER/WARNING/INFO severity"
- name: procedure_completeness
type: check
description: "Recovery-guide mode returns bash commands, verification steps, parameter templates"
- name: forbidden_recommendations
type: check
description: "Never recommends pcs standby, sequential shutdown, or shutdown -h"
- name: knowledge_base_accuracy
type: llm
description: "Response accurately reflects TNF knowledge base content"
---

# Skill Analysis

The `two-node:cluster-diagnostic` skill diagnoses TNF (Two-Node Fencing)
cluster issues across 4 modes: diagnose (live SSH), validate (check proposed
procedures), recovery-guide (return correct procedures), and game (interactive
training). The skill encodes 7 validated bare metal test scenarios (HPE ProLiant
e920t, OCP 4.22.0-rc.3) into a knowledge base.

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

should we mention the baremetal system and ocp details ? this should work against any baremetal and ocp versions right ?

Copy link
Copy Markdown
Contributor Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

This file (cluster-diagnostic.md) is generated from /eval-analyze. The cluster-diagnostic tool should work with any OCP version, baremetal, or VM. The reference ./cluster-knowledge-base.md skill would need updating, not this markdown file. This can probably be addressed for accuracy, but does not necessarily affect the test output.


**Eval scope**: Only `validate` and `recovery-guide` modes are testable in eval
because `diagnose` requires live SSH access and `game` requires interactive
AskUserQuestion handling. Game mode can be tested with tool interception but
adds complexity.

## Inputs

Each test case has `input.yaml` with:

- `command_input`: Full argument string (e.g., `validate "cordon, drain, shutdown"`,
`recovery-guide full-shutdown`)
- `mode`: Which mode is being tested (`validate`, `recovery-guide`, `game`)

And `annotations.yaml` with expected outcomes:

- `expected_blockers`: List of BLOCKER findings expected (validate mode)
- `expected_warnings`: List of WARNING findings expected
- `expected_scenario`: Scenario name (recovery-guide mode)
- `should_reject`: Whether the procedure should be rejected (validate mode)

## Outputs

All output is conversational — the skill writes nothing to disk. Judges use
`{{ conversation }}` to evaluate the assistant's response text.

## Pipeline Flow

1. Parse argument to determine mode
2. Read `cluster-knowledge-base.md` (800+ lines with 7 failure modes, severity
table, correct procedures, edge cases)
3. For validate: parse procedure text → check each step against 7 failure modes
→ report BLOCKER/WARNING/INFO with explanations
4. For recovery-guide: look up scenario → return step-by-step bash commands with
parameter templates and verification steps
5. For game: read `game-mode.md` → present questions via AskUserQuestion → score

## Quality Criteria

**Deterministic** (code-checkable):

- Severity classification matches knowledge base table
- Never recommends pcs standby, sequential shutdown, or shutdown -h
- Recovery procedures include bash commands and verification steps

**LLM judgment** (requires reasoning):

- Response accurately reflects TNF architecture facts
- Failure mode explanations reference correct root causes
- Recovery procedures match validated bare metal test results
Comment thread
coderabbitai[bot] marked this conversation as resolved.
Loading