diff --git a/plugins/two-node/evals/README.md b/plugins/two-node/evals/README.md
new file mode 100644
index 00000000..ecf3f815
--- /dev/null
+++ b/plugins/two-node/evals/README.md
@@ -0,0 +1,86 @@
+# Evaluation Configs
+
+Automated quality scoring for two-node plugin skills using the
+[agent-eval-harness](https://github.com/opendatahub-io/agent-eval-harness)
+Claude Code plugin.
+
+Evals measure skill quality on a spectrum (judges score 1-5, not
+pass/fail) — they catch regressions and drift, not exact-match
+correctness.
+
+## Available Evals
+
+| Config | Skill | Modes Tested | Cases |
+|--------|-------|--------------|-------|
+| `cluster-diagnostic.yaml` | `two-node:cluster-diagnostic` | validate, recovery-guide, game | 6 |
+| `threat-model-tnf.yaml` | `threat-model:tnf` | PR analysis | 5 |
+
+## Running Locally
+
+```bash
+# Install the eval harness plugin
+/plugin marketplace add opendatahub-skills/agent-eval-harness
+
+# Run an existing eval
+/eval-run --model claude-opus-4-6 --config evals/cluster-diagnostic.yaml
+```
+
+To create a new eval, see [Adding a New Eval](#adding-a-new-eval) below.
+
+## Running in CI
+
+Comment `/test eval-cluster-diagnostic` on a PR to trigger the eval job.
+The CI workflow is defined in
+[openshift/release](https://github.com/openshift/release) under
+`ci-operator/config/openshift-eng/edge-tooling/`.
+
+## Directory Structure
+
+```text
+evals/
+├── <skill-name>.yaml           # Eval config (judges, thresholds, schema)
+├── <skill-name>.md             # Cached skill analysis
+└── <skill-name>/
+    └── cases/
+        └── case-NNN-<slug>/
+            ├── input.yaml      # Scenario input
+            └── annotations.yaml # Expected outcomes
+```
+
+## Adding a New Eval
+
+1. **Analyze the skill** — reads SKILL.md, designs judges, writes the eval config
+
+   ```bash
+   /eval-analyze --skill <name> --config evals/<name>.yaml
+   ```
+
+2. **Generate scenarios** — creates `input.yaml` + `annotations.yaml` per case
+
+   ```bash
+   /eval-dataset --config evals/<name>.yaml
+   ```
+
+3. **Run the eval** — executes the skill against each case, scores with judges, generates HTML report
+
+   ```bash
+   /eval-run --model claude-opus-4-6 --config evals/<name>.yaml
+   ```
+
+4. **Review results** — walk through cases, collect human feedback
+
+   ```bash
+   /eval-review --run-id <run-id> --config evals/<name>.yaml
+   ```
+
+5. **(Optional) Optimize** — auto-fix SKILL.md based on judge failures, re-run to verify
+
+   ```bash
+   /eval-optimize --config evals/<name>.yaml
+   ```
+
+6. **Commit and CI**
+   - Commit `evals/<name>.yaml`, `evals/<name>.md`, and `evals/<name>/cases/` to this repo
+   - Add a CI entry in [openshift/release](https://github.com/openshift/release)
+     pointing `EVAL_CONFIG` to the yaml path
+   - PR reviewers can then trigger the eval with `/test eval-<name>`
diff --git a/plugins/two-node/evals/cluster-diagnostic.md b/plugins/two-node/evals/cluster-diagnostic.md
new file mode 100644
index 00000000..eb442663
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic.md
@@ -0,0 +1,87 @@
+---
+# Auto-generated by /eval-analyze — edit to override
+skill: two-node:cluster-diagnostic
+analyzed_at: 2026-06-05T23:00:00Z
+skill_hash: bb04c2fed029
+
+# Discovered skill capabilities
+execution_mode: case
+headless: true
+dry_run: false
+
+# Suggested judges (summary from analysis)
+suggested_judges:
+  - name: budget_check
+    type: builtin
+    description: "Cost stays within $3.00 per case"
+  - name: severity_classification
+    type: check
+    description: "Validate mode assigns correct BLOCKER/WARNING/INFO severity"
+  - name: procedure_completeness
+    type: check
+    description: "Recovery-guide mode returns bash commands, verification steps, parameter templates"
+  - name: forbidden_recommendations
+    type: check
+    description: "Never recommends pcs standby, sequential shutdown, or shutdown -h"
+  - name: knowledge_base_accuracy
+    type: llm
+    description: "Response accurately reflects TNF knowledge base content"
+---
+
+# Skill Analysis
+
+The `two-node:cluster-diagnostic` skill diagnoses TNF (Two-Node Fencing)
+cluster issues across 4 modes: diagnose (live SSH), validate (check proposed
+procedures), recovery-guide (return correct procedures), and game (interactive
+training). The skill encodes 7 validated bare metal test scenarios (HPE ProLiant
+e920t, OCP 4.22.0-rc.3) into a knowledge base.
+
+**Eval scope**: Only `validate` and `recovery-guide` modes are testable in eval
+because `diagnose` requires live SSH access and `game` requires interactive
+AskUserQuestion handling. Game mode can be tested with tool interception but
+adds complexity.
+
+## Inputs
+
+Each test case has `input.yaml` with:
+
+- `command_input`: Full argument string (e.g., `validate "cordon, drain, shutdown"`,
+  `recovery-guide full-shutdown`)
+- `mode`: Which mode is being tested (`validate`, `recovery-guide`, `game`)
+
+And `annotations.yaml` with expected outcomes:
+
+- `expected_blockers`: List of BLOCKER findings expected (validate mode)
+- `expected_warnings`: List of WARNING findings expected
+- `expected_scenario`: Scenario name (recovery-guide mode)
+- `should_reject`: Whether the procedure should be rejected (validate mode)
+
+## Outputs
+
+All output is conversational — the skill writes nothing to disk. Judges use
+`{{ conversation }}` to evaluate the assistant's response text.
+
+## Pipeline Flow
+
+1. Parse argument to determine mode
+2. Read `cluster-knowledge-base.md` (800+ lines with 7 failure modes, severity
+   table, correct procedures, edge cases)
+3. For validate: parse procedure text → check each step against 7 failure modes
+   → report BLOCKER/WARNING/INFO with explanations
+4. For recovery-guide: look up scenario → return step-by-step bash commands with
+   parameter templates and verification steps
+5. For game: read `game-mode.md` → present questions via AskUserQuestion → score
+
+## Quality Criteria
+
+**Deterministic** (code-checkable):
+
+- Severity classification matches knowledge base table
+- Never recommends pcs standby, sequential shutdown, or shutdown -h
+- Recovery procedures include bash commands and verification steps
+
+**LLM judgment** (requires reasoning):
+
+- Response accurately reflects TNF architecture facts
+- Failure mode explanations reference correct root causes
+- Recovery procedures match validated bare metal test results
diff --git a/plugins/two-node/evals/cluster-diagnostic.yaml b/plugins/two-node/evals/cluster-diagnostic.yaml
new file mode 100644
index 00000000..02ef9263
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic.yaml
@@ -0,0 +1,267 @@
+name: cluster-diagnostic-eval
+description: Evaluate the cluster-diagnostic skill across validate, recovery-guide, and game modes
+skill: two-node:cluster-diagnostic
+
+execution:
+  mode: case
+  arguments: "{command_input}"
+  timeout: 300
+  max_budget_usd: 8.0
+
+runner:
+  type: claude-code
+  plugin_dirs:
+    - plugins/two-node
+
+models:
+  judge: claude-opus-4-6
+  hook: claude-sonnet-4-6
+
+permissions:
+  allow: []
+  deny: []
+
+mlflow:
+  experiment: cluster-diagnostic-eval
+
+dataset:
+  path: plugins/two-node/evals/cluster-diagnostic/cases
+  schema: |
+    Each case directory contains:
+    - input.yaml: YAML file with:
+      - 'command_input' (string): The full argument string passed to the skill.
+        For validate mode: 'validate <procedure text>'
+        For recovery-guide mode: 'recovery-guide <scenario-name>'
+        For game mode: 'game' (requires AskUserQuestion interception)
+      - 'mode' (string): One of 'validate', 'recovery-guide', 'game'
+        Used by annotation-aware judges to apply mode-specific checks.
+    - annotations.yaml: Expected outcomes for the test case:
+      - 'mode' (string): validate | recovery-guide | game
+      - 'expected_blockers' (list): BLOCKER findings expected (validate mode)
+      - 'expected_warnings' (list): WARNING findings expected (validate mode)
+      - 'expected_scenario' (string): scenario name (recovery-guide mode)
+      - 'should_reject' (bool): whether the procedure should be rejected (validate mode)
+
+    Note: diagnose mode is excluded from eval because it requires live SSH
+    access to cluster nodes.
+
+inputs:
+  tools:
+    - match: Questions asked to the user via AskUserQuestion (game mode)
+      prompt: |
+        Answer based on the test case context in input.yaml and answers.yaml.
+        For game mode selection, pick 'quiz' unless answers.yaml specifies otherwise.
+        For quiz/scenario/rapid-fire answers, use answers.yaml guidance.
+        Default: pick the first option.
+
+outputs:
+  - path: output
+    schema: |
+      This skill produces conversation output only — no files are written to disk.
+      Judges should use {{ conversation }} to evaluate the assistant's response text.
+
+      For validate mode: expect a findings list with BLOCKER/WARNING/INFO severity
+      classifications, each referencing a failure mode from the knowledge base.
+
+      For recovery-guide mode: expect step-by-step markdown with bash commands
+      using parameter templates ($BMC_USER, $BMC_PASS, etc.) and verification steps.
+
+      For game mode: expect interactive questions, scoring, and a final rating
+      (Novice/Operator/Expert/TNF Master).
+
+traces:
+  stdout: true
+  stderr: true
+  events: true
+  metrics: true
+
+judges:
+  - name: budget_check
+    builtin: cost_budget
+    arguments:
+      max_cost_usd: 8.0
+
+  - name: severity_classification
+    description: |
+      For validate mode: checks that expected BLOCKER findings are present
+      and procedures with blockers are rejected. Sequential shutdown and
+      pcs standby must be BLOCKER.
+    if: "annotations.get('mode') == 'validate'"
+    check: |
+      conversation = outputs.get("conversation", "")
+      ann = outputs.get("annotations", {})
+      expected_blockers = ann.get("expected_blockers", [])
+      should_reject = ann.get("should_reject", False)
+
+      if not conversation:
+          return (False, "No conversation output found")
+
+      conv_upper = conversation.upper()
+      has_blocker = "BLOCKER" in conv_upper
+
+      if should_reject and not has_blocker:
+          return (False, "Procedure should have been rejected with BLOCKER but no BLOCKER found")
+
+      if not should_reject and has_blocker:
+          return (False, "Procedure should NOT have BLOCKER findings but BLOCKER was found")
+
+      found_blockers = []
+      for b in expected_blockers:
+          if b.lower() in conversation.lower():
+              found_blockers.append(b)
+
+      if expected_blockers and not found_blockers:
+          return (False, f"Expected blockers {expected_blockers} not found in output")
+
+      return (True, f"Severity classification correct. Blockers found: {found_blockers}")
+
+  - name: warning_classification
+    description: |
+      For validate mode: checks that expected WARNING findings are present
+      in the output. Verifies the skill identifies non-blocking issues.
+    if: "annotations.get('mode') == 'validate'"
+    check: |
+      conversation = outputs.get("conversation", "")
+      ann = outputs.get("annotations", {})
+      expected_warnings = ann.get("expected_warnings", [])
+
+      if not conversation:
+          return (False, "No conversation output found")
+
+      if not expected_warnings:
+          return (True, "No warnings expected for this case")
+
+      conv_lower = conversation.lower()
+      found = [w for w in expected_warnings if w.lower() in conv_lower]
+      missing = [w for w in expected_warnings if w.lower() not in conv_lower]
+
+      if missing:
+          return (False, f"Expected warnings not found: {missing}. Found: {found}")
+      return (True, f"All expected warnings found: {found}")
+
+  - name: procedure_completeness
+    description: |
+      For recovery-guide mode: checks that the returned procedure includes
+      bash commands, verification steps, and parameter templates.
+    if: "annotations.get('mode') == 'recovery-guide'"
+    check: |
+      conversation = outputs.get("conversation", "")
+
+      if not conversation:
+          return (False, "No conversation output found")
+
+      checks = {
+          "bash_commands": any(marker in conversation for marker in ["```bash", "```sh", "curl ", "pcs ", "oc "]),
+          "has_verification": any(w in conversation.lower() for w in ["verify", "confirm", "check", "poll", "wait"]),
+          "has_parameters": any(p in conversation for p in ["$BMC", "$NODE", "BMC_USER", "BMC_PASS", "BMC_HOST"]),
+      }
+
+      passed = sum(checks.values())
+      total = len(checks)
+      failed = [k for k, v in checks.items() if not v]
+
+      if passed == total:
+          return (True, f"All {total} completeness checks passed")
+      else:
+          return (False, f"{passed}/{total} checks passed. Missing: {failed}")
+
+  - name: forbidden_recommendations
+    description: |
+      Checks that the skill never recommends procedures that violate known
+      failure modes: pcs node standby, sequential shutdown, shutdown -h.
+    check: |
+      conversation = outputs.get("conversation", "")
+      ann = outputs.get("annotations", {})
+      mode = ann.get("mode", "")
+
+      if not conversation:
+          return (False, "No conversation output found")
+
+      # Only check recommendations sections, not quoted failure descriptions
+      forbidden = []
+
+      # Check if skill RECOMMENDS (not just mentions) dangerous procedures
+      conv_lower = conversation.lower()
+
+      recommend_sections = []
+      for marker in ["recommend", "suggested", "recovery", "procedure", "steps to"]:
+          idx = conv_lower.find(marker)
+          if idx >= 0:
+              recommend_sections.append(conversation[idx:idx+500])
+
+      for section in recommend_sections:
+          sec_lower = section.lower()
+          if "pcs node standby" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower:
+              forbidden.append("pcs node standby recommended")
+          if "shutdown -h" in sec_lower and "never" not in sec_lower and "do not" not in sec_lower:
+              forbidden.append("shutdown -h recommended")
+
+      if forbidden:
+          return (False, f"Forbidden recommendations found: {forbidden}")
+      return (True, "No forbidden procedures recommended")
+
+  - name: game_mode_scoring
+    description: |
+      For game mode: checks that the skill produces a score and a
+      final rating (Novice/Operator/Expert/TNF Master).
+    if: "annotations.get('mode') == 'game'"
+    check: |
+      conversation = outputs.get("conversation", "")
+
+      if not conversation:
+          return (False, "No conversation output found")
+
+      conv_lower = conversation.lower()
+      ratings = ["novice", "operator", "expert", "tnf master"]
+      found_rating = [r for r in ratings if r in conv_lower]
+
+      has_score = any(w in conv_lower for w in ["score", "points", "/"])
+
+      if not found_rating:
+          return (False, "No rating (Novice/Operator/Expert/TNF Master) found")
+      if not has_score:
+          return (False, "No score or points found in output")
+      return (True, f"Game completed with rating: {found_rating[0]}")
+
+  - name: knowledge_base_accuracy
+    description: |
+      LLM judge that evaluates whether the skill's response accurately
+      reflects the TNF knowledge base content — correct failure modes,
+      proper severity classification reasoning, and accurate recovery procedures.
+    prompt: |
+      Evaluate whether this cluster-diagnostic skill response is accurate
+      and complete for the given mode.
+
+      ## Skill Response
+      {{ conversation }}
+
+      ## Test Case Annotations
+      {{ annotations }}
+
+      ## Scoring Criteria
+
+      Score 1-5:
+      - 5: Response is fully accurate, references correct failure modes,
+           severity is properly justified, procedures match tested bare metal results
+      - 4: Minor omissions but no inaccuracies, severity is correct
+      - 3: Mostly accurate but missing important details or has minor inaccuracy
+      - 2: Contains inaccurate claims about TNF behavior or recommends untested procedures
+      - 1: Fundamentally incorrect — wrong failure modes, wrong severity, dangerous recommendations
+
+      Return a JSON object: {"score": <1-5>, "rationale": "<explanation>"}
+
+thresholds:
+  budget_check:
+    min_pass_rate: 1.0
+  severity_classification:
+    min_pass_rate: 0.8
+  warning_classification:
+    min_pass_rate: 0.8
+  procedure_completeness:
+    min_pass_rate: 0.8
+  forbidden_recommendations:
+    min_pass_rate: 1.0
+  game_mode_scoring:
+    min_pass_rate: 1.0
+  knowledge_base_accuracy:
+    min_mean: 3.5
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/annotations.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/annotations.yaml
new file mode 100644
index 00000000..217fedab
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/annotations.yaml
@@ -0,0 +1,9 @@
+mode: validate
+expected_blockers:
+  - sequential shutdown
+  - shutdown -h
+expected_warnings:
+  - cordon
+  - drain
+expected_scenario: null
+should_reject: true
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/input.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/input.yaml
new file mode 100644
index 00000000..4186c5e7
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-001-validate-sequential-shutdown/input.yaml
@@ -0,0 +1,4 @@
+command_input: >-
+  validate "cordon all nodes, drain workloads, then shut down each node
+  one at a time using shutdown -h 1 via oc debug"
+mode: validate
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/annotations.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/annotations.yaml
new file mode 100644
index 00000000..c762f6b6
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/annotations.yaml
@@ -0,0 +1,5 @@
+mode: validate
+expected_blockers: []
+expected_warnings: []
+expected_scenario: null
+should_reject: false
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/input.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/input.yaml
new file mode 100644
index 00000000..13b06b84
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-002-validate-safe-redfish/input.yaml
@@ -0,0 +1,5 @@
+command_input: >-
+  validate "Send Redfish GracefulShutdown to both nodes simultaneously
+  using curl, poll PowerState until Off, then send On to both nodes
+  to restart"
+mode: validate
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/annotations.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/annotations.yaml
new file mode 100644
index 00000000..4b741409
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/annotations.yaml
@@ -0,0 +1,5 @@
+mode: recovery-guide
+expected_blockers: []
+expected_warnings: []
+expected_scenario: full-shutdown
+should_reject: false
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/input.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/input.yaml
new file mode 100644
index 00000000..58326c61
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-003-recovery-full-shutdown/input.yaml
@@ -0,0 +1,2 @@
+command_input: recovery-guide full-shutdown
+mode: recovery-guide
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/annotations.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/annotations.yaml
new file mode 100644
index 00000000..66d323bb
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/annotations.yaml
@@ -0,0 +1,5 @@
+mode: recovery-guide
+expected_blockers: []
+expected_warnings: []
+expected_scenario: standby
+should_reject: false
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/input.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/input.yaml
new file mode 100644
index 00000000..dda90015
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-004-recovery-standby/input.yaml
@@ -0,0 +1,2 @@
+command_input: recovery-guide standby
+mode: recovery-guide
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/annotations.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/annotations.yaml
new file mode 100644
index 00000000..f0ee61ee
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/annotations.yaml
@@ -0,0 +1,6 @@
+mode: validate
+expected_blockers:
+  - pcs node standby
+expected_warnings: []
+expected_scenario: null
+should_reject: true
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/input.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/input.yaml
new file mode 100644
index 00000000..9ef90652
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-005-validate-pcs-standby/input.yaml
@@ -0,0 +1,4 @@
+command_input: >-
+  validate "Put both nodes in standby using pcs node standby --all,
+  wait for resources to stop, then power off the servers"
+mode: validate
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/annotations.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/annotations.yaml
new file mode 100644
index 00000000..3b953d22
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/annotations.yaml
@@ -0,0 +1,5 @@
+mode: game
+expected_blockers: []
+expected_warnings: []
+expected_scenario: null
+should_reject: false
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/answers.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/answers.yaml
new file mode 100644
index 00000000..44678d18
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/answers.yaml
@@ -0,0 +1,6 @@
+game_mode: quiz
+answer_correctly: true
+difficulty_guidance: >
+  Answer TNF knowledge questions accurately based on the
+  cluster-knowledge-base content. Pick the most correct option
+  for each question.
diff --git a/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/input.yaml b/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/input.yaml
new file mode 100644
index 00000000..d7e8057b
--- /dev/null
+++ b/plugins/two-node/evals/cluster-diagnostic/cases/case-006-game-quiz/input.yaml
@@ -0,0 +1,2 @@
+command_input: "game"
+mode: game
diff --git a/plugins/two-node/evals/threat-model-tnf.md b/plugins/two-node/evals/threat-model-tnf.md
new file mode 100644
index 00000000..39a7df7b
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf.md
@@ -0,0 +1,103 @@
+---
+# Auto-generated by /eval-analyze — edit to override
+skill: threat-model:tnf
+analyzed_at: 2026-06-05T00:00:00Z
+skill_hash: ca8e410b0d9b
+
+# Discovered skill capabilities
+execution_mode: case
+headless: true
+dry_run: false
+
+# Suggested judges (summary from analysis)
+suggested_judges:
+  - name: budget_check
+    type: builtin
+    description: "Cost stays under $8 per invocation"
+  - name: report_exists
+    type: check
+    description: "PR<N>-THREAT-MODEL-<repo>.md file was generated"
+  - name: report_sections_complete
+    type: check
+    description: "All 9 required sections present in report"
+  - name: dfd_elements_mapped
+    type: check
+    description: "DFD element IDs (P/DS/DF/EE/TB) referenced in report"
+  - name: stride_matrix_present
+    type: check
+    description: "Per-element STRIDE matrix has X/~/-/N/A markers"
+  - name: mitre_techniques_assigned
+    type: check
+    description: "MITRE ATT&CK technique IDs (T####) are present"
+  - name: threat_analysis_quality
+    type: llm
+    description: "Overall quality: severity accuracy, DFD mapping, STRIDE completeness, recommendations"
+  - name: findings_tracker_updated
+    type: check
+    description: "Cumulative findings tracker was appended"
+---
+
+# Skill Analysis
+
+The `threat-model:tnf` skill performs security threat analysis on GitHub PRs affecting the TNF (Two-Node Fencing) OpenShift topology. It combines three approaches:
+
+1. **Automated scanning** — runs ShellCheck on shell scripts in the PR diff
+2. **Pattern detection** — searches for command injection, credential exposure, privilege escalation, and 7 other security pattern categories
+3. **Formal threat modeling** — maps code changes to TNF DFD elements (8 processes, 5 data stores, 12 data flows, 3 external entities, 6 trust boundaries), applies per-element STRIDE analysis, and cross-references against the formal TNF threat model
+
+Output is a formal threat model report with MITRE ATT&CK technique mappings, OWASP Top 10:2025 categorization, risk assessment, and actionable recommendations for developers and customers.
+
+## Inputs
+
+Each test case provides a single PR identifier via `input.yaml`:
+
+- **`pr_input`** — the PR to analyze, in one of three formats:
+  - PR number only: `2156` (repo detected from working directory)
+  - GitHub URL: `https://github.com/ClusterLabs/resource-agents/pull/2156`
+  - Repo + number: `resource-agents 2156`
+
+The PR must be a real, accessible GitHub PR. The skill uses `gh pr view` and `gh pr diff` to fetch PR data.
+
+Optional fields: `repo` (repository name), `org` (GitHub organization).
+
+## Outputs
+
+The skill writes to a `reports/` directory (resolved via workspace discovery):
+
+- **`PR<number>-THREAT-MODEL-<repo>.md`** — main threat model report (~200-500 lines)
+- **`VULN-PR<number>-<desc>.md`** — individual vulnerability tickets (Critical/High only, optional)
+
+It also appends to a cumulative findings tracker at `$WORKSPACE/.claude/skills/threat-model/mitre-findings-tnf.md`.
+
+## Pipeline Flow
+
+1. **Workspace discovery** — walk up from CWD looking for `repos/` directory; set WORKSPACE, REPOS, THREAT_MODEL_DIR, REPORT_DIR, FINDINGS_FILE
+2. **Parse input** — extract org, repo, PR number from the three input formats
+3. **Fetch PR** — `gh pr view` for metadata, `gh pr diff` for the full diff
+4. **ShellCheck** — run on any .sh files; map security codes (SC2086→T1059) to MITRE
+5. **Pattern analysis** — search diff for 10 security pattern categories
+6. **DFD mapping** — match code paths to TNF elements using the mapping table in `dfd-elements-tnf.md`
+7. **STRIDE analysis** — per-element threat assessment; cross-reference against TNF-THREAT-MODEL.md if available
+8. **Combine findings** — deduplicate, assign VULN-N IDs, determine severity
+9. **MITRE/OWASP mapping** — assign technique IDs and OWASP categories using reference files
+10. **Generate report** — write markdown report using report-templates.md format
+11. **Append tracker** — add findings block to cumulative mitre-findings-tnf.md
+
+## Quality Criteria
+
+A **good** report:
+
+- Correctly identifies all affected DFD elements from the code paths in the PR
+- Applies STRIDE systematically to each element (all 6 categories for processes, T/I/D for stores and flows)
+- Assigns accurate severity levels matching MITRE/OWASP standards
+- Identifies trust boundary crossings (especially TB3→TB4, TB4→TB5)
+- Provides specific, actionable recommendations with code-level guidance
+- Maps findings to correct MITRE techniques (T1059 for injection, T1552 for credentials, T1611 for container escape)
+
+A **bad** report:
+
+- Misses affected DFD elements or assigns wrong elements to code paths
+- Has incomplete STRIDE matrix (missing categories or missing rationale)
+- Over/under-rates severity (e.g., calling a minor code quality issue "Critical")
+- Provides vague recommendations ("improve security") without specific guidance
+- Missing sections or incorrect report structure
diff --git a/plugins/two-node/evals/threat-model-tnf.yaml b/plugins/two-node/evals/threat-model-tnf.yaml
new file mode 100644
index 00000000..ae628262
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf.yaml
@@ -0,0 +1,247 @@
+name: threat-model-tnf-eval
+description: Evaluate the threat-model:tnf skill — PR security analysis with STRIDE/DFD, MITRE ATT&CK, and OWASP mapping for TNF topology
+skill: threat-model:tnf
+
+execution:
+  mode: case
+  arguments: "{pr_input}"
+  timeout: 600
+  max_budget_usd: 8.0
+
+runner:
+  type: claude-code
+  plugin_dirs:
+    - plugins/threat-model
+
+models:
+  skill: claude-opus-4-6
+  judge: claude-opus-4-6
+
+permissions:
+  allow: []
+  deny: []
+
+mlflow:
+  experiment: threat-model-tnf-eval
+
+dataset:
+  path: plugins/two-node/evals/threat-model-tnf/cases
+  schema: |
+    Each case directory contains:
+    - input.yaml: YAML file with fields:
+      - 'pr_input': the PR identifier to analyze — one of:
+        - A PR number (e.g., '2156')
+        - A GitHub URL (e.g., 'https://github.com/ClusterLabs/resource-agents/pull/2156')
+        - A 'repo number' pair (e.g., 'resource-agents 2156')
+        [EXTERNAL: GitHub] — must be a real, accessible PR on GitHub
+      - 'repo' (optional): repository name for context (e.g., 'resource-agents')
+      - 'org' (optional): GitHub org (e.g., 'ClusterLabs', 'openshift')
+    - reference.md (optional): gold-standard threat model report for comparison.
+      Uses the report template format: Executive Summary, DFD Impact Analysis,
+      Per-Element STRIDE matrix, Threat Analysis (VULN-N sections), MITRE/OWASP
+      mapping, Risk Assessment, and Recommendations.
+    - annotations.yaml (optional): expected metadata for outcome-aware scoring:
+      - 'expected_vuln_count': expected number of findings
+      - 'expected_severities': list of expected severity levels
+      - 'affected_dfd_elements': list of expected DFD element IDs (e.g., ['P5', 'P7', 'DS3'])
+      - 'expected_mitre_techniques': list of expected MITRE technique IDs
+      - 'has_shell_scripts': whether the PR contains shell scripts for ShellCheck
+      - 'has_trust_boundary_crossing': whether the PR crosses trust boundaries
+
+outputs:
+  - path: reports
+    schema: |
+      The skill writes threat model reports as markdown files:
+      - PR<number>-THREAT-MODEL-<repo>.md: main report with sections:
+        - Document header (version, date, classification, repo, topology, author, URL)
+        - Executive Summary with findings count table (Critical/High/Medium/Low)
+        - Change Overview describing the PR and security-relevant changes
+        - Affected Files table (file path, line changes, security relevance)
+        - DFD Impact Analysis:
+          - Affected DFD Elements table (Element ID, Name, Impact, Trust Boundary)
+          - Trust Boundary Crossings narrative
+          - Per-Element STRIDE matrix (S/T/R/I/D/E per element, X/~/-/N/A)
+          - Threat Model Cross-Reference table (PE-* IDs if formal model exists)
+        - Automated Scanner Results (ShellCheck table or "skipped" note)
+        - Threat Analysis: per-VULN section with Severity, OWASP, MITRE, CWE,
+          Affected Code, Description, Attack Vector, Impact (CIA), Recommended Fix
+        - OWASP & MITRE ATT&CK Mapping table
+        - Risk Assessment table (Likelihood, Impact, Risk)
+        - Recommendations (Developers: Before/After Merge; Customers: Config/Ops)
+        - References
+      - VULN-PR<number>-<desc>.md (optional): individual vulnerability tickets
+        for Critical/High findings only
+
+traces:
+  stdout: true
+  stderr: true
+  events: true
+  metrics: true
+
+judges:
+  - name: budget_check
+    builtin: cost_budget
+    arguments:
+      max_cost_usd: 8.0
+
+  - name: report_exists
+    description: Verify that the main threat model report markdown file was generated
+    check: |
+      files = outputs.get("files", {})
+      reports = [k for k in files if "THREAT" in k.upper() and k.endswith(".md")]
+      if not reports:
+          return (False, "No threat model report file found")
+      return (True, f"Report generated: {reports[0]}")
+
+  - name: report_sections_complete
+    description: Verify all required report sections are present in the generated report
+    check: |
+      files = outputs.get("files", {})
+      reports = {k: v for k, v in files.items() if "THREAT" in k.upper() and k.endswith(".md")}
+      if not reports:
+          return (False, "No report file found")
+      content = list(reports.values())[0]
+      required = [
+          "Executive Summary",
+          "Change Overview",
+          "Affected Files",
+          "DFD Impact Analysis",
+          "STRIDE",
+          "Threat Analysis",
+          "MITRE",
+          "Risk Assessment",
+          "Recommendations",
+      ]
+      missing = [s for s in required if s not in content]
+      if missing:
+          return (False, f"Missing sections: {', '.join(missing)}")
+      return (True, f"All {len(required)} required sections present")
+
+  - name: dfd_elements_mapped
+    description: Verify that DFD elements (P1-P8, DS1-DS5, DF1-DF12) are referenced in the report
+    check: |
+      import re
+      files = outputs.get("files", {})
+      reports = {k: v for k, v in files.items() if "THREAT" in k.upper() and k.endswith(".md")}
+      if not reports:
+          return (False, "No report file found")
+      content = list(reports.values())[0]
+      elements = re.findall(r'\b(P[1-8]|DS[1-5]|DF(?:1[0-2]|[1-9])|EE[1-3]|TB[1-6])\b', content)
+      unique = set(elements)
+      if not unique:
+          return (False, "No DFD elements found in report")
+      return (True, f"DFD elements referenced: {sorted(unique)}")
+
+  - name: stride_matrix_present
+    description: Verify the per-element STRIDE matrix is populated with X, ~, or - markers
+    check: |
+      import re
+      files = outputs.get("files", {})
+      reports = {k: v for k, v in files.items() if "THREAT" in k.upper() and k.endswith(".md")}
+      if not reports:
+          return (False, "No report file found")
+      content = list(reports.values())[0]
+      stride_section = content.split("Per-Element STRIDE")
+      if len(stride_section) < 2:
+          return (False, "No Per-Element STRIDE section found")
+      markers = re.findall(r'\b[XxNn/Aa~-]\b', stride_section[1][:2000])
+      if len(markers) < 3:
+          return (False, f"STRIDE matrix appears empty or minimal ({len(markers)} markers)")
+      return (True, f"STRIDE matrix populated ({len(markers)} cell markers found)")
+
+  - name: mitre_techniques_assigned
+    description: Verify MITRE ATT&CK technique IDs (T####) are present and mapped to findings
+    check: |
+      import re
+      files = outputs.get("files", {})
+      reports = {k: v for k, v in files.items() if "THREAT" in k.upper() and k.endswith(".md")}
+      if not reports:
+          return (False, "No report file found")
+      content = list(reports.values())[0]
+      techniques = set(re.findall(r'T\d{4}', content))
+      if not techniques:
+          return (False, "No MITRE ATT&CK technique IDs found")
+      return (True, f"MITRE techniques: {sorted(techniques)}")
+
+  - name: threat_analysis_quality
+    description: |
+      LLM judge assessing overall threat analysis quality: severity accuracy,
+      DFD mapping correctness, STRIDE completeness, and recommendation actionability
+    prompt: |
+      You are evaluating a TNF (Two-Node Fencing) PR threat analysis report.
+
+      ## Report output:
+
+      {{ outputs }}
+
+      ## Skill conversation:
+
+      {{ conversation }}
+
+      ## Evaluation criteria
+
+      Score on a 1-5 scale across these dimensions, then give an overall score:
+
+      **1. Severity accuracy** — do assigned severities (Critical/High/Medium/Low) match the actual risk?
+      - Critical: RCE, credential exposure at high-trust boundary (P5/P6/P8), STONITH bypass
+      - High: command injection with exploitation path, new credential dependency, missing validation on network boundary
+      - Medium: fail-open behavior, non-critical info disclosure, potential race condition
+      - Low: minor code quality, non-exploitable pattern
+
+      **2. DFD mapping correctness** — are code changes correctly mapped to TNF DFD elements (P1-P8, DS1-DS5, DF1-DF12)?
+      - Code paths should match the element mapping table (e.g., cluster-etcd-operator/pkg/tnf/fencing/ → P5)
+      - Trust boundary crossings should be identified (TB2→TB3, TB3→TB4, TB4→TB5)
+
+      **3. STRIDE completeness** — is each affected element analyzed across all applicable STRIDE categories?
+      - Processes: all 6 (S,T,R,I,D,E)
+      - Data Stores: T,I,D
+      - Data Flows: T,I,D
+      - External Entities: S,R
+
+      **4. MITRE/OWASP accuracy** — are technique assignments correct?
+      - T1059 for command injection, T1552 for credential exposure, T1611 for container escape
+      - OWASP categories should match (A05 for injection, A07 for auth failures)
+
+      **5. Recommendation quality** — are recommendations specific and actionable?
+      - Developer recommendations should include code-level guidance
+      - Customer recommendations should include hardening or monitoring steps
+      - Vague recommendations ("improve security") score low
+
+      ## Scoring
+      Score 1: Report is missing major sections, contains incorrect mappings, or has no useful findings
+      Score 2: Report exists but has significant gaps — missing STRIDE analysis, wrong DFD elements, or vague recommendations
+      Score 3: Adequate report covering basics — correct elements identified, some STRIDE analysis, generic recommendations
+      Score 4: Good report — accurate DFD mapping, thorough STRIDE, relevant MITRE techniques, specific recommendations
+      Score 5: Excellent — comprehensive coverage, all trust boundaries analyzed, accurate severity, actionable recommendations with code examples
+
+      Respond with a single integer score (1-5) on the first line, then explain your reasoning.
+
+  - name: findings_tracker_updated
+    description: Verify the findings tracker was appended with new entries (checks conversation for append confirmation)
+    check: |
+      conv = outputs.get("conversation", "")
+      files = outputs.get("files", {})
+      tracker_files = [k for k in files if "mitre-findings" in k.lower()]
+      if tracker_files:
+          return (True, f"Findings tracker file found: {tracker_files[0]}")
+      if "findings" in conv.lower() and ("append" in conv.lower() or "tracker" in conv.lower()):
+          return (True, "Findings tracker update mentioned in conversation")
+      return (False, "No evidence of findings tracker update")
+
+thresholds:
+  budget_check:
+    min_pass_rate: 1.0
+  report_exists:
+    min_pass_rate: 1.0
+  report_sections_complete:
+    min_pass_rate: 1.0
+  dfd_elements_mapped:
+    min_pass_rate: 1.0
+  stride_matrix_present:
+    min_pass_rate: 0.8
+  mitre_techniques_assigned:
+    min_pass_rate: 1.0
+  threat_analysis_quality:
+    min_mean: 3.5
+  findings_tracker_updated:
+    min_pass_rate: 0.8
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/annotations.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/annotations.yaml
new file mode 100644
index 00000000..751425b7
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/annotations.yaml
@@ -0,0 +1,18 @@
+description: >
+  Shell script PR that adds kubeconfig-based K8s API access to the podman-etcd
+  OCF agent. Introduces new trust boundary crossing (TB4→TB2) and credential
+  dependency. Should trigger ShellCheck analysis and identify credential exposure.
+has_shell_scripts: true
+has_trust_boundary_crossing: true
+expected_severities:
+  - High
+  - Medium
+  - Low
+affected_dfd_elements:
+  - P7
+  - DS5
+  - DF11
+expected_mitre_techniques:
+  - T1552
+  - T1078
+  - T1005
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/input.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/input.yaml
new file mode 100644
index 00000000..37361080
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-001-shell-script-k8s-api/input.yaml
@@ -0,0 +1,3 @@
+pr_input: "https://github.com/ClusterLabs/resource-agents/pull/2156"
+repo: resource-agents
+org: ClusterLabs
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/annotations.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/annotations.yaml
new file mode 100644
index 00000000..7d6e2436
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/annotations.yaml
@@ -0,0 +1,21 @@
+description: >
+  Adds a TNF fencing credentials rotation script. This touches the full credential
+  flow path (DS2→P5→DS3) and should identify high-severity findings around
+  credential handling, STONITH configuration, and BMC access. Complex case with
+  multiple DFD elements affected.
+has_shell_scripts: true
+has_trust_boundary_crossing: true
+expected_severities:
+  - Critical
+  - High
+  - Medium
+affected_dfd_elements:
+  - P5
+  - DS2
+  - DS3
+  - DF4
+  - DF7
+expected_mitre_techniques:
+  - T1552
+  - T1059
+  - T1529
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/input.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/input.yaml
new file mode 100644
index 00000000..cd609686
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-002-credential-rotation-script/input.yaml
@@ -0,0 +1,3 @@
+pr_input: "cluster-etcd-operator 1611"
+repo: cluster-etcd-operator
+org: openshift
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/annotations.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/annotations.yaml
new file mode 100644
index 00000000..b62f1b9c
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/annotations.yaml
@@ -0,0 +1,16 @@
+description: >
+  Adds MAC-address based fencing credentials lookup. Introduces a new data flow
+  for credential resolution and modifies the fencing job's credential discovery
+  path. Tests DFD mapping for P5 and the credential pipeline.
+has_shell_scripts: false
+has_trust_boundary_crossing: true
+expected_severities:
+  - High
+  - Medium
+affected_dfd_elements:
+  - P5
+  - DS2
+  - DF4
+expected_mitre_techniques:
+  - T1552
+  - T1078
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/input.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/input.yaml
new file mode 100644
index 00000000..34abba7b
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-003-mac-fencing-lookup/input.yaml
@@ -0,0 +1,3 @@
+pr_input: "https://github.com/openshift/cluster-etcd-operator/pull/1600"
+repo: cluster-etcd-operator
+org: openshift
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/annotations.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/annotations.yaml
new file mode 100644
index 00000000..7a2fc3ac
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/annotations.yaml
@@ -0,0 +1,9 @@
+description: >
+  Trivial indentation fix in nfsserver — not TNF-specific, no shell scripts
+  relevant to TNF. Should produce a report with minimal or no security findings.
+  Edge case testing the skill's handling of low-risk, non-TNF PRs.
+has_shell_scripts: false
+has_trust_boundary_crossing: false
+expected_severities: []
+affected_dfd_elements: []
+expected_mitre_techniques: []
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/input.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/input.yaml
new file mode 100644
index 00000000..f883d7a3
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-004-trivial-indentation-fix/input.yaml
@@ -0,0 +1,3 @@
+pr_input: "https://github.com/ClusterLabs/resource-agents/pull/2168"
+repo: resource-agents
+org: ClusterLabs
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/annotations.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/annotations.yaml
new file mode 100644
index 00000000..d9d3e7b2
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/annotations.yaml
@@ -0,0 +1,15 @@
+description: >
+  Bug fix gating dual-replica setup and adding retry logic in TNF pipeline.
+  Modifies P4 (Setup Job) behavior. Tests whether the skill correctly identifies
+  denial-of-service risk from retry logic changes and setup gate modifications.
+  Uses bare PR number format — tests repo auto-detection from CWD.
+has_shell_scripts: false
+has_trust_boundary_crossing: false
+expected_severities:
+  - Medium
+  - Low
+affected_dfd_elements:
+  - P4
+  - P2
+expected_mitre_techniques:
+  - T1499
diff --git a/plugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/input.yaml b/plugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/input.yaml
new file mode 100644
index 00000000..2ccb92a2
--- /dev/null
+++ b/plugins/two-node/evals/threat-model-tnf/cases/case-005-tnf-retry-bugfix/input.yaml
@@ -0,0 +1,3 @@
+pr_input: "1620"
+repo: cluster-etcd-operator
+org: openshift