submariner-io · dfarrell07 · Apr 9, 2026
@@ -0,0 +1,215 @@
+---
+sep: 32
+title: "CVE Fix Skill Refactoring"
+status: implemented
+authors: [dfarrell07]
+created: 2026-04-09
+components: [shipyard]
+repos: [submariner-io/shipyard]
+---
+
+# CVE Fix Skill Refactoring
+
+## Summary
+
+Extract the CVE fix skill from a monolithic SKILL.md into shellcheck-linted
+scripts under `scripts/cve/`, with a `make cve-fix` target and Claude subagent
+review. The happy path is fully deterministic — scan, fix, verify, commit — with
+agent review as a quality layer on all results.
+
+## Motivation
+
+The CVE fix skill embeds all bash inline in markdown code blocks. This means
+no shellcheck linting, no standalone execution, no composability with make
+targets, and the agent orchestrates every step even when no judgment is needed.
+
+CVE fixing is almost always deterministic: scan with grype, `go get` the
+FIXED-IN version, verify, commit. An agent should review the results, not
+drive every step. The release-management repo's patterns (extracted scripts,
+make targets, pre-fetched evidence + subagent review) solve this well.
+
+## Design
+
+### Architecture
+
+```bash
+make cve-fix BRANCH=release-0.23                                  # Current repo
+make cve-fix REPO=../submariner-operator BRANCH=release-0.23      # Cross-repo
+/cve-fix release-0.23                                             # Via skill
+/cve-fix release-0.23 ../submariner-operator                      # Cross-repo via skill
+```
+
+One orchestrator (`fix-all.sh`) runs three phases:
+
+1. **Deterministic**: detect config, create fix branch, clean, scan (JSON),
+   group CVEs by package, fix each, run tests. CVEs without a fix version
+   are left for agent review.
+2. **Agent review**: invoke Claude subagent with pre-fetched evidence (scan
+   results, locate output, fix summary). The agent verifies fixes, handles
+   unfixed CVEs, and can call `ignore.sh` or retry with different versions.
+   Skipped with a warning if `claude` CLI is not available.
+3. **Summary**: print what was fixed, ignored, flagged, and the PR command.
+
+### Key Design Decisions
+
+**Dynamic go.mod discovery.** Repos like lighthouse have go.mod files beyond
+root and tools (`coredns/go.mod`). All scripts discover modules dynamically
+via `find_gomods()` rather than hardcoding root + tools. Uses `git ls-files`
+to respect `.gitignore` (avoiding build artifacts like `ovn-kubernetes/`),
+with a `find` fallback outside git repos. This means locate, fix, tidy,
+clean, and staging all work on any repo layout without config.
+
+**Verify by checking go.mod, not scan output.** After fixing a package,
+grype still reports it if any go.sum in the repo references the old version
+as a transitive dependency. Verification checks that the go.mod files we
+updated now contain the fixed version, not whether the package name vanishes
+from the scan.
+
+**stdout is for data, stderr is for progress.** When `--json` is used,
+only the grype JSON goes to stdout. Docker commands, `make clean`, DB updates,
+and warnings all go to stderr. Data piped between scripts uses `printf '%s\n'`
+instead of `echo` to avoid backslash interpretation.
+
+**Non-fatal external dependencies.** Docker image pulls (for Go version
+detection) and Dapper builds (for binary scanning) can fail without aborting
+the workflow. Container-based grype scans fall back to local grype on failure.
+Source-level scanning still works.
+
+**Insert, don't append.** New `.grype.yaml` ignore entries are inserted
+before the `exclude:` section, not appended to the end of the file.
+
+### Scripts
+
+```text
+scripts/cve/
+  lib.sh              # Shared: find_gomods, run_grype, clean_gomod, state mgmt
+  detect.sh           # Config detection -> state file
+  scan.sh             # Grype scan (container or local), JSON + table output
+  locate.sh           # Find package across all go.mod files
+  fix-package.sh      # Fix one package: update, verify, commit
+  fix-stdlib.sh       # Fix stdlib CVEs: update go directive, commit
+  fix-all.sh          # Orchestrator: phases 1-3
+  ignore.sh           # Add to .grype.yaml ignore list, commit
+  review.sh           # Pre-fetch evidence, invoke Claude subagent
+  review-prompt.md    # Subagent prompt template (envsubst)
+  test-lib.sh         # Unit tests (29 tests, no docker/network needed)
+```
+
+### State File
+
+Scripts share state via a sourced env file keyed by repo, branch,
+and PID (so cleanup from one run doesn't delete another's state file):
+
+```bash
+# /tmp/cve-fix-{repo-basename}-{sanitized-branch}-{pid}.env
+REPO="/home/user/go/src/submariner-io/submariner-operator"
+BRANCH="release-0.23"
+FIX_BRANCH="fix-0.23-cves-2026-04-09"
+HAS_TOOLS_GOMOD=true
+GENERATED_FILE="./pkg/embeddedyamls/yamls.go"
+DIFF_IGNORE_ARGS="-Icontroller-gen.kubebuilder.io/version"
+NEEDS_BUILD_FOR_SCAN=true
+CONTAINER_CMD="docker"
+HAS_LOCAL_GRYPE=true
+# ... plus image, Go version, fetch state, script path
+```
+
+Each consumer script calls `load_state "$1"` which sources the file
+and `cd`s to `$REPO`.
+
+### Script Interfaces
+
+| Script | Usage | Exit codes |
+|--------|-------|------------|
+| `detect.sh` | `[repo] [branch] [--setup-branch]` | 0: ok, 1: error |
+| `scan.sh` | `STATE_FILE [--fresh] [--json]` | 0: ok, 1: error |
+| `fix-package.sh` | `STATE_FILE PACKAGE VERSION CVE_ID...` | 0: fixed, 2: breaking, 3: persists |
+| `fix-stdlib.sh` | `STATE_FILE GO_VERSION CVE_ID...` | 0: fixed, 2: breaking |
+| `fix-all.sh` | `[repo] [branch]` | 0: all addressed, 1: error, 2: unresolved |
+| `ignore.sh` | `STATE_FILE PACKAGE CVE_ID SEVERITY REASON` | 0: ok |
+| `review.sh` | `STATE_FILE FIX_SUMMARY [SCAN_OUTPUT]` | 0: ok |
+
+Args to `detect.sh` and `fix-all.sh` are order-independent; `0.23`
+auto-expands to `release-0.23`. `fix-package.sh` reverts on Go or K8s
+minor version upgrades. Agent review skips if `claude` CLI is not found.
+
+### Cross-Repo Usage
+
+CVE scanning runs outside Dapper (uses host docker/podman or local grype), so the
+`cve-fix` make target is in shipyard's own Makefile with
+`NON_DAPPER_GOALS`, not the shared include chain. No per-repo config
+needed — `detect.sh` auto-discovers go.mod files, generated files,
+build requirements, and scanner availability.
+
+### SKILL.md
+
+The skill is a thin wrapper that finds the scripts and delegates. The
+`description` field includes TRIGGER conditions so Claude invokes
+it from natural language (e.g., "fix CVEs in release-0.23"):
+
+```bash
+CVE_SCRIPTS="$(pwd)/scripts/cve"
+[[ -d "$CVE_SCRIPTS" ]] || CVE_SCRIPTS="$HOME/go/src/submariner-io/shipyard/scripts/cve"
+exec bash "$CVE_SCRIPTS/fix-all.sh" $ARGUMENTS
+```
+
+### Parallel Execution
+
+The state-file-per-repo design enables parallel execution. For a single
+repo, Claude invokes the skill directly. For multiple repos, Claude
+spawns one agent per repo, each running fix-all.sh via the Bash tool.
+Subagents use the Bash tool because the Skill tool's forked context
+does not stream output, causing the 600s watchdog timeout on long runs.
+
+```text
+User: fix cves in subctl on 0.21
+Claude: invokes /cve-fix (single skill call)
+
+User: fix cves in all repos on 0.21
+Claude: spawns 4 agents, each running fix-all.sh (parallel)
+```
+
+### Future: Jira Integration
+
+A Jira MCP skill could close the loop: look up CVE-tracking Jiras,
+run `/cve-fix` for the affected repos/branches, and link the resulting
+PRs back to the Jira issues automatically.
+
+### Alternatives Considered
+
+**Multi-step SKILL.md calling scripts** -- the agent would orchestrate
+each step (scan, fix, verify, commit) individually. Rejected because CVE
+fixing rarely needs agent judgment; full deterministic orchestration is
+simpler and faster.
+
+**Separate make targets for scan/fix/review** -- rejected for simplicity.
+One target does everything; agent review is automatic.
+
+**Hardcoded root + tools go.mod** -- would miss modules like
+`coredns/go.mod` in lighthouse. Dynamic discovery via `find_gomods()` handles
+any repo layout.
+
+### Backward Compatibility
+
+The `/cve-fix` skill interface is preserved (same arguments, same behavior).
+The `.agents/workflows/cve-fix.md` files in each repo retain the manual
+process as a reference. Shipyard's has a pointer to the automation at
+the top.
+
+## Implementation
+
+Built in five layers: shared library (`lib.sh`) → detection and scanning
+(`detect.sh`, `scan.sh`) → per-package operations (`locate.sh`,
+`fix-package.sh`, `fix-stdlib.sh`, `ignore.sh`) → agent review
+(`review.sh`, `review-prompt.md`) → orchestrator and integration
+(`fix-all.sh`, `test-lib.sh`, `Makefile`, `SKILL.md`).
+
+## Done When
+
+- [x] Scripts shellcheck-clean, 29 unit tests pass
+- [x] End-to-end on all 7 repos, release-0.19 through devel (47 combos)
+- [x] Multi-module repos (lighthouse: root + tools + coredns)
+- [x] Agent review (ignored helm/otel CVEs incompatible with stable branches)
+- [x] Parallel execution (operator + lighthouse via subagents)
+- [x] `make cve-fix REPO=... BRANCH=...` (non-interactive, no dapper)
+- [x] SKILL.md reduced from 696 to 85 lines