diff --git a/seps/SEP-0032-cve-fix-refactoring.md b/seps/SEP-0032-cve-fix-refactoring.md new file mode 100644 index 00000000..e9c8c5cf --- /dev/null +++ b/seps/SEP-0032-cve-fix-refactoring.md @@ -0,0 +1,215 @@ +--- +sep: 32 +title: "CVE Fix Skill Refactoring" +status: implemented +authors: [dfarrell07] +created: 2026-04-09 +components: [shipyard] +repos: [submariner-io/shipyard] +--- + +# CVE Fix Skill Refactoring + +## Summary + +Extract the CVE fix skill from a monolithic SKILL.md into shellcheck-linted +scripts under `scripts/cve/`, with a `make cve-fix` target and Claude subagent +review. The happy path is fully deterministic — scan, fix, verify, commit — with +agent review as a quality layer on all results. + +## Motivation + +The CVE fix skill embeds all bash inline in markdown code blocks. This means +no shellcheck linting, no standalone execution, no composability with make +targets, and the agent orchestrates every step even when no judgment is needed. + +CVE fixing is almost always deterministic: scan with grype, `go get` the +FIXED-IN version, verify, commit. An agent should review the results, not +drive every step. The release-management repo's patterns (extracted scripts, +make targets, pre-fetched evidence + subagent review) solve this well. + +## Design + +### Architecture + +```bash +make cve-fix BRANCH=release-0.23 # Current repo +make cve-fix REPO=../submariner-operator BRANCH=release-0.23 # Cross-repo +/cve-fix release-0.23 # Via skill +/cve-fix release-0.23 ../submariner-operator # Cross-repo via skill +``` + +One orchestrator (`fix-all.sh`) runs three phases: + +1. **Deterministic**: detect config, create fix branch, clean, scan (JSON), + group CVEs by package, fix each, run tests. CVEs without a fix version + are left for agent review. +2. **Agent review**: invoke Claude subagent with pre-fetched evidence (scan + results, locate output, fix summary). The agent verifies fixes, handles + unfixed CVEs, and can call `ignore.sh` or retry with different versions. + Skipped with a warning if `claude` CLI is not available. +3. **Summary**: print what was fixed, ignored, flagged, and the PR command. + +### Key Design Decisions + +**Dynamic go.mod discovery.** Repos like lighthouse have go.mod files beyond +root and tools (`coredns/go.mod`). All scripts discover modules dynamically +via `find_gomods()` rather than hardcoding root + tools. Uses `git ls-files` +to respect `.gitignore` (avoiding build artifacts like `ovn-kubernetes/`), +with a `find` fallback outside git repos. This means locate, fix, tidy, +clean, and staging all work on any repo layout without config. + +**Verify by checking go.mod, not scan output.** After fixing a package, +grype still reports it if any go.sum in the repo references the old version +as a transitive dependency. Verification checks that the go.mod files we +updated now contain the fixed version, not whether the package name vanishes +from the scan. + +**stdout is for data, stderr is for progress.** When `--json` is used, +only the grype JSON goes to stdout. Docker commands, `make clean`, DB updates, +and warnings all go to stderr. Data piped between scripts uses `printf '%s\n'` +instead of `echo` to avoid backslash interpretation. + +**Non-fatal external dependencies.** Docker image pulls (for Go version +detection) and Dapper builds (for binary scanning) can fail without aborting +the workflow. Container-based grype scans fall back to local grype on failure. +Source-level scanning still works. + +**Insert, don't append.** New `.grype.yaml` ignore entries are inserted +before the `exclude:` section, not appended to the end of the file. + +### Scripts + +```text +scripts/cve/ + lib.sh # Shared: find_gomods, run_grype, clean_gomod, state mgmt + detect.sh # Config detection -> state file + scan.sh # Grype scan (container or local), JSON + table output + locate.sh # Find package across all go.mod files + fix-package.sh # Fix one package: update, verify, commit + fix-stdlib.sh # Fix stdlib CVEs: update go directive, commit + fix-all.sh # Orchestrator: phases 1-3 + ignore.sh # Add to .grype.yaml ignore list, commit + review.sh # Pre-fetch evidence, invoke Claude subagent + review-prompt.md # Subagent prompt template (envsubst) + test-lib.sh # Unit tests (29 tests, no docker/network needed) +``` + +### State File + +Scripts share state via a sourced env file keyed by repo, branch, +and PID (so cleanup from one run doesn't delete another's state file): + +```bash +# /tmp/cve-fix-{repo-basename}-{sanitized-branch}-{pid}.env +REPO="/home/user/go/src/submariner-io/submariner-operator" +BRANCH="release-0.23" +FIX_BRANCH="fix-0.23-cves-2026-04-09" +HAS_TOOLS_GOMOD=true +GENERATED_FILE="./pkg/embeddedyamls/yamls.go" +DIFF_IGNORE_ARGS="-Icontroller-gen.kubebuilder.io/version" +NEEDS_BUILD_FOR_SCAN=true +CONTAINER_CMD="docker" +HAS_LOCAL_GRYPE=true +# ... plus image, Go version, fetch state, script path +``` + +Each consumer script calls `load_state "$1"` which sources the file +and `cd`s to `$REPO`. + +### Script Interfaces + +| Script | Usage | Exit codes | +|--------|-------|------------| +| `detect.sh` | `[repo] [branch] [--setup-branch]` | 0: ok, 1: error | +| `scan.sh` | `STATE_FILE [--fresh] [--json]` | 0: ok, 1: error | +| `fix-package.sh` | `STATE_FILE PACKAGE VERSION CVE_ID...` | 0: fixed, 2: breaking, 3: persists | +| `fix-stdlib.sh` | `STATE_FILE GO_VERSION CVE_ID...` | 0: fixed, 2: breaking | +| `fix-all.sh` | `[repo] [branch]` | 0: all addressed, 1: error, 2: unresolved | +| `ignore.sh` | `STATE_FILE PACKAGE CVE_ID SEVERITY REASON` | 0: ok | +| `review.sh` | `STATE_FILE FIX_SUMMARY [SCAN_OUTPUT]` | 0: ok | + +Args to `detect.sh` and `fix-all.sh` are order-independent; `0.23` +auto-expands to `release-0.23`. `fix-package.sh` reverts on Go or K8s +minor version upgrades. Agent review skips if `claude` CLI is not found. + +### Cross-Repo Usage + +CVE scanning runs outside Dapper (uses host docker/podman or local grype), so the +`cve-fix` make target is in shipyard's own Makefile with +`NON_DAPPER_GOALS`, not the shared include chain. No per-repo config +needed — `detect.sh` auto-discovers go.mod files, generated files, +build requirements, and scanner availability. + +### SKILL.md + +The skill is a thin wrapper that finds the scripts and delegates. The +`description` field includes TRIGGER conditions so Claude invokes +it from natural language (e.g., "fix CVEs in release-0.23"): + +```bash +CVE_SCRIPTS="$(pwd)/scripts/cve" +[[ -d "$CVE_SCRIPTS" ]] || CVE_SCRIPTS="$HOME/go/src/submariner-io/shipyard/scripts/cve" +exec bash "$CVE_SCRIPTS/fix-all.sh" $ARGUMENTS +``` + +### Parallel Execution + +The state-file-per-repo design enables parallel execution. For a single +repo, Claude invokes the skill directly. For multiple repos, Claude +spawns one agent per repo, each running fix-all.sh via the Bash tool. +Subagents use the Bash tool because the Skill tool's forked context +does not stream output, causing the 600s watchdog timeout on long runs. + +```text +User: fix cves in subctl on 0.21 +Claude: invokes /cve-fix (single skill call) + +User: fix cves in all repos on 0.21 +Claude: spawns 4 agents, each running fix-all.sh (parallel) +``` + +### Future: Jira Integration + +A Jira MCP skill could close the loop: look up CVE-tracking Jiras, +run `/cve-fix` for the affected repos/branches, and link the resulting +PRs back to the Jira issues automatically. + +### Alternatives Considered + +**Multi-step SKILL.md calling scripts** -- the agent would orchestrate +each step (scan, fix, verify, commit) individually. Rejected because CVE +fixing rarely needs agent judgment; full deterministic orchestration is +simpler and faster. + +**Separate make targets for scan/fix/review** -- rejected for simplicity. +One target does everything; agent review is automatic. + +**Hardcoded root + tools go.mod** -- would miss modules like +`coredns/go.mod` in lighthouse. Dynamic discovery via `find_gomods()` handles +any repo layout. + +### Backward Compatibility + +The `/cve-fix` skill interface is preserved (same arguments, same behavior). +The `.agents/workflows/cve-fix.md` files in each repo retain the manual +process as a reference. Shipyard's has a pointer to the automation at +the top. + +## Implementation + +Built in five layers: shared library (`lib.sh`) → detection and scanning +(`detect.sh`, `scan.sh`) → per-package operations (`locate.sh`, +`fix-package.sh`, `fix-stdlib.sh`, `ignore.sh`) → agent review +(`review.sh`, `review-prompt.md`) → orchestrator and integration +(`fix-all.sh`, `test-lib.sh`, `Makefile`, `SKILL.md`). + +## Done When + +- [x] Scripts shellcheck-clean, 29 unit tests pass +- [x] End-to-end on all 7 repos, release-0.19 through devel (47 combos) +- [x] Multi-module repos (lighthouse: root + tools + coredns) +- [x] Agent review (ignored helm/otel CVEs incompatible with stable branches) +- [x] Parallel execution (operator + lighthouse via subagents) +- [x] `make cve-fix REPO=... BRANCH=...` (non-interactive, no dapper) +- [x] SKILL.md reduced from 696 to 85 lines