Skip to content
Open
Changes from all commits
Commits
File filter

Filter by extension

Filter by extension

Conversations
Failed to load comments.
Loading
Jump to
Jump to file
Failed to load files.
Loading
Diff view
Diff view
215 changes: 215 additions & 0 deletions seps/SEP-0032-cve-fix-refactoring.md
Original file line number Diff line number Diff line change
@@ -0,0 +1,215 @@
---
sep: 32
title: "CVE Fix Skill Refactoring"
status: implemented
authors: [dfarrell07]
created: 2026-04-09
components: [shipyard]
repos: [submariner-io/shipyard]
---

# CVE Fix Skill Refactoring

## Summary

Extract the CVE fix skill from a monolithic SKILL.md into shellcheck-linted
scripts under `scripts/cve/`, with a `make cve-fix` target and Claude subagent
review. The happy path is fully deterministic — scan, fix, verify, commit — with
agent review as a quality layer on all results.

## Motivation

The CVE fix skill embeds all bash inline in markdown code blocks. This means
no shellcheck linting, no standalone execution, no composability with make
targets, and the agent orchestrates every step even when no judgment is needed.

CVE fixing is almost always deterministic: scan with grype, `go get` the
FIXED-IN version, verify, commit. An agent should review the results, not
drive every step. The release-management repo's patterns (extracted scripts,
make targets, pre-fetched evidence + subagent review) solve this well.

## Design

### Architecture

```bash
make cve-fix BRANCH=release-0.23 # Current repo
make cve-fix REPO=../submariner-operator BRANCH=release-0.23 # Cross-repo
/cve-fix release-0.23 # Via skill
/cve-fix release-0.23 ../submariner-operator # Cross-repo via skill
```

One orchestrator (`fix-all.sh`) runs three phases:

1. **Deterministic**: detect config, create fix branch, clean, scan (JSON),
group CVEs by package, fix each, run tests. CVEs without a fix version
are left for agent review.
2. **Agent review**: invoke Claude subagent with pre-fetched evidence (scan
results, locate output, fix summary). The agent verifies fixes, handles
unfixed CVEs, and can call `ignore.sh` or retry with different versions.
Skipped with a warning if `claude` CLI is not available.
3. **Summary**: print what was fixed, ignored, flagged, and the PR command.

### Key Design Decisions

**Dynamic go.mod discovery.** Repos like lighthouse have go.mod files beyond
root and tools (`coredns/go.mod`). All scripts discover modules dynamically
via `find_gomods()` rather than hardcoding root + tools. Uses `git ls-files`
to respect `.gitignore` (avoiding build artifacts like `ovn-kubernetes/`),
with a `find` fallback outside git repos. This means locate, fix, tidy,
clean, and staging all work on any repo layout without config.

**Verify by checking go.mod, not scan output.** After fixing a package,
grype still reports it if any go.sum in the repo references the old version
as a transitive dependency. Verification checks that the go.mod files we
updated now contain the fixed version, not whether the package name vanishes
from the scan.

**stdout is for data, stderr is for progress.** When `--json` is used,
only the grype JSON goes to stdout. Docker commands, `make clean`, DB updates,
and warnings all go to stderr. Data piped between scripts uses `printf '%s\n'`
instead of `echo` to avoid backslash interpretation.

**Non-fatal external dependencies.** Docker image pulls (for Go version
detection) and Dapper builds (for binary scanning) can fail without aborting
the workflow. Container-based grype scans fall back to local grype on failure.
Source-level scanning still works.

**Insert, don't append.** New `.grype.yaml` ignore entries are inserted
before the `exclude:` section, not appended to the end of the file.

### Scripts

```text
scripts/cve/
lib.sh # Shared: find_gomods, run_grype, clean_gomod, state mgmt
detect.sh # Config detection -> state file
scan.sh # Grype scan (container or local), JSON + table output
locate.sh # Find package across all go.mod files
fix-package.sh # Fix one package: update, verify, commit
fix-stdlib.sh # Fix stdlib CVEs: update go directive, commit
fix-all.sh # Orchestrator: phases 1-3
ignore.sh # Add to .grype.yaml ignore list, commit
review.sh # Pre-fetch evidence, invoke Claude subagent
review-prompt.md # Subagent prompt template (envsubst)
test-lib.sh # Unit tests (29 tests, no docker/network needed)
```

### State File

Scripts share state via a sourced env file keyed by repo, branch,
and PID (so cleanup from one run doesn't delete another's state file):

```bash
# /tmp/cve-fix-{repo-basename}-{sanitized-branch}-{pid}.env
REPO="/home/user/go/src/submariner-io/submariner-operator"
BRANCH="release-0.23"
FIX_BRANCH="fix-0.23-cves-2026-04-09"
HAS_TOOLS_GOMOD=true
GENERATED_FILE="./pkg/embeddedyamls/yamls.go"
DIFF_IGNORE_ARGS="-Icontroller-gen.kubebuilder.io/version"
NEEDS_BUILD_FOR_SCAN=true
CONTAINER_CMD="docker"
HAS_LOCAL_GRYPE=true
# ... plus image, Go version, fetch state, script path
```

Each consumer script calls `load_state "$1"` which sources the file
and `cd`s to `$REPO`.

### Script Interfaces

| Script | Usage | Exit codes |
|--------|-------|------------|
| `detect.sh` | `[repo] [branch] [--setup-branch]` | 0: ok, 1: error |
| `scan.sh` | `STATE_FILE [--fresh] [--json]` | 0: ok, 1: error |
| `fix-package.sh` | `STATE_FILE PACKAGE VERSION CVE_ID...` | 0: fixed, 2: breaking, 3: persists |
| `fix-stdlib.sh` | `STATE_FILE GO_VERSION CVE_ID...` | 0: fixed, 2: breaking |
| `fix-all.sh` | `[repo] [branch]` | 0: all addressed, 1: error, 2: unresolved |
| `ignore.sh` | `STATE_FILE PACKAGE CVE_ID SEVERITY REASON` | 0: ok |
| `review.sh` | `STATE_FILE FIX_SUMMARY [SCAN_OUTPUT]` | 0: ok |

Args to `detect.sh` and `fix-all.sh` are order-independent; `0.23`
auto-expands to `release-0.23`. `fix-package.sh` reverts on Go or K8s
minor version upgrades. Agent review skips if `claude` CLI is not found.

### Cross-Repo Usage

CVE scanning runs outside Dapper (uses host docker/podman or local grype), so the
`cve-fix` make target is in shipyard's own Makefile with
`NON_DAPPER_GOALS`, not the shared include chain. No per-repo config
needed — `detect.sh` auto-discovers go.mod files, generated files,
build requirements, and scanner availability.

### SKILL.md

The skill is a thin wrapper that finds the scripts and delegates. The
`description` field includes TRIGGER conditions so Claude invokes
it from natural language (e.g., "fix CVEs in release-0.23"):

```bash
CVE_SCRIPTS="$(pwd)/scripts/cve"
[[ -d "$CVE_SCRIPTS" ]] || CVE_SCRIPTS="$HOME/go/src/submariner-io/shipyard/scripts/cve"
exec bash "$CVE_SCRIPTS/fix-all.sh" $ARGUMENTS
```

### Parallel Execution

The state-file-per-repo design enables parallel execution. For a single
repo, Claude invokes the skill directly. For multiple repos, Claude
spawns one agent per repo, each running fix-all.sh via the Bash tool.
Subagents use the Bash tool because the Skill tool's forked context
does not stream output, causing the 600s watchdog timeout on long runs.

```text
User: fix cves in subctl on 0.21
Claude: invokes /cve-fix (single skill call)

User: fix cves in all repos on 0.21
Claude: spawns 4 agents, each running fix-all.sh (parallel)
```

### Future: Jira Integration

A Jira MCP skill could close the loop: look up CVE-tracking Jiras,
run `/cve-fix` for the affected repos/branches, and link the resulting
PRs back to the Jira issues automatically.

### Alternatives Considered

**Multi-step SKILL.md calling scripts** -- the agent would orchestrate
each step (scan, fix, verify, commit) individually. Rejected because CVE
fixing rarely needs agent judgment; full deterministic orchestration is
simpler and faster.

**Separate make targets for scan/fix/review** -- rejected for simplicity.
One target does everything; agent review is automatic.

**Hardcoded root + tools go.mod** -- would miss modules like
`coredns/go.mod` in lighthouse. Dynamic discovery via `find_gomods()` handles
any repo layout.

### Backward Compatibility

The `/cve-fix` skill interface is preserved (same arguments, same behavior).
The `.agents/workflows/cve-fix.md` files in each repo retain the manual
process as a reference. Shipyard's has a pointer to the automation at
the top.

## Implementation

Built in five layers: shared library (`lib.sh`) → detection and scanning
(`detect.sh`, `scan.sh`) → per-package operations (`locate.sh`,
`fix-package.sh`, `fix-stdlib.sh`, `ignore.sh`) → agent review
(`review.sh`, `review-prompt.md`) → orchestrator and integration
(`fix-all.sh`, `test-lib.sh`, `Makefile`, `SKILL.md`).

## Done When

- [x] Scripts shellcheck-clean, 29 unit tests pass
- [x] End-to-end on all 7 repos, release-0.19 through devel (47 combos)
- [x] Multi-module repos (lighthouse: root + tools + coredns)
- [x] Agent review (ignored helm/otel CVEs incompatible with stable branches)
- [x] Parallel execution (operator + lighthouse via subagents)
- [x] `make cve-fix REPO=... BRANCH=...` (non-interactive, no dapper)
- [x] SKILL.md reduced from 696 to 85 lines
Loading