feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS#8696
feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS#8696Devinwong wants to merge 2 commits into
Conversation
5fff98d to
061ba60
Compare
64e782d to
ede050a
Compare
0c90761 to
b33ec66
Compare
b33ec66 to
07b497b
Compare
07b497b to
0d6f945
Compare
| } | ||
| } | ||
|
|
||
| // lpsTargetFromNodeConfig reads the apiserver FQDN (the forced dial target) and the cluster |
There was a problem hiding this comment.
hmm does this assume we are getting from aks node config? will it work with nbc cse sh file as well?
There was a problem hiding this comment.
yes we are getting it from AKSNodeConfig JSON. IIUC, AKSNodeconfig is only ready on Node with phase 2.5. So this approach will only work starting from phase 2.5. Before that, it's fail-open no-op.
1dedfdd to
523604d
Compare
523604d to
438f210
Compare
438f210 to
065bc53
Compare
065bc53 to
4ec7206
Compare
4ec7206 to
13b7457
Compare
13b7457 to
b2a3200
Compare
b2a3200 to
f3e6691
Compare
Add a fail-open 'check-hotfix' CLI subcommand that reads the base->hotfix
pointer map from the live-patching-service (LPS) over the IMDS-attested SNI
path that is reachable pre-kubelet, and stages the resolved {hotfixes:{...}}
pointer to the path download-hotfix already reads. download-hotfix keeps its
unchanged patch-only, strictly-higher gating; check-hotfix only fetches and
writes the pointer.
- Raw net/http HTTPS GET (no client-go). TLS ServerName pinned to the LPS
SNI host while the TCP dial is forced to the apiserver FQDN (curl --resolve
trick); Authorization is the IMDS attested-data signature; the server cert
is verified against the cluster CA from the provision-config.
- FQDN + cluster CA come from the AKSNodeConfig ANC already parses (the only
credential source present pre-provisioning); caSource is logged.
- Shares the hotfixConfig parser/data contract with download-hotfix.
- Always exits 0; emits CheckHotfix telemetry (lpsRead, noHotfixForBase,
noHotfixAvailable, customDataFallback, failed).
- A reachable LPS with no hotfix published for this node (HTTP 401, 403, 404)
is a benign no-op (noHotfixAvailable): no overlay is staged and it is never
classified as a failure. Only transport/5xx failures fall back.
- PoC cold-start fallback reads a lenient top-level hotfixes object from the
node config when the LPS read fails (TODO: typed contract field).
- Injectable App fields (checkHotfixFetcher, fetchAttestedToken,
nodeConfigPath) for network-free unit tests.
- The LPS route + response schema are a planned-maintenance deliverable that
is not finalized; lpsHotfixPath is a clearly-marked placeholder with a TODO.
The IMDS/LPS client helpers mirror the connectivity prototype and should be
de-duplicated into a shared LPS client when that lands.
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
f3e6691 to
8717ee9
Compare
Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Failed gateRun: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=170741296 Failed job/stage/task: Detective summaryKnown VHD scan/CIS-CAT gate failure recurred. Likely cause / signatureLikely known CIS-CAT assessor scan tooling failure, not this PR. Signature: Strongest alternative: a real Ubuntu image compliance/package regression affecting CIS-CAT. Less likely because this exactly matches the existing repeated exit-122 signature and PR #8696 changes ANC/check-hotfix Go code and tests, not image generation, Packer, scan configuration, CSE, or provisioning scripts. Recommended actionNo PR author action recommended. Node Lifecycle/VHD gate owner should continue repair item #38671557. Evidence
|
Add a
check-hotfixsubcommand that reads the hotfix pointer from the live-patching-servicecheck-hotfix(newaks-node-controllersubcommand) fetches a base-to-hotfix-version pointer from the live-patching-service (LPS) over an IMDS-attested HTTPS path reachable before kubelet, stages it to the filedownload-hotfixalready consumes, and falls back to an embedded pointer if the LPS is unreachable. It only fetches and stages -download-hotfixkeeps its unchanged patch-only, strictly-higher gating - and it always exits 0, so provisioning is never blocked.flowchart LR START["check-hotfix runs<br/>(pre-kubelet)"] --> FETCH["Fetch hotfix pointer<br/>from LPS"] FETCH --> GOT{"Got it?"} GOT -- "yes" --> STAGE["Stage pointer file"] GOT -- "nothing for me<br/>(401/403/404)" --> NOOP["Skip - benign"] GOT -- "LPS unreachable" --> COLD["Fall back to<br/>embedded pointer"] COLD --> STAGE STAGE --> DH["download-hotfix<br/>reads & applies it later"] NOOP --> EXIT["Always exit 0<br/>(never blocks provisioning)"] STAGE --> EXITStacking
2.1a (#8694) has merged, so this PR targets
maindirectly (app.gowiring +checkhotfix.go+checkhotfix_test.go). It is always-on by itself; the feature gate arrives later in 2.1d.Open dependency (placeholder route)
The LPS route/schema for the pointer is a planned-maintenance deliverable that is not finalized; the prototype only proved reachability. The route is a named placeholder (
lpsHotfixPath = "/v1/anc-hotfix") with a TODO, and the IMDS/LPS client helpers are flagged in-code to be de-duplicated into a shared LPS client when that lands. This PR is held DRAFT until the route finalizes.Net effect (examples)
LPS serves
{"hotfixes":{"202604.01":"202604.01.1"}}; check-hotfix stages the same JSON to/opt/azure/containers/aks-node-controller-hotfix.json.hotfixespresentTests
Network-free unit tests (LPS fetcher + attested-token injected) cover every outcome above, fail-open on parse/transport errors, shared-parser equivalence with
download-hotfix, the SNI-pinned TLS client (CA required, no insecure fallback), and the 0644 staged-file mode. All new tests pass; the onlygo test ./...failures are pre-existing Windows-only environmental ones that pass in Linux CI.