Skip to content

feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS#8696

Open
Devinwong wants to merge 2 commits into
mainfrom
devinwong/anc-check-hotfix-configmap
Open

feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS#8696
Devinwong wants to merge 2 commits into
mainfrom
devinwong/anc-check-hotfix-configmap

Conversation

@Devinwong

@Devinwong Devinwong commented Jun 12, 2026

Copy link
Copy Markdown
Collaborator

Add a check-hotfix subcommand that reads the hotfix pointer from the live-patching-service

check-hotfix (new aks-node-controller subcommand) fetches a base-to-hotfix-version pointer from the live-patching-service (LPS) over an IMDS-attested HTTPS path reachable before kubelet, stages it to the file download-hotfix already consumes, and falls back to an embedded pointer if the LPS is unreachable. It only fetches and stages - download-hotfix keeps its unchanged patch-only, strictly-higher gating - and it always exits 0, so provisioning is never blocked.

flowchart LR
    START["check-hotfix runs<br/>(pre-kubelet)"] --> FETCH["Fetch hotfix pointer<br/>from LPS"]
    FETCH --> GOT{"Got it?"}
    GOT -- "yes" --> STAGE["Stage pointer file"]
    GOT -- "nothing for me<br/>(401/403/404)" --> NOOP["Skip - benign"]
    GOT -- "LPS unreachable" --> COLD["Fall back to<br/>embedded pointer"]
    COLD --> STAGE
    STAGE --> DH["download-hotfix<br/>reads &amp; applies it later"]
    NOOP --> EXIT["Always exit 0<br/>(never blocks provisioning)"]
    STAGE --> EXIT
Loading

Stacking

2.1a (#8694) has merged, so this PR targets main directly (app.go wiring + checkhotfix.go + checkhotfix_test.go). It is always-on by itself; the feature gate arrives later in 2.1d.

main  (#8694 2.1a base->version hotfix map - merged)
 \- #8696  2.1b  check-hotfix LPS endpoint reader (Go)        <- this PR
     \- #8715  2.1c  wire check-hotfix into wrapper (shell)
         \- #8717  2.1d  enable_provisioning_hotfix contract field + Go self-gate

Open dependency (placeholder route)

The LPS route/schema for the pointer is a planned-maintenance deliverable that is not finalized; the prototype only proved reachability. The route is a named placeholder (lpsHotfixPath = "/v1/anc-hotfix") with a TODO, and the IMDS/LPS client helpers are flagged in-code to be de-duplicated into a shared LPS client when that lands. This PR is held DRAFT until the route finalizes.

Net effect (examples)

LPS serves {"hotfixes":{"202604.01":"202604.01.1"}}; check-hotfix stages the same JSON to /opt/azure/containers/aks-node-controller-hotfix.json.

Node baked ANC version LPS read outcome download-hotfix then does
202604.01.0 OK lpsRead base 202604.01 -> 202604.01.1, patch 1 > 0, upgrades
202607.15.0 OK (no matching base) noHotfixForBase no pointer for this base, no-op
any 401 / 403 / 404 noHotfixAvailable no overlay staged, benign no-op
202604.01.0 unreachable, embedded hotfixes present customDataFallback reads staged fallback pointer, resolves as above
202604.01.0 unreachable, no fallback failed (still exit 0) nothing staged, no-op

Tests

Network-free unit tests (LPS fetcher + attested-token injected) cover every outcome above, fail-open on parse/transport errors, shared-parser equivalence with download-hotfix, the SNI-pinned TLS client (CA required, no insecure fallback), and the 0644 staged-file mode. All new tests pass; the only go test ./... failures are pre-existing Windows-only environmental ones that pass in Linux CI.

@Devinwong Devinwong changed the title feat(anc): provisioning-hotfix M1 - check-hotfix ConfigMap reader (2.1b) feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap Jun 12, 2026
@Devinwong Devinwong force-pushed the devinwong/laughing-pancake branch from 5fff98d to 061ba60 Compare June 15, 2026 21:53
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 64e782d to ede050a Compare June 15, 2026 21:56
@Devinwong Devinwong marked this pull request as ready for review June 16, 2026 02:27
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch 2 times, most recently from 0c90761 to b33ec66 Compare June 16, 2026 18:08
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from b33ec66 to 07b497b Compare June 19, 2026 21:06
@Devinwong Devinwong requested a review from xuexu6666 as a code owner June 19, 2026 21:06
@Devinwong Devinwong changed the title feat(anc): add check-hotfix subcommand to read hotfix pointer from ConfigMap feat(anc): add check-hotfix subcommand to read hotfix pointer from LPS Jun 19, 2026
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 07b497b to 0d6f945 Compare June 20, 2026 00:25
Copilot AI review requested due to automatic review settings July 1, 2026 00:20

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 3 out of 3 changed files in this pull request and generated 5 comments.

Comment thread aks-node-controller/checkhotfix.go Outdated
Comment thread aks-node-controller/checkhotfix.go Outdated
Comment thread aks-node-controller/checkhotfix.go Outdated
Comment thread aks-node-controller/checkhotfix_test.go Outdated
Comment thread aks-node-controller/checkhotfix_test.go Outdated
Comment thread aks-node-controller/checkhotfix.go
Comment thread aks-node-controller/checkhotfix.go
}
}

// lpsTargetFromNodeConfig reads the apiserver FQDN (the forced dial target) and the cluster

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

hmm does this assume we are getting from aks node config? will it work with nbc cse sh file as well?

@Devinwong Devinwong Jul 1, 2026

Copy link
Copy Markdown
Collaborator Author

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

yes we are getting it from AKSNodeConfig JSON. IIUC, AKSNodeconfig is only ready on Node with phase 2.5. So this approach will only work starting from phase 2.5. Before that, it's fail-open no-op.

@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 1dedfdd to 523604d Compare July 1, 2026 16:57
Copilot AI review requested due to automatic review settings July 1, 2026 17:14
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 523604d to 438f210 Compare July 1, 2026 17:14

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 2 comments.

Comment thread aks-node-controller/httpclient.go Outdated
Comment thread aks-node-controller/app.go
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 438f210 to 065bc53 Compare July 1, 2026 17:27
Copilot AI review requested due to automatic review settings July 1, 2026 17:38
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 065bc53 to 4ec7206 Compare July 1, 2026 17:38

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 4ec7206 to 13b7457 Compare July 1, 2026 18:46
Copilot AI review requested due to automatic review settings July 1, 2026 23:52
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from 13b7457 to b2a3200 Compare July 1, 2026 23:52

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated 1 comment.

Comment thread aks-node-controller/checkhotfix.go
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from b2a3200 to f3e6691 Compare July 2, 2026 16:47
Add a fail-open 'check-hotfix' CLI subcommand that reads the base->hotfix
pointer map from the live-patching-service (LPS) over the IMDS-attested SNI
path that is reachable pre-kubelet, and stages the resolved {hotfixes:{...}}
pointer to the path download-hotfix already reads. download-hotfix keeps its
unchanged patch-only, strictly-higher gating; check-hotfix only fetches and
writes the pointer.

- Raw net/http HTTPS GET (no client-go). TLS ServerName pinned to the LPS
  SNI host while the TCP dial is forced to the apiserver FQDN (curl --resolve
  trick); Authorization is the IMDS attested-data signature; the server cert
  is verified against the cluster CA from the provision-config.
- FQDN + cluster CA come from the AKSNodeConfig ANC already parses (the only
  credential source present pre-provisioning); caSource is logged.
- Shares the hotfixConfig parser/data contract with download-hotfix.
- Always exits 0; emits CheckHotfix telemetry (lpsRead, noHotfixForBase,
  noHotfixAvailable, customDataFallback, failed).
- A reachable LPS with no hotfix published for this node (HTTP 401, 403, 404)
  is a benign no-op (noHotfixAvailable): no overlay is staged and it is never
  classified as a failure. Only transport/5xx failures fall back.
- PoC cold-start fallback reads a lenient top-level hotfixes object from the
  node config when the LPS read fails (TODO: typed contract field).
- Injectable App fields (checkHotfixFetcher, fetchAttestedToken,
  nodeConfigPath) for network-free unit tests.
- The LPS route + response schema are a planned-maintenance deliverable that
  is not finalized; lpsHotfixPath is a clearly-marked placeholder with a TODO.
  The IMDS/LPS client helpers mirror the connectivity prototype and should be
  de-duplicated into a shared LPS client when that lands.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>
Copilot AI review requested due to automatic review settings July 2, 2026 18:25
@Devinwong Devinwong force-pushed the devinwong/anc-check-hotfix-configmap branch from f3e6691 to 8717ee9 Compare July 2, 2026 18:25

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

Co-authored-by: Copilot <223556219+Copilot@users.noreply.github.com>

Copilot AI left a comment

Copy link
Copy Markdown
Contributor

Choose a reason for hiding this comment

The reason will be displayed to describe this comment to others. Learn more.

Pull request overview

Copilot reviewed 5 out of 5 changed files in this pull request and generated no new comments.

@aks-node-assistant

Copy link
Copy Markdown
Contributor

Failed gate

Run: https://msazure.visualstudio.com/CloudNativeCompute/_build/results?buildId=170741296

Failed job/stage/task: build2204gen2containerd / Test, Scan, and Cleanup (logId 431).

Detective summary

Known VHD scan/CIS-CAT gate failure recurred. vhd-scanning.sh failed twice after CIS-CAT Pro Assessor v4.57.1 completed Assessment 1 with exit value 122; the wrapper then surfaced task exit code 2. The same build stage shows multiple VHD legs failing Test, Scan, and Cleanup with the same exit-code shape.

Likely cause / signature

Likely known CIS-CAT assessor scan tooling failure, not this PR. Signature: AB-GATE-LINUX-VHD-SCAN-CISCAT-EXIT122. Confidence: High.

Strongest alternative: a real Ubuntu image compliance/package regression affecting CIS-CAT. Less likely because this exactly matches the existing repeated exit-122 signature and PR #8696 changes ANC/check-hotfix Go code and tests, not image generation, Packer, scan configuration, CSE, or provisioning scripts.

Recommended action

No PR author action recommended. Node Lifecycle/VHD gate owner should continue repair item #38671557.

Evidence

  • Timeline/build status: build stage failed in VHD Test, Scan, and Cleanup tasks with exit code 2
  • Log: vhd-scanning.sh failed twice; CIS-CAT Pro v4.57.1 reported Assessment 1 Exit Value: 122
  • PR metadata: changes are ANC/check-hotfix Go code and tests only
  • Wiki signature: AB-GATE-LINUX-VHD-SCAN-CISCAT-EXIT122

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

None yet

Projects

None yet

Development

Successfully merging this pull request may close these issues.

4 participants