topologymanager: add CPU-attached NUMA filter option by fanzhangio · Pull Request #138172 · kubernetes/kubernetes

fanzhangio · 2026-04-02T09:19:51Z

What type of PR is this?

Bug Fix

/kind bug

What this PR does / why we need it:

This PR adds an opt-in TopologyManager mode for large NUMA systems with many CPU-less NUMA nodes.
On systems such as NVIDIA Grace/Vera-class platforms (typically GB200), firmware can expose a large NUMA ID space even though only a small subset of NUMA nodes actually have CPUs. When kubelet is configured to allow that large NUMA count, topology hint generation can still scale poorly in the device-manager / topology-manager path and block admission before device plugin Allocate() is reached.

This PR adds a new TopologyManager policy option:

restrict-to-cpu-numa-nodes

When enabled, TopologyManager computes an effective NUMA-node set containing only NUMA nodes with CPUs attached and propagates that same view to all topology-aware hint providers.

Specifically:

TopologyManager uses the filtered NUMA set for policy creation and NUMA accounting.
CPUManager generates hints only across that filtered set.
MemoryManager generates hints only across that filtered set.
DeviceManager projects device topology onto that same filtered set using NUMA distance information so hint generation and aligned allocation stay consistent.

This is intended to address the large-NUMA root cause in #135541.
This PR is separate from the reserved-memory consistency fix. That is a different bug and should be reviewed independently.

Which issue(s) this PR is related to:

Fixes #135541
KEP: KEP-5726

Special notes for your reviewer:

This reworks the earlier approach from Support TopologyManager in Large NUMA System (NVIDIA GB200) #135581 around a TopologyManager option instead of provider-local heuristics.

Behavior is unchanged unless restrict-to-cpu-numa-nodes is explicitly enabled.

Example kubelet config for affected systems:

topologyManagerPolicy: best-effort
topologyManagerScope: pod
topologyManagerPolicyOptions:
  prefer-closest-numa-nodes: "true"
  max-allowable-numa-nodes: "34"
  restrict-to-cpu-numa-nodes: "true"
cpuManagerPolicy: static

This PR reuses some of the same plumbing which was carried by a separate PR reserved-memory consistency bug fix fix: exclude fully reserved NUMA nodes consistently from topology hints, not just memory manager decisions #137280
e.g.,

topologymanager.Store.GetNUMANodeIDs()
storing/exporting filtered NUMA IDs from TopologyManager
CPU/Memory hint paths consuming that filtered NUMA set
DeviceManager reading the filtered NUMA set from TopologyManager

What is happening now is this PR duplicates some common infrastructure that the reserved-memory PR also introduced. That is expected from an implementation standpoint.
If reserved-memory PR is likely to merge first, then rebase this PR on top of that and drop the duplicated plumbing.

Need to flag out for reviwewer that the overlap is only shared plumbing and that the reserved-memory semantics are intentionally not part of this PR.

Does this PR introduce a user-facing change?

Added an alpha TopologyManager policy option, restrict-to-cpu-numa-nodes, to limit topology hint generation to NUMA nodes with CPUs attached. This helps kubelet handle large CPU-less NUMA topologies such as NVIDIA Grace/Vera-class systems.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

Large NUMA systems such as NVIDIA Grace-class (GB200/300) can expose many NUMA nodes to the OS even though only a small subset of those nodes have CPUs attached. When TopologyManager is configured to allow the full NUMA count, topology hint generation can still scale poorly across that full NUMA universe and block pod admission before device Allocate() is reached. Add a new TopologyManager policy option, restrict-to-cpu-numa-nodes, to make this behavior explicit and opt-in. When enabled, TopologyManager computes an effective NUMA-node set from NUMA nodes with CPUs attached and propagates that same view to topology-aware hint providers. With this change: - TopologyManager uses the filtered NUMA-node set for policy creation and NUMA accounting. - CPUManager and MemoryManager generate hints only across that filtered NUMA-node set. - DeviceManager projects device topology onto the same filtered set using NUMA distances so hint generation and aligned allocation remain consistent. This bounds hint generation on CPU-less NUMA topologies without changing default behavior.

k8s-ci-robot · 2026-04-02T09:19:54Z

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-04-02T09:20:00Z

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-04-02T09:20:01Z

Hi @fanzhangio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

k8s-ci-robot · 2026-04-02T09:20:13Z

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fanzhangio
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details

Needs approval from an approver in each of these files:

pkg/kubelet/cm/OWNERS

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

fanzhangio · 2026-04-02T14:58:43Z

@dims @klueska I opened this PR for introducing an alpha TopologyManager policy option to help kubelet handle GB/VR systems that expose many CPU-less NUMA nodes. This is the code implmentation of KEP-5726

dims · 2026-04-02T15:06:59Z

/ok-to-test

dims · 2026-04-02T15:07:59Z

/assign @ffromani @klueska

fanzhangio · 2026-04-02T15:40:00Z

/test pull-kubernetes-unit

fanzhangio · 2026-04-02T15:43:25Z

/retest

klueska · 2026-04-02T15:46:19Z

~~This needs a KEP if you are adding a new option to the user facing API.~~

~~I see a KEP issue (created by me), but not an actual KEP (unless I'm missing something).~~

Edit: I see a KEP was opened 5 hours ago. Please update the description with the proper links.

fanzhangio · 2026-04-02T16:14:28Z

Thanks @klueska , I've updated the KEP-5726 PR link

Also updated the PR description that this PR reuses some common infrastructure that a separate PR reserved-memory PR also introduced. reserved-memory consistency bug fix #137280

The overlap is only shared plumbing and that the reserved-memory semantics are intentionally not part of this PR.
If reserved-memory PR got merged first, we can rebase this PR on top of that and drop the duplicated plumbing.

ffromani · 2026-04-03T07:46:27Z

Thanks. This is 1.37 material, and the implementation may require rework depending on how the KEP review unfolds.

k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 2, 2026

k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 2, 2026

k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 2, 2026

github-project-automation bot added this to SIG Node: code and documentation PRs Apr 2, 2026

k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 2, 2026

github-project-automation bot moved this to Triage in SIG Node: code and documentation PRs Apr 2, 2026

k8s-ci-robot requested review from dchen1107 and klueska April 2, 2026 09:20

k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2026

k8s-ci-robot assigned ffromani and klueska Apr 2, 2026

fanzhangio mentioned this pull request Apr 2, 2026

Support TopologyManager in Large NUMA System (NVIDIA GB200) #135581

Closed

Provide feedback

Saved searches

Use saved searches to filter your results more quickly

topologymanager: add CPU-attached NUMA filter option#138172

topologymanager: add CPU-attached NUMA filter option#138172
fanzhangio wants to merge 1 commit intokubernetes:masterfrom
fanzhangio:gb200-topologymanager-root-fix

fanzhangio commented Apr 2, 2026 •

edited

Loading

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

fanzhangio commented Apr 2, 2026

Uh oh!

dims commented Apr 2, 2026

Uh oh!

dims commented Apr 2, 2026

Uh oh!

fanzhangio commented Apr 2, 2026

Uh oh!

fanzhangio commented Apr 2, 2026

Uh oh!

klueska commented Apr 2, 2026 •

edited

Loading

Uh oh!

fanzhangio commented Apr 2, 2026 •

edited

Loading

Uh oh!

ffromani commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

Conversation

fanzhangio commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

What type of PR is this?

What this PR does / why we need it:

Which issue(s) this PR is related to:

Special notes for your reviewer:

Does this PR introduce a user-facing change?

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

k8s-ci-robot commented Apr 2, 2026

Uh oh!

fanzhangio commented Apr 2, 2026

Uh oh!

dims commented Apr 2, 2026

Uh oh!

dims commented Apr 2, 2026

Uh oh!

fanzhangio commented Apr 2, 2026

Uh oh!

fanzhangio commented Apr 2, 2026

Uh oh!

klueska commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

fanzhangio commented Apr 2, 2026 • edited Loading Uh oh! There was an error while loading. Please reload this page.

Uh oh!

Uh oh!

ffromani commented Apr 3, 2026

Uh oh!

Reviewers

Assignees

Labels

Projects

Milestone

Development

Uh oh!

5 participants

fanzhangio commented Apr 2, 2026 •

edited

Loading

klueska commented Apr 2, 2026 •

edited

Loading

fanzhangio commented Apr 2, 2026 •

edited

Loading