topologymanager: add CPU-attached NUMA filter option#138172
topologymanager: add CPU-attached NUMA filter option#138172fanzhangio wants to merge 1 commit intokubernetes:masterfrom
Conversation
Large NUMA systems such as NVIDIA Grace-class (GB200/300) can expose many NUMA nodes to the OS even though only a small subset of those nodes have CPUs attached. When TopologyManager is configured to allow the full NUMA count, topology hint generation can still scale poorly across that full NUMA universe and block pod admission before device Allocate() is reached. Add a new TopologyManager policy option, restrict-to-cpu-numa-nodes, to make this behavior explicit and opt-in. When enabled, TopologyManager computes an effective NUMA-node set from NUMA nodes with CPUs attached and propagates that same view to topology-aware hint providers. With this change: - TopologyManager uses the filtered NUMA-node set for policy creation and NUMA accounting. - CPUManager and MemoryManager generate hints only across that filtered NUMA-node set. - DeviceManager projects device topology onto the same filtered set using NUMA distances so hint generation and aligned allocation remain consistent. This bounds hint generation on CPU-less NUMA topologies without changing default behavior.
|
Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
This issue is currently awaiting triage. If a SIG or subproject determines this is a relevant issue, they will accept it by applying the The DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
Hi @fanzhangio. Thanks for your PR. I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with Tip We noticed you've done this a few times! Consider joining the org to skip this step and gain Once the patch is verified, the new status will be reflected by the I understand the commands that are listed here. DetailsInstructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository. |
|
[APPROVALNOTIFIER] This PR is NOT APPROVED This pull-request has been approved by: fanzhangio The full list of commands accepted by this bot can be found here. DetailsNeeds approval from an approver in each of these files:Approvers can indicate their approval by writing |
|
/ok-to-test |
|
/test pull-kubernetes-unit |
|
/retest |
|
Edit: I see a KEP was opened 5 hours ago. Please update the description with the proper links. |
|
Thanks @klueska , I've updated the KEP-5726 PR link Also updated the PR description that this PR reuses some common infrastructure that a separate PR reserved-memory PR also introduced. The overlap is only shared plumbing and that the reserved-memory semantics are intentionally not part of this PR. |
|
Thanks. This is 1.37 material, and the implementation may require rework depending on how the KEP review unfolds. |
What type of PR is this?
Bug Fix
/kind bug
What this PR does / why we need it:
This PR adds an opt-in TopologyManager mode for large NUMA systems with many CPU-less NUMA nodes.
On systems such as NVIDIA Grace/Vera-class platforms (typically GB200), firmware can expose a large NUMA ID space even though only a small subset of NUMA nodes actually have CPUs. When kubelet is configured to allow that large NUMA count, topology hint generation can still scale poorly in the device-manager / topology-manager path and block admission before device plugin
Allocate()is reached.This PR adds a new TopologyManager policy option:
restrict-to-cpu-numa-nodesWhen enabled, TopologyManager computes an effective NUMA-node set containing only NUMA nodes with CPUs attached and propagates that same view to all topology-aware hint providers.
Specifically:
This is intended to address the large-NUMA root cause in #135541.
This PR is separate from the reserved-memory consistency fix. That is a different bug and should be reviewed independently.
Which issue(s) this PR is related to:
Fixes #135541
KEP: KEP-5726
Special notes for your reviewer:
Behavior is unchanged unless
restrict-to-cpu-numa-nodesis explicitly enabled.Example kubelet config for affected systems:
reserved-memory consistency bug fixfix: exclude fully reserved NUMA nodes consistently from topology hints, not just memory manager decisions #137280e.g.,
What is happening now is this PR duplicates some common infrastructure that the reserved-memory PR also introduced. That is expected from an implementation standpoint.
If
reserved-memory PRis likely to merge first, then rebase this PR on top of that and drop the duplicated plumbing.Need to flag out for reviwewer that the overlap is only shared plumbing and that the reserved-memory semantics are intentionally not part of this PR.
Does this PR introduce a user-facing change?
Added an alpha TopologyManager policy option,
restrict-to-cpu-numa-nodes, to limit topology hint generation to NUMA nodes with CPUs attached. This helps kubelet handle large CPU-less NUMA topologies such as NVIDIA Grace/Vera-class systems.Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.: