Skip to content

topologymanager: add CPU-attached NUMA filter option#138172

Open
fanzhangio wants to merge 1 commit intokubernetes:masterfrom
fanzhangio:gb200-topologymanager-root-fix
Open

topologymanager: add CPU-attached NUMA filter option#138172
fanzhangio wants to merge 1 commit intokubernetes:masterfrom
fanzhangio:gb200-topologymanager-root-fix

Conversation

@fanzhangio
Copy link
Copy Markdown
Contributor

@fanzhangio fanzhangio commented Apr 2, 2026

What type of PR is this?

Bug Fix

/kind bug

What this PR does / why we need it:

This PR adds an opt-in TopologyManager mode for large NUMA systems with many CPU-less NUMA nodes.
On systems such as NVIDIA Grace/Vera-class platforms (typically GB200), firmware can expose a large NUMA ID space even though only a small subset of NUMA nodes actually have CPUs. When kubelet is configured to allow that large NUMA count, topology hint generation can still scale poorly in the device-manager / topology-manager path and block admission before device plugin Allocate() is reached.

This PR adds a new TopologyManager policy option:

  • restrict-to-cpu-numa-nodes

When enabled, TopologyManager computes an effective NUMA-node set containing only NUMA nodes with CPUs attached and propagates that same view to all topology-aware hint providers.

Specifically:

  • TopologyManager uses the filtered NUMA set for policy creation and NUMA accounting.
  • CPUManager generates hints only across that filtered set.
  • MemoryManager generates hints only across that filtered set.
  • DeviceManager projects device topology onto that same filtered set using NUMA distance information so hint generation and aligned allocation stay consistent.

This is intended to address the large-NUMA root cause in #135541.
This PR is separate from the reserved-memory consistency fix. That is a different bug and should be reviewed independently.

Which issue(s) this PR is related to:

Fixes #135541
KEP: KEP-5726

Special notes for your reviewer:

  1. This reworks the earlier approach from Support TopologyManager in Large NUMA System (NVIDIA GB200) #135581 around a TopologyManager option instead of provider-local heuristics.

Behavior is unchanged unless restrict-to-cpu-numa-nodes is explicitly enabled.

Example kubelet config for affected systems:

topologyManagerPolicy: best-effort
topologyManagerScope: pod
topologyManagerPolicyOptions:
  prefer-closest-numa-nodes: "true"
  max-allowable-numa-nodes: "34"
  restrict-to-cpu-numa-nodes: "true"
cpuManagerPolicy: static
  1. This PR reuses some of the same plumbing which was carried by a separate PR reserved-memory consistency bug fix fix: exclude fully reserved NUMA nodes consistently from topology hints, not just memory manager decisions #137280
    e.g.,
  • topologymanager.Store.GetNUMANodeIDs()
  • storing/exporting filtered NUMA IDs from TopologyManager
  • CPU/Memory hint paths consuming that filtered NUMA set
  • DeviceManager reading the filtered NUMA set from TopologyManager

What is happening now is this PR duplicates some common infrastructure that the reserved-memory PR also introduced. That is expected from an implementation standpoint.
If reserved-memory PR is likely to merge first, then rebase this PR on top of that and drop the duplicated plumbing.

Need to flag out for reviwewer that the overlap is only shared plumbing and that the reserved-memory semantics are intentionally not part of this PR.

Does this PR introduce a user-facing change?

Added an alpha TopologyManager policy option, restrict-to-cpu-numa-nodes, to limit topology hint generation to NUMA nodes with CPUs attached. This helps kubelet handle large CPU-less NUMA topologies such as NVIDIA Grace/Vera-class systems.

Additional documentation e.g., KEPs (Kubernetes Enhancement Proposals), usage docs, etc.:

NONE

Large NUMA systems such as NVIDIA Grace-class (GB200/300) can expose
many NUMA nodes to the OS even though only a small subset of those nodes
have CPUs attached. When TopologyManager is configured to allow the full
NUMA count, topology hint generation can still scale poorly across that
full NUMA universe and block pod admission before device Allocate() is
reached.

Add a new TopologyManager policy option, restrict-to-cpu-numa-nodes, to
make this behavior explicit and opt-in.
When enabled, TopologyManager computes an effective NUMA-node set from
NUMA nodes with CPUs attached and propagates that same view to
topology-aware hint providers.

With this change:
- TopologyManager uses the filtered NUMA-node set for policy creation
  and NUMA accounting.
- CPUManager and MemoryManager generate hints only across that filtered
  NUMA-node set.
- DeviceManager projects device topology onto the same filtered set
  using NUMA distances so hint generation and aligned allocation remain
  consistent.

This bounds hint generation on CPU-less NUMA topologies without
changing default behavior.
@k8s-ci-robot k8s-ci-robot added the size/L Denotes a PR that changes 100-499 lines, ignoring generated files. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Adding the "do-not-merge/release-note-label-needed" label because no release-note block was detected, please follow our release note process to remove it.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added kind/bug Categorizes issue or PR as related to a bug. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. labels Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

This issue is currently awaiting triage.

If a SIG or subproject determines this is a relevant issue, they will accept it by applying the triage/accepted label and provide further guidance.

The triage/accepted label can be added by org members by writing /triage accepted in a comment.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added the needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

Hi @fanzhangio. Thanks for your PR.

I'm waiting for a kubernetes member to verify that this patch is reasonable to test. If it is, they should reply with /ok-to-test on its own line. Until that is done, I will not automatically test new commits in this PR, but the usual testing commands by org members will still work.

Tip

We noticed you've done this a few times! Consider joining the org to skip this step and gain /lgtm and other bot rights. We recommend asking approvers on your previous PRs to sponsor you.

Once the patch is verified, the new status will be reflected by the ok-to-test label.

I understand the commands that are listed here.

Details

Instructions for interacting with me using PR comments are available here. If you have questions or suggestions related to my behavior, please file an issue against the kubernetes-sigs/prow repository.

@k8s-ci-robot k8s-ci-robot added needs-priority Indicates a PR lacks a `priority/foo` label and requires one. area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. sig/node Categorizes an issue or PR as relevant to SIG Node. labels Apr 2, 2026
@k8s-ci-robot k8s-ci-robot removed the do-not-merge/needs-sig Indicates an issue or PR lacks a `sig/foo` label and requires one. label Apr 2, 2026
@k8s-ci-robot
Copy link
Copy Markdown
Contributor

[APPROVALNOTIFIER] This PR is NOT APPROVED

This pull-request has been approved by: fanzhangio
Once this PR has been reviewed and has the lgtm label, please assign ffromani for approval. For more information see the Code Review Process.

The full list of commands accepted by this bot can be found here.

Details Needs approval from an approver in each of these files:

Approvers can indicate their approval by writing /approve in a comment
Approvers can cancel approval by writing /approve cancel in a comment

@fanzhangio
Copy link
Copy Markdown
Contributor Author

@dims @klueska I opened this PR for introducing an alpha TopologyManager policy option to help kubelet handle GB/VR systems that expose many CPU-less NUMA nodes. This is the code implmentation of KEP-5726

@dims
Copy link
Copy Markdown
Member

dims commented Apr 2, 2026

/ok-to-test

@k8s-ci-robot k8s-ci-robot added ok-to-test Indicates a non-member PR verified by an org member that is safe to test. and removed needs-ok-to-test Indicates a PR that requires an org member to verify it is safe to test. labels Apr 2, 2026
@dims
Copy link
Copy Markdown
Member

dims commented Apr 2, 2026

/assign @ffromani @klueska

@fanzhangio
Copy link
Copy Markdown
Contributor Author

/test pull-kubernetes-unit

@fanzhangio
Copy link
Copy Markdown
Contributor Author

/retest

@klueska
Copy link
Copy Markdown
Contributor

klueska commented Apr 2, 2026

This needs a KEP if you are adding a new option to the user facing API.

I see a KEP issue (created by me), but not an actual KEP (unless I'm missing something).

Edit: I see a KEP was opened 5 hours ago. Please update the description with the proper links.

@fanzhangio
Copy link
Copy Markdown
Contributor Author

fanzhangio commented Apr 2, 2026

Thanks @klueska , I've updated the KEP-5726 PR link

Also updated the PR description that this PR reuses some common infrastructure that a separate PR reserved-memory PR also introduced. reserved-memory consistency bug fix #137280

The overlap is only shared plumbing and that the reserved-memory semantics are intentionally not part of this PR.
If reserved-memory PR got merged first, we can rebase this PR on top of that and drop the duplicated plumbing.

@ffromani
Copy link
Copy Markdown
Contributor

ffromani commented Apr 3, 2026

Thanks. This is 1.37 material, and the implementation may require rework depending on how the KEP review unfolds.

Sign up for free to join this conversation on GitHub. Already have an account? Sign in to comment

Labels

area/kubelet cncf-cla: yes Indicates the PR's author has signed the CNCF CLA. do-not-merge/release-note-label-needed Indicates that a PR should not merge because it's missing one of the release note labels. kind/bug Categorizes issue or PR as related to a bug. needs-priority Indicates a PR lacks a `priority/foo` label and requires one. needs-triage Indicates an issue or PR lacks a `triage/foo` label and requires one. ok-to-test Indicates a non-member PR verified by an org member that is safe to test. sig/node Categorizes an issue or PR as relevant to SIG Node. size/L Denotes a PR that changes 100-499 lines, ignoring generated files.

Development

Successfully merging this pull request may close these issues.

kubelet TopologyManager Stalls on Large NUMA Systems (e.g. Nvidia GB200 Compute Tray with 34 NUMA nodes)

5 participants