Skip to content

Commit 855131c

Browse files
committed
KEP-5726: TopologyManager CPU-attached NUMA filter option
Signed-off-by: Fan Zhang <fanzhang@nvidia.com>
1 parent c9bb5d4 commit 855131c

File tree

2 files changed

+176
-0
lines changed

2 files changed

+176
-0
lines changed
Lines changed: 147 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,147 @@
1+
# KEP-5726: TopologyManager CPU-attached NUMA filter option
2+
3+
## Summary
4+
5+
Add an alpha TopologyManager policy option, `restrict-to-cpu-numa-nodes`, to limit topology hint generation to NUMA nodes with CPUs attached.
6+
7+
## Motivation
8+
9+
Some large NUMA systems (typically coherent-system, lile NVIDIA GraceBlackwell/VeraRubin-class) expose many NUMA nodes to the OS even though only a small subset have CPUs attached. On systems such as NVIDIA GB200/300 platforms, kubelet topology hint generation can scale poorly when it considers the full NUMA-node set.
10+
11+
This proposal adds an explicit, opt-in TopologyManager behavior for those systems.
12+
13+
Related context:
14+
- Root issue: `kubernetes/kubernetes#135541`
15+
- Related broader discussion: `kubernetes/enhancements#5726`
16+
- This reworks and supersedes `kubernetes/kubernetes#135581`
17+
18+
### Goals
19+
20+
- Provide an opt-in way to bound topology hint generation on large NUMA systems with many CPU-less NUMA nodes.
21+
- Keep the effective NUMA-node set consistent across TopologyManager, CPUManager, MemoryManager, and DeviceManager.
22+
- Preserve default behavior when the option is not enabled.
23+
24+
### Non-Goals
25+
26+
- Redesign NUMA topology discovery in cadvisor or kubelet.
27+
- Change the semantics of `--reserved-memory`.
28+
- Change default TopologyManager behavior for existing users.
29+
30+
## Proposal
31+
32+
Introduce a new TopologyManager alpha policy option:
33+
34+
- `restrict-to-cpu-numa-nodes`
35+
36+
When enabled:
37+
- TopologyManager computes an effective NUMA-node set from NUMA nodes with CPUs attached.
38+
- CPUManager and MemoryManager generate hints only across that effective NUMA-node set.
39+
- MemoryManager generates hints only across that set.
40+
- DeviceManager projects device NUMA topology onto that set using NUMA distance information.
41+
42+
When disabled:
43+
- existing behavior is unchanged.
44+
45+
Example:
46+
47+
```yaml
48+
topologyManagerPolicy: best-effort
49+
topologyManagerScope: pod
50+
topologyManagerPolicyOptions:
51+
max-allowable-numa-nodes: "34"
52+
prefer-closest-numa-nodes: "true"
53+
restrict-to-cpu-numa-nodes: "true"
54+
```
55+
56+
## User Stories
57+
58+
- As an operator of GB200 / Vera-class systems, I want kubelet topology hint generation to avoid exploring CPU-less NUMA nodes so kubelet can admit topology-aware pods reliably.
59+
- As a Kubernetes user, I want this behavior to be opt-in so existing systems are unchanged unless I explicitly enable it.
60+
61+
## Design Details
62+
63+
### API
64+
65+
New TopologyManager policy option:
66+
- `restrict-to-cpu-numa-nodes: "true|false"`
67+
68+
This is an alpha policy option under the existing TopologyManagerPolicyAlphaOptions feature gate.
69+
70+
Feature level:
71+
- Alpha
72+
73+
### TopologyManager Behavior
74+
75+
When enabled, TopologyManager filters the discovered topology to CPU-attached NUMA nodes before constructing its effective NUMA view.
76+
If filtering would produce an empty topology, kubelet falls back to the original topology.
77+
78+
### CPUManager and MemoryManager Behavior
79+
80+
Both managers consume the effective NUMA-node set from TopologyManager and generate hints only across that set.
81+
82+
### DeviceManager Behavior
83+
84+
Device plugins may report NUMA nodes outside the effective CPU-attached NUMA set.
85+
When enabled, DeviceManager projects raw device NUMA-node IDs onto the effective NUMA-node set using NUMA distances, so hint generation and aligned allocation share the same reduced placement universe.
86+
87+
### Failure Modes
88+
89+
- If no CPU-attached NUMA nodes can be derived, fall back to the original topology.
90+
- If device NUMA-distance data is unavailable, fall back conservatively.
91+
92+
## Risks and Mitigations
93+
94+
Risk:
95+
- Some platforms may expect CPU-less NUMA nodes to remain first-class in hint generation.
96+
Mitigation:
97+
- option is alpha and opt-in.
98+
Risk:
99+
- Device projection may affect aligned allocation choices.
100+
Mitigation:
101+
- DeviceManager uses the same effective NUMA-node set as TopologyManager.
102+
- Unit tests cover projected hint generation behavior.
103+
104+
## Test Plan
105+
106+
- Unit tests for TopologyManager option parsing and effective NUMA topology construction.
107+
- Unit tests for CPUManager and MemoryManager hint generation with filtered NUMA sets.
108+
- Unit tests for DeviceManager NUMA projection and aligned allocation behavior.
109+
- Validation on affected large NUMA hardware such as GB200.
110+
111+
## Graduation Criteria
112+
113+
### Alpha
114+
115+
- KEP merged
116+
- Pption implemented behind TopologyManagerPolicyAlphaOptions
117+
- Unit tests added
118+
- Validation on representative platform
119+
120+
### Beta
121+
122+
- Additional platform validation.
123+
- No major correctness regressions reported.
124+
125+
### Stable
126+
127+
- Sufficient production confidence
128+
- Finalized operator guidance
129+
130+
## Production Readiness Review Questions
131+
132+
### Feature Enablement and Rollback
133+
134+
- Feature gate: TopologyManagerPolicyAlphaOptions
135+
- Component: kubelet
136+
- Additional enablement: set restrict-to-cpu-numa-nodes
137+
- Default behavior is unchanged
138+
- Rollback is supported by removing the option
139+
140+
### Monitoring Requirements
141+
142+
No new metrics are proposed initially. Operators can use kubelet logs and admission behavior on affected nodes.
143+
144+
## Implementation History
145+
146+
- 2026-04-02: Initial draft
147+
- 2026-04-02: Initial implementation PR opened in kubernetes/kubernetes#138172
Lines changed: 29 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,29 @@
1+
title: TopologyManager CPU-attached NUMA filter option
2+
kep-number: 5726
3+
authors:
4+
- "@fanzhangio"
5+
owning-sig: sig/node
6+
participating-sigs:
7+
- sig/node
8+
status: provisional
9+
creation-date: 2026-04-02
10+
reviewers:
11+
- TBD
12+
approvers:
13+
- TBD
14+
see-also:
15+
- "https://github.com/kubernetes/kubernetes/issues/135541"
16+
- "https://github.com/kubernetes/enhancements/issues/5726"
17+
- "https://github.com/kubernetes/kubernetes/pull/135581"
18+
stage: alpha
19+
latest-milestone: "v1.36.1"
20+
milestone:
21+
alpha: "v1.36"
22+
beta: "TBD"
23+
stable: "TBD"
24+
feature-gates:
25+
- name: TopologyManagerPolicyAlphaOptions
26+
components:
27+
- kubelet
28+
disable-supported: true
29+
metrics: []

0 commit comments

Comments
 (0)