Skip to content

Commit 9dcddc8

Browse files
committed
KEP-6007: Add Topology Manager option to improve workload density for single-numa-node
Signed-off-by: Jing C. Zhang (EXT-Nokia) <jing.c.zhang.ext@nokia.com>
1 parent 9b3e591 commit 9dcddc8

File tree

4 files changed

+291
-0
lines changed

4 files changed

+291
-0
lines changed
Lines changed: 266 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,266 @@
1+
# KEP-6007: Add Topology Manager option to improve workload density for single-numa-node
2+
3+
[enhancement tracking issue]: https://github.com/kubernetes/enhancements/issues/6007
4+
5+
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
7+
- [Summary](#summary)
8+
- [Motivation](#motivation)
9+
- [Problem statement](#problem-statement)
10+
- [Illustrative NUMA packing](#illustrative-numa-packing)
11+
- [Goals](#goals)
12+
- [Non-Goals](#non-goals)
13+
- [Proposal](#proposal)
14+
- [User Stories](#user-stories)
15+
- [Proposed API](#proposed-api)
16+
- [Algorithm](#algorithm)
17+
- [Notes / Constraints / Caveats](#notes--constraints--caveats)
18+
- [Risks and Mitigations](#risks-and-mitigations)
19+
- [Design Details](#design-details)
20+
- [Kubelet wiring (container manager ↔ topology manager)](#kubelet-wiring-container-manager--topology-manager)
21+
- [Test Plan](#test-plan)
22+
- [Unit tests](#unit-tests)
23+
- [Integration / e2e](#integration--e2e)
24+
- [Rollout and Documentation](#rollout-and-documentation)
25+
- [Graduation Criteria](#graduation-criteria)
26+
- [Drawbacks](#drawbacks)
27+
- [Alternatives](#alternatives)
28+
- [Implementation History](#implementation-history)
29+
<!-- /toc -->
30+
31+
## Release Signoff Checklist
32+
33+
Items marked with (R) are required *prior to targeting to a milestone / release*.
34+
35+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements]
36+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
37+
- [ ] (R) Design details are appropriately documented
38+
- [ ] (R) Test plan is in place
39+
- [ ] (R) Graduation criteria is in place
40+
- [ ] (R) Production readiness review completed
41+
- [ ] (R) Production readiness review approved
42+
43+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
44+
45+
## Summary
46+
47+
Clusters that enforce strict NUMA locality with `topologyManagerPolicy: single-numa-node`
48+
often give up **workload density**: pods fail to schedule even when node has **enough CPUs
49+
in total**, because exclusive CPUs are fragmented across NUMAs.
50+
Among **tied** equally valid single-NUMA merged outcomes, Topology Manager today breaks ties
51+
with **bitmask / `Narrowest`** (equal width → **smaller mask / lower NUMA id**), **not**
52+
**free or contiguous** exclusive-CPU headroom—so it cannot prefer the choice that leaves
53+
**more room for the next** large single-NUMA pod.
54+
55+
This KEP proposes an **optional** Topology Manager policy option **`prefer-most-allocated-numa-node`**
56+
so that, under **`single-numa-node`**, kubelet can break ties using **kubelet-local** signals
57+
(static CPU and Memory managers)—in the spirit of **“most allocated”** packing (favor placing the pod
58+
on the NUMA that **preserves a larger contiguous exclusive-CPU hole** on other NUMA for the **next**
59+
workload). Default behavior stays **unchanged** when the option is off.
60+
61+
## Motivation
62+
63+
### Problem statement
64+
65+
- **Why `best-effort` provides “more capacity”:** The same Guaranteed workload with
66+
integer CPUs can be admitted under `best-effort` with **cross-NUMA** CPU placement.
67+
`single-numa-node` rejects unless the request fits on **one** NUMA; the difference is
68+
topology locality, not raw millis.
69+
- **Observed failure mode:** Workloads pinned to one NUMA (devices, hugepages etc) make
70+
the node **asymmetric**. For **new** pods whose merged hints allow **either** NUMA,
71+
**always preferring the lower NUMA id** reduces the overall workload density on the node.
72+
73+
### Illustrative NUMA packing
74+
75+
The two figures below show why **where** work lands on each NUMA affects **single-NUMA**
76+
workloads that need a **contiguous** block. Pod B is NUMA pinned due to device affinity.
77+
78+
![Illustration: capacity-not-aware NUMA selection (today’s tie-break).](capacity-not-aware.png)
79+
80+
*Capacity-not-aware (today):* among valid NUMAs, the merger **prefers the lower NUMA id**
81+
(via bitmask ordering) when there are **no** stronger hints from devices, hugepages, or
82+
static memory—**without** considering remaining contiguous exclusive-CPU headroom.
83+
84+
![Illustration: prefer-most-allocated style selection leaves a larger contiguous free region on one NUMA.](capacity-aware.png)
85+
86+
*With `prefer-most-allocated-numa-node` (proposed):* tie-break scoring favors an outcome in the
87+
**most-allocated / consolidate** spirit—**concentrating** use so the **other** NUMA keeps a
88+
**larger contiguous** exclusive-CPU region—often better for the **next** Guaranteed pod under
89+
static CPU.
90+
91+
### Goals
92+
93+
- **Improve workload density** (better use of nodes)
94+
for operators who use **`single-numa-node`** for strict locality (like in Telco).
95+
- Integrate via **`TopologyManagerPolicyOptions`**, using the same **feature gate /
96+
graduation pattern** as for other topology policy options.
97+
98+
### Non-Goals
99+
100+
- Reimplementing **`NodeResourcesFit`** or other scheduler scoring. The NUMA-level scoring
101+
**aligns with** the scheduler's `mostRequestedScore` formula but does not import or
102+
duplicate scheduler code.
103+
104+
**Side benefit:** Because the default kube-scheduler is **not NUMA-aware**, it may place a
105+
pod on a node whose aggregate resources look sufficient, only for kubelet to **reject** the
106+
pod when no single NUMA node has enough contiguous free space. By acting as a **local
107+
defragmenter**—consolidating smaller workloads on one NUMA to preserve larger contiguous
108+
free regions on the other—this option reduces such "last-mile" admission failures and the
109+
wasted scheduling cycles they cause.
110+
111+
## Proposal
112+
113+
### User Stories
114+
115+
1. **As a cluster operator** running `single-numa-node` with static CPU and Memory managers,
116+
I want **better density** under strict single-NUMA locality—without switching to
117+
**`best-effort`** (which results in cross-NUMA CPUs).
118+
2. **As a platform engineer**, I want an **opt-in** policy option so existing clusters keep
119+
today’s behavior unless they enable the new option.
120+
121+
### Proposed API
122+
123+
Introduce a new Topology Manager policy option **`prefer-most-allocated-numa-node`**, configurable
124+
via kubelet configuration alongside existing options:
125+
- Only in effect when **`topologyManagerPolicy` is `single-numa-node`**.
126+
- `TopologyManagerPolicyOptions` / alpha-beta gates as for other new options.
127+
128+
When the option is **disabled** (default), behavior remains **unchanged** from today.
129+
130+
```yaml
131+
kind: KubeletConfiguration
132+
apiVersion: kubelet.config.k8s.io/v1beta1
133+
featureGates:
134+
...
135+
TopologyManagerPolicyAlphaOptions: true
136+
topologyManagerPolicyOptions:
137+
prefer-most-allocated-numa-node: "true"
138+
topologyManagerPolicy: single-numa-node
139+
memoryManagerPolicy: Static
140+
cpuManagerPolicy: static
141+
...
142+
```
143+
144+
### Algorithm
145+
146+
**Trigger:** `topologyManagerPolicy` is **`single-numa-node`**, **`prefer-most-allocated-numa-node`**
147+
is set in **`topologyManagerPolicyOptions`** (with required feature gates), and the hint merger is
148+
comparing **preferred** merged hints whose NUMA affinity is a **single** NUMA node (bit count 1).
149+
150+
**Scoring (kubelet aggregator):**
151+
152+
Each signal computes a **utilization score** per NUMA using the same formula as
153+
kube-scheduler's `MostAllocated` plugin: `score = (assigned × 100) / allocatable`,
154+
where **allocatable** accounts for reserved resources so the ratio reflects true
155+
utilization.
156+
157+
1. **CPU signal (static CPU manager):** Score each NUMA by exclusive-CPU utilization
158+
(assigned / allocatable, where allocatable excludes reserved CPUs). Higher score wins.
159+
Equal scores or non-static policy → undecided.
160+
2. **Memory signal (static memory manager):** Score each NUMA by regular-memory utilization
161+
(assigned / allocatable, where allocatable excludes system-reserved memory). Higher score
162+
wins. Equal scores or non-static policy → undecided.
163+
3. **Combine:** Neither decides → **`Narrowest`** fallback. One decides → use it.
164+
Both decide and agree → use it. Both decide but disagree → **`Narrowest`** fallback.
165+
166+
**Interactions:**
167+
168+
- **Scheduler / descheduler:** Not a substitute for this option. The **scheduler** chooses a
169+
**node**; it does not run Topology Manager or finalize **static** CPU / memory **NUMA** placement
170+
on the node. The **descheduler** evicts pods so they can be scheduled again. This KEP answers
171+
the question "which single-NUMA outcome wins when several are equivalent?".
172+
- **Merge:** Still driven by hint providers; this step only affects **which** valid
173+
single-NUMA preferred outcome wins when multiple exist.
174+
175+
176+
### Notes / Constraints / Caveats
177+
178+
- Requires **static** CPU Manager and Memory Manager so **per-NUMA** signals are meaningful.
179+
180+
### Risks and Mitigations
181+
182+
| Risk | Mitigation |
183+
|------|------------|
184+
| Admission latency | Run scoring only on the tie path; reuse cached summaries where possible |
185+
| Behavior surprise when enabled | Opt-in; metrics `topology_manager_admission_*` already exist |
186+
187+
## Design Details
188+
189+
### Kubelet wiring (container manager ↔ topology manager)
190+
191+
Topology Manager is created **before** the CPU and Memory managers, so those dependencies are
192+
not available at `NewManager` time. Inside the **container manager**, after the CPU and Memory
193+
managers are constructed (and registered as topology hint providers), kubelet builds a **preferred
194+
NUMA tie-breaker** object that **holds references** to those two managers and registers it via
195+
**`TopologyManager.SetPreferredSingleNUMATieBreaker`**. The topology **Manager** forwards to
196+
**Scope**, which stores the object on **`singleNumaNodePolicy`** when the active policy is
197+
**`single-numa-node`**; other policies ignore registration.
198+
199+
When a pod reaches topology admission and the hint **merger** takes the tie path (two **preferred**
200+
hints with **single-bit** NUMA masks) and **`PreferMostAllocatedNUMANode`** is **true** in
201+
**`PolicyOptions`**, the merger calls **`ComparePreferredSingleNUMAForTopology`** on the stored
202+
tie-breaker. That implementation delegates to the **CPU** and **Memory** managers’
203+
**`ComparePreferredSingleNUMAForTopology`** methods, then applies the **agree / single-signal /
204+
disagree→fallback** rules described under **Algorithm**.
205+
206+
### Test Plan
207+
208+
[X] Owners of involved components may require updates to existing tests before merge.
209+
210+
#### Unit tests
211+
212+
Five test suites cover each layer of the feature:
213+
214+
1. **CPU signal scoring** (`pkg/kubelet/cm/cpumanager/cpu_compare_preferred_scoring_test.go`) —
215+
Verifies the CPU manager's utilization-based comparison: equal utilization → undecided,
216+
higher utilization → prefer that NUMA, non-static policy → undecided, and asymmetric
217+
reserved CPUs correctly change the outcome even when raw assigned counts are equal.
218+
219+
2. **Memory signal scoring** (`pkg/kubelet/cm/memorymanager/memory_compare_preferred_scoring_test.go`) —
220+
Same structure as CPU: equal, higher-on-candidate, higher-on-current, non-static, and
221+
asymmetric allocatable memory changing the outcome despite equal assigned bytes.
222+
223+
3. **Aggregator combine rules** (`pkg/kubelet/cm/preferred_numa_tiebreak_test.go`) —
224+
Stub-driven tests for the combine logic: both undecided, CPU-only, memory-only,
225+
agree, and disagree→fallback.
226+
227+
4. **Topology Manager merge** (`pkg/kubelet/cm/topologymanager/policy_prefer_most_allocated_test.go`) —
228+
End-to-end merge through `singleNumaNodePolicy`: tie-breaker overrides default, and
229+
absent tie-breaker falls back to Narrowest.
230+
231+
5. **Policy option gating** (`pkg/kubelet/cm/topologymanager/policy_options_test.go`) —
232+
`prefer-most-allocated-numa-node` accepted only with `TopologyManagerPolicyAlphaOptions`
233+
enabled; rejected otherwise.
234+
235+
#### Integration / e2e
236+
237+
- Multi-NUMA **static** CPU + `single-numa-node`: ordered admission of Guaranteed pods
238+
(no devices) to validate **baseline** vs **option** NUMA choice when ties exist.
239+
- Scenarios where **one** NUMA already holds an exclusive / device-bound pod and pods could
240+
admit to **either** NUMA—assert the option changes **which** NUMA wins vs low-index default.
241+
242+
### Rollout and Documentation
243+
244+
- Alpha: new option behind `TopologyManagerPolicyAlphaOptions`
245+
- User-facing docs: kubelet configuration reference, relationship to `single-numa-node` and
246+
static managers
247+
- Release notes per phase.
248+
249+
### Graduation Criteria
250+
251+
- **Alpha:** Implementation + unit tests; documented semantics and known limitations.
252+
- **Beta:** e2e signal on multi-NUMA CI; no major semantic surprises from production feedback.
253+
- **GA:** Sufficient soak; option promotion per SIG Node policy for policy options.
254+
255+
## Drawbacks
256+
257+
- More logic and coupling between Topology Manager and CPU/Memory manager state.
258+
259+
## Alternatives
260+
261+
1. **Status quo:** Keep bitmask / low-NUMA-id tie-break; accept lower density on **asymmetric**
262+
nodes when symmetric pods **always** stack on the lower-id NUMA first.
263+
264+
## Implementation History
265+
266+
- 2026-04-07: Draft created.
9.84 KB
Loading
9.77 KB
Loading
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
title: Add Topology Manager option to improve workload density for single-numa-node
2+
kep-number: 6007
3+
authors:
4+
- "jingczhang"
5+
- "saipranav36"
6+
owning-sig: sig-node
7+
participating-sigs: []
8+
status: provisional
9+
creation-date: "2026-04-07"
10+
reviewers: []
11+
approvers:
12+
- "@sig-node-tech-leads"
13+
see-also: []
14+
replaces: []
15+
latest-milestone: ""
16+
milestone:
17+
alpha: "v1.37"
18+
beta: "v1.38"
19+
stable: "v1.40"
20+
feature-gates:
21+
- name: "TopologyManagerPolicyAlphaOptions"
22+
components:
23+
- kubelet
24+
disable-supported: true
25+
metrics: []

0 commit comments

Comments
 (0)