Skip to content

Commit a047fa3

Browse files
committed
KEP-NNNN: Improve workload density for single-numa-node policy
Signed-off-by: Jing C. Zhang (EXT-Nokia) <jing.c.zhang.ext@nokia.com>
1 parent 9b3e591 commit a047fa3

File tree

4 files changed

+282
-0
lines changed

4 files changed

+282
-0
lines changed
Lines changed: 257 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,257 @@
1+
# KEP-6007: Add Topology Manager option to improve workload density for single-numa-node
2+
3+
[enhancement tracking issue]: https://github.com/kubernetes/enhancements/issues/new/choose
4+
5+
<!-- toc -->
6+
- [Release Signoff Checklist](#release-signoff-checklist)
7+
- [Summary](#summary)
8+
- [Motivation](#motivation)
9+
- [Problem statement](#problem-statement)
10+
- [Illustrative NUMA packing](#illustrative-numa-packing)
11+
- [Goals](#goals)
12+
- [Non-Goals](#non-goals)
13+
- [Proposal](#proposal)
14+
- [User Stories](#user-stories)
15+
- [Proposed API](#proposed-api)
16+
- [Algorithm](#algorithm)
17+
- [Notes / Constraints / Caveats](#notes--constraints--caveats)
18+
- [Risks and Mitigations](#risks-and-mitigations)
19+
- [Design Details](#design-details)
20+
- [Kubelet wiring (container manager ↔ topology manager)](#kubelet-wiring-container-manager--topology-manager)
21+
- [Test Plan](#test-plan)
22+
- [Rollout and Documentation](#rollout-and-documentation)
23+
- [Graduation Criteria](#graduation-criteria)
24+
- [Drawbacks](#drawbacks)
25+
- [Alternatives](#alternatives)
26+
- [Implementation History](#implementation-history)
27+
<!-- /toc -->
28+
29+
## Release Signoff Checklist
30+
31+
Items marked with (R) are required *prior to targeting to a milestone / release*.
32+
33+
- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements]
34+
- [ ] (R) KEP approvers have approved the KEP status as `implementable`
35+
- [ ] (R) Design details are appropriately documented
36+
- [ ] (R) Test plan is in place
37+
- [ ] (R) Graduation criteria is in place
38+
- [ ] (R) Production readiness review completed
39+
- [ ] (R) Production readiness review approved
40+
41+
[kubernetes/enhancements]: https://git.k8s.io/enhancements
42+
43+
## Summary
44+
45+
Clusters that enforce strict NUMA locality with `topologyManagerPolicy: single-numa-node`
46+
often give up **workload density**: pods fail to schedule even when node has **enough CPUs
47+
in total**, because exclusive CPUs are fragmented across NUMAs.
48+
Among **tied** equally valid single-NUMA merged outcomes, Topology Manager today breaks ties
49+
with **bitmask / `Narrowest`** (equal width → **smaller mask / lower NUMA id**), **not**
50+
**free or contiguous** exclusive-CPU headroom—so it cannot prefer the choice that leaves
51+
**more room for the next** large single-NUMA pod.
52+
53+
This KEP proposes an **optional** Topology Manager policy option **`prefer-most-allocated-numa-node`**
54+
so that, under **`single-numa-node`**, kubelet can break ties using **kubelet-local** signals
55+
(static CPU and Memory managers)—in the spirit of **“most allocated”** packing (favor placing the pod
56+
on the NUMA that **preserves a larger contiguous exclusive-CPU hole** on other NUMA for the **next**
57+
workload). Default behavior stays **unchanged** when the option is off.
58+
59+
## Motivation
60+
61+
### Problem statement
62+
63+
- **Why `best-effort` provides “more capacity”:** The same Guaranteed workload with
64+
integer CPUs can be admitted under `best-effort` with **cross-NUMA** CPU placement.
65+
`single-numa-node` rejects unless the request fits on **one** NUMA; the difference is
66+
topology locality, not raw millis.
67+
- **Observed failure mode:** Workloads pinned to one NUMA (devices, hugepages etc) make
68+
the node **asymmetric**. For **new** pods whose merged hints allow **either** NUMA,
69+
**always preferring the lower NUMA id** reduces the overall workload density on the node.
70+
71+
### Illustrative NUMA packing
72+
73+
The two figures below show why **where** work lands on each NUMA affects **single-NUMA**
74+
workloads that need a **contiguous** block. Pod B is NUMA pinned due to device affinity.
75+
76+
![Illustration: capacity-not-aware NUMA selection (today’s tie-break).](capacity-not-aware.png)
77+
78+
*Capacity-not-aware (today):* among valid NUMAs, the merger **prefers the lower NUMA id**
79+
(via bitmask ordering) when there are **no** stronger hints from devices, hugepages, or
80+
static memory—**without** considering remaining contiguous exclusive-CPU headroom.
81+
82+
![Illustration: prefer-most-allocated style selection leaves a larger contiguous free region on one NUMA.](capacity-aware.png)
83+
84+
*With `prefer-most-allocated-numa-node` (proposed):* tie-break scoring favors an outcome in the
85+
**most-allocated / consolidate** spirit—**concentrating** use so the **other** NUMA keeps a
86+
**larger contiguous** exclusive-CPU region—often better for the **next** Guaranteed pod under
87+
static CPU.
88+
89+
### Goals
90+
91+
- **Improve workload density** (better use of nodes)
92+
for operators who use **`single-numa-node`** for strict locality (like in Telco).
93+
- Integrate via **`TopologyManagerPolicyOptions`**, using the same **feature gate /
94+
graduation pattern** as for other topology policy options.
95+
96+
### Non-Goals
97+
98+
- Reimplementing **`NodeResourcesFit`** or other scheduler scoring. The NUMA-level scoring
99+
**aligns with** the scheduler's `mostRequestedScore` formula but does not import or
100+
duplicate scheduler code.
101+
102+
## Proposal
103+
104+
### User Stories
105+
106+
1. **As a cluster operator** running `single-numa-node` with static CPU and Memory managers,
107+
I want **better density** under strict single-NUMA locality—without switching to
108+
**`best-effort`** (which results in cross-NUMA CPUs).
109+
2. **As a platform engineer**, I want an **opt-in** policy option so existing clusters keep
110+
today’s behavior unless they enable the new option.
111+
112+
### Proposed API
113+
114+
Introduce a new Topology Manager policy option **`prefer-most-allocated-numa-node`**, configurable
115+
via kubelet configuration alongside existing options:
116+
- Only in effect when **`topologyManagerPolicy` is `single-numa-node`**.
117+
- `TopologyManagerPolicyOptions` / alpha-beta gates as for other new options.
118+
119+
When the option is **disabled** (default), behavior remains **unchanged** from today.
120+
121+
```yaml
122+
kind: KubeletConfiguration
123+
apiVersion: kubelet.config.k8s.io/v1beta1
124+
featureGates:
125+
...
126+
TopologyManagerPolicyAlphaOptions: true
127+
topologyManagerPolicyOptions:
128+
prefer-most-allocated-numa-node: "true"
129+
topologyManagerPolicy: single-numa-node
130+
memoryManagerPolicy: Static
131+
cpuManagerPolicy: static
132+
...
133+
```
134+
135+
### Algorithm
136+
137+
**Trigger:** `topologyManagerPolicy` is **`single-numa-node`**, **`prefer-most-allocated-numa-node`**
138+
is set in **`topologyManagerPolicyOptions`** (with required feature gates), and the hint merger is
139+
comparing **preferred** merged hints whose NUMA affinity is a **single** NUMA node (bit count 1).
140+
141+
**Scoring (kubelet aggregator):**
142+
143+
Each signal computes a **utilization score** per NUMA using the same formula as
144+
kube-scheduler's `MostAllocated` plugin: `score = (assigned × 100) / allocatable`,
145+
where **allocatable** accounts for reserved resources so the ratio reflects true
146+
utilization.
147+
148+
1. **CPU signal (static CPU manager):** Score each NUMA by exclusive-CPU utilization
149+
(assigned / allocatable, where allocatable excludes reserved CPUs). Higher score wins.
150+
Equal scores or non-static policy → undecided.
151+
2. **Memory signal (static memory manager):** Score each NUMA by regular-memory utilization
152+
(assigned / allocatable, where allocatable excludes system-reserved memory). Higher score
153+
wins. Equal scores or non-static policy → undecided.
154+
3. **Combine:** Neither decides → **`Narrowest`** fallback. One decides → use it.
155+
Both decide and agree → use it. Both decide but disagree → **`Narrowest`** fallback.
156+
157+
**Interactions:**
158+
159+
- **Scheduler / descheduler:** Not a substitute for this option. The **scheduler** chooses a
160+
**node**; it does not run Topology Manager or finalize **static** CPU / memory **NUMA** placement
161+
on the node. The **descheduler** evicts pods so they can be scheduled again. This KEP answers
162+
the question "which single-NUMA outcome wins when several are equivalent?".
163+
- **Merge:** Still driven by hint providers; this step only affects **which** valid
164+
single-NUMA preferred outcome wins when multiple exist.
165+
166+
167+
### Notes / Constraints / Caveats
168+
169+
- Requires **static** CPU Manager and Memory Manager so **per-NUMA** signals are meaningful.
170+
171+
### Risks and Mitigations
172+
173+
| Risk | Mitigation |
174+
|------|------------|
175+
| Admission latency | Run scoring only on the tie path; reuse cached summaries where possible |
176+
| Behavior surprise when enabled | Opt-in; metrics `topology_manager_admission_*` already exist |
177+
178+
## Design Details
179+
180+
### Kubelet wiring (container manager ↔ topology manager)
181+
182+
Topology Manager is created **before** the CPU and Memory managers, so those dependencies are
183+
not available at `NewManager` time. Inside the **container manager**, after the CPU and Memory
184+
managers are constructed (and registered as topology hint providers), kubelet builds a **preferred
185+
NUMA tie-breaker** object that **holds references** to those two managers and registers it via
186+
**`TopologyManager.SetPreferredSingleNUMATieBreaker`**. The topology **Manager** forwards to
187+
**Scope**, which stores the object on **`singleNumaNodePolicy`** when the active policy is
188+
**`single-numa-node`**; other policies ignore registration.
189+
190+
When a pod reaches topology admission and the hint **merger** takes the tie path (two **preferred**
191+
hints with **single-bit** NUMA masks) and **`PreferMostAllocatedNUMANode`** is **true** in
192+
**`PolicyOptions`**, the merger calls **`ComparePreferredSingleNUMAForTopology`** on the stored
193+
tie-breaker. That implementation delegates to the **CPU** and **Memory** managers’
194+
**`ComparePreferredSingleNUMAForTopology`** methods, then applies the **agree / single-signal /
195+
disagree→fallback** rules described under **Algorithm**.
196+
197+
### Test Plan
198+
199+
[X] Owners of involved components may require updates to existing tests before merge.
200+
201+
#### Unit tests
202+
203+
Five test suites cover each layer of the feature:
204+
205+
1. **CPU signal scoring** (`pkg/kubelet/cm/cpumanager/cpu_compare_preferred_scoring_test.go`) —
206+
Verifies the CPU manager's utilization-based comparison: equal utilization → undecided,
207+
higher utilization → prefer that NUMA, non-static policy → undecided, and asymmetric
208+
reserved CPUs correctly change the outcome even when raw assigned counts are equal.
209+
210+
2. **Memory signal scoring** (`pkg/kubelet/cm/memorymanager/memory_compare_preferred_scoring_test.go`) —
211+
Same structure as CPU: equal, higher-on-candidate, higher-on-current, non-static, and
212+
asymmetric allocatable memory changing the outcome despite equal assigned bytes.
213+
214+
3. **Aggregator combine rules** (`pkg/kubelet/cm/preferred_numa_tiebreak_test.go`) —
215+
Stub-driven tests for the combine logic: both undecided, CPU-only, memory-only,
216+
agree, and disagree→fallback.
217+
218+
4. **Topology Manager merge** (`pkg/kubelet/cm/topologymanager/policy_prefer_most_allocated_test.go`) —
219+
End-to-end merge through `singleNumaNodePolicy`: tie-breaker overrides default, and
220+
absent tie-breaker falls back to Narrowest.
221+
222+
5. **Policy option gating** (`pkg/kubelet/cm/topologymanager/policy_options_test.go`) —
223+
`prefer-most-allocated-numa-node` accepted only with `TopologyManagerPolicyAlphaOptions`
224+
enabled; rejected otherwise.
225+
226+
#### Integration / e2e
227+
228+
- Multi-NUMA **static** CPU + `single-numa-node`: ordered admission of Guaranteed pods
229+
(no devices) to validate **baseline** vs **option** NUMA choice when ties exist.
230+
- Scenarios where **one** NUMA already holds an exclusive / device-bound pod and pods could
231+
admit to **either** NUMA—assert the option changes **which** NUMA wins vs low-index default.
232+
233+
### Rollout and Documentation
234+
235+
- Alpha: new option behind `TopologyManagerPolicyAlphaOptions`
236+
- User-facing docs: kubelet configuration reference, relationship to `single-numa-node` and
237+
static managers
238+
- Release notes per phase.
239+
240+
### Graduation Criteria
241+
242+
- **Alpha:** Implementation + unit tests; documented semantics and known limitations.
243+
- **Beta:** e2e signal on multi-NUMA CI; no major semantic surprises from production feedback.
244+
- **GA:** Sufficient soak; option promotion per SIG Node policy for policy options.
245+
246+
## Drawbacks
247+
248+
- More logic and coupling between Topology Manager and CPU/Memory manager state.
249+
250+
## Alternatives
251+
252+
1. **Status quo:** Keep bitmask / low-NUMA-id tie-break; accept lower density on **asymmetric**
253+
nodes when symmetric pods **always** stack on the lower-id NUMA first.
254+
255+
## Implementation History
256+
257+
- 2026-04-07: Draft created.
9.84 KB
Loading
9.77 KB
Loading
Lines changed: 25 additions & 0 deletions
Original file line numberDiff line numberDiff line change
@@ -0,0 +1,25 @@
1+
title: Add Topology Manager option to improve workload density for single-numa-node
2+
kep-number: 6007
3+
authors:
4+
- "jingczhang"
5+
- "saipranav36"
6+
owning-sig: sig-node
7+
participating-sigs: []
8+
status: provisional
9+
creation-date: "2026-04-07"
10+
reviewers: []
11+
approvers:
12+
- "@sig-node-tech-leads"
13+
see-also: []
14+
replaces: []
15+
latest-milestone: ""
16+
milestone:
17+
alpha: "v1.37"
18+
beta: "v1.38"
19+
stable: "v1.40"
20+
feature-gates:
21+
- name: "TopologyManagerPolicyAlphaOptions"
22+
components:
23+
- kubelet
24+
disable-supported: true
25+
metrics: []

0 commit comments

Comments
 (0)