|
| 1 | +# KEP-6007: Add Topology Manager option to improve workload density for single-numa-node |
| 2 | + |
| 3 | +[enhancement tracking issue]: https://github.com/kubernetes/enhancements/issues/6007 |
| 4 | + |
| 5 | +<!-- toc --> |
| 6 | +- [Release Signoff Checklist](#release-signoff-checklist) |
| 7 | +- [Summary](#summary) |
| 8 | +- [Motivation](#motivation) |
| 9 | + - [Problem statement](#problem-statement) |
| 10 | + - [Illustrative NUMA packing](#illustrative-numa-packing) |
| 11 | + - [Goals](#goals) |
| 12 | + - [Non-Goals](#non-goals) |
| 13 | +- [Proposal](#proposal) |
| 14 | + - [User Stories](#user-stories) |
| 15 | + - [Proposed API](#proposed-api) |
| 16 | + - [Algorithm](#algorithm) |
| 17 | + - [Notes / Constraints / Caveats](#notes--constraints--caveats) |
| 18 | + - [Risks and Mitigations](#risks-and-mitigations) |
| 19 | +- [Design Details](#design-details) |
| 20 | + - [Kubelet wiring (container manager ↔ topology manager)](#kubelet-wiring-container-manager--topology-manager) |
| 21 | + - [Test Plan](#test-plan) |
| 22 | + - [Unit tests](#unit-tests) |
| 23 | + - [Integration / e2e](#integration--e2e) |
| 24 | + - [Rollout and Documentation](#rollout-and-documentation) |
| 25 | + - [Graduation Criteria](#graduation-criteria) |
| 26 | +- [Drawbacks](#drawbacks) |
| 27 | +- [Alternatives](#alternatives) |
| 28 | +- [Implementation History](#implementation-history) |
| 29 | +<!-- /toc --> |
| 30 | + |
| 31 | +## Release Signoff Checklist |
| 32 | + |
| 33 | +Items marked with (R) are required *prior to targeting to a milestone / release*. |
| 34 | + |
| 35 | +- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements] |
| 36 | +- [ ] (R) KEP approvers have approved the KEP status as `implementable` |
| 37 | +- [ ] (R) Design details are appropriately documented |
| 38 | +- [ ] (R) Test plan is in place |
| 39 | +- [ ] (R) Graduation criteria is in place |
| 40 | +- [ ] (R) Production readiness review completed |
| 41 | +- [ ] (R) Production readiness review approved |
| 42 | + |
| 43 | +[kubernetes/enhancements]: https://git.k8s.io/enhancements |
| 44 | + |
| 45 | +## Summary |
| 46 | + |
| 47 | +Clusters that enforce strict NUMA locality with `topologyManagerPolicy: single-numa-node` |
| 48 | +often give up **workload density**: pods fail to schedule even when node has **enough CPUs |
| 49 | +in total**, because exclusive CPUs are fragmented across NUMAs. |
| 50 | +Among **tied** equally valid single-NUMA merged outcomes, Topology Manager today breaks ties |
| 51 | +with **bitmask / `Narrowest`** (equal width → **smaller mask / lower NUMA id**), **not** |
| 52 | +**free or contiguous** exclusive-CPU headroom—so it cannot prefer the choice that leaves |
| 53 | +**more room for the next** large single-NUMA pod. |
| 54 | + |
| 55 | +This KEP proposes an **optional** Topology Manager policy option **`prefer-most-allocated-numa-node`** |
| 56 | +so that, under **`single-numa-node`**, kubelet can break ties using **kubelet-local** signals |
| 57 | +(static CPU and Memory managers)—in the spirit of **“most allocated”** packing (favor placing the pod |
| 58 | +on the NUMA that **preserves a larger contiguous exclusive-CPU hole** on other NUMA for the **next** |
| 59 | +workload). Default behavior stays **unchanged** when the option is off. |
| 60 | + |
| 61 | +## Motivation |
| 62 | + |
| 63 | +### Problem statement |
| 64 | + |
| 65 | +- **Why `best-effort` provides “more capacity”:** The same Guaranteed workload with |
| 66 | + integer CPUs can be admitted under `best-effort` with **cross-NUMA** CPU placement. |
| 67 | + `single-numa-node` rejects unless the request fits on **one** NUMA; the difference is |
| 68 | + topology locality, not raw millis. |
| 69 | +- **Observed failure mode:** Workloads pinned to one NUMA (devices, hugepages etc) make |
| 70 | + the node **asymmetric**. For **new** pods whose merged hints allow **either** NUMA, |
| 71 | + **always preferring the lower NUMA id** reduces the overall workload density on the node. |
| 72 | + |
| 73 | +### Illustrative NUMA packing |
| 74 | + |
| 75 | +The two figures below show why **where** work lands on each NUMA affects **single-NUMA** |
| 76 | +workloads that need a **contiguous** block. Pod B is NUMA pinned due to device affinity. |
| 77 | + |
| 78 | + |
| 79 | + |
| 80 | +*Capacity-not-aware (today):* among valid NUMAs, the merger **prefers the lower NUMA id** |
| 81 | +(via bitmask ordering) when there are **no** stronger hints from devices, hugepages, or |
| 82 | +static memory—**without** considering remaining contiguous exclusive-CPU headroom. |
| 83 | + |
| 84 | + |
| 85 | + |
| 86 | +*With `prefer-most-allocated-numa-node` (proposed):* tie-break scoring favors an outcome in the |
| 87 | +**most-allocated / consolidate** spirit—**concentrating** use so the **other** NUMA keeps a |
| 88 | +**larger contiguous** exclusive-CPU region—often better for the **next** Guaranteed pod under |
| 89 | +static CPU. |
| 90 | + |
| 91 | +### Goals |
| 92 | + |
| 93 | +- **Improve workload density** (better use of nodes) |
| 94 | + for operators who use **`single-numa-node`** for strict locality (like in Telco). |
| 95 | +- Integrate via **`TopologyManagerPolicyOptions`**, using the same **feature gate / |
| 96 | + graduation pattern** as for other topology policy options. |
| 97 | + |
| 98 | +### Non-Goals |
| 99 | + |
| 100 | +- Reimplementing **`NodeResourcesFit`** or other scheduler scoring. The NUMA-level scoring |
| 101 | + **aligns with** the scheduler's `mostRequestedScore` formula but does not import or |
| 102 | + duplicate scheduler code. |
| 103 | + |
| 104 | +**Side benefit:** Because the default kube-scheduler is **not NUMA-aware**, it may place a |
| 105 | +pod on a node whose aggregate resources look sufficient, only for kubelet to **reject** the |
| 106 | +pod when no single NUMA node has enough contiguous free space. By acting as a **local |
| 107 | +defragmenter**—consolidating smaller workloads on one NUMA to preserve larger contiguous |
| 108 | +free regions on the other—this option reduces such "last-mile" admission failures and the |
| 109 | +wasted scheduling cycles they cause. |
| 110 | + |
| 111 | +## Proposal |
| 112 | + |
| 113 | +### User Stories |
| 114 | + |
| 115 | +1. **As a cluster operator** running `single-numa-node` with static CPU and Memory managers, |
| 116 | + I want **better density** under strict single-NUMA locality—without switching to |
| 117 | + **`best-effort`** (which results in cross-NUMA CPUs). |
| 118 | +2. **As a platform engineer**, I want an **opt-in** policy option so existing clusters keep |
| 119 | + today’s behavior unless they enable the new option. |
| 120 | + |
| 121 | +### Proposed API |
| 122 | + |
| 123 | +Introduce a new Topology Manager policy option **`prefer-most-allocated-numa-node`**, configurable |
| 124 | +via kubelet configuration alongside existing options: |
| 125 | +- Only in effect when **`topologyManagerPolicy` is `single-numa-node`**. |
| 126 | +- `TopologyManagerPolicyOptions` / alpha-beta gates as for other new options. |
| 127 | + |
| 128 | +When the option is **disabled** (default), behavior remains **unchanged** from today. |
| 129 | + |
| 130 | +```yaml |
| 131 | +kind: KubeletConfiguration |
| 132 | +apiVersion: kubelet.config.k8s.io/v1beta1 |
| 133 | +featureGates: |
| 134 | + ... |
| 135 | + TopologyManagerPolicyAlphaOptions: true |
| 136 | +topologyManagerPolicyOptions: |
| 137 | + prefer-most-allocated-numa-node: "true" |
| 138 | +topologyManagerPolicy: single-numa-node |
| 139 | +memoryManagerPolicy: Static |
| 140 | +cpuManagerPolicy: static |
| 141 | +... |
| 142 | +``` |
| 143 | + |
| 144 | +### Algorithm |
| 145 | + |
| 146 | +**Trigger:** `topologyManagerPolicy` is **`single-numa-node`**, **`prefer-most-allocated-numa-node`** |
| 147 | +is set in **`topologyManagerPolicyOptions`** (with required feature gates), and the hint merger is |
| 148 | +comparing **preferred** merged hints whose NUMA affinity is a **single** NUMA node (bit count 1). |
| 149 | + |
| 150 | +**Scoring (kubelet aggregator):** |
| 151 | + |
| 152 | +Each signal computes a **utilization score** per NUMA using the same formula as |
| 153 | +kube-scheduler's `MostAllocated` plugin: `score = (assigned × 100) / allocatable`, |
| 154 | +where **allocatable** accounts for reserved resources so the ratio reflects true |
| 155 | +utilization. |
| 156 | + |
| 157 | +1. **CPU signal (static CPU manager):** Score each NUMA by exclusive-CPU utilization |
| 158 | + (assigned / allocatable, where allocatable excludes reserved CPUs). Higher score wins. |
| 159 | + Equal scores or non-static policy → undecided. |
| 160 | +2. **Memory signal (static memory manager):** Score each NUMA by regular-memory utilization |
| 161 | + (assigned / allocatable, where allocatable excludes per-NUMA reserved memory). Higher score |
| 162 | + wins. Equal scores or non-static policy → undecided. |
| 163 | +3. **Combine:** Neither decides → **`Narrowest`** fallback. One decides → use it. |
| 164 | + Both decide and agree → use it. Both decide but disagree → **`Narrowest`** fallback. |
| 165 | + |
| 166 | +**Interactions:** |
| 167 | + |
| 168 | +- **Scheduler / descheduler:** Not a substitute for this option. The **scheduler** chooses a |
| 169 | + **node**; it does not run Topology Manager or finalize **static** CPU / memory **NUMA** placement |
| 170 | + on the node. The **descheduler** evicts pods so they can be scheduled again. This KEP answers |
| 171 | + the question "which single-NUMA outcome wins when several are equivalent?". |
| 172 | +- **Merge:** Still driven by hint providers; this step only affects **which** valid |
| 173 | + single-NUMA preferred outcome wins when multiple exist. |
| 174 | + |
| 175 | + |
| 176 | +### Notes / Constraints / Caveats |
| 177 | + |
| 178 | +- Requires **static** CPU Manager and Memory Manager so **per-NUMA** signals are meaningful. |
| 179 | + |
| 180 | +### Risks and Mitigations |
| 181 | + |
| 182 | +| Risk | Mitigation | |
| 183 | +|------|------------| |
| 184 | +| Admission latency | Run scoring only on the tie path; reuse cached summaries where possible | |
| 185 | +| Behavior surprise when enabled | Opt-in; metrics `topology_manager_admission_*` already exist | |
| 186 | + |
| 187 | +## Design Details |
| 188 | + |
| 189 | +### Kubelet wiring (container manager ↔ topology manager) |
| 190 | + |
| 191 | +Topology Manager is created **before** the CPU and Memory managers, so those dependencies are |
| 192 | +not available at `NewManager` time. Inside the **container manager**, after the CPU and Memory |
| 193 | +managers are constructed (and registered as topology hint providers), kubelet builds a **preferred |
| 194 | +NUMA tie-breaker** object that **holds references** to those two managers and registers it via |
| 195 | +**`TopologyManager.SetPreferredSingleNUMATieBreaker`**. The topology **Manager** forwards to |
| 196 | +**Scope**, which stores the object on **`singleNumaNodePolicy`** when the active policy is |
| 197 | +**`single-numa-node`**; other policies ignore registration. |
| 198 | + |
| 199 | +When a pod reaches topology admission and the hint **merger** takes the tie path (two **preferred** |
| 200 | +hints with **single-bit** NUMA masks) and **`PreferMostAllocatedNUMANode`** is **true** in |
| 201 | +**`PolicyOptions`**, the merger calls **`ComparePreferredSingleNUMAForTopology`** on the stored |
| 202 | +tie-breaker. That implementation delegates to the **CPU** and **Memory** managers’ |
| 203 | +**`ComparePreferredSingleNUMAForTopology`** methods, then applies the **agree / single-signal / |
| 204 | +disagree→fallback** rules described under **Algorithm**. |
| 205 | + |
| 206 | +### Test Plan |
| 207 | + |
| 208 | +[X] Owners of involved components may require updates to existing tests before merge. |
| 209 | + |
| 210 | +#### Unit tests |
| 211 | + |
| 212 | +Five test suites cover each layer of the feature: |
| 213 | + |
| 214 | +1. **CPU signal scoring** (`pkg/kubelet/cm/cpumanager/cpu_compare_preferred_scoring_test.go`) — |
| 215 | + Verifies the CPU manager's utilization-based comparison: equal utilization → undecided, |
| 216 | + higher utilization → prefer that NUMA, non-static policy → undecided, and asymmetric |
| 217 | + reserved CPUs correctly change the outcome even when raw assigned counts are equal. |
| 218 | + |
| 219 | +2. **Memory signal scoring** (`pkg/kubelet/cm/memorymanager/memory_compare_preferred_scoring_test.go`) — |
| 220 | + Verifies the memory manager's utilization-based comparison: equal utilization → undecided, |
| 221 | + higher utilization → prefer that NUMA, non-static policy → undecided, and asymmetric |
| 222 | + per-NUMA reserved memory correctly change the outcome even when raw assigned bytes are equal. |
| 223 | + |
| 224 | +3. **Aggregator combine rules** (`pkg/kubelet/cm/preferred_numa_tiebreak_test.go`) — |
| 225 | + Stub-driven tests for the combine logic: both undecided, CPU-only, memory-only, |
| 226 | + agree, and disagree→fallback. |
| 227 | + |
| 228 | +4. **Topology Manager merge** (`pkg/kubelet/cm/topologymanager/policy_prefer_most_allocated_test.go`) — |
| 229 | + End-to-end merge through `singleNumaNodePolicy`: tie-breaker overrides default, and |
| 230 | + absent tie-breaker falls back to Narrowest. |
| 231 | + |
| 232 | +5. **Policy option gating** (`pkg/kubelet/cm/topologymanager/policy_options_test.go`) — |
| 233 | + `prefer-most-allocated-numa-node` accepted only with `TopologyManagerPolicyAlphaOptions` |
| 234 | + enabled; rejected otherwise. |
| 235 | + |
| 236 | +#### Integration / e2e |
| 237 | + |
| 238 | +- Multi-NUMA **static** CPU + `single-numa-node`: ordered admission of Guaranteed pods |
| 239 | + (no devices) to validate **baseline** vs **option** NUMA choice when ties exist. |
| 240 | +- Scenarios where **one** NUMA already holds an exclusive / device-bound pod and pods could |
| 241 | + admit to **either** NUMA—assert the option changes **which** NUMA wins vs low-index default. |
| 242 | + |
| 243 | +### Rollout and Documentation |
| 244 | + |
| 245 | +- Alpha: new option behind `TopologyManagerPolicyAlphaOptions` |
| 246 | +- User-facing docs: kubelet configuration reference, relationship to `single-numa-node` and |
| 247 | + static managers |
| 248 | +- Release notes per phase. |
| 249 | + |
| 250 | +### Graduation Criteria |
| 251 | + |
| 252 | +- **Alpha:** Implementation + unit tests; documented semantics and known limitations. |
| 253 | +- **Beta:** e2e signal on multi-NUMA CI; no major semantic surprises from production feedback. |
| 254 | +- **GA:** Sufficient soak; option promotion per SIG Node policy for policy options. |
| 255 | + |
| 256 | +## Drawbacks |
| 257 | + |
| 258 | +- More logic and coupling between Topology Manager and CPU/Memory manager state. |
| 259 | + |
| 260 | +## Alternatives |
| 261 | + |
| 262 | +1. **Status quo:** Keep bitmask / low-NUMA-id tie-break; accept lower density on **asymmetric** |
| 263 | + nodes when symmetric pods **always** stack on the lower-id NUMA first. |
| 264 | + |
| 265 | +## Implementation History |
| 266 | + |
| 267 | +- 2026-04-07: Draft created. |
0 commit comments