kubernetes · jingczhang · Apr 7, 2026
diff --git a/keps/sig-node/6007-prefer-most-allocated-numa-node/README.md b/keps/sig-node/6007-prefer-most-allocated-numa-node/README.md
@@ -0,0 +1,267 @@
+# KEP-6007: Add Topology Manager option to improve workload density for single-numa-node
+
+[enhancement tracking issue]: https://github.com/kubernetes/enhancements/issues/6007
+
+<!-- toc -->
+- [Release Signoff Checklist](#release-signoff-checklist)
+- [Summary](#summary)
+- [Motivation](#motivation)
+  - [Problem statement](#problem-statement)
+  - [Illustrative NUMA packing](#illustrative-numa-packing)
+  - [Goals](#goals)
+  - [Non-Goals](#non-goals)
+- [Proposal](#proposal)
+  - [User Stories](#user-stories)
+  - [Proposed API](#proposed-api)
+  - [Algorithm](#algorithm)
+  - [Notes / Constraints / Caveats](#notes--constraints--caveats)
+  - [Risks and Mitigations](#risks-and-mitigations)
+- [Design Details](#design-details)
+  - [Kubelet wiring (container manager ↔ topology manager)](#kubelet-wiring-container-manager--topology-manager)
+  - [Test Plan](#test-plan)
+    - [Unit tests](#unit-tests)
+    - [Integration / e2e](#integration--e2e)
+  - [Rollout and Documentation](#rollout-and-documentation)
+  - [Graduation Criteria](#graduation-criteria)
+- [Drawbacks](#drawbacks)
+- [Alternatives](#alternatives)
+- [Implementation History](#implementation-history)
+<!-- /toc -->
+
+## Release Signoff Checklist
+
+Items marked with (R) are required *prior to targeting to a milestone / release*.
+
+- [ ] (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements]
+- [ ] (R) KEP approvers have approved the KEP status as `implementable`
+- [ ] (R) Design details are appropriately documented
+- [ ] (R) Test plan is in place
+- [ ] (R) Graduation criteria is in place
+- [ ] (R) Production readiness review completed
+- [ ] (R) Production readiness review approved
+
+[kubernetes/enhancements]: https://git.k8s.io/enhancements
+
+## Summary
+
+Clusters that enforce strict NUMA locality with `topologyManagerPolicy: single-numa-node`
+often give up **workload density**: pods fail to schedule even when node has **enough CPUs
+in total**, because exclusive CPUs are fragmented across NUMAs.
+Among **tied** equally valid single-NUMA merged outcomes, Topology Manager today breaks ties
+with **bitmask / `Narrowest`** (equal width → **smaller mask / lower NUMA id**), **not**
+**free or contiguous** exclusive-CPU headroom—so it cannot prefer the choice that leaves
+**more room for the next** large single-NUMA pod.
+
+This KEP proposes an **optional** Topology Manager policy option **`prefer-most-allocated-numa-node`**
+so that, under **`single-numa-node`**, kubelet can break ties using **kubelet-local** signals
+(static CPU and Memory managers)—in the spirit of **“most allocated”** packing (favor placing the pod
+on the NUMA that **preserves a larger contiguous exclusive-CPU hole** on other NUMA for the **next**
+workload). Default behavior stays **unchanged** when the option is off.
+
+## Motivation
+
+### Problem statement
+
+- **Why `best-effort` provides “more capacity”:** The same Guaranteed workload with
+  integer CPUs can be admitted under `best-effort` with **cross-NUMA** CPU placement.
+  `single-numa-node` rejects unless the request fits on **one** NUMA; the difference is
+  topology locality, not raw millis.
+- **Observed failure mode:** Workloads pinned to one NUMA (devices, hugepages etc) make
+  the node **asymmetric**. For **new** pods whose merged hints allow **either** NUMA,
+  **always preferring the lower NUMA id** reduces the overall workload density on the node.
+
+### Illustrative NUMA packing
+
+The two figures below show why **where** work lands on each NUMA affects **single-NUMA**
+workloads that need a **contiguous** block. Pod B is NUMA pinned due to device affinity.
+
+![Illustration: capacity-not-aware NUMA selection (today’s tie-break).](capacity-not-aware.png)
+
+*Capacity-not-aware (today):* among valid NUMAs, the merger **prefers the lower NUMA id**
+(via bitmask ordering) when there are **no** stronger hints from devices, hugepages, or
+static memory—**without** considering remaining contiguous exclusive-CPU headroom.
+
+![Illustration: prefer-most-allocated style selection leaves a larger contiguous free region on one NUMA.](capacity-aware.png)
+
+*With `prefer-most-allocated-numa-node` (proposed):* tie-break scoring favors an outcome in the
+**most-allocated / consolidate** spirit—**concentrating** use so the **other** NUMA keeps a
+**larger contiguous** exclusive-CPU region—often better for the **next** Guaranteed pod under
+static CPU.
+
+### Goals
+
+- **Improve workload density** (better use of nodes)
+  for operators who use **`single-numa-node`** for strict locality (like in Telco).
+- Integrate via **`TopologyManagerPolicyOptions`**, using the same **feature gate /
+  graduation pattern** as for other topology policy options.
+
+### Non-Goals
+
+- Reimplementing **`NodeResourcesFit`** or other scheduler scoring. The NUMA-level scoring
+  **aligns with** the scheduler's `mostRequestedScore` formula but does not import or
+  duplicate scheduler code.
+
+**Side benefit:** Because the default kube-scheduler is **not NUMA-aware**, it may place a
+pod on a node whose aggregate resources look sufficient, only for kubelet to **reject** the
+pod when no single NUMA node has enough contiguous free space. By acting as a **local
+defragmenter**—consolidating smaller workloads on one NUMA to preserve larger contiguous
+free regions on the other—this option reduces such "last-mile" admission failures and the
+wasted scheduling cycles they cause.
+
+## Proposal
+
+### User Stories
+
+1. **As a cluster operator** running `single-numa-node` with static CPU and Memory managers,
+   I want **better density** under strict single-NUMA locality—without switching to
+   **`best-effort`** (which results in cross-NUMA CPUs).
+2. **As a platform engineer**, I want an **opt-in** policy option so existing clusters keep
+   today’s behavior unless they enable the new option.
+
+### Proposed API
+
+Introduce a new Topology Manager policy option **`prefer-most-allocated-numa-node`**, configurable
+via kubelet configuration alongside existing options:
+- Only in effect when **`topologyManagerPolicy` is `single-numa-node`**.
+- `TopologyManagerPolicyOptions` / alpha-beta gates as for other new options.
+
+When the option is **disabled** (default), behavior remains **unchanged** from today.
+
+```yaml
+kind: KubeletConfiguration
+apiVersion: kubelet.config.k8s.io/v1beta1
+featureGates:
+  ...
+  TopologyManagerPolicyAlphaOptions: true
+topologyManagerPolicyOptions:
+  prefer-most-allocated-numa-node: "true"
+topologyManagerPolicy: single-numa-node
+memoryManagerPolicy: Static
+cpuManagerPolicy: static
+...
+```
+
+### Algorithm
+
+**Trigger:** `topologyManagerPolicy` is **`single-numa-node`**, **`prefer-most-allocated-numa-node`**
+is set in **`topologyManagerPolicyOptions`** (with required feature gates), and the hint merger is
+comparing **preferred** merged hints whose NUMA affinity is a **single** NUMA node (bit count 1).
+
+**Scoring (kubelet aggregator):**
+
+Each signal computes a **utilization score** per NUMA using the same formula as
+kube-scheduler's `MostAllocated` plugin: `score = (assigned × 100) / allocatable`,
+where **allocatable** accounts for reserved resources so the ratio reflects true
+utilization.
+
+1. **CPU signal (static CPU manager):** Score each NUMA by exclusive-CPU utilization
+   (assigned / allocatable, where allocatable excludes reserved CPUs). Higher score wins.
+   Equal scores or non-static policy → undecided.
+2. **Memory signal (static memory manager):** Score each NUMA by regular-memory utilization
+   (assigned / allocatable, where allocatable excludes per-NUMA reserved memory). Higher score
+   wins. Equal scores or non-static policy → undecided.
+3. **Combine:** Neither decides → **`Narrowest`** fallback. One decides → use it.
+   Both decide and agree → use it. Both decide but disagree → **`Narrowest`** fallback.
+
+**Interactions:**
+
+- **Scheduler / descheduler:** Not a substitute for this option. The **scheduler** chooses a
+  **node**; it does not run Topology Manager or finalize **static** CPU / memory **NUMA** placement
+  on the node. The **descheduler** evicts pods so they can be scheduled again. This KEP answers
+  the question "which single-NUMA outcome wins when several are equivalent?".
+- **Merge:** Still driven by hint providers; this step only affects **which** valid
+  single-NUMA preferred outcome wins when multiple exist.
+
+
+### Notes / Constraints / Caveats
+
+- Requires **static** CPU Manager and Memory Manager so **per-NUMA** signals are meaningful.
+
+### Risks and Mitigations
+
+| Risk | Mitigation |
+|------|------------|
+| Admission latency | Run scoring only on the tie path; reuse cached summaries where possible |
+| Behavior surprise when enabled | Opt-in; metrics `topology_manager_admission_*` already exist |
+
+## Design Details
+
+### Kubelet wiring (container manager ↔ topology manager)
+
+Topology Manager is created **before** the CPU and Memory managers, so those dependencies are
+not available at `NewManager` time. Inside the **container manager**, after the CPU and Memory
+managers are constructed (and registered as topology hint providers), kubelet builds a **preferred
+NUMA tie-breaker** object that **holds references** to those two managers and registers it via
+**`TopologyManager.SetPreferredSingleNUMATieBreaker`**. The topology **Manager** forwards to
+**Scope**, which stores the object on **`singleNumaNodePolicy`** when the active policy is
+**`single-numa-node`**; other policies ignore registration.
+
+When a pod reaches topology admission and the hint **merger** takes the tie path (two **preferred**
+hints with **single-bit** NUMA masks) and **`PreferMostAllocatedNUMANode`** is **true** in
+**`PolicyOptions`**, the merger calls **`ComparePreferredSingleNUMAForTopology`** on the stored
+tie-breaker. That implementation delegates to the **CPU** and **Memory** managers’
+**`ComparePreferredSingleNUMAForTopology`** methods, then applies the **agree / single-signal /
+disagree→fallback** rules described under **Algorithm**.
+
+### Test Plan
+
+[X] Owners of involved components may require updates to existing tests before merge.
+
+#### Unit tests
+
+Five test suites cover each layer of the feature:
+
+1. **CPU signal scoring** (`pkg/kubelet/cm/cpumanager/cpu_compare_preferred_scoring_test.go`) —
+   Verifies the CPU manager's utilization-based comparison: equal utilization → undecided,
+   higher utilization → prefer that NUMA, non-static policy → undecided, and asymmetric
+   reserved CPUs correctly change the outcome even when raw assigned counts are equal.
+
+2. **Memory signal scoring** (`pkg/kubelet/cm/memorymanager/memory_compare_preferred_scoring_test.go`) —
+   Verifies the memory manager's utilization-based comparison: equal utilization → undecided,
+   higher utilization → prefer that NUMA, non-static policy → undecided, and asymmetric
+   per-NUMA reserved memory correctly change the outcome even when raw assigned bytes are equal.
+
+3. **Aggregator combine rules** (`pkg/kubelet/cm/preferred_numa_tiebreak_test.go`) —
+   Stub-driven tests for the combine logic: both undecided, CPU-only, memory-only,
+   agree, and disagree→fallback.
+
+4. **Topology Manager merge** (`pkg/kubelet/cm/topologymanager/policy_prefer_most_allocated_test.go`) —
+   End-to-end merge through `singleNumaNodePolicy`: tie-breaker overrides default, and
+   absent tie-breaker falls back to Narrowest.
+
+5. **Policy option gating** (`pkg/kubelet/cm/topologymanager/policy_options_test.go`) —
+   `prefer-most-allocated-numa-node` accepted only with `TopologyManagerPolicyAlphaOptions`
+   enabled; rejected otherwise.
+
+#### Integration / e2e
+
+- Multi-NUMA **static** CPU + `single-numa-node`: ordered admission of Guaranteed pods
+  (no devices) to validate **baseline** vs **option** NUMA choice when ties exist.
+- Scenarios where **one** NUMA already holds an exclusive / device-bound pod and pods could
+  admit to **either** NUMA—assert the option changes **which** NUMA wins vs low-index default.
+
+### Rollout and Documentation
+
+- Alpha: new option behind `TopologyManagerPolicyAlphaOptions`
+- User-facing docs: kubelet configuration reference, relationship to `single-numa-node` and
+  static managers
+- Release notes per phase.
+
+### Graduation Criteria
+
+- **Alpha:** Implementation + unit tests; documented semantics and known limitations.
+- **Beta:** e2e signal on multi-NUMA CI; no major semantic surprises from production feedback.
+- **GA:** Sufficient soak; option promotion per SIG Node policy for policy options.
+
+## Drawbacks
+
+- More logic and coupling between Topology Manager and CPU/Memory manager state.
+
+## Alternatives
+
+1. **Status quo:** Keep bitmask / low-NUMA-id tie-break; accept lower density on **asymmetric**
+   nodes when symmetric pods **always** stack on the lower-id NUMA first.
+
+## Implementation History
+
+- 2026-04-07: Draft created.
diff --git a/keps/sig-node/6007-prefer-most-allocated-numa-node/capacity-aware.png b/keps/sig-node/6007-prefer-most-allocated-numa-node/capacity-aware.png
diff --git a/keps/sig-node/6007-prefer-most-allocated-numa-node/capacity-not-aware.png b/keps/sig-node/6007-prefer-most-allocated-numa-node/capacity-not-aware.png
diff --git a/keps/sig-node/6007-prefer-most-allocated-numa-node/kep.yaml b/keps/sig-node/6007-prefer-most-allocated-numa-node/kep.yaml
@@ -0,0 +1,25 @@
+title: Add Topology Manager option to improve workload density for single-numa-node
+kep-number: 6007
+authors:
+  - "jingczhang"
+  - "saipranav36"
+owning-sig: sig-node
+participating-sigs: []
+status: provisional
+creation-date: "2026-04-07"
+reviewers: []
+approvers: 
+  - "@sig-node-tech-leads"
+see-also: []
+replaces: []
+latest-milestone: ""
+milestone: 
+  alpha: "v1.37"
+  beta: "v1.38"
+  stable: "v1.40"
+feature-gates:
+  - name: "TopologyManagerPolicyAlphaOptions"
+    components:
+      - kubelet
+disable-supported: true
+metrics: []