diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md new file mode 100644 index 000000000000..b1fac46caed3 --- /dev/null +++ b/keps/sig-node/5963-device-compatibility-groups/README.md @@ -0,0 +1,1104 @@ +# KEP-5963: DRA Device Compatibility Groups + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API](#api) + - [CompatibilityGroups Assignment](#compatibilitygroups-assignment) + - [Examples](#examples) + - [Example 1: What the existing API enables](#example-1-what-the-existing-api-enables) + - [Example 2: How the existing API does not solve the problem](#example-2-how-the-existing-api-does-not-solve-the-problem) + - [Example 3: How the proposed API solves the problem](#example-3-how-the-proposed-api-solves-the-problem) + - [Example 4: Multiple compatible groups with an incompatible group](#example-4-multiple-compatible-groups-with-an-incompatible-group) + - [Scheduler Changes](#scheduler-changes) + - [Interaction with Multi-Request Claims and Device Constraints](#interaction-with-multi-request-claims-and-device-constraints) + - [Driver Responsibilities](#driver-responsibilities) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) + - [Current Workaround: Driver-level Preparation Failure](#current-workaround-driver-level-preparation-failure) + - [Inverted naming: `mutualExclusionGroups`](#inverted-naming-mutualexclusiongroups) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://git.k8s.io/enhancements) (not the initial KEP PR) +- (R) KEP approvers have approved the KEP status as `implementable` +- (R) Design details are appropriately documented +- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - e2e Tests for all Beta API Operations (endpoints) + - (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - (R) Minimum Two Week Window for GA e2e tests to prove flake free +- (R) Graduation criteria is in place + - (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA +- (R) Production readiness review completed +- (R) Production readiness review approved +- "Implementation History" section is up-to-date for milestone +- User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/) +- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +## Summary + +This KEP proposes an extension to the Dynamic Resource Allocation (DRA) API to +support mutually exclusive device allocation constraints. Hardware devices often +support multiple partitioning or virtualization schemes (for example, GPU MIG +slicing vs. MPS sharing) that provide different trade-offs in terms of isolation, +performance, and resource sharing. These schemes are frequently mutually exclusive +at the hardware level: once a physical device is partitioned or configured using +one scheme, it cannot be reconfigured to use a different scheme until all existing +allocations are released. + +The current DRA Partitionable Devices API has no mechanism for drivers to express +these mutual exclusivity constraints. A shared counter with a capacity of one +can ensure mutual exclusion, but this cannot be used here: such a counter +would have to be decremented once when allocating the *first* device from a set +of compatible devices, not once for *each* device. This cannot be expressed +at the moment. + +Without a mechanism for this, incompatible allocations are only +detected during resource preparation, after the scheduler has already made its +decisions, leading to pod startup failures and resource thrashing. This KEP +introduces API and scheduler changes so that compatibility constraints can be +declared in ResourceSlice objects and enforced at scheduling time. + +## Motivation + +Hardware devices often support multiple partitioning or virtualization schemes +that are mutually exclusive at the hardware level. For example, an NVIDIA GPU +can be configured for MIG (Multi-Instance GPU) slicing or MPS (Multi-Process +Service) sharing, but not both simultaneously on the same physical device. + +Without a mechanism to express these constraints in DRA, the following problems +arise: + +1. **Late Failure Detection**: Incompatible allocations are only detected during + resource preparation, after scheduling decisions have already been made. +2. **Scheduler Unawareness**: The scheduler may allocate incompatible devices, + leading to pod startup failures. +3. **Poor User Experience**: Users receive cryptic preparation failures instead + of clear scheduling feedback. +4. **Resource Thrashing**: The scheduler may repeatedly attempt incompatible + allocations before giving up. + +The current workaround—having DRA drivers fail resource preparation when +incompatible allocations are attempted—is insufficient because it provides no +mechanism to inform the scheduler, and does not prevent repeated failed attempts. + +### Goals + +- Allow DRA drivers to specify compatibility between virtual devices within a +single physical device. +- Allow the scheduler to make informed allocation decisions that respect +compatibility rules declared in ResourceSlice objects. +- Provide a generic mechanism applicable to any hardware with partitioning +constraints, not just GPUs. +- Maintain backward compatibility with existing ResourceSlice specifications. + +### Non-Goals + +- Allowing DRA drivers to specify compatibility between devices that do not + share a counter set. The scope of compatibility constraints is limited to + virtual devices consuming from the same counter set (which, by convention, + represents a single underlying physical device). +- Providing a centralized or cluster-wide registry of compatibility group + names. Group names are opaque strings scoped to a single ResourceSlice pool + and are meaningful only to the driver that publishes them. +- Enabling the scheduler to *reconfigure* a physical device between + partitioning schemes (e.g., MIG ↔ MPS) as part of scheduling. This KEP only + addresses rejecting incompatible allocations; transitions between schemes + remain a driver concern and typically require draining existing allocations. +- Expressing compatibility constraints on `ResourceClaim` objects. The field + is driver-authored and lives only on `ResourceSlice`. +- Replacing existing counter-capacity checks. `compatibilityGroups` is an + additional predicate; capacity math on `sharedCounters` continues to apply + unchanged. + +## Proposal + +**CompatibilityGroups Assignment** + +Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare which +named groups they belong to. For two devices consuming counters from the same +counter set to be co-allocated, they must share at least one compatibility group. + +Devices that omit this field are compatible only with other devices in the same +counter set that also omit it. Existing ResourceSlices (where no device sets the +field) continue to behave as today; drivers adopting this feature should annotate +every device sharing a counter set. + +### User Stories + +#### Story 1 + +As a GPU operator using NVIDIA GPUs, I want to express in my ResourceSlice +that MIG-partitioned virtual devices and MPS-sharing virtual devices on the +same physical GPU are mutually exclusive. When a pod requesting a MIG partition +is already running on a GPU, I want the scheduler to automatically exclude all +MPS devices on that same GPU from consideration for new allocations, rather than +allowing an allocation that will fail at device preparation time. + +#### Story 2 + +As a hardware vendor publishing DRA drivers for an accelerator that supports +multiple exclusive operating modes (for example, exclusive mode, software +partitioning, and hardware partitioning), I want to declare the compatibility +constraints directly in my ResourceSlice, so that the Kubernetes scheduler +can enforce those constraints without requiring my driver to fail pod startup +with cryptic error messages. + +### Notes/Constraints/Caveats + +The compatibility relation is **symmetric**: if A can be co-allocated with B, +then B can be co-allocated with A. It is **not transitive**: A and B sharing a +group, and B and C sharing a group, does not imply A and C share one. +Concretely, the scheduler evaluates the pairwise predicate +`groups(A) ∩ groups(B) ≠ ∅` against every already-allocated device on the same +counter set; it does not compute transitive closures. Drivers that want three +device types to be mutually co-allocatable must ensure every pair shares at +least one group (see Example 4). + +### Risks and Mitigations + +**Scheduler performance impact**: Evaluating compatibility constraints during +device selection adds work to each scheduling cycle that involves DRA devices. + +**Older schedulers ignoring new field**: A kube-scheduler that does not +understand `compatibilityGroups` will ignore this +field and may allocate incompatible devices. This degrades to the current +behavior (driver fails at preparation time). Mitigation: document the version +skew behavior clearly; drivers must still validate at preparation time even +when the scheduler enforces constraints. + +**Incorrect driver declarations**: If a driver declares incorrect compatibility +constraints, the scheduler may either reject valid allocations or permit invalid +ones. Mitigation: the API is driver-authored and opt-in; drivers are responsible +for correctness and documentation of their compatibility matrix. + +## Design Details + +### API + +#### CompatibilityGroups Assignment + +A new field `compatibilityGroups` is added inside each entry of +`device.consumesCounters[]`. It contains a list of string group names. +For two devices consuming counters from the same counter set to be allocated +together, either both must omit the field (or set it to an empty list), or both +must declare the field and share at least one group name. A nil +`compatibilityGroups` and an empty `compatibilityGroups: []` are treated +identically. This means a device that declares the field is never co-allocatable +with a sibling that omits it. + +The field is placed on each `consumesCounters[]` entry rather than on the +device itself because compatibility is a physical-hardware property scoped to +the shared resource represented by the counter set. A single virtual device +that consumes from multiple counter sets may therefore declare different +groups per counter set, reflecting different exclusivity constraints on +different pieces of underlying hardware. Two devices that do not share any +counter set are never compared via this field, even if they live on the same +node or in the same `ResourceSlice`. + +**Naming convention used in examples.** A device of type `T` lists `T` in its +groups. When types `T1…Tn` are mutually co-allocatable, every device of those +types additionally lists a shared composite group (e.g., `t1t2`). A type that +is compatible with no other type lists only `[T]`. The scheduler does not +parse group names — this convention is purely for readability; any opaque +strings that satisfy the symmetry requirement work. + +Example showing MIG and FOO partitions on the same physical GPU: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-1-cs + counters: + multiprocessors: + value: "152" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + - name: gpu-1-mig1 + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - mig + counters: + multiprocessors: + value: "2" + - name: gpu-1-foo-part + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - foo + - foobar + counters: + multiprocessors: + value: "17" + - name: gpu-1-bar-part + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - bar + - foobar + counters: + multiprocessors: + value: "17" +``` + +- `gpu-1-mig1` (groups: `mig`) and `gpu-1-foo-part` (groups: `foo`, `foobar`) +share no compatibility group, so they cannot be co-allocated on the same +counter set. +- `gpu-1-foo-part` (groups: `foo`, `foobar`) and `gpu-1-bar-part` (groups: +`bar`, `foobar`) share the `foobar` group, so they can be co-allocated on the +same counter set. + +### Examples + +The following examples demonstrate the problem and the proposed solution using +a GPU that supports two mutually exclusive partitioning schemes: MIG (hardware +partitioning into isolated instances) and MPS (software-level time-sharing). + +#### Example 1: What the existing API enables + +The DRA Partitionable Devices API uses shared counter sets to track the +capacity of a physical device across its virtual partitions. When all virtual +devices on a GPU use the same partitioning scheme, the counter capacity check +is sufficient to ensure correct allocation. + +ResourceSlices — a single GPU advertising three MIG 1g partitions: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-counters +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-0-counters + counters: + multiprocessors: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-devices +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + - name: gpu-0-mig-1g-0 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-1 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-2 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" +``` + +ResourceClaims — two pods each requesting a MIG 1g partition: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-a-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-b-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +``` + +The scheduler allocates `gpu-0-mig-1g-0` to pod-a and `gpu-0-mig-1g-1` to +pod-b. Both consume from `gpu-0-counters` (20 + 20 = 40 <= 100). Both pods +start successfully because both devices use the same MIG partitioning mode. + +#### Example 2: How the existing API does not solve the problem + +When a driver advertises devices from multiple mutually exclusive partitioning +schemes on the same GPU, all sharing the same counter set, the current API has +no way to express that these schemes cannot coexist. + +ResourceSlices — the same GPU now advertising both MIG and MPS devices: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-counters +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-0-counters + counters: + multiprocessors: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-devices +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + # MIG partitions + - name: gpu-0-mig-1g-0 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-1 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + # MPS shares + - name: gpu-0-mps-0 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "50" + - name: gpu-0-mps-1 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "50" +``` + +ResourceClaims — pod-a requests a MIG partition, pod-b requests an MPS share: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-a-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-b-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mps' +``` + +The scheduler sees `gpu-0-mig-1g-0` (20 SMs) and `gpu-0-mps-0` (50 SMs). +Total: 70 <= 100 — the counter capacity check passes. The scheduler allocates +both. But at preparation time, the driver fails because MIG and MPS cannot be +active simultaneously on the same physical GPU. Pod-b gets a cryptic +preparation error. The scheduler may retry the same incompatible combination +repeatedly, causing resource thrashing. + +#### Example 3: How the proposed API solves the problem + +With `compatibilityGroups`, the driver declares that MIG devices belong to the +`"mig"` group and MPS devices belong to the `"mps"` group. The scheduler +enforces that devices sharing a counter set must share at least one +compatibility group. + +ResourceSlices — same devices, now with compatibility groups: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-counters +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-0-counters + counters: + multiprocessors: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-devices +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + # MIG partitions + - name: gpu-0-mig-1g-0 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mig + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-1 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mig + counters: + multiprocessors: + value: "20" + # MPS shares + - name: gpu-0-mps-0 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mps + counters: + multiprocessors: + value: "50" + - name: gpu-0-mps-1 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mps + counters: + multiprocessors: + value: "50" +``` + +ResourceClaims — identical to Example 2: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-a-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-b-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mps' +``` + +The scheduler allocates `gpu-0-mig-1g-0` (group: `mig`) to pod-a. When +evaluating `gpu-0-mps-0` (group: `mps`) for pod-b, it checks +compatibility: both devices consume from `gpu-0-counters`, but they share no +compatibility group (`mig` vs `mps`). The scheduler rejects the allocation and +pod-b becomes Unschedulable with event: "claim violates device compatibility +constraints". No cryptic preparation failure, no resource thrashing. + +Two MIG devices (both group: `mig`) or two MPS devices (both group: `mps`) can +still be co-allocated, since they share a group. Each device lists only its +own type because MIG and MPS are not compatible with each other; if they +were, they would also share a composite group like `migmps`. + +#### Example 4: Multiple compatible groups with an incompatible group + +A device may support more than two partitioning schemes, some of which can +coexist. In this example, a device advertises three partition types: `foo`, +`bar`, and `baz`. `foo` and `bar` can coexist on the same device, but `baz` +is incompatible with both. + +By convention, each device's `compatibilityGroups` is a composite of the +types it can be co-allocated with: each device lists its own type, plus a +shared composite group for every set of types it is compatible with. So +`foo` devices list `[foo, foobar]`, `bar` devices list `[bar, foobar]`, and +`baz` — compatible with no other type — lists `[baz]`. + +This example is written generically — the counter name `units` stands in for +any hardware-specific resource (SMs, bandwidth, slots). + +ResourceSlices — a device advertising foo, bar, and baz partitions: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-device-0-counters +spec: + driver: device.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: device-0-counters + counters: + units: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-device-0-devices +spec: + driver: device.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + # foo partitions + - name: device-0-foo-0 + attributes: + type: + string: "foo" + consumesCounters: + - counterSet: device-0-counters + compatibilityGroups: + - foo + - foobar + counters: + units: + value: "25" + # bar partitions + - name: device-0-bar-0 + attributes: + type: + string: "bar" + consumesCounters: + - counterSet: device-0-counters + compatibilityGroups: + - bar + - foobar + counters: + units: + value: "25" + # baz partitions + - name: device-0-baz-0 + attributes: + type: + string: "baz" + consumesCounters: + - counterSet: device-0-counters + compatibilityGroups: + - baz + counters: + units: + value: "50" +``` + +`device-0-foo-0` (groups: `foo`, `foobar`) and `device-0-bar-0` (groups: +`bar`, `foobar`) share the `foobar` group, so they can be co-allocated. +`device-0-baz-0` (groups: `baz`) shares no group with either, so it cannot be +co-allocated with them. + +For instance, if pod-a is allocated `device-0-foo-0`, a subsequent pod +requesting `device-0-bar-0` succeeds (both share `foobar`), but a pod +requesting `device-0-baz-0` is rejected (`foo`/`foobar` vs `baz` — no shared +group). + +### Scheduler Changes + +The DRA scheduler plugin is enhanced to: + +1. Maintain a cache of allocated devices per node, including their compatibility + fields (`compatibilityGroups` values). +2. For each candidate device during allocation, evaluate whether it is compatible + with all currently allocated devices on the node, and whether all allocated + devices are compatible with it (bidirectional check). +3. Remove candidate devices from consideration if they violate compatibility + constraints. +4. Emit clear scheduling events when a device is rejected due to compatibility. + +**Complexity.** Let *M* be the number of devices already allocated on a +counter set, *N* the number of candidates under consideration for that +counter set, and *G* the maximum number of groups declared per counter-set +consumption entry. The additional filter cost per scheduling cycle is +O(*N* · *M* · *G*) for pairwise group-intersection checks, with typical *G* +≤ 4 (hardware partition modes per device are small in practice). The +existing DRA allocation loop already iterates over candidates per counter +set, so the new work is a constant-factor-per-candidate addition rather than +a new outer loop. + +### Interaction with Multi-Request Claims and Device Constraints + +**Multiple requests within one claim.** The compatibility predicate is +evaluated pairwise between every device already allocated *on the same counter +set* and each candidate, regardless of whether the allocated device belongs to +the same claim, a different claim on the same pod, or a different pod entirely. +Two devices within a single `ResourceClaim` that land on the same counter set +are therefore subject to the same pairwise check: the second request sees the +first as already-allocated state. + +**Allocation order.** The scheduler does not reorder requests within a claim +to improve feasibility. If requests are ordered such that an early compatible +pick later blocks a mandatory pick, the claim becomes Unschedulable and +standard retry behavior applies. This matches how existing DRA constraints +behave. + +**Composition with `DeviceConstraints`.** `compatibilityGroups` is a +driver-authored, ResourceSlice-side constraint. `DeviceConstraints` (e.g., +`matchAttribute`) is a user-authored, ResourceClaim-side constraint. The two +are evaluated independently and both must pass for a candidate to be +allocated. A claim can never *relax* a driver-declared compatibility group, +and a driver can never *force* a claim-side `matchAttribute`. They compose by +conjunction. + +### Driver Responsibilities + +Resource drivers are responsible for: + +1. Populating `compatibilityGroups` for all devices with compatibility requirements. +2. Ensuring compatibility rules are symmetric and consistent across all devices + in a ResourceSlice. +3. Documenting their compatibility matrix. +4. Continuing to validate at resource preparation time for version-skew safety. + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +None. The DRA scheduler plugin and `ResourceSlice` validation already have +unit and integration coverage; new tests are additive. + +##### Unit tests + +- `k8s.io/dynamic-resource-allocation/structured`: pairwise group-intersection + predicate (empty, nil, single, multiple groups; nil-vs-nil, nil-vs-set, + set-vs-set; `[]` treated as nil). +- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources`: filter + behavior with mixed compatible and incompatible candidates on the same + counter set; no-op behavior when the feature gate is disabled; no-op + behavior when no device in the slice declares the field. +- `k8s.io/kubernetes/pkg/apis/resource/validation`: field validation — + accepted shapes, max group-name length, max groups per counter consumption. + +##### Integration tests + +- Feature gate enablement/disablement round-trip: field is persisted when + enabled, dropped on write when disabled. +- Scheduler rejects a claim when the only remaining candidate on a node + belongs to an incompatible group; admits it when a compatible candidate + exists on another node. +- Upgrade → downgrade → upgrade: allocations made during the "upgrade" phase + remain valid after downgrade; re-enabling enforcement does not re-evaluate + existing allocations. + +##### e2e tests + +- Fake DRA driver advertising two mutually exclusive groups (`mig`, `mps`) on + a single counter set. Scheduling a `mig` pod followed by an `mps` pod on + the same node leaves the second pod Unschedulable with the documented + event; reversing the order reproduces the behavior symmetrically. +- Same driver with compatible groups (`foo`, `bar`) — both pods schedule. +- Feature-gate-off baseline: the second pod reaches preparation and the + driver rejects it (pre-KEP behavior preserved). + +### Graduation Criteria +#### Alpha +- API defined and implemented +- All relevant code is merged and placed behind a feature flag +- Unit and integration tests +- Driver-author documentation published under `kubernetes/website` (DRA + drivers section), including the strict nil-matching rule and a worked + MIG/MPS example. + +#### Beta +- E2E tests passing in CI +- Validated with at least one production DRA driver (out-of-tree testing) + +#### GA +- At least 2 releases as beta + +### Upgrade / Downgrade Strategy +#### Upgrade +Upon upgrading, no `ResourceSlice` leverages the new optional fields yet, so the current behavior remains as-is + +#### Downgrade +If downgrading to a version that does not have this enhancement implemented, older schedulers and api-servers do not know of the added optional field, and revert to their defined behavior prior to this enhancement + +Allocated devices that leveraged this new field will remain allocated, and future allocations will not take `compatibilityGroups` into consideration. + + +### Version Skew Strategy + +The feature introduces a new optional field on `ResourceSlice` and new +enforcement logic in the scheduler. Skew behaviors to consider: + +**New kube-apiserver + old kube-scheduler.** The apiserver accepts and persists +`compatibilityGroups`. An old scheduler ignores the field and may allocate +incompatible devices. This degrades to the pre-KEP behavior: the DRA driver +rejects the allocation at resource preparation time. Drivers MUST continue to +validate at preparation time for this reason (see Driver Responsibilities). + +**Old kube-apiserver + new kube-scheduler.** The old apiserver drops the unknown +field on writes. ResourceSlices in etcd therefore do not carry +`compatibilityGroups`, and the new scheduler sees only nil values, producing the +pre-KEP behavior. No incorrect allocations result. + +**Mixed-version HA kube-scheduler.** If one replica enforces the field and +another does not, the enforcing replica may reject allocations the +non-enforcing replica would accept. Both outcomes are safe (either the +scheduler correctly rejects, or the driver rejects at preparation time). +Resolution is to complete the scheduler rollout. + +**Downgrade with in-flight allocations.** Devices already allocated under the +new rules remain allocated across a downgrade; the post-downgrade scheduler +will not consider `compatibilityGroups` for future allocations, reverting to +pre-KEP behavior. No existing allocations are invalidated. + +**Feature gate off on one component.** If `DRADeviceCompatibilityGroups` is +enabled on kube-apiserver but disabled on kube-scheduler (or vice versa), +behavior matches the corresponding skew row above — apiserver stores the field +but scheduler ignores it, or scheduler enforces on field values that the +apiserver may drop on writes. + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- Feature gate + - Feature gate name: DRADeviceCompatibilityGroups + - Components depending on the feature gate: kube-scheduler, kube-apiserver +- Gate behavior per component: + - **kube-apiserver**: when disabled, strips `compatibilityGroups` on writes + and hides it on reads of `ResourceSlice`. Prevents drivers from persisting + values that cannot be enforced. + - **kube-scheduler**: when disabled, ignores the field on read and does not + perform the pairwise intersection check during filtering. +- No control-plane downtime is required to toggle the gate. +- No node downtime or reprovisioning is required. + +###### Does enabling the feature change any default behavior? +No, this KEP proposes an additional optional field to the `ResourceSlice` API + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes, rolling back the enablement will revert the cluster to its pre-enablement behavior + +###### What happens if we reenable the feature if it was previously rolled back? +Existing `compatibilityGroup` configurations in `ResourceSlice`s will become effective again + +###### Are there any tests for feature enablement/disablement? +Yes, there will be integration tests to verify feature enablement/disablement + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? +Rollout risk is limited to the two components touched by the feature gate +(kube-apiserver field handling and kube-scheduler filter logic). +Already-running workloads are not affected: compatibility filtering only runs +during scheduling of *new* allocations, so disabling the gate or rolling back +binaries does not disturb existing pod/device bindings. + +###### What specific metrics should inform a rollback? +A new scheduler metric +`scheduler_dra_compatibility_rejections_total{driver,counter_set,reason}` +counts claim filter rejections caused by compatibility constraints. A rollback +is warranted if this metric spikes unexpectedly after a driver update (likely +an incorrect compatibility matrix — see Risks → Incorrect driver declarations). +Operators should also watch `scheduler_unschedulable_pods` correlated with +events matching `Insufficient compatible DRA devices`. + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +Upgrade → downgrade → upgrade will be covered by the integration test +described in Test Plan → Integration tests. At alpha, manual verification on a +kind cluster with the feature gate flipped is acceptable; CI coverage is a +Beta graduation criterion. + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? +This feature is not intended for use by workload usage, it is intended for DRA Drivers + +###### How can someone using this feature know that it is working for their instance? + +- Events + - Scheduling events: + - When a candidate device is filtered out because its compatibility groups + do not intersect those of an already-allocated device on the same + counter set, the scheduler logs a per-node filter reason of the form: + ``` + device gpu-0-mps-0 (groups [mps]) incompatible with allocated device gpu-0-mig-1g-0 (groups [mig]) on counterSet gpu-0-counters + ``` + - If no node has any allocatable candidate, the standard scheduler + "0/N nodes are available" event aggregates this reason across nodes + (e.g., `4 Insufficient compatible DRA devices`). +- Pod.status + - Condition name: Unschedulable + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? +N/A + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +Operators can use `scheduler_dra_compatibility_rejections_total` together with +`scheduler_unschedulable_pods` and the standard DRA scheduler plugin latency +metrics to determine whether compatibility filtering is contributing to +scheduling failures or latency regressions. + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? +No — `scheduler_dra_compatibility_rejections_total{driver,counter_set,reason}` +(introduced by this KEP; see Rollout → What specific metrics should inform a +rollback) covers the primary observability need. Additional breakdowns can be +added post-alpha if field feedback justifies them. + +### Dependencies +DRA Partitionable Devices enabled + +###### Does this feature depend on any specific services running in the cluster? +No + +### Scalability + +###### Will enabling / using this feature result in any new API calls? +No + +###### Will enabling / using this feature result in introducing new API types? +No, only a new API field + +###### Will enabling / using this feature result in any new calls to the cloud provider? +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? +Yes, additional field to the `ResourceSlice` API + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +Scheduling cycles that involve DRA devices incur an additional +O(*N* · *M* · *G*) group-intersection check per counter set (see Design +Details → Scheduler Changes), where *M* is devices already allocated on that +counter set, *N* is candidates considered, and *G* is groups per device. For +realistic values (*M* ≤ 16, *N* ≤ 64, *G* ≤ 4) the added work is in the low +thousands of string comparisons per counter set per cycle and is not expected +to be measurable against existing DRA scheduling cost. Benchmarks will be run +during alpha and reported in the Implementation History. + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? +No new side effects + +###### What are other known failure modes? +N/A + +###### What steps should be taken if SLOs are not being met to determine the problem? +TBD + +## Implementation History + +- 2026-03-17: KEP opened, status `provisional`. +- 2026-04-18: KEP under review; API shape and default semantics settled. + +## Drawbacks + +Adding compatibility constraint support to the scheduler increases the +complexity of the DRA scheduling logic. The new field must be evaluated for +every device candidate during every scheduling cycle that involves DRA +resources, which adds latency and memory overhead. + +## Alternatives + +### Current Workaround: Driver-level Preparation Failure + +The existing workaround is for DRA drivers to fail resource preparation when +incompatible allocations are attempted. This approach is insufficient because: + +- It detects incompatibilities only after scheduling has committed to the +allocation, leading to pod startup failures. +- It provides no mechanism to inform the scheduler so it can try other nodes +or device combinations. +- It results in resource thrashing as the scheduler retries the same failing +combination. + +### Inverted naming: `mutualExclusionGroups` + +An alternative API would invert the semantics: instead of declaring which +groups a device *belongs to* (co-allocation predicate), declare which groups +a device is *incompatible with* (exclusion predicate). Two devices would then +be co-allocatable if and only if the intersection of their exclusion sets and +their own group memberships is empty. + +The inverted model is arguably more intuitive for the motivating case — a MIG +device "excludes MPS," full stop — and does not require drivers to list each +peer group in their own entry (as Example 4 does, where `foo` devices must +include `bar` in their group list). It was rejected because: + +- The co-allocation framing composes naturally with the existing DRA model, + where counter-set membership already expresses "can share resources." A + group is a finer-grained membership within the same model. +- Exclusion semantics require two fields to express the same information (the + groups you *are* in, and the groups you *exclude*), or a global registry of + group names. Membership-only is simpler. +- Symmetry is easier to validate: a driver that forgets to include `foo` in a + `bar` device's groups produces a diagnosable allocation failure, rather + than silent incorrect behavior under exclusion semantics. + +## Infrastructure Needed (Optional) + diff --git a/keps/sig-node/5963-device-compatibility-groups/kep.yaml b/keps/sig-node/5963-device-compatibility-groups/kep.yaml new file mode 100644 index 000000000000..4ccbeae4913b --- /dev/null +++ b/keps/sig-node/5963-device-compatibility-groups/kep.yaml @@ -0,0 +1,40 @@ +title: DRA Device Compatibility Groups +kep-number: 5963 +authors: + - "@omeryahud" +owning-sig: sig-node +participating-sigs: + - sig-scheduling +status: provisional +creation-date: 2026-03-17 +reviewers: + - TBD +approvers: + - TBD + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: v1.37 + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: v1.37 + beta: v1.38 + stable: v1.39 + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRADeviceCompatibilityGroups + components: + - kube-scheduler + - kube-apiserver +disable-supported: true + +metrics: []