From 89aef0f4d5966e5cf1067028b34a152c1a2fa52a Mon Sep 17 00:00:00 2001 From: Omer Yahud Date: Tue, 14 Apr 2026 22:53:15 +0000 Subject: [PATCH 1/5] KEP-5963: Device Compatibility Groups Signed-off-by: Omer Yahud --- .../README.md | 894 ++++++++++++++++++ .../5963-device-compatibility-groups/kep.yaml | 40 + 2 files changed, 934 insertions(+) create mode 100644 keps/sig-node/5963-device-compatibility-groups/README.md create mode 100644 keps/sig-node/5963-device-compatibility-groups/kep.yaml diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md new file mode 100644 index 000000000000..feaf44d13d67 --- /dev/null +++ b/keps/sig-node/5963-device-compatibility-groups/README.md @@ -0,0 +1,894 @@ +# KEP-5963: DRA Device Compatibility Groups + +- [Release Signoff Checklist](#release-signoff-checklist) +- [Summary](#summary) +- [Motivation](#motivation) + - [Goals](#goals) + - [Non-Goals](#non-goals) +- [Proposal](#proposal) + - [User Stories](#user-stories) + - [Story 1](#story-1) + - [Story 2](#story-2) + - [Notes/Constraints/Caveats](#notesconstraintscaveats) + - [Risks and Mitigations](#risks-and-mitigations) +- [Design Details](#design-details) + - [API](#api) + - [CompatibilityGroups Assignment](#compatibilitygroups-assignment) + - [Examples](#examples) + - [Example 1: What the existing API enables](#example-1-what-the-existing-api-enables) + - [Example 2: How the existing API does not solve the problem](#example-2-how-the-existing-api-does-not-solve-the-problem) + - [Example 3: How the proposed API solves the problem](#example-3-how-the-proposed-api-solves-the-problem) + - [Example 4: Multiple compatible groups with an incompatible group](#example-4-multiple-compatible-groups-with-an-incompatible-group) + - [Scheduler Changes](#scheduler-changes) + - [Driver Responsibilities](#driver-responsibilities) + - [Test Plan](#test-plan) + - [Prerequisite testing updates](#prerequisite-testing-updates) + - [Unit tests](#unit-tests) + - [Integration tests](#integration-tests) + - [e2e tests](#e2e-tests) + - [Graduation Criteria](#graduation-criteria) + - [Upgrade / Downgrade Strategy](#upgrade--downgrade-strategy) + - [Version Skew Strategy](#version-skew-strategy) +- [Production Readiness Review Questionnaire](#production-readiness-review-questionnaire) + - [Feature Enablement and Rollback](#feature-enablement-and-rollback) + - [Rollout, Upgrade and Rollback Planning](#rollout-upgrade-and-rollback-planning) + - [Monitoring Requirements](#monitoring-requirements) + - [Dependencies](#dependencies) + - [Scalability](#scalability) + - [Troubleshooting](#troubleshooting) +- [Implementation History](#implementation-history) +- [Drawbacks](#drawbacks) +- [Alternatives](#alternatives) +- [Infrastructure Needed (Optional)](#infrastructure-needed-optional) + +## Release Signoff Checklist + +Items marked with (R) are required *prior to targeting to a milestone / release*. + +- (R) Enhancement issue in release milestone, which links to KEP dir in [kubernetes/enhancements](https://git.k8s.io/enhancements) (not the initial KEP PR) +- (R) KEP approvers have approved the KEP status as `implementable` +- (R) Design details are appropriately documented +- (R) Test plan is in place, giving consideration to SIG Architecture and SIG Testing input (including test refactors) + - e2e Tests for all Beta API Operations (endpoints) + - (R) Ensure GA e2e tests meet requirements for [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) + - (R) Minimum Two Week Window for GA e2e tests to prove flake free +- (R) Graduation criteria is in place + - (R) [all GA Endpoints](https://github.com/kubernetes/community/pull/1806) must be hit by [Conformance Tests](https://github.com/kubernetes/community/blob/master/contributors/devel/sig-architecture/conformance-tests.md) within one minor version of promotion to GA +- (R) Production readiness review completed +- (R) Production readiness review approved +- "Implementation History" section is up-to-date for milestone +- User-facing documentation has been created in [kubernetes/website](https://git.k8s.io/website), for publication to [kubernetes.io](https://kubernetes.io/) +- Supporting documentation—e.g., additional design documents, links to mailing list discussions/SIG meetings, relevant PRs/issues, release notes + +## Summary + +This KEP proposes an extension to the Dynamic Resource Allocation (DRA) API to +support mutually exclusive device allocation constraints. Hardware devices often +support multiple partitioning or virtualization schemes (for example, GPU MIG +slicing vs. MPS sharing) that provide different trade-offs in terms of isolation, +performance, and resource sharing. These schemes are frequently mutually exclusive +at the hardware level: once a physical device is partitioned or configured using +one scheme, it cannot be reconfigured to use a different scheme until all existing +allocations are released. + +The current DRA Partitionable Devices API has no mechanism for drivers to express +these mutual exclusivity constraints. Without it, incompatible allocations are only +detected during resource preparation, after the scheduler has already made its +decisions, leading to pod startup failures and resource thrashing. This KEP +introduces API and scheduler changes so that compatibility constraints can be +declared in ResourceSlice objects and enforced at scheduling time. + +## Motivation + +Hardware devices often support multiple partitioning or virtualization schemes +that are mutually exclusive at the hardware level. For example, an NVIDIA GPU +can be configured for MIG (Multi-Instance GPU) slicing or MPS (Multi-Process +Service) sharing, but not both simultaneously on the same physical device. + +Without a mechanism to express these constraints in DRA, the following problems +arise: + +1. **Late Failure Detection**: Incompatible allocations are only detected during + resource preparation, after scheduling decisions have already been made. +2. **Scheduler Unawareness**: The scheduler may allocate incompatible devices, + leading to pod startup failures. +3. **Poor User Experience**: Users receive cryptic preparation failures instead + of clear scheduling feedback. +4. **Resource Thrashing**: The scheduler may repeatedly attempt incompatible + allocations before giving up. + +The current workaround—having DRA drivers fail resource preparation when +incompatible allocations are attempted—is insufficient because it provides no +mechanism to inform the scheduler, and does not prevent repeated failed attempts. + +### Goals + +- Allow DRA drivers to specify compatibility between virtual devices within a +single physical device. +- Allow the scheduler to make informed allocation decisions that respect +compatibility rules declared in ResourceSlice objects. +- Provide a generic mechanism applicable to any hardware with partitioning +constraints, not just GPUs. +- Maintain backward compatibility with existing ResourceSlice specifications. + +### Non-Goals + +- Allow DRA drivers to specify compatibility between physical or virtual devices +across different physical devices. The scope of compatibility constraints is limited to virtual devices sharing the same +underlying physical device. + +## Proposal + +**CompatibilityGroups Assignment** + +Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare which +named groups they belong to. For two devices consuming counters from the same +counter set to be co-allocated, they must share at least one compatibility group. +Devices without this field are considered compatible with all groups. This +approach is simpler and has minimal API surface. + +### User Stories + +#### Story 1 + +As a GPU operator using NVIDIA GPUs, I want to express in my ResourceSlice +that MIG-partitioned virtual devices and MPS-sharing virtual devices on the +same physical GPU are mutually exclusive. When a pod requesting a MIG partition +is already running on a GPU, I want the scheduler to automatically exclude all +MPS devices on that same GPU from consideration for new allocations, rather than +allowing an allocation that will fail at device preparation time. + +#### Story 2 + +As a hardware vendor publishing DRA drivers for an accelerator that supports +multiple exclusive operating modes (for example, exclusive mode, software +partitioning, and hardware partitioning), I want to declare the compatibility +constraints directly in my ResourceSlice, so that the Kubernetes scheduler +can enforce those constraints without requiring my driver to fail pod startup +with cryptic error messages. + +### Notes/Constraints/Caveats + +The compatibility constraint is bidirectional and transitive: if device A +specifies a constraint that excludes device B, then allocating A must prevent +B from being allocated, and vice versa. This proposal implements this +bidirectional check in the scheduler. + +### Risks and Mitigations + +**Scheduler performance impact**: Evaluating compatibility constraints during +device selection adds work to each scheduling cycle that involves DRA devices. + +**Older schedulers ignoring new field**: A kube-scheduler that does not +understand `compatibilityGroups` will ignore this +field and may allocate incompatible devices. This degrades to the current +behavior (driver fails at preparation time). Mitigation: document the version +skew behavior clearly; drivers must still validate at preparation time even +when the scheduler enforces constraints. + +**Incorrect driver declarations**: If a driver declares incorrect compatibility +constraints, the scheduler may either reject valid allocations or permit invalid +ones. Mitigation: the API is driver-authored and opt-in; drivers are responsible +for correctness and documentation of their compatibility matrix. + +## Design Details + +### API + +#### CompatibilityGroups Assignment + +A new field `compatibilityGroups` is added inside each entry of +`device.consumesCounters[]`. It contains a list of string group names. +For two devices consuming counters from the same counter set to be allocated +together, they must share at least one group name. Devices that omit this +field are considered compatible with all groups. + +Example showing MIG and FOO partitions on the same physical GPU: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-1-cs + counters: + multiprocessors: + value: "152" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + - name: gpu-1-mig1 + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - mig + counters: + multiprocessors: + value: "2" + - name: gpu-1-foo-part + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - foo + - bar + counters: + multiprocessors: + value: "17" + - name: gpu-1-bar-part + consumesCounters: + - counterSet: gpu-1-cs + compatibilityGroups: + - foo + - bar + counters: + multiprocessors: + value: "17" +``` + +- `gpu-1-mig1` and `gpu-1-foo-part` share no compatibility group (`mig` vs +`foo`/`bar`), so they cannot be co-allocated on the same counter set. +- `gpu-1-foo-part` and `gpu-1-bar-part` share compatibility groups (`foo`, `bar`), +so they can be co-allocated on the same counter set. + +### Examples + +The following examples demonstrate the problem and the proposed solution using +a GPU that supports two mutually exclusive partitioning schemes: MIG (hardware +partitioning into isolated instances) and MPS (software-level time-sharing). + +#### Example 1: What the existing API enables + +The DRA Partitionable Devices API uses shared counter sets to track the +capacity of a physical device across its virtual partitions. When all virtual +devices on a GPU use the same partitioning scheme, the counter capacity check +is sufficient to ensure correct allocation. + +ResourceSlices — a single GPU advertising three MIG 1g partitions: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-counters +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-0-counters + counters: + multiprocessors: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-devices +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + - name: gpu-0-mig-1g-0 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-1 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-2 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" +``` + +ResourceClaims — two pods each requesting a MIG 1g partition: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-a-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-b-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +``` + +The scheduler allocates `gpu-0-mig-1g-0` to pod-a and `gpu-0-mig-1g-1` to +pod-b. Both consume from `gpu-0-counters` (20 + 20 = 40 <= 100). Both pods +start successfully because both devices use the same MIG partitioning mode. + +#### Example 2: How the existing API does not solve the problem + +When a driver advertises devices from multiple mutually exclusive partitioning +schemes on the same GPU, all sharing the same counter set, the current API has +no way to express that these schemes cannot coexist. + +ResourceSlices — the same GPU now advertising both MIG and MPS devices: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-counters +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-0-counters + counters: + multiprocessors: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-devices +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + # MIG partitions + - name: gpu-0-mig-1g-0 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-1 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "20" + # MPS shares + - name: gpu-0-mps-0 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "50" + - name: gpu-0-mps-1 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + counters: + multiprocessors: + value: "50" +``` + +ResourceClaims — pod-a requests a MIG partition, pod-b requests an MPS share: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-a-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-b-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mps' +``` + +The scheduler sees `gpu-0-mig-1g-0` (20 SMs) and `gpu-0-mps-0` (50 SMs). +Total: 70 <= 100 — the counter capacity check passes. The scheduler allocates +both. But at preparation time, the driver fails because MIG and MPS cannot be +active simultaneously on the same physical GPU. Pod-b gets a cryptic +preparation error. The scheduler may retry the same incompatible combination +repeatedly, causing resource thrashing. + +#### Example 3: How the proposed API solves the problem + +With `compatibilityGroups`, the driver declares that MIG devices belong to the +`"mig"` group and MPS devices belong to the `"mps"` group. The scheduler +enforces that devices sharing a counter set must share at least one +compatibility group. + +ResourceSlices — same devices, now with compatibility groups: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-counters +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: gpu-0-counters + counters: + multiprocessors: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-gpu-0-devices +spec: + driver: gpu.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + # MIG partitions + - name: gpu-0-mig-1g-0 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mig + counters: + multiprocessors: + value: "20" + - name: gpu-0-mig-1g-1 + attributes: + type: + string: "mig-1g" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mig + counters: + multiprocessors: + value: "20" + # MPS shares + - name: gpu-0-mps-0 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mps + counters: + multiprocessors: + value: "50" + - name: gpu-0-mps-1 + attributes: + type: + string: "mps" + consumesCounters: + - counterSet: gpu-0-counters + compatibilityGroups: + - mps + counters: + multiprocessors: + value: "50" +``` + +ResourceClaims — identical to Example 2: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-a-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mig-1g' +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceClaim +metadata: + name: pod-b-gpu + namespace: default +spec: + devices: + requests: + - name: gpu + selectors: + - cel: + expression: >- + device.driver == 'gpu.example.com' && + device.attributes['type'].string == 'mps' +``` + +The scheduler allocates `gpu-0-mig-1g-0` (group: `mig`) to pod-a. When +evaluating `gpu-0-mps-0` (group: `mps`) for pod-b, it checks +compatibility: both devices consume from `gpu-0-counters`, but they share no +compatibility group (`mig` vs `mps`). The scheduler rejects the allocation and +pod-b becomes Unschedulable with event: "claim violates device compatibility +constraints". No cryptic preparation failure, no resource thrashing. + +Two MIG devices (both group: `mig`) or two MPS devices (both group: `mps`) can +still be co-allocated, since they share a group. + +#### Example 4: Multiple compatible groups with an incompatible group + +A device may support more than two partitioning schemes, some of which can +coexist. In this example, a device advertises three partition types: `foo`, +`bar`, and `baz`. `foo` and `bar` can coexist on the same device, but `baz` +is incompatible with both. To express this, `foo` devices include `bar` in +their compatibility groups and vice versa, while `baz` devices only list +their own group. + +ResourceSlices — a device advertising foo, bar, and baz partitions: + +```yaml +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-device-0-counters +spec: + driver: device.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + sharedCounters: + - name: device-0-counters + counters: + units: + value: "100" +--- +apiVersion: resource.k8s.io/v1 +kind: ResourceSlice +metadata: + name: node-1-device-0-devices +spec: + driver: device.example.com + pool: + name: node-1-pool + generation: 1 + resourceSliceCount: 2 + nodeName: node-1 + devices: + # foo partitions + - name: device-0-foo-0 + attributes: + type: + string: "foo" + consumesCounters: + - counterSet: device-0-counters + compatibilityGroups: + - foo + - bar + counters: + units: + value: "25" + # bar partitions + - name: device-0-bar-0 + attributes: + type: + string: "bar" + consumesCounters: + - counterSet: device-0-counters + compatibilityGroups: + - bar + - foo + counters: + units: + value: "25" + # baz partitions + - name: device-0-baz-0 + attributes: + type: + string: "baz" + consumesCounters: + - counterSet: device-0-counters + compatibilityGroups: + - baz + counters: + units: + value: "50" +``` + +`device-0-foo-0` (groups: `foo`, `bar`) and `device-0-bar-0` (groups: `bar`, +`foo`) share compatibility groups, so they can be co-allocated. `device-0-baz-0` +belongs only to `baz`, which shares no group with either foo or bar devices, so +it cannot be co-allocated with them. + +For instance, if pod-a is allocated `device-0-foo-0`, a subsequent pod +requesting `device-0-bar-0` succeeds (both share `foo` and `bar`), but a pod +requesting `device-0-baz-0` is rejected (`foo`/`bar` vs `baz` — no shared +group). + +### Scheduler Changes + +The DRA scheduler plugin is enhanced to: + +1. Maintain a cache of allocated devices per node, including their compatibility + fields (`compatibilityGroups` values). +2. For each candidate device during allocation, evaluate whether it is compatible + with all currently allocated devices on the node, and whether all allocated + devices are compatible with it (bidirectional check). +3. Remove candidate devices from consideration if they violate compatibility + constraints. +4. Emit clear scheduling events when a device is rejected due to compatibility. + +### Driver Responsibilities + +Resource drivers are responsible for: + +1. Populating `compatibilityGroups` for all devices with compatibility requirements. +2. Ensuring compatibility rules are symmetric and consistent across all devices + in a ResourceSlice. +3. Documenting their compatibility matrix. +4. Continuing to validate at resource preparation time for version-skew safety. + +### Test Plan + +[X] I/we understand the owners of the involved components may require updates to +existing tests to make this code solid enough prior to committing the changes necessary +to implement this enhancement. + +##### Prerequisite testing updates + +##### Unit tests + +- TBD + +##### Integration tests + +- TBD + +##### e2e tests + +- TBD + +### Graduation Criteria +#### Alpha +- API defined and implemented +- All relevant code is merged and placed behind a feature flag +- Unit and integration tests +- Documentation + +#### Beta +- E2E tests passing in CI +- Validated with at least one production DRA driver (out-of-tree testing) + +#### GA +- At least 2 releases as beta + +### Upgrade / Downgrade Strategy +#### Upgrade +Upon upgrading, no `ResourceSlice` leverages the new optional fields yet, so the current behavior remains as-is + +#### Downgrade +If downgrading to a version that does not have this enhancement implemented, older schedulers and api-servers do not know of the added optional field, and revert to their defined behavior prior to this enhancement + +Allocated devices that leveraged this new field will remain allocated, and future allocations will not take `compatibilityGroups` into consideration. + + +### Version Skew Strategy +No version skew concerns + +## Production Readiness Review Questionnaire + +### Feature Enablement and Rollback + +###### How can this feature be enabled / disabled in a live cluster? + +- Feature gate + - Feature gate name: DRADeviceCompatibilityGroups + - Components depending on the feature gate: kube-scheduler, kube-apiserver +- Other + - Describe the mechanism: + - Will enabling / disabling the feature require downtime of the control + plane? + - Will enabling / disabling the feature require downtime or reprovisioning + of a node? + +###### Does enabling the feature change any default behavior? +No, this KEP proposes an additional optional field to the `ResourceSlice` API + +###### Can the feature be disabled once it has been enabled (i.e. can we roll back the enablement)? +Yes, rolling back the enablement will revert the cluster to its pre-enablement behavior + +###### What happens if we reenable the feature if it was previously rolled back? +Existing `compatibilityGroup` configurations in `ResourceSlice`s will become effective again + +###### Are there any tests for feature enablement/disablement? +Yes, there will be integration tests to verify feature enablement/disablement + +### Rollout, Upgrade and Rollback Planning + +###### How can a rollout or rollback fail? Can it impact already running workloads? +I expect code changes in `kube-apiserver` and `kube-scheduler`, so something can go wrong with those. +No impact on already running workloads. + +###### What specific metrics should inform a rollback? +TBD + +###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? +TBD + +###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? +No + +### Monitoring Requirements + +###### How can an operator determine if the feature is in use by workloads? +This feature is not intended for use by workload usage, it is intended for DRA Drivers + +###### How can someone using this feature know that it is working for their instance? + +- Events + - Scheduling events: + - When all allocated devices in all Nodes are not compatible with any device that is considered for allocation the following event will be emitted by the scheduler for each Node: "No available nodes found: claim violates device compatibility constraints" +- Pod.status + - Condition name: Unschedulable + +###### What are the reasonable SLOs (Service Level Objectives) for the enhancement? +N/A + +###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? +N/A + +###### Are there any missing metrics that would be useful to have to improve observability of this feature? +No + +### Dependencies +DRA Partitionable Devices enabled + +###### Does this feature depend on any specific services running in the cluster? +No + +### Scalability + +###### Will enabling / using this feature result in any new API calls? +No + +###### Will enabling / using this feature result in introducing new API types? +No, only a new API field + +###### Will enabling / using this feature result in any new calls to the cloud provider? +No + +###### Will enabling / using this feature result in increasing size or count of the existing API objects? +Yes, additional field to the `ResourceSlice` API + +###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? +Scheduling cycles will take longer to complete due to the additional responsibility the scheduler will recieve, I expect it to be negligible + +###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? +No + +###### Can enabling / using this feature result in resource exhaustion of some node resources (PIDs, sockets, inodes, etc.)? +No + +### Troubleshooting + +###### How does this feature react if the API server and/or etcd is unavailable? +No new side effects + +###### What are other known failure modes? +N/A + +###### What steps should be taken if SLOs are not being met to determine the problem? +TBD + +## Implementation History + +## Drawbacks + +Adding compatibility constraint support to the scheduler increases the +complexity of the DRA scheduling logic. The new field must be evaluated for +every device candidate during every scheduling cycle that involves DRA +resources, which adds latency and memory overhead. + +## Alternatives + +### Current Workaround: Driver-level Preparation Failure + +The existing workaround is for DRA drivers to fail resource preparation when +incompatible allocations are attempted. This approach is insufficient because: + +- It detects incompatibilities only after scheduling has committed to the +allocation, leading to pod startup failures. +- It provides no mechanism to inform the scheduler so it can try other nodes +or device combinations. +- It results in resource thrashing as the scheduler retries the same failing +combination. + +## Infrastructure Needed (Optional) + diff --git a/keps/sig-node/5963-device-compatibility-groups/kep.yaml b/keps/sig-node/5963-device-compatibility-groups/kep.yaml new file mode 100644 index 000000000000..4ccbeae4913b --- /dev/null +++ b/keps/sig-node/5963-device-compatibility-groups/kep.yaml @@ -0,0 +1,40 @@ +title: DRA Device Compatibility Groups +kep-number: 5963 +authors: + - "@omeryahud" +owning-sig: sig-node +participating-sigs: + - sig-scheduling +status: provisional +creation-date: 2026-03-17 +reviewers: + - TBD +approvers: + - TBD + +# The target maturity stage in the current dev cycle for this KEP. +# If the purpose of this KEP is to deprecate a user-visible feature +# and a Deprecated feature gates are added, they should be deprecated|disabled|removed. +stage: alpha + +# The most recent milestone for which work toward delivery of this KEP has been +# done. This can be the current (upcoming) milestone, if it is being actively +# worked on. +latest-milestone: v1.37 + +# The milestone at which this feature was, or is targeted to be, at each stage. +milestone: + alpha: v1.37 + beta: v1.38 + stable: v1.39 + +# The following PRR answers are required at alpha release +# List the feature gate name and the components for which it must be enabled +feature-gates: + - name: DRADeviceCompatibilityGroups + components: + - kube-scheduler + - kube-apiserver +disable-supported: true + +metrics: [] From 25c0ae6cec567e3e25f4dbf98a911a3237e15bb8 Mon Sep 17 00:00:00 2001 From: Omer Yahud Date: Tue, 14 Apr 2026 23:01:56 +0000 Subject: [PATCH 2/5] Clarify backwards compatibility for devices without compatibilityGroups Signed-off-by: Omer Yahud --- keps/sig-node/5963-device-compatibility-groups/README.md | 7 ++++--- 1 file changed, 4 insertions(+), 3 deletions(-) diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md index feaf44d13d67..76899d42897b 100644 --- a/keps/sig-node/5963-device-compatibility-groups/README.md +++ b/keps/sig-node/5963-device-compatibility-groups/README.md @@ -123,9 +123,10 @@ underlying physical device. Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare which named groups they belong to. For two devices consuming counters from the same -counter set to be co-allocated, they must share at least one compatibility group. -Devices without this field are considered compatible with all groups. This -approach is simpler and has minimal API surface. +counter set to be co-allocated, they must share at least one compatibility group. + +Devices without this field are considered compatible with other devices that dont +specify this field, for backwards compatibility. ### User Stories From f2daedf0363e406ad067a9dc72fad400cd40b4bb Mon Sep 17 00:00:00 2001 From: Omer Yahud <4971966+omeryahud@users.noreply.github.com> Date: Wed, 15 Apr 2026 15:58:28 +0300 Subject: [PATCH 3/5] Apply suggestion from @pohly Co-authored-by: Patrick Ohly --- keps/sig-node/5963-device-compatibility-groups/README.md | 8 +++++++- 1 file changed, 7 insertions(+), 1 deletion(-) diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md index 76899d42897b..bc1e139cf176 100644 --- a/keps/sig-node/5963-device-compatibility-groups/README.md +++ b/keps/sig-node/5963-device-compatibility-groups/README.md @@ -72,7 +72,13 @@ one scheme, it cannot be reconfigured to use a different scheme until all existi allocations are released. The current DRA Partitionable Devices API has no mechanism for drivers to express -these mutual exclusivity constraints. Without it, incompatible allocations are only +these mutual exclusivity constraints. A shared counter with a capacity of one +can ensure mutual exclusion, but this cannot be used here: such a counter +would have to be decremented once when allocating the *first* device from a set +of compatible devices, not once for *each* device. This cannot be expressed +at the moment. + +Without a mechanism for this, incompatible allocations are only detected during resource preparation, after the scheduler has already made its decisions, leading to pod startup failures and resource thrashing. This KEP introduces API and scheduler changes so that compatibility constraints can be From df0a13c152c58143d56386a7250e3531880a736b Mon Sep 17 00:00:00 2001 From: Omer Yahud Date: Sun, 19 Apr 2026 12:02:04 +0000 Subject: [PATCH 4/5] KEP-5963: address review feedback Tighten API semantics, production readiness, and version skew coverage ahead of the implementable review: - Resolve missing-field contradiction in favor of the strict "nil only matches nil" rule, stated consistently in Proposal and API. - Correct the transitivity claim to a symmetric-but-not-transitive pairwise group-intersection predicate. - Flesh out Version Skew Strategy with five concrete skew scenarios. - Populate Test Plan, rollback metric, upgrade-testing plan, and Implementation History. - Explain field placement on consumesCounters[] as an intentional hardware-scoped choice; add a multi-request/DeviceConstraints interaction subsection. - Document scheduler complexity as O(N*M*G) with expected bounds. - Expand Non-Goals, improve the scheduling event format to name the conflicting groups, and align Monitoring section with the new metric. - Clarify per-component behavior of the DRADeviceCompatibilityGroups feature gate. - Add Inverted-naming alternative, regenerate ToC. --- .../README.md | 248 +++++++++++++++--- 1 file changed, 218 insertions(+), 30 deletions(-) diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md index bc1e139cf176..168cdebf2b3b 100644 --- a/keps/sig-node/5963-device-compatibility-groups/README.md +++ b/keps/sig-node/5963-device-compatibility-groups/README.md @@ -20,6 +20,7 @@ - [Example 3: How the proposed API solves the problem](#example-3-how-the-proposed-api-solves-the-problem) - [Example 4: Multiple compatible groups with an incompatible group](#example-4-multiple-compatible-groups-with-an-incompatible-group) - [Scheduler Changes](#scheduler-changes) + - [Interaction with Multi-Request Claims and Device Constraints](#interaction-with-multi-request-claims-and-device-constraints) - [Driver Responsibilities](#driver-responsibilities) - [Test Plan](#test-plan) - [Prerequisite testing updates](#prerequisite-testing-updates) @@ -39,6 +40,8 @@ - [Implementation History](#implementation-history) - [Drawbacks](#drawbacks) - [Alternatives](#alternatives) + - [Current Workaround: Driver-level Preparation Failure](#current-workaround-driver-level-preparation-failure) + - [Inverted naming: `mutualExclusionGroups`](#inverted-naming-mutualexclusiongroups) - [Infrastructure Needed (Optional)](#infrastructure-needed-optional) ## Release Signoff Checklist @@ -119,9 +122,22 @@ constraints, not just GPUs. ### Non-Goals -- Allow DRA drivers to specify compatibility between physical or virtual devices -across different physical devices. The scope of compatibility constraints is limited to virtual devices sharing the same -underlying physical device. +- Allowing DRA drivers to specify compatibility between devices that do not + share a counter set. The scope of compatibility constraints is limited to + virtual devices consuming from the same counter set (which, by convention, + represents a single underlying physical device). +- Providing a centralized or cluster-wide registry of compatibility group + names. Group names are opaque strings scoped to a single ResourceSlice pool + and are meaningful only to the driver that publishes them. +- Enabling the scheduler to *reconfigure* a physical device between + partitioning schemes (e.g., MIG ↔ MPS) as part of scheduling. This KEP only + addresses rejecting incompatible allocations; transitions between schemes + remain a driver concern and typically require draining existing allocations. +- Expressing compatibility constraints on `ResourceClaim` objects. The field + is driver-authored and lives only on `ResourceSlice`. +- Replacing existing counter-capacity checks. `compatibilityGroups` is an + additional predicate; capacity math on `sharedCounters` continues to apply + unchanged. ## Proposal @@ -131,8 +147,10 @@ Add a `device.consumesCounters[].compatibilityGroups` field. Devices declare whi named groups they belong to. For two devices consuming counters from the same counter set to be co-allocated, they must share at least one compatibility group. -Devices without this field are considered compatible with other devices that dont -specify this field, for backwards compatibility. +Devices that omit this field are compatible only with other devices in the same +counter set that also omit it. Existing ResourceSlices (where no device sets the +field) continue to behave as today; drivers adopting this feature should annotate +every device sharing a counter set. ### User Stories @@ -156,10 +174,14 @@ with cryptic error messages. ### Notes/Constraints/Caveats -The compatibility constraint is bidirectional and transitive: if device A -specifies a constraint that excludes device B, then allocating A must prevent -B from being allocated, and vice versa. This proposal implements this -bidirectional check in the scheduler. +The compatibility relation is **symmetric**: if A can be co-allocated with B, +then B can be co-allocated with A. It is **not transitive**: A and B sharing a +group, and B and C sharing a group, does not imply A and C share one. +Concretely, the scheduler evaluates the pairwise predicate +`groups(A) ∩ groups(B) ≠ ∅` against every already-allocated device on the same +counter set; it does not compute transitive closures. Drivers that want three +device types to be mutually co-allocatable must ensure every pair shares at +least one group (see Example 4). ### Risks and Mitigations @@ -187,8 +209,20 @@ for correctness and documentation of their compatibility matrix. A new field `compatibilityGroups` is added inside each entry of `device.consumesCounters[]`. It contains a list of string group names. For two devices consuming counters from the same counter set to be allocated -together, they must share at least one group name. Devices that omit this -field are considered compatible with all groups. +together, either both must omit the field (or set it to an empty list), or both +must declare the field and share at least one group name. A nil +`compatibilityGroups` and an empty `compatibilityGroups: []` are treated +identically. This means a device that declares the field is never co-allocatable +with a sibling that omits it. + +The field is placed on each `consumesCounters[]` entry rather than on the +device itself because compatibility is a physical-hardware property scoped to +the shared resource represented by the counter set. A single virtual device +that consumes from multiple counter sets may therefore declare different +groups per counter set, reflecting different exclusivity constraints on +different pieces of underlying hardware. Two devices that do not share any +counter set are never compared via this field, even if they live on the same +node or in the same `ResourceSlice`. Example showing MIG and FOO partitions on the same physical GPU: @@ -618,6 +652,9 @@ is incompatible with both. To express this, `foo` devices include `bar` in their compatibility groups and vice versa, while `baz` devices only list their own group. +This example is written generically — the counter name `units` stands in for +any hardware-specific resource (SMs, bandwidth, slots). + ResourceSlices — a device advertising foo, bar, and baz partitions: ```yaml @@ -712,6 +749,40 @@ The DRA scheduler plugin is enhanced to: constraints. 4. Emit clear scheduling events when a device is rejected due to compatibility. +**Complexity.** Let *M* be the number of devices already allocated on a +counter set, *N* the number of candidates under consideration for that +counter set, and *G* the maximum number of groups declared per counter-set +consumption entry. The additional filter cost per scheduling cycle is +O(*N* · *M* · *G*) for pairwise group-intersection checks, with typical *G* +≤ 4 (hardware partition modes per device are small in practice). The +existing DRA allocation loop already iterates over candidates per counter +set, so the new work is a constant-factor-per-candidate addition rather than +a new outer loop. + +### Interaction with Multi-Request Claims and Device Constraints + +**Multiple requests within one claim.** The compatibility predicate is +evaluated pairwise between every device already allocated *on the same counter +set* and each candidate, regardless of whether the allocated device belongs to +the same claim, a different claim on the same pod, or a different pod entirely. +Two devices within a single `ResourceClaim` that land on the same counter set +are therefore subject to the same pairwise check: the second request sees the +first as already-allocated state. + +**Allocation order.** The scheduler does not reorder requests within a claim +to improve feasibility. If requests are ordered such that an early compatible +pick later blocks a mandatory pick, the claim becomes Unschedulable and +standard retry behavior applies. This matches how existing DRA constraints +behave. + +**Composition with `DeviceConstraints`.** `compatibilityGroups` is a +driver-authored, ResourceSlice-side constraint. `DeviceConstraints` (e.g., +`matchAttribute`) is a user-authored, ResourceClaim-side constraint. The two +are evaluated independently and both must pass for a candidate to be +allocated. A claim can never *relax* a driver-declared compatibility group, +and a driver can never *force* a claim-side `matchAttribute`. They compose by +conjunction. + ### Driver Responsibilities Resource drivers are responsible for: @@ -730,24 +801,50 @@ to implement this enhancement. ##### Prerequisite testing updates +None. The DRA scheduler plugin and `ResourceSlice` validation already have +unit and integration coverage; new tests are additive. + ##### Unit tests -- TBD +- `k8s.io/dynamic-resource-allocation/structured`: pairwise group-intersection + predicate (empty, nil, single, multiple groups; nil-vs-nil, nil-vs-set, + set-vs-set; `[]` treated as nil). +- `k8s.io/kubernetes/pkg/scheduler/framework/plugins/dynamicresources`: filter + behavior with mixed compatible and incompatible candidates on the same + counter set; no-op behavior when the feature gate is disabled; no-op + behavior when no device in the slice declares the field. +- `k8s.io/kubernetes/pkg/apis/resource/validation`: field validation — + accepted shapes, max group-name length, max groups per counter consumption. ##### Integration tests -- TBD +- Feature gate enablement/disablement round-trip: field is persisted when + enabled, dropped on write when disabled. +- Scheduler rejects a claim when the only remaining candidate on a node + belongs to an incompatible group; admits it when a compatible candidate + exists on another node. +- Upgrade → downgrade → upgrade: allocations made during the "upgrade" phase + remain valid after downgrade; re-enabling enforcement does not re-evaluate + existing allocations. ##### e2e tests -- TBD +- Fake DRA driver advertising two mutually exclusive groups (`mig`, `mps`) on + a single counter set. Scheduling a `mig` pod followed by an `mps` pod on + the same node leaves the second pod Unschedulable with the documented + event; reversing the order reproduces the behavior symmetrically. +- Same driver with compatible groups (`foo`, `bar`) — both pods schedule. +- Feature-gate-off baseline: the second pod reaches preparation and the + driver rejects it (pre-KEP behavior preserved). ### Graduation Criteria #### Alpha - API defined and implemented - All relevant code is merged and placed behind a feature flag - Unit and integration tests -- Documentation +- Driver-author documentation published under `kubernetes/website` (DRA + drivers section), including the strict nil-matching rule and a worked + MIG/MPS example. #### Beta - E2E tests passing in CI @@ -767,7 +864,37 @@ Allocated devices that leveraged this new field will remain allocated, and futur ### Version Skew Strategy -No version skew concerns + +The feature introduces a new optional field on `ResourceSlice` and new +enforcement logic in the scheduler. Skew behaviors to consider: + +**New kube-apiserver + old kube-scheduler.** The apiserver accepts and persists +`compatibilityGroups`. An old scheduler ignores the field and may allocate +incompatible devices. This degrades to the pre-KEP behavior: the DRA driver +rejects the allocation at resource preparation time. Drivers MUST continue to +validate at preparation time for this reason (see Driver Responsibilities). + +**Old kube-apiserver + new kube-scheduler.** The old apiserver drops the unknown +field on writes. ResourceSlices in etcd therefore do not carry +`compatibilityGroups`, and the new scheduler sees only nil values, producing the +pre-KEP behavior. No incorrect allocations result. + +**Mixed-version HA kube-scheduler.** If one replica enforces the field and +another does not, the enforcing replica may reject allocations the +non-enforcing replica would accept. Both outcomes are safe (either the +scheduler correctly rejects, or the driver rejects at preparation time). +Resolution is to complete the scheduler rollout. + +**Downgrade with in-flight allocations.** Devices already allocated under the +new rules remain allocated across a downgrade; the post-downgrade scheduler +will not consider `compatibilityGroups` for future allocations, reverting to +pre-KEP behavior. No existing allocations are invalidated. + +**Feature gate off on one component.** If `DRADeviceCompatibilityGroups` is +enabled on kube-apiserver but disabled on kube-scheduler (or vice versa), +behavior matches the corresponding skew row above — apiserver stores the field +but scheduler ignores it, or scheduler enforces on field values that the +apiserver may drop on writes. ## Production Readiness Review Questionnaire @@ -778,12 +905,14 @@ No version skew concerns - Feature gate - Feature gate name: DRADeviceCompatibilityGroups - Components depending on the feature gate: kube-scheduler, kube-apiserver -- Other - - Describe the mechanism: - - Will enabling / disabling the feature require downtime of the control - plane? - - Will enabling / disabling the feature require downtime or reprovisioning - of a node? +- Gate behavior per component: + - **kube-apiserver**: when disabled, strips `compatibilityGroups` on writes + and hides it on reads of `ResourceSlice`. Prevents drivers from persisting + values that cannot be enforced. + - **kube-scheduler**: when disabled, ignores the field on read and does not + perform the pairwise intersection check during filtering. +- No control-plane downtime is required to toggle the gate. +- No node downtime or reprovisioning is required. ###### Does enabling the feature change any default behavior? No, this KEP proposes an additional optional field to the `ResourceSlice` API @@ -800,14 +929,26 @@ Yes, there will be integration tests to verify feature enablement/disablement ### Rollout, Upgrade and Rollback Planning ###### How can a rollout or rollback fail? Can it impact already running workloads? -I expect code changes in `kube-apiserver` and `kube-scheduler`, so something can go wrong with those. -No impact on already running workloads. +Rollout risk is limited to the two components touched by the feature gate +(kube-apiserver field handling and kube-scheduler filter logic). +Already-running workloads are not affected: compatibility filtering only runs +during scheduling of *new* allocations, so disabling the gate or rolling back +binaries does not disturb existing pod/device bindings. ###### What specific metrics should inform a rollback? -TBD +A new scheduler metric +`scheduler_dra_compatibility_rejections_total{driver,counter_set,reason}` +counts claim filter rejections caused by compatibility constraints. A rollback +is warranted if this metric spikes unexpectedly after a driver update (likely +an incorrect compatibility matrix — see Risks → Incorrect driver declarations). +Operators should also watch `scheduler_unschedulable_pods` correlated with +events matching `Insufficient compatible DRA devices`. ###### Were upgrade and rollback tested? Was the upgrade->downgrade->upgrade path tested? -TBD +Upgrade → downgrade → upgrade will be covered by the integration test +described in Test Plan → Integration tests. At alpha, manual verification on a +kind cluster with the feature gate flipped is acceptable; CI coverage is a +Beta graduation criterion. ###### Is the rollout accompanied by any deprecations and/or removals of features, APIs, fields of API types, flags, etc.? No @@ -821,7 +962,15 @@ This feature is not intended for use by workload usage, it is intended for DRA D - Events - Scheduling events: - - When all allocated devices in all Nodes are not compatible with any device that is considered for allocation the following event will be emitted by the scheduler for each Node: "No available nodes found: claim violates device compatibility constraints" + - When a candidate device is filtered out because its compatibility groups + do not intersect those of an already-allocated device on the same + counter set, the scheduler logs a per-node filter reason of the form: + ``` + device gpu-0-mps-0 (groups [mps]) incompatible with allocated device gpu-0-mig-1g-0 (groups [mig]) on counterSet gpu-0-counters + ``` + - If no node has any allocatable candidate, the standard scheduler + "0/N nodes are available" event aggregates this reason across nodes + (e.g., `4 Insufficient compatible DRA devices`). - Pod.status - Condition name: Unschedulable @@ -829,10 +978,16 @@ This feature is not intended for use by workload usage, it is intended for DRA D N/A ###### What are the SLIs (Service Level Indicators) an operator can use to determine the health of the service? -N/A +Operators can use `scheduler_dra_compatibility_rejections_total` together with +`scheduler_unschedulable_pods` and the standard DRA scheduler plugin latency +metrics to determine whether compatibility filtering is contributing to +scheduling failures or latency regressions. ###### Are there any missing metrics that would be useful to have to improve observability of this feature? -No +No — `scheduler_dra_compatibility_rejections_total{driver,counter_set,reason}` +(introduced by this KEP; see Rollout → What specific metrics should inform a +rollback) covers the primary observability need. Additional breakdowns can be +added post-alpha if field feedback justifies them. ### Dependencies DRA Partitionable Devices enabled @@ -855,7 +1010,14 @@ No Yes, additional field to the `ResourceSlice` API ###### Will enabling / using this feature result in increasing time taken by any operations covered by existing SLIs/SLOs? -Scheduling cycles will take longer to complete due to the additional responsibility the scheduler will recieve, I expect it to be negligible +Scheduling cycles that involve DRA devices incur an additional +O(*N* · *M* · *G*) group-intersection check per counter set (see Design +Details → Scheduler Changes), where *M* is devices already allocated on that +counter set, *N* is candidates considered, and *G* is groups per device. For +realistic values (*M* ≤ 16, *N* ≤ 64, *G* ≤ 4) the added work is in the low +thousands of string comparisons per counter set per cycle and is not expected +to be measurable against existing DRA scheduling cost. Benchmarks will be run +during alpha and reported in the Implementation History. ###### Will enabling / using this feature result in non-negligible increase of resource usage (CPU, RAM, disk, IO, ...) in any components? No @@ -876,6 +1038,9 @@ TBD ## Implementation History +- 2026-03-17: KEP opened, status `provisional`. +- 2026-04-18: KEP under review; API shape and default semantics settled. + ## Drawbacks Adding compatibility constraint support to the scheduler increases the @@ -897,5 +1062,28 @@ or device combinations. - It results in resource thrashing as the scheduler retries the same failing combination. +### Inverted naming: `mutualExclusionGroups` + +An alternative API would invert the semantics: instead of declaring which +groups a device *belongs to* (co-allocation predicate), declare which groups +a device is *incompatible with* (exclusion predicate). Two devices would then +be co-allocatable if and only if the intersection of their exclusion sets and +their own group memberships is empty. + +The inverted model is arguably more intuitive for the motivating case — a MIG +device "excludes MPS," full stop — and does not require drivers to list each +peer group in their own entry (as Example 4 does, where `foo` devices must +include `bar` in their group list). It was rejected because: + +- The co-allocation framing composes naturally with the existing DRA model, + where counter-set membership already expresses "can share resources." A + group is a finer-grained membership within the same model. +- Exclusion semantics require two fields to express the same information (the + groups you *are* in, and the groups you *exclude*), or a global registry of + group names. Membership-only is simpler. +- Symmetry is easier to validate: a driver that forgets to include `foo` in a + `bar` device's groups produces a diagnosable allocation failure, rather + than silent incorrect behavior under exclusion semantics. + ## Infrastructure Needed (Optional) From 7f26f29e1c6f9bdc2e9a4977381f965a2e897012 Mon Sep 17 00:00:00 2001 From: Omer Yahud Date: Sat, 25 Apr 2026 22:46:03 +0000 Subject: [PATCH 5/5] clarify grouping conventions Signed-off-by: Omer Yahud --- .../README.md | 51 ++++++++++++------- 1 file changed, 33 insertions(+), 18 deletions(-) diff --git a/keps/sig-node/5963-device-compatibility-groups/README.md b/keps/sig-node/5963-device-compatibility-groups/README.md index 168cdebf2b3b..b1fac46caed3 100644 --- a/keps/sig-node/5963-device-compatibility-groups/README.md +++ b/keps/sig-node/5963-device-compatibility-groups/README.md @@ -224,6 +224,13 @@ different pieces of underlying hardware. Two devices that do not share any counter set are never compared via this field, even if they live on the same node or in the same `ResourceSlice`. +**Naming convention used in examples.** A device of type `T` lists `T` in its +groups. When types `T1…Tn` are mutually co-allocatable, every device of those +types additionally lists a shared composite group (e.g., `t1t2`). A type that +is compatible with no other type lists only `[T]`. The scheduler does not +parse group names — this convention is purely for readability; any opaque +strings that satisfy the symmetry requirement work. + Example showing MIG and FOO partitions on the same physical GPU: ```yaml @@ -264,7 +271,7 @@ spec: - counterSet: gpu-1-cs compatibilityGroups: - foo - - bar + - foobar counters: multiprocessors: value: "17" @@ -272,17 +279,19 @@ spec: consumesCounters: - counterSet: gpu-1-cs compatibilityGroups: - - foo - bar + - foobar counters: multiprocessors: value: "17" ``` -- `gpu-1-mig1` and `gpu-1-foo-part` share no compatibility group (`mig` vs -`foo`/`bar`), so they cannot be co-allocated on the same counter set. -- `gpu-1-foo-part` and `gpu-1-bar-part` share compatibility groups (`foo`, `bar`), -so they can be co-allocated on the same counter set. +- `gpu-1-mig1` (groups: `mig`) and `gpu-1-foo-part` (groups: `foo`, `foobar`) +share no compatibility group, so they cannot be co-allocated on the same +counter set. +- `gpu-1-foo-part` (groups: `foo`, `foobar`) and `gpu-1-bar-part` (groups: +`bar`, `foobar`) share the `foobar` group, so they can be co-allocated on the +same counter set. ### Examples @@ -641,16 +650,22 @@ pod-b becomes Unschedulable with event: "claim violates device compatibility constraints". No cryptic preparation failure, no resource thrashing. Two MIG devices (both group: `mig`) or two MPS devices (both group: `mps`) can -still be co-allocated, since they share a group. +still be co-allocated, since they share a group. Each device lists only its +own type because MIG and MPS are not compatible with each other; if they +were, they would also share a composite group like `migmps`. #### Example 4: Multiple compatible groups with an incompatible group A device may support more than two partitioning schemes, some of which can coexist. In this example, a device advertises three partition types: `foo`, `bar`, and `baz`. `foo` and `bar` can coexist on the same device, but `baz` -is incompatible with both. To express this, `foo` devices include `bar` in -their compatibility groups and vice versa, while `baz` devices only list -their own group. +is incompatible with both. + +By convention, each device's `compatibilityGroups` is a composite of the +types it can be co-allocated with: each device lists its own type, plus a +shared composite group for every set of types it is compatible with. So +`foo` devices list `[foo, foobar]`, `bar` devices list `[bar, foobar]`, and +`baz` — compatible with no other type — lists `[baz]`. This example is written generically — the counter name `units` stands in for any hardware-specific resource (SMs, bandwidth, slots). @@ -695,7 +710,7 @@ spec: - counterSet: device-0-counters compatibilityGroups: - foo - - bar + - foobar counters: units: value: "25" @@ -708,7 +723,7 @@ spec: - counterSet: device-0-counters compatibilityGroups: - bar - - foo + - foobar counters: units: value: "25" @@ -726,14 +741,14 @@ spec: value: "50" ``` -`device-0-foo-0` (groups: `foo`, `bar`) and `device-0-bar-0` (groups: `bar`, -`foo`) share compatibility groups, so they can be co-allocated. `device-0-baz-0` -belongs only to `baz`, which shares no group with either foo or bar devices, so -it cannot be co-allocated with them. +`device-0-foo-0` (groups: `foo`, `foobar`) and `device-0-bar-0` (groups: +`bar`, `foobar`) share the `foobar` group, so they can be co-allocated. +`device-0-baz-0` (groups: `baz`) shares no group with either, so it cannot be +co-allocated with them. For instance, if pod-a is allocated `device-0-foo-0`, a subsequent pod -requesting `device-0-bar-0` succeeds (both share `foo` and `bar`), but a pod -requesting `device-0-baz-0` is rejected (`foo`/`bar` vs `baz` — no shared +requesting `device-0-bar-0` succeeds (both share `foobar`), but a pod +requesting `device-0-baz-0` is rejected (`foo`/`foobar` vs `baz` — no shared group). ### Scheduler Changes